Sometimes names of Indigenous Australians will include their traditional country and that needs to be captured
The issue of controlled vocabularies not being comprehensive, is especially the case in relation to First Nations people in Australia
Presumably this is an area where AI could actually usefully be employed, though it is not my area of expertise
We are particularly keen to note the importance of inclusion of Indigenous languages in these discussions. Alongside the use of text in a specific language is also the need to identify the Indigenous country that the work is associated with. There is work ongoing for this in Australia.
We are concerned that this will never happen at any useful scale – below the very high level. For example, it is already hard enough to get people to change US English to Australian English in a familiar word processing tool such as Word.
(This is actually a comment on Appendix 3.)There is a correction regarding the URL as follows.JPCOAR Metadata Schema 2.0 Drafthttps://schema.irdb.nii.ac.jp/ja/schema/2.0-draft/14https://schema.irdb.nii.ac.jp/ja/schema/2.0-draft/1 –>JPCOAR Metadata Schema 2.0https://schema.irdb.nii.ac.jp/en/schema/2.0/14https://schema.irdb.nii.ac.jp/en/schema/2.0/1
I corrected the descriptions about WEKO3.before: WEKO3 is a cloud-based repository system supported by JPCOAR (Japan Consortium for Open Access Repositories). It is developed based on INVENIO by CERN.↓after: WEKO3 is a repository software developed by NII (National Institute of Informatics, Japan) based on INVENIO by CERN. This software operates JAIRO Cloud, a cloud-based repository system, which is supported by JPCOAR (Japan Consortium for Open Access Repositories) and NII.before: Specifically, ISO-639-3 is acceptable as the language of the text and for a language attribute of other metadata elements, ISO-639-1 is acceptable.↓after: Specifically, ISO-639-3 is acceptable as the language of the text, and for a language attribute of other metadata elements, ISO-639-1 is acceptable.before: With each field, you can add a language tag in the form of a two-character ISO using the dropdown menu.↓after: With WEKO3, you can add the language tag in the form of a two-character ISO using the dropdown menu, checkbox, and radio button.
It would be recommended that the introduction and recommendations be translated in multiple languages and shared from the COAR website to promote and encourage the multilingualism.
Besides the recognition of the translator we encourage to recognize all the professionals involved in the editorial process(I added) mentioned in the resource, it takes a lot of time but it is worthy. This is an example:
.contributor.assistanttotheeditorinchief
Cruz Salas, Minerva
dc.contributor.businessmanager
Zempoalteca Quintana, Mario
dc.contributor.copyeditorandtranslator
Dashner Monk, Heather
dc.contributor.designer
Pérez Ramírez, Patricia
dc.contributor.editorinchief
Jiménez, Teresa Andreu
dc.contributor.layout
Álvarez Sotelo, María Elena
dc.contributor.salesandcirculationmanager
Creamer Tejeda, Cynthia
dc.contributor.translator
Fernández Hall, María Cristina
In fact, RECOLECTA evaluation for repositories considers this point: “4.5.- Existe un campo específico para indicar la descripción de la colaboración. En este campo se registra la entidad o persona responsable de coordinar, corregir, comentar o, en general, contribuir de alguna otra manera al desarrollo del recurso” (4.5.- There is a specific field to indicate the description of the collaboration. This field records the entity or person responsible for coordinating, correcting, commenting or, in general, contributing in some other way to the development of the resource.
https://calidadrevistas.fecyt.es/sites/default/files/informes/2021guiaevaluacionrecolecta_vf.pdf
ORCID or ISNE help to identify authors or creators nowadays, but the use of authorities catalogues should be included.
Source: https://comments.coar-repositories.org/comments-by-commenter/
At the University of Ottawa (Canada), we have a DSpace 6.2 repository containing a mix of French and English resources. Currently, our repository adds ‘en_us’ by default to any metadata fields being tagged with xml:lang, even if the metadata language is French. Do you have any recommendations for sorting through the records and identifying metadata with the wrong language code, and changing it?
This seems aspirational but vague. How would we go about integrating Wikidata in DSpace?
Could potentially be problematic in implementation if the automatically assigned codes did not accurately describe the content uploaded to the repository. Users are more likely to accurately classify their work than library staff or an AI.
Seems difficult to implement as content creators are unlikely to complete this step unless we make it compulsory. Since submission of pre-/post-prints and OA articles to our repository is voluntary, attempting to implement such a policy would likely deter submissions. In the case of graduate students, who are already required to submit their theses, this step would necessitate significant manual review. We could, however, provide a section in our submission guidelines and help text on how to add this information to encourage our submitters.
It should be noted that the concept labels of Wikidata would keep changing. In our repository – depositar, we only store and expose the identifier (e.g. “Q11030”) itself. Then we inquiry the MediaWiki API to get latest multilingual labels of a Wikidata vocabulary. We think it would be better to store and expose both (1) the latest label and (2) the (old) label at the time of the assignment of a keyword.
In relation to this topic, if a document has a complete translation, should 1 item be generated for each version and proceed to connect the versions with relation metadata? Or what would be the recommendation for those cases?
Generally would like a definition on terms used. Who are “repository managers”? What is a “repository”? What is the intended audience? I can make inferences based on the content inside the recommendations and listed repository platforms (Dspace, Dataverse), but would like it to be more explicit. That way we can interpret these recommendations for all of our use cases here.
I would recommend to broaden this to explore and use multilingual authorities more generally, when possible. For example, revising to “explore possibilities of integrating multilingual authorities, such as wikidata or INSERT OTHER EXAMPLE HERE in your repository.” It would be good/make sense to encourage cross-pollination of diff languages’ resources, and acknowledge authority work exists in institutions across the globe. Wikidata and other resources like VIAF do often aggregate a few of these, but not all. (Generally those that have the time/support to contribute to Wikidata.) To more fully support linguistic diversity and languages that may have more distributed support, it would be best to recommend authorities in general.
Curious about singling out ORCID as that’s mostly used for living people/researchers. I get the feeling these recommendations are more about scientific/data repositories and not for other digital collections that contain more archival or historical data. This is already an accepted convention to record identifiers for name authorities when they exist. — maybe just making clear the audience for these recommendations in the introduction would help?
More clarity on what this recommendation is asking for would be helpful. Does this refer to using AI or NLP/NER models? Or does this refer to simply having a robust taxonomy manager within one’s repository? A predictive text feature is simple to implement, but relies on existing content within a site. If AI or NLP/NER models are being suggested, more text should be spent on setting up best practices for their data modeling (bad data in just churns out poor metadata.)
Agree with ISO recommendation to increase support for using ISO-639-3 codes, but this then appears to be at odds with the first set of recommendations for repository managers. Is the recommendation existing because of lack of support (screen readers, IIIF not supporting 3-letter codes) and ISO 3 letter codes actually desired over ISO 2 letter codes? It would be helpful for this to be stated more clearly. Appendix 4 also unclear on this. Maybe some sort of decision tree format in the recommendation text?
Link is now corrected to the one Iryna provided, Sadie.
[This comment is not related to paragraph 3, but actually to paragraph 4, which cannot be commented (neither §5 or §6).]
“This document presents the results of the task force work focusing on identifying good practices for … licenses, …”
What are the good practices for *licenses* in this version? Have I missed them?
I would suggest recommending a specific order for core metadata fields like title in case there would be more than one title in the metadata, for example original title + title translated into English (or in any other language) for more visibility and discoverability.
I would suggest having the original title first, followed then by all additional/alternative titles. Thus, taking back your example:
<dc:title xml:lang=”fr”>Libre Accès</dc:title>
<dc:title xml:lang=”en”>Open Access</dc:title>
First dc:title with ‘fr’ language attribute because the describe record is in French and second dc:title with attribute ‘en’ because the submitter wants to make the record more visible and better harvested at an international level.
Use case:
Record https://hal.science/hal-03130990 describes a book in French. Main title is in French (Les grands discours à L’Unesco de 1945 à nos jours), but there is an additional title in English (Great speeches at Unesco from 1945 to nowadays) that can be displayed by switching from ‘fr’ to ‘en’. However, in the OAI output, the English title comes first: https://api.archives-ouvertes.fr/oai/hal/?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:HAL:hal-03130990v1 There is no guarantee that aggregators and harvesters will index and display both titles. This is a risk that only the first provided title is displayed (here English), while the describe record is in French. I have encountered this issue several times already.
I would avoid using the phrase “non-official languages”. What’s the point of being an official language or not? What about using rather “local”, “regional” or “less spread” languages? I would also be careful if you provide any language examples here (just to spare susceptibilities 😉 ).
As you say, Marc21 format uses ISO 639-2 language codes and I find it a bit odd that the group recommends using 369-1 encoding in MarcXML. This looks somehow contradictory to me. But if you do, I would suggest adding subfield “2” (Source of code) to be compliant with the Marc format. Thus, taking your example:
Possibility 1:
<datafield tag=”041″ ind1=” ” ind2=”7″>
<subfield code=”a”>it</subfield>
<subfield code=”a”>en</subfield>
<subfield code=”a”>fr</subfield>
<subfield code=”2″>iso639-1</subfield>
</datafield>
Possibility 2, since field 041 can be repeated:
<datafield tag=”041″ ind1=” ” ind2=”7″>
<subfield code=”a”>it</subfield>
<subfield code=”2″>iso639-1</subfield>
</datafield>
<datafield tag=”041″ ind1=” ” ind2=”7″>
<subfield code=”a”>en</subfield>
<subfield code=”2″>iso639-1</subfield>
</datafield>
<datafield tag=”041″ ind1=” ” ind2=”7″>
<subfield code=”a”>fr</subfield>
<subfield code=”2″>iso639-1</subfield>
</datafield>
Remarks:
Note that second indicator has been set to “7” (= Source specified in subfield $2)
First indicator set to “0” (as in your example) means that the item is not a translation or does not include any translation, which is somehow surprising with the three language codes you provide (although no impossible at all).
Marc control field 008 can only accept Marc codes (ISO 639-2) in positions 35-37. If only a non-MARC code is used in 041 to express the language in an item, field 008/35-37 should be coded with three fill characters (|||), not with the alpha-2 code!
Keep maybe also in mind that NISO ODI recommends on their side to always use the list of MARC language codes (https://www.loc.gov/marc/languages/language_code.html), thus ISO 639-2 (alpha-3 code).
NISO RP-19-2020, Open Discovery Initiative: Promoting Transparency in Discovery https://www.niso.org/publications/rp-19-2020-odi
We are particularly keen to note the importance of inclusion of Indigenous languages in these discussions. Alongside the use of text in a specific language is also the need to identify the Indigenous country that the work is associated with. There is work ongoing for this in Australia.
Presumably this is an area where AI could actually usefully be employed, though it is not my area of expertise
We are concerned that this will never happen at any useful scale – below the very high level. For example, it is already hard enough to get people to change US English to Australian English in a familiar word processing tool such as Word.
Sometimes names of Indigenous Australians will include their traditional country and that needs to be captured
The issue of controlled vocabularies not being comprehensive, is especially the case in relation to First Nations people in Australia
Thank you, FRANÇOIS, we have this sentence in the Recommendations for repository managers on translated content – Promote the use of (re)translation-friendly licences to encourage translation of newly produced content and retranslation as well as promote translation crediting (e.g CC-BY) – see https://hal-lara.archives-ouvertes.fr/OUVRIR-LA-SCIENCE/hal-03640511
Thank you, looks like it’s some kind of formatting issue, the real link in the document is this one https://github.com/dcmi/pids_in_dc/blob/master/proposal/The_Association_of_Persistent_Identifiers_with_Literals_in_XML-formatted_Metadata_using_Dublin.md
Thanks a lot Susanna, I agree with you. This is what we suggested in our blog post about machine translation: “This document/This material is a machine translation [of : [citation of original]] from [source language code] into [target language code]. This machine translation has not been reviewed or edited and is provided “as is” for the sole purpose of assisting users in understanding at least part of the subject matter of the original content expressed in [source language]. This provision does not imply a guarantee of correctness and accuracy of the said machine translation [in target language] by any natural or legal person in any part of this translation. [Consequently, the provision of this translation shall not give rise to any liability on the part of any person to any other person in the event that this translation is used for any purpose whatsoever.] Users of this machine translation are expressly invited to have it checked, revised or edited by a professional translator or relevant expert.” https://www.coar-repositories.org/news-updates/is-there-a-case-for-accepting-machine-translated-scholarly-content-in-repositories/. And this is one of the real cases where the author used machine translation https://zenodo.org/record/7935017
As Devon Murphy (and above for paragraph 6 Raina Heaton) notes, ISO 639-3 three-letter codes are the more viable standard for much greater language inclusion, including more than 7800 individual languages (Living, Extinct, Historical, Ancient and Constructed), as compared with the less than 200 unambiguous individual languages included in 639-1. Thus the vast majority of languages are only handled in 639-3, though it is true that the vast majority of resources are currently covered by Part 1 as the majority of resources are in a relatively small number of major languages. The rubric for deciding which tag to use when using BCP 47 (RFC 5646) syntax is that if there is a correct 639-1, 2-letter code, it should be preferred. The IANA Subtag Registry actually lists the authoritative primary subtags, omitting the 639-2 and 639-3 identifiers that are exact equivalents of 639-1 identifiers. However, if the metadata architecture of a repository is such that the language, geographic, script, and variant components of metadata are separate elements, there is no benefit in preferring two letter identifiers (ISO 639-1) over three letter identifiers.
It is inappropriate to cite a long-obsolete RFC, RFC 1766, as an authority for this section. BCP 47 currently points to RFC 5646, the third RFC in sequence since 1766.https://datatracker.ietf.org/doc/html/rfc5646 or https://www.ietf.org/rfc/rfc5646.txtThe single letter ‘i’ subtag prefix has not been included in the RFC since 3066 (which followed 1766 and became obsolete in 2006), and is no longer in use for new subtags assigned by IANA. All 12 instances of IANA declared i-__ subtags were grandfathered into the current registry and 11 of the 12 have been deprecated since then. This approach should not be included as a recommendation.
In DSpace 7, the value-pairs set for languages can include whatever languages and language identifiers desired, as this is customized in the submission-forms.xml file. It can include three letter identifiers, if there are languages with three-letter identifiers that are present in the target material for a collection.
The paragraph’s middle sentences are not clearly worded.
It is odd to move between sections that are repository specific, then this section that is topic specific, then back to three more repository specific sections. Paragraphs 30, 31 here fit better in the DSpace and Dataverse sections, respectively.
It would be recommended that the introduction and recommendations be translated in multiple languages and shared from the COAR website to promote and encourage the multilingualism.
I corrected the descriptions about WEKO3.before: WEKO3 is a cloud-based repository system supported by JPCOAR (Japan Consortium for Open Access Repositories). It is developed based on INVENIO by CERN.↓after: WEKO3 is a repository software developed by NII (National Institute of Informatics, Japan) based on INVENIO by CERN. This software operates JAIRO Cloud, a cloud-based repository system, which is supported by JPCOAR (Japan Consortium for Open Access Repositories) and NII.before: Specifically, ISO-639-3 is acceptable as the language of the text and for a language attribute of other metadata elements, ISO-639-1 is acceptable.↓after: Specifically, ISO-639-3 is acceptable as the language of the text, and for a language attribute of other metadata elements, ISO-639-1 is acceptable.before: With each field, you can add a language tag in the form of a two-character ISO using the dropdown menu.↓after: With WEKO3, you can add the language tag in the form of a two-character ISO using the dropdown menu, checkbox, and radio button.
(This is actually a comment on Appendix 3.)There is a correction regarding the URL as follows.JPCOAR Metadata Schema 2.0 Drafthttps://schema.irdb.nii.ac.jp/ja/schema/2.0-draft/14https://schema.irdb.nii.ac.jp/ja/schema/2.0-draft/1 –>JPCOAR Metadata Schema 2.0https://schema.irdb.nii.ac.jp/en/schema/2.0/14https://schema.irdb.nii.ac.jp/en/schema/2.0/1
The xml:lang attribute is specific to XML. I suggest making this recommendation agnostic to the metadata representation, or give xml:lang as an example (“e.g.,”) here.
This wording “Because of the cardinality [0-n]” is confusing, because xml:lang is a non-repeatable [0-1] attribute when applied to a specific element. I think the recommendation is clear from the below example – multiple languages should be in separate metadata fields, not combined – but this sentence almost suggests that xml:lang could be repeated within the same element.
We always respect the language of the resource. If, for example, the resource is in Spanish, in the case of keywords, we use qualifiers to differentiate the metadata in different languages:
dc.subject.keywordseng
institutional repository
dc.subject.keywordseng
interoperability
dc.subject.keywordsspa
metadatos
dc.subject.keywordsspa
repositorio institucional
dc.subject.keywordsspa
interoperabilidad
https://ru.micisan.unam.mx/handle/123456789/22232?show=full
We declare the original language of the resource at the item level. All the metadata is described in English as an international language, following the agreement of the European Commission. But we have two users interfaces, one in Spanish (our language) and one in English in order to increase international visibility (this interface nowadays is updating).
Declare the language of the metadata (xml:lang attribute)
We always respect the language of the resource. If, for example, the resource is in Spanish, in the case of keywords, we use qualifiers to differentiate the metadata in different languages:
dc.subject.keywordseng
institutional repository
dc.subject.keywordseng
interoperability
dc.subject.keywordsspa
metadatos
dc.subject.keywordsspa
repositorio institucional
dc.subject.keywordsspa
interoperabilidad
https://ru.micisan.unam.mx/handle/123456789/22232?show=full
We used ISO639-3, according to OpenAIRE Guidelines v3: “Recommended: ISO 639-x, where x can be 1,2 or 3. Best Practice: we use ISO 639-3 and by doing so we follow: http://www.sil.org/iso639-3/”.
Totally agree.
ORCID or ISNE help to identify authors or creators nowadays, but the use of authorities catalogues should be included.
Besides the recognition of the translator we encourage to recognize all the professionals involved in the editorial process(I added) mentioned in the resource, it takes a lot of time but it is worthy. This is an example:
.contributor.assistanttotheeditorinchief
Cruz Salas, Minerva
dc.contributor.businessmanager
Zempoalteca Quintana, Mario
dc.contributor.copyeditorandtranslator
Dashner Monk, Heather
dc.contributor.designer
Pérez Ramírez, Patricia
dc.contributor.editorinchief
Jiménez, Teresa Andreu
dc.contributor.layout
Álvarez Sotelo, María Elena
dc.contributor.salesandcirculationmanager
Creamer Tejeda, Cynthia
dc.contributor.translator
Fernández Hall, María Cristina
In fact, RECOLECTA evaluation for repositories considers this point: “4.5.- Existe un campo específico para indicar la descripción de la colaboración. En este campo se registra la entidad o persona responsable de coordinar, corregir, comentar o, en general, contribuir de alguna otra manera al desarrollo del recurso” (4.5.- There is a specific field to indicate the description of the collaboration. This field records the entity or person responsible for coordinating, correcting, commenting or, in general, contributing in some other way to the development of the resource.
https://calidadrevistas.fecyt.es/sites/default/files/informes/2021guiaevaluacionrecolecta_vf.pdf
In linguistics and language studies, ISO-639-3 (3-letter) codes are a standard. First, most languages don’t have 2-letter codes, and when they do they are often confusing because they don’t represent languages (e.g. cr for ‘Cree’, ms for ‘Malay’, or zh for ‘Chinese). This obscures exactly the type of diversity we hope to promote. Linguists and language archives are also increasingly using glottocodes (https://content.iospress.com/articles/semantic-web/sw212843) for “languoids”, since what gets to “count” as a language is largely political. Consider having an optional field to include those as well.
The link in this paragraph makes it seem like it will go to a set of recommendations, but that doesn’t seem to be what the URL directs you to
Is this suggesting this language be included in a cover page or README supplemental file during download? Could you make it clearer how the export options could include this information?
First, thank you for putting these recommendations together. Overall, I find them to be extremely helpful.
I think the presentation of the recommendations on this page would benefit from a different arrangement to make it easier to find information on the relevant recommendations. Since almost all of the bulleted recommendations have overlap between the 3 categories of people (1. repository managers, 2. repository software/platform developers, and 3. content creators), I recommend consolidating the recommendations to remove overlap, explaining them in prose (as you have done here), and then adding a table to indicate how each recommendation is relevant to the 3 categories of people. For example, the condensed recommendations would be the rows, and the columns would represent the 3 categories of people. Individual cells would have e.g., check marks to indicate who needs to be concerned with each specific recommendation. Then I recommend reorganizing the rest of the document so that it is organized around the actual recommendations and not the categories of people. In the current organization, for many of the sections, I had a hard time figuring out exactly which of the recommendations was the focus of the different sections, and I kept having to go back to the recommendations to try to figure it out.
This comment is actually for the text between examples 2 and 3, but that paragraph is not tagged for comments.
In general I agree with this, however, at The Archive of Indigenous Languages of Latin America (AILLA), we have found that this is not optimal in certain specific use cases dealing with educational materials or articles on “linguistic typology” or “linguistic diversity”, which frequently include a single or a few example(s) of a grammatical phenomenon in many different languages. We have found that when users are searching for content in a particular language, they get frustrated when they land on such examples in our repository because they either are not written in the language for which they are searching, or they do not have enough examples in or information about that language to be useful. This is, of course an edge case, but one to be considered. See e.g. Terrence Kaufman’s archived course on American Indian Languages https://ailla.utexas.org/islandora/object/ailla%3A137495.
The guidelines are not clear on why the xml:lang attribute should be used instead of one more applicable to the metadata schema of the repository. The links are not helpful for sussing this out.
Here’s a MODS example from AILLA, where we use ISO 639-3 language codes:
<titleInfo lang=”eng”>
<title>Iskonawa Oral Tradition</title>
</titleInfo>
<titleInfo lang=”spa”>
<title>Tradición Oral Iskonawa</title>
</titleInfo>
AILLA uses ISO 639-3 language codes, which are the most specific and usually the best option for Indigenous and minority languages. For languages that do not have an assigned 639-3 code, we assign a local code, but we are considering moving to assigning the code “mis” for Missing to these languages.
This comment if meant for the paragraph above that does not have a comment bubble. In the sentence, “This approach does not ensure consistency, not does it reveal hierarchical relations among terms” the second instance of “not” should actually be “nor”.
The linked text in the first paragraphe “using controlled terms …” works and leads to the COAR Controlled Vocabularies Implementation Guide, but many of the links on that page are broken or the server was unavailable when I tried to access them.
This comment is for bullet #4 (“It is not automated;”) in the above list where there is no comment bubble.
What does this mean. Please add more information to clarify what it would mean for this list to be automated. Automated how? For what purpose?
AILLA has not needed to transliterate a non-roman script yet, but we are expecting a deposit of modern (21st century) texts written in Mayan hieroglyphics by native speakers of several different Mayan languages. I do not think that UTF-8 can handle this script. This deposit presents new challenges for us, e.g. in determining which language code to use to classify the texts (we plan to use the code for the native language of the author), in rendering the glyphs (it might not be possible to do this in the metadata), and translating the content of the writing into Spanish and English (we must rely on the content creators to do that). I’m sure there will be additional challenges that we have not foreseen.
This statement implies that a translation is always made – or at least revised – by a human, but especially for metadata, texts are more and more machine translated with very limited human supervision. How can we make machine translated content – and the associated error risk – clearly identified? Maybe we should at least make a distinction between the two cases below:
1. For human translation and human-revised machine translation of publications -> “This material titled ‘[translated title]’ is an integral (partial) translation in [language name – standard language code] dated [DD-MM-YYYY] by [translator(‘s)s name(s)] of “original title” by [author(‘s)s name(s) in [language name – standard language code] as published in [publication details]/retrieved from [DOI, other PID resolver or URL].”
2. For raw machine translation, and especially for metadata ->“This material is a machine translation in [language name – standard language code] dated [DD-MM-YYYY] of [“original title”] by [author(‘s)s name(s) in [language name – standard language code] as published in [publication details]/retrieved from [DOI, other PID resolver or URL].” NB: It should be noted that his kind of description omits translation credits because the content is actually generated by a machine without human intervention, but it raises the question about who is actually responsible for this machine translated content.