Actions for repository managers: Correct labelling of languages
When the language of the resource is correctly attributed, it allows discovery and indexing services to properly process and parse the text. Indexing involves text analysis practices such as stemming, lemmatization (grouping together the inflected forms of a word so they can be analysed as a single item), and the appropriate treatment of stop-words, all of which are language specific. Including the language tag enables information seekers, aggregators, and other discovery services to correctly identify the language of the full text and treat items accordingly.
Declare the language of the resource at the item level
¶ 20 Leave a comment on paragraph 20 0 Declaring the primary language of the document is considered mandatory. The language metadata must be encoded using the ISO-639 language code.
¶ 21 Leave a comment on paragraph 21 1 If the document has only one language, language metadata identifies the primary language of the resource. Attribution of the primary language of the resource must be done at the item level.
Example 1: language in simple Dublin Core XML with ISO-639-1 encoding
<dc:language>en</dc:language>
Example 2: language in MODS with ISO 639-2 encoding
<language>
<languageTerm authority="iso639-2b" type="code" uthorityURI="http://id.loc.gov/vocabulary/iso639-2" valueURI="http://id.loc.gov/vocabulary/iso639-2/eng">eng</languageTerm>
</language>
If the whole document or parts of the documents contains more than one language, the language metadata should be repeated to mention each language.
Example 3: bilingual (french/english) document in simple Dublin Core XML with ISO-639-1 encoding
<dc:language>en</dc:language>
<dc:language>fr</dc:language>
Example 4: bilingual (french/english) document in MODS with ISO 639-2 encoding
<language>
<languageTerm authority="iso639-2b" type="code" uthorityURI="http://id.loc.gov/vocabulary/iso639-2" valueURI="http://id.loc.gov/vocabulary/iso639-2/eng">eng</languageTerm>
</language>
<language>
<languageTerm authority="iso639-2b" type="code" uthorityURI="http://id.loc.gov/vocabulary/iso639-2" valueURI="http://id.loc.gov/vocabulary/iso639-2/fre">fre</languageTerm>
</language>
See more implementation examples following metadata standards/guidelines in the Appendix 2.
Declare the language of the metadata (xml:lang attribute)
¶ 22 Leave a comment on paragraph 22 2 Use the xml:lang attribute to indicate the language of the metadata field. Because of the cardinality [0-n], the xml:lang attribute could describe the same element in different languages, so this would be more accurate than the dc:language element.
¶ 23 Leave a comment on paragraph 23 0 Declare the language of the metadata even in English. It seems like an additional effort, but it’s worth it, aggregators can’t assume that.
¶ 24 Leave a comment on paragraph 24 2 The xml:lang attribute/subproperty is described at https://www.w3.org/TR/xml/#sec-lang-tag . The values of the attribute are language identifiers as defined by [IETF BCP 47], Tags for the Identification of Languages.
How to attribute language when there is more than one language in the metadata fields
Use of the xml:lang attribute to indicate the language of the metadata field.
<datacite:titles>
<datacite:title xml:lang="en">Open Access</datacite:title>
<datacite:title xml:lang="pl">Otwarty Dostęp</datacite:title>
</datacite:titles>
<dc:title xml:lang="en">Open Access</dc:title>
<dc:title xml:lang="fr">Libre Accès</dc:title>
See more implementation examples following metadata standards/guidelines in the Appendix 3.
Use two-letter language codes whenever they are available, and three-letter codes if necessary (ISO 639)
¶ 25
Leave a comment on paragraph 25 4
IANA recommends using two-letter codes whenever they are available, and three-letter codes if necessary (e.g. if no two letter code exists):
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry and https://en.wikipedia.org/wiki/IETF_language_tag.
According to the “Internet Official Protocol Standards” IETF RFC 1766 : Tags for the Identification of Languages: “The language tag is composed of 1 or more parts: A primary language tag and a (possibly empty) series of subtags.
In the primary language tag all 2-letter tags are interpreted according to ISO standard 639, “Code for the representation of names of languages” [ISO 639].
The information in the subtag may for instance be:
Using language codes can also be practical for historical or non-official languages (e.g. Latin, Walloon, etc.). Examples in Walloon:
https://orbi.uliege.be/handle/2268/28421
https://orbi.uliege.be/handle/2268/28419
See more about ISO 639-1, ISO 639-2 and ISO 639-3 and language tags in Appendix 4.
Implementing recommendations in repository software platforms
Dataverse
¶ 26 Leave a comment on paragraph 26 0 Dataverse is the open source data repository developed by IQSS of Harvard University. A strong Dataverse community is helping to improve the basic functionality and develop it further. DANS-KNAW delivered production ready (Docker/k8s) Dataverse repository for the European Open Science Cloud (EOSC) communities CESSDA, CLARIN and DARIAH. To address the heterogeneous and multilingual datasets integration challenges, DANS-KNAW introduced external controlled vocabularies support (CESSDA Metadata Model connected to Skosmos framework; support for CLARIN Component MetaData Infrastructure and the European Language Social Science Thesaurus (ELSST) hosted by CESSDA and ODISSEI in Skosmos – CESSDA has an updated version with more language properties).
DSpace
¶ 27 Leave a comment on paragraph 27 1 DSpace 7 allows using a language attribute for any metadata you want – a two-character ISO in the dropdown menu, but in the field you can write anything and there are different variations. E.g. you can add more than one language option for an item – for several fields, but you can’t specify on which field you are referring to. For earlier versions of DSpace there was a need to find workarounds for this. See Appendix 5 on how to fix language code inconsistencies in repositories running on previous versions of DSpace.
¶ 28 Leave a comment on paragraph 28 0 However, there is no exposure of the language of metadata in OAI-PMH and it’s a request to software developers.
Multiple repository interface languages
¶ 29 Leave a comment on paragraph 29 1 If the repository software supports multiple interface languages, it is recommended to set up the user interface in the native language(s) of the target group, along with that in English.
¶ 30 Leave a comment on paragraph 30 0 DSpace provides support for multiple interface languages. The text displayed on the interface is called “messages” and the messages files (language packs) are contributed and managed by the community outside the core DSpace project to allow more regular updates and releases. Users can modify community translations or create their own and commit them to the dspace-api-lang project on Github. Apart from messages, it is possible to localize other elements, such as help pages, input forms and email templates. Instructions on how to enable the interface in multiple languages is available in the DSpace documentation. DSpace 7 makes a step forward towards facilitating UI translations: https://wiki.lyrasis.org/pages/viewpage.a:
¶ 31 Leave a comment on paragraph 31 0 Dataverse supports multilingual user interfaces and relies on community translations done by volunteers. Major progress towards creating a directory of language packs was made within the Social Sciences and Humanities Open Cloud (SSHOC) project and the online tool Weblate was designed to facilitate new translations. A user guide for Weblate is also available: https://doi.org/10.5281/zenodo.4807371.
EPrints
¶ 32 Leave a comment on paragraph 32 0 EPrints provides support for multiple interface languages, using language-specific folders of “phrases” and other files. By default, EPrints only comes packaged with English language phrases, but the community has shared many translations in the EPrints Bazaar and on EPrints Files. EPrints uses the two letter ISO language standard to specify sub-directories of phrases and other types of language specific directories, for example:
- ¶ 33 Leave a comment on paragraph 33 0
- lib/lang/en/phrases/
- lib/lang/fr/static/
- lib/lang/de/templates/
¶ 34 Leave a comment on paragraph 34 0 EPrints subject metadata is designed to accommodate multilingual labels, so subject labels can be displayed according to what language the user has set for the interface.
¶ 35 Leave a comment on paragraph 35 0 EPrints is designed to default to English phrases; if it has missing phrases for another declared interface language, it will use the English language phrases until the missing phrases are added. There is a technical wiki page about translations but it may be out of date as it has only been edited a few times in the last few years.
¶ 36 Leave a comment on paragraph 36 0 EPrints can be extended to declare language information at the item or file level but this is not in place on EPrints by default. Similarly EPrints XML export plugins, embedded metadata and OAI-PMH interface code could be extended to define xml:lang attributes but it does not do this by default.
OSF – Open Science Framework
¶ 37 Leave a comment on paragraph 37 0 New metadata enhancements on OSF for all OSF Projects, Registrations, and Preprints now includes the language of materials, more details in New OSF Metadata to Support Data Sharing Policy Compliance.
TIND IR
https://www.tind.io/ir
¶ 38 Leave a comment on paragraph 38 1 The TIND IR is a MARC-based repository. That means that the easiest way to include information about multilingual content is through the 041 field and relevant subfields (https://www.loc.gov/marc/bibliographic/bd041.html). While the language codes used for cataloguing (https://www.loc.gov/marc/languages/language_name.html) do not conform to the recommendation of this group, the processes and details of repository entries are more flexible and should probably instead use the language code methods described in this recommendation. The use of subfields allows for granular declaration of item language, summary language, table of contents language, and more.
The XML might look something like the following:
<datafield tag="041" ind1="0" ind2=" ">
<subfield code="a">it</subfield>
<subfield code="a">en</subfield>
<subfield code="a">fr</subfield>
</datafield>
WEKO 3
¶ 39 Leave a comment on paragraph 39 1 WEKO3 is a cloud-based repository system supported by JPCOAR (Japan Consortium for Open Access Repositories). It is developed based on INVENIO by CERN. In WEKO3, JPCOAR metadata schema is supported by default and a language attribute can be added for any metadata as long as it is allowed in the schema. Specifically, ISO-639-3 is acceptable as the language of the text and for a language attribute of other metadata elements, ISO-639-1 is acceptable. With each field, you can add a language tag in the form of a two-character ISO using the dropdown menu.
This comment is actually for the text between examples 2 and 3, but that paragraph is not tagged for comments.
In general I agree with this, however, at The Archive of Indigenous Languages of Latin America (AILLA), we have found that this is not optimal in certain specific use cases dealing with educational materials or articles on “linguistic typology” or “linguistic diversity”, which frequently include a single or a few example(s) of a grammatical phenomenon in many different languages. We have found that when users are searching for content in a particular language, they get frustrated when they land on such examples in our repository because they either are not written in the language for which they are searching, or they do not have enough examples in or information about that language to be useful. This is, of course an edge case, but one to be considered. See e.g. Terrence Kaufman’s archived course on American Indian Languages https://ailla.utexas.org/islandora/object/ailla%3A137495.