Some of the use cases that are driving the recommended practices are as follows:
1. As a non-English institution, I am receiving in my repository documents in English that I need to describe.
|When a new English document is submitted to the repository, it needs to be described with different metadata fields in different languages (e.g. abstracts, titles, keywords, document type) and using non-English controlled vocabularies.|
|Example: Hokkaido University uses JPCOAR metadata schema – Metadata in different languages is put in the same metadata field but distinguished by the language attribute, e.g. dc.description.abstract and dc.subject, https://eprints.lib.hokudai.ac.jp/dspace/handle/2115/79104?mode=full&submit_simple=Show+full+item+record – a language column on the right side of the page shows the ISO language code of the metadata. When journal articles are deposited, every metadata on the published version is included (no translation from the original; in Japanese language journals typically abstracts and keywords are written in English as well and full-text – in Japanese); abstracts are in metadata and the language attribute is embedded; authors names in the language of the article. At least, there is a scheme to mark metadata for multi-language; but there are concerns about discoverability and what is more suitable metadata.|
2. As a repository manager, I often deal with articles, thesis or dissertations that are written in more than one language.
|All thesis and dissertation are submitted in French but many contain articles inserted as chapters in the language they were written in.|
|Example: At ULiège, if a document is available in different languages, each language version is made available as a different record with metadata in different languages. Example of the same document in two different languages, for which two different records exist: https://orbi.uliege.be/handle/2268/170862 and https://orbi.uliege.be/handle/2268/170863. But there is only one language attribute for the record.|
3. As an author, I would like to see my articles written in different languages in one record – for statistics and for reporting
|All articles in different languages are deposited in one item and need to be described properly.|
|Example: At Open University of Catalonia there were two separate records for articles in different languages in the past. Now, by request from authors, translations are together in one record or even in the same file document, which simplifies citations tracking and increases visibility. But there might be issues for content aggregators and indexing services.|
4. As a repository manager, I want to provide submission fields in different languages
|[THIS MAY BE SPECIFIC TO DSPACE]. When configuring submission forms, the labels and help/instructions for each field can only be written in one language. Multilingualism can only be achieved by typing the label in each language in the same field (Author/Auteur).|
5. As a repository manager, I want to have a collection name and description in more than one language
|Currently only one language is allowed for a collection name and description.|
|Example: [THIS MAY BE SPECIFIC TO DSPACE]. It would be nice if introductory texts (HTML) etc. of communities/collections could be presented in multiple languages. This could quite easily be accomplished by using CSS and named divs. But unfortunately html attributes, such as id and style, seem to be removed in the html output – i.e. <div id=”swedish”>text</div> is transformed to <div>text</div> in the UI. |
As collections and communities are items in DSpace (and thus have their own metadata), maybe a way to solve this problem would be to allow language selections at the metadata level, like it could be done already for objects metadata (i.e abstracts).
A simple and quick workaround to the bilingually issue of collections/communities in DSpace is to use a delimiter, like the bar | , in between two text describing these entities and their metadata fields as needed. All is required is to split the text at viewing time so that only the text in the currently active is displayed. Here you will see the Arabic version of the communities/collections list:https://repo-nu.maktabat-online.com/community-list. When switching the language to the English interface, using the world icon on top, you will see them all appear in English. The same approach has been applied to the facets elements, where you now see controlled values like names of formats/ types, universities/ colleges/ departments, entities, etc. in multiple languages.
6. As a repository manager, I want to be able to manage labels in my language efficiently.
|In open source multilingual softwares (OJS, DSpace, Eprints, etc.), the English labels are the mandatory ones when developing new features. Other languages’ updates are often lagging behind and managed afterwards by the community or sometimes locally. Translations for new software functionalities is a challenge.|
|Examples: At ZORA (Zurich Open Repository and Archive) https://www.zora.uzh.ch/ EPrints repository there is a German version of the interface. |
CSpace in China includes a metadata schema and interface in different languages, but repository managers still have challenges describing content in repositories.
It’s usually up to the users to select language tags and users are trained on how to deposit multilingual content.
The interface languages of the repositories developed by the University of Belgrade Computer Centre (Serbia) include English and Serbian (in two alphabets: Cyrillic and Latin), e.g. https://dais.sanu.ac.rs/. As the users were not satisfied with the available translations, the development team devised an in-house web application to facilitate translation: https://trapist.rcub.bg.ac.rs/DESI/. The application allows adding, removing and changing selected labels in individual or in all repositories. Changes are propagated to the repositories within 24 hours.
7. As a repository manager, I want to offer metadata translation in English – e.g. abstracts, titles and subjects
|Some metadata need to be translated in English using machine translation tools|
|Examples: A Google translation API https://cloud.google.com/translate is used for translating abstracts, titles and subjects. |
This could also be achieved by recommending or requiring at least minimum metadata in English in user guidelines. In the Digital Archive of the Serbian Academy of Sciences and Arts, providing at least a brief description and keywords in English is recommended, as this improves content discoverability: https://repowiki.rcub.bg.ac.rs/index.php/DAIS_-_Digital_Archive_of_the_Serbian_Academy_of_Sciences_and_Arts:_Metadata.
8. As a national repository, I need to deposit items in all languages of the country.
|Content is available in local languages, but some of them don’t have the language code, aren’t in Unicode and there are no controlled vocabularies in those languages.|
|Example: In Nepal, only titles are added in Nepali language and the rest of metadata are in English, There is no consistency for keywords standardization in Nepali language and no controlled vocabularies. Many local languages aren’t in Unicode and sometimes romanized words are used – e.g. किताब kitaba (romanized) and a book (in translated form). This creates issues for Google Scholar indexing that would like to see metadata in the language of the article.|
9. As a repository manager, I would like to expose the language of the metadata in OAI-PMH.
|Currently there is no exposure for the language of the metadata in OAI-PMH.|
|Wish list: Repositories should consistently and consciously use metadata language tags to ensure that incorrect language information isn’t exposed. And a language attribute should be exportable, including OAI-PMH. Another option could be a proactive approach by repositories – downloading – e.g. on the monthly basis – the extraction of metadata reference sheets and making them openly available to expose the language values.|
10. As an aggregator and discovery system, I want to know what is the language of the full text document I am indexing, so I can assist users in finding content in their preferred language
|There are issues with indexing contents at aggregator level (Solr, VuFind, etc.) because there is no way to separate the indexes by language and use language specific tools to enrich the search experiences.|
Most regional repositories metadata does not have proper separation of multilingual information. Even mixed languages can be found on single textual metadata fields.
Keywords and descriptors are in multiple languages without the proper identification, hundreds of repositories are using different vocabularies even in the same language. Some ideas were discussed around the implementation of automatic classifiers to tag repository metadata with normalized vocabularies for the region.
|Examples: LA Referencia is developing a language detecting tool (using different python libraries for natural language processing) to separate languages in metadata textual fields in order to improve metadata at aggregator level. The idea is to add proper xml:lang tags to every textual metadata field. This tagging would be used by the indexing process in order to generate separated indexes, still the problem of dealing with different languages in the search UI is complex to solve. |
CORE seems to use a language detection tool. Distinguishing among Bosnian, Croatian, Montenegrin and Serbian is a challenge, as these languages are very similar. Due to this, language tags in CORE are usually incorrect when it comes to these languages. Using the common tag BCMS languages would be a solution to this problem.
11. As an aggregator, I would like to index content correctly and assist users in finding content in their languages.
|OpenAIRE Institutional and thematic Repository Guidelines (for aggregating repository content) encourage the use of the xml:lang attribute to indicate the language of the metadata. OpenAIRE aggregator supports the xml language tag|
|Example: <dc:description> Foreword [by] Hazel Anderson; Introduction; The scientific heresy: transformation of a society; Consciousness as causal reality [etc]</dc:description>|
<dc:description xml:lang=”en-US”> A number of problems in quantum state and system identification are addressed.</dc:description>
OpenAIRE supports the xml language tag and the aggregator conducts metadata checks for language – e.g. in subjects, titles and abstracts/descriptions; no names though – ORCID is recommended for names – OpenAIRE I+T: Title https://openaire-guidelines-for-literature-repository-managers.readthedocs.io/en/latest/field_title.html#dci-title , Description https://openaire-guidelines-for-literature-repository-managers.readthedocs.io/en/latest/field_description.html#attribute-lang-o
OpenAIRE also allows multiple languages https://openaire-guidelines-for-literature-repository-managers.readthedocs.io/en/v4.0.0/field_language.html – content resource has this language. Action: promote this to repositories
12. As a researcher, I want to know what research is out there in other languages. Could also be a use case for a patient, etc.
|Translating abstracts and making them available, offering an option to search by keywords in many languages could be some of the solutions and deep learning tools started offering this – e.g. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/abs/1810.04805).|
|Examples: BASE – multilingual search https://www.base-search.net/Search/Advanced, (search term is included in Eurovoc Thesaurus or Agrovoc Thesaurus. Example search for climatology). Wikidata and Abstract Wikipedia providing information independent of language: https://meta.wikimedia.org/wiki/Abstract_Wikipedia.|
13. As a digital preservation librarian or archivist, I need to know how to include natural language information in technical and descriptive metadata so that digital archival documents can be effectively indexed for retrieval and access.
|Documented best practices for the inclusion of natural language information using digital preservation metadata standards such as METS and PREMIS facilitate increased accessibility, inclusion and diversity of digital archives.|
|Examples: Language is required information for effective indexing for retrieval of text (word stemming, stop words), video and audio content (speech-to-text allows for retrieval/indexing, subtitling of audio and video for accessibility). |
Language metadata can be included using Dublin Core’s <dc:language> tag as a part of the Internal Descriptive Metadata (mdWrap) of a METS file.
Language metadata can be included as one of the <significantProperties> of semantic units in PREMIS.
For text documents, language metadata can be included using textMD (https://www.loc.gov/standards/textMD/), most commonly as an extension schema used within the METS administrative metadata section. Language can also be included as a part of standalone textMD document within the PREMIS element <objectCharacteristicsExtension>.
14. As a user, when submitting or browsing content, I want to be able to use an interface in my own language.
|Repository interface is available in different languages.|
|Examples: At Open University of Catalonia, the repository has three language interfaces for the repository end-user https://openaccess.uoc.edu/. Each language interface has metadata fields names in the same language – e.g. Autor in Catalan and Spanish, Author in English.In all institutional repositories developed by the University of Belgrade Computer Centre, the end-user interface is available in English and Serbian (both Cyrillic and Latin) However, the labels and help in the input form are available only in Serbian because it is not possible to align them with the interface language in DSpace.|
15. As an English language institution I use a catalog to describe content in my repository – in English and other languages
|Content is entered in native language, but findability might be an issue.|
|Example: At Berkeley Law, a MARC based system is used for describing content. Since this is < 1-3% of the content there is no expectation that searching in non-English terms will return any results unless the user is looking for something specific. Subject terms in the repository aren’t used, but this seems like an easy way to increase accessibility in other languages. The catalog and repository are linked and search is available in many languages. The catalogers speak many languages and are capable of cataloging in non-English languages, but still most cataloging is done in English aimed at single language speakers.|
16. As an institution that supports a lot of translations, I would like to credit translators when depositing translated items in the repository.
|Translators could be credited using taxonomies, e.g. CREDIT taxonomy, which is only available in English now, and it would be good to have an official translation into other languages. Two ‘unofficial’ French translations exist’: see https://coop-ist.cirad.fr/etre-auteur/reconnaitre-tous-les-contributeurs/3-la-taxonomie-credit-pour-identifier-toutes-les-contributions and https://www.redactionmedicale.fr/2018/03/la-taxonomie-credit-devrait-etre-utilisee-par-les-revues-francaises-pour-decrire-la-contribution-des. Translators are acknowledged in the institutional repository (e.g. as contributors with names and roles), but it’s not a case for some other archives – e.g. preprint archives.|
|Example: ULiège repository has a translator metadata field, e.g. see here https://orbi.uliege.be/handle/2268/290642.|
17. As a translator, I would like to know whether a translation exists
|As a translator, I need to know whether a translation exists:For a quotation embedded in a source document in the same language, but I need to check if there’s a target (original or translated) language version of the quoted text (with a reference in the notes or bibliography of the source document), before deciding whether to translate the quotation myself or reuse the existing translated quotation in my translation; To use text about the same topic as the translation I’m assigned, I may need to build a corpus of similar documents in the source and target languages of my assignment to use them in concordancing software which allow to search text strings (words, terms, phrases) in one language and retrieve in two languages. I may seek through a desktop research a collection of documents with their translation in the target language and then process them in an aligning software to obtain aligned files for words/phrases.To build alignments, either a) To feed into a CAT (computer aided translation) systems, or b) To feed into the learning modules of MT (Machine Translation) systems.|
|Example: In all those cases, having documents being recorded with proper metadata designating the original/translation status and pointing to the matching counterpart(s), might help the above desktop searches if the metadata were interoperable with search engines, library catalogs, repositories and CRIS systems. This will also be relevant for journal editors, terminologists, text miners and language technologists. To facilitate their work we need interoperability and interconnections between different systems. |
Translate Science is building such a tool and that is why we need good language metadata from repositories.