Appendix 1. Use cases and challenges
Some of the use cases that are driving the recommended practices are as follows:
1. As a non-English institution, I am receiving in my repository documents in English that I need to describe.
When a new English document is submitted to the repository, it needs to be described with different metadata fields in different languages (e.g. abstracts, titles, keywords, document type) and using non-English controlled vocabularies. |
Example: Hokkaido University uses JPCOAR metadata schema – Metadata in different languages is put in the same metadata field but distinguished by the language attribute, e.g. dc.description.abstract and dc.subject, https://eprints.lib.hokudai.ac.jp/dspace/handle/2115/79104?mode=full&submit_simple=Show+full+item+record – a language column on the right side of the page shows the ISO language code of the metadata. When journal articles are deposited, every metadata on the published version is included (no translation from the original; in Japanese language journals typically abstracts and keywords are written in English as well and full-text – in Japanese); abstracts are in metadata and the language attribute is embedded; authors names in the language of the article. At least, there is a scheme to mark metadata for multi-language; but there are concerns about discoverability and what is more suitable metadata. |
2. As a repository manager, I often deal with articles, thesis or dissertations that are written in more than one language.
All thesis and dissertation are submitted in French but many contain articles inserted as chapters in the language they were written in. |
Example: At ULiège, if a document is available in different languages, each language version is made available as a different record with metadata in different languages. Example of the same document in two different languages, for which two different records exist: https://orbi.uliege.be/handle/2268/170862 and https://orbi.uliege.be/handle/2268/170863. But there is only one language attribute for the record. |
3. As an author, I would like to see my articles written in different languages in one record – for statistics and for reporting
All articles in different languages are deposited in one item and need to be described properly. |
Example: At Open University of Catalonia there were two separate records for articles in different languages in the past. Now, by request from authors, translations are together in one record or even in the same file document, which simplifies citations tracking and increases visibility. But there might be issues for content aggregators and indexing services. |
4. As a repository manager, I want to provide submission fields in different languages
[THIS MAY BE SPECIFIC TO DSPACE]. When configuring submission forms, the labels and help/instructions for each field can only be written in one language. Multilingualism can only be achieved by typing the label in each language in the same field (Author/Auteur). |
5. As a repository manager, I want to have a collection name and description in more than one language
Currently only one language is allowed for a collection name and description. |
Example: [THIS MAY BE SPECIFIC TO DSPACE]. It would be nice if introductory texts (HTML) etc. of communities/collections could be presented in multiple languages. This could quite easily be accomplished by using CSS and named divs. But unfortunately html attributes, such as id and style, seem to be removed in the html output – i.e. <div id=”swedish”>text</div> is transformed to <div>text</div> in the UI. As collections and communities are items in DSpace (and thus have their own metadata), maybe a way to solve this problem would be to allow language selections at the metadata level, like it could be done already for objects metadata (i.e abstracts). A simple and quick workaround to the bilingually issue of collections/communities in DSpace is to use a delimiter, like the bar | , in between two text describing these entities and their metadata fields as needed. All is required is to split the text at viewing time so that only the text in the currently active is displayed. Here you will see the Arabic version of the communities/collections list:https://repo-nu.maktabat-online.com/community-list. When switching the language to the English interface, using the world icon on top, you will see them all appear in English. The same approach has been applied to the facets elements, where you now see controlled values like names of formats/ types, universities/ colleges/ departments, entities, etc. in multiple languages. |
6. As a repository manager, I want to be able to manage labels in my language efficiently.
In open source multilingual softwares (OJS, DSpace, Eprints, etc.), the English labels are the mandatory ones when developing new features. Other languages’ updates are often lagging behind and managed afterwards by the community or sometimes locally. Translations for new software functionalities is a challenge. |
Examples: At ZORA (Zurich Open Repository and Archive) https://www.zora.uzh.ch/ EPrints repository there is a German version of the interface. CSpace in China includes a metadata schema and interface in different languages, but repository managers still have challenges describing content in repositories. It’s usually up to the users to select language tags and users are trained on how to deposit multilingual content. The interface languages of the repositories developed by the University of Belgrade Computer Centre (Serbia) include English and Serbian (in two alphabets: Cyrillic and Latin), e.g. https://dais.sanu.ac.rs/. As the users were not satisfied with the available translations, the development team devised an in-house web application to facilitate translation: https://trapist.rcub.bg.ac.rs/DESI/. The application allows adding, removing and changing selected labels in individual or in all repositories. Changes are propagated to the repositories within 24 hours. |
7. As a repository manager, I want to offer metadata translation in English – e.g. abstracts, titles and subjects
Some metadata need to be translated in English using machine translation tools |
Examples: A Google translation API https://cloud.google.com/translate is used for translating abstracts, titles and subjects. This could also be achieved by recommending or requiring at least minimum metadata in English in user guidelines. In the Digital Archive of the Serbian Academy of Sciences and Arts, providing at least a brief description and keywords in English is recommended, as this improves content discoverability: https://repowiki.rcub.bg.ac.rs/index.php/DAIS_-_Digital_Archive_of_the_Serbian_Academy_of_Sciences_and_Arts:_Metadata. |
8. As a national repository, I need to deposit items in all languages of the country.
Content is available in local languages, but some of them don’t have the language code, aren’t in Unicode and there are no controlled vocabularies in those languages. |
Example: In Nepal, only titles are added in Nepali language and the rest of metadata are in English, There is no consistency for keywords standardization in Nepali language and no controlled vocabularies. Many local languages aren’t in Unicode and sometimes romanized words are used – e.g. किताब kitaba (romanized) and a book (in translated form). This creates issues for Google Scholar indexing that would like to see metadata in the language of the article. |
9. As a repository manager, I would like to expose the language of the metadata in OAI-PMH.
Currently there is no exposure for the language of the metadata in OAI-PMH. |
Wish list: Repositories should consistently and consciously use metadata language tags to ensure that incorrect language information isn’t exposed. And a language attribute should be exportable, including OAI-PMH. Another option could be a proactive approach by repositories – downloading – e.g. on the monthly basis – the extraction of metadata reference sheets and making them openly available to expose the language values. |
10. As an aggregator and discovery system, I want to know what is the language of the full text document I am indexing, so I can assist users in finding content in their preferred language
There are issues with indexing contents at aggregator level (Solr, VuFind, etc.) because there is no way to separate the indexes by language and use language specific tools to enrich the search experiences. Most regional repositories metadata does not have proper separation of multilingual information. Even mixed languages can be found on single textual metadata fields. Keywords and descriptors are in multiple languages without the proper identification, hundreds of repositories are using different vocabularies even in the same language. Some ideas were discussed around the implementation of automatic classifiers to tag repository metadata with normalized vocabularies for the region. |
Examples: LA Referencia is developing a language detecting tool (using different python libraries for natural language processing) to separate languages in metadata textual fields in order to improve metadata at aggregator level. The idea is to add proper xml:lang tags to every textual metadata field. This tagging would be used by the indexing process in order to generate separated indexes, still the problem of dealing with different languages in the search UI is complex to solve. CORE seems to use a language detection tool. Distinguishing among Bosnian, Croatian, Montenegrin and Serbian is a challenge, as these languages are very similar. Due to this, language tags in CORE are usually incorrect when it comes to these languages. Using the common tag BCMS languages would be a solution to this problem. |
11. As an aggregator, I would like to index content correctly and assist users in finding content in their languages.
OpenAIRE Institutional and thematic Repository Guidelines (for aggregating repository content) encourage the use of the xml:lang attribute to indicate the language of the metadata. OpenAIRE aggregator supports the xml language tag |
Example: <dc:description> Foreword [by] Hazel Anderson; Introduction; The scientific heresy: transformation of a society; Consciousness as causal reality [etc]</dc:description> <dc:description xml:lang=”en-US”> A number of problems in quantum state and system identification are addressed.</dc:description> OpenAIRE supports the xml language tag and the aggregator conducts metadata checks for language – e.g. in subjects, titles and abstracts/descriptions; no names though – ORCID is recommended for names – OpenAIRE I+T: Title https://openaire-guidelines-for-literature-repository-managers.readthedocs.io/en/latest/field_title.html#dci-title , Description https://openaire-guidelines-for-literature-repository-managers.readthedocs.io/en/latest/field_description.html#attribute-lang-o OpenAIRE also allows multiple languages https://openaire-guidelines-for-literature-repository-managers.readthedocs.io/en/v4.0.0/field_language.html – content resource has this language. Action: promote this to repositories |
12. As a researcher, I want to know what research is out there in other languages. Could also be a use case for a patient, etc.
Translating abstracts and making them available, offering an option to search by keywords in many languages could be some of the solutions and deep learning tools started offering this – e.g. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/abs/1810.04805). |
Examples: BASE – multilingual search https://www.base-search.net/Search/Advanced, (search term is included in Eurovoc Thesaurus or Agrovoc Thesaurus. Example search for climatology). Wikidata and Abstract Wikipedia providing information independent of language: https://meta.wikimedia.org/wiki/Abstract_Wikipedia. |
13. As a digital preservation librarian or archivist, I need to know how to include natural language information in technical and descriptive metadata so that digital archival documents can be effectively indexed for retrieval and access.
Documented best practices for the inclusion of natural language information using digital preservation metadata standards such as METS and PREMIS facilitate increased accessibility, inclusion and diversity of digital archives. |
Examples: Language is required information for effective indexing for retrieval of text (word stemming, stop words), video and audio content (speech-to-text allows for retrieval/indexing, subtitling of audio and video for accessibility). Language metadata can be included using Dublin Core’s <dc:language> tag as a part of the Internal Descriptive Metadata (mdWrap) of a METS file. Language metadata can be included as one of the <significantProperties> of semantic units in PREMIS. For text documents, language metadata can be included using textMD (https://www.loc.gov/standards/textMD/), most commonly as an extension schema used within the METS administrative metadata section. Language can also be included as a part of standalone textMD document within the PREMIS element <objectCharacteristicsExtension>. |
14. As a user, when submitting or browsing content, I want to be able to use an interface in my own language.
Repository interface is available in different languages. |
Examples: At Open University of Catalonia, the repository has three language interfaces for the repository end-user https://openaccess.uoc.edu/. Each language interface has metadata fields names in the same language – e.g. Autor in Catalan and Spanish, Author in English.In all institutional repositories developed by the University of Belgrade Computer Centre, the end-user interface is available in English and Serbian (both Cyrillic and Latin) However, the labels and help in the input form are available only in Serbian because it is not possible to align them with the interface language in DSpace. |
15. As an English language institution I use a catalog to describe content in my repository – in English and other languages
Content is entered in native language, but findability might be an issue. |
Example: At Berkeley Law, a MARC based system is used for describing content. Since this is < 1-3% of the content there is no expectation that searching in non-English terms will return any results unless the user is looking for something specific. Subject terms in the repository aren’t used, but this seems like an easy way to increase accessibility in other languages. The catalog and repository are linked and search is available in many languages. The catalogers speak many languages and are capable of cataloging in non-English languages, but still most cataloging is done in English aimed at single language speakers. |
16. As an institution that supports a lot of translations, I would like to credit translators when depositing translated items in the repository.
Translators could be credited using taxonomies, e.g. CREDIT taxonomy, which is only available in English now, and it would be good to have an official translation into other languages. Two ‘unofficial’ French translations exist’: see https://coop-ist.cirad.fr/etre-auteur/reconnaitre-tous-les-contributeurs/3-la-taxonomie-credit-pour-identifier-toutes-les-contributions and https://www.redactionmedicale.fr/2018/03/la-taxonomie-credit-devrait-etre-utilisee-par-les-revues-francaises-pour-decrire-la-contribution-des. Translators are acknowledged in the institutional repository (e.g. as contributors with names and roles), but it’s not a case for some other archives – e.g. preprint archives. |
Example: ULiège repository has a translator metadata field, e.g. see here https://orbi.uliege.be/handle/2268/290642. |
17. As a translator, I would like to know whether a translation exists
As a translator, I need to know whether a translation exists:For a quotation embedded in a source document in the same language, but I need to check if there’s a target (original or translated) language version of the quoted text (with a reference in the notes or bibliography of the source document), before deciding whether to translate the quotation myself or reuse the existing translated quotation in my translation; To use text about the same topic as the translation I’m assigned, I may need to build a corpus of similar documents in the source and target languages of my assignment to use them in concordancing software which allow to search text strings (words, terms, phrases) in one language and retrieve in two languages. I may seek through a desktop research a collection of documents with their translation in the target language and then process them in an aligning software to obtain aligned files for words/phrases.To build alignments, either a) To feed into a CAT (computer aided translation) systems, or b) To feed into the learning modules of MT (Machine Translation) systems. |
Example: In all those cases, having documents being recorded with proper metadata designating the original/translation status and pointing to the matching counterpart(s), might help the above desktop searches if the metadata were interoperable with search engines, library catalogs, repositories and CRIS systems. This will also be relevant for journal editors, terminologists, text miners and language technologists. To facilitate their work we need interoperability and interconnections between different systems. Translate Science is building such a tool and that is why we need good language metadata from repositories. |
Recent Comments in this Document
June 30, 2023 at 5:22 am
Sometimes names of Indigenous Australians will include their traditional country and that needs to be captured
See in context
June 30, 2023 at 5:20 am
The issue of controlled vocabularies not being comprehensive, is especially the case in relation to First Nations people in Australia
See in context
June 30, 2023 at 5:18 am
Presumably this is an area where AI could actually usefully be employed, though it is not my area of expertise
See in context
June 30, 2023 at 5:13 am
We are particularly keen to note the importance of inclusion of Indigenous languages in these discussions. Alongside the use of text in a specific language is also the need to identify the Indigenous country that the work is associated with. There is work ongoing for this in Australia.
See in context
June 30, 2023 at 5:02 am
We are concerned that this will never happen at any useful scale – below the very high level. For example, it is already hard enough to get people to change US English to Australian English in a familiar word processing tool such as Word.
See in context
June 30, 2023 at 1:26 am
(This is actually a comment on Appendix 3.)There is a correction regarding the URL as follows.JPCOAR Metadata Schema 2.0 Drafthttps://schema.irdb.nii.ac.jp/ja/schema/2.0-draft/14https://schema.irdb.nii.ac.jp/ja/schema/2.0-draft/1 –>JPCOAR Metadata Schema 2.0https://schema.irdb.nii.ac.jp/en/schema/2.0/14https://schema.irdb.nii.ac.jp/en/schema/2.0/1
See in context
June 30, 2023 at 1:16 am
I corrected the descriptions about WEKO3.before: WEKO3 is a cloud-based repository system supported by JPCOAR (Japan Consortium for Open Access Repositories). It is developed based on INVENIO by CERN.↓after: WEKO3 is a repository software developed by NII (National Institute of Informatics, Japan) based on INVENIO by CERN. This software operates JAIRO Cloud, a cloud-based repository system, which is supported by JPCOAR (Japan Consortium for Open Access Repositories) and NII.before: Specifically, ISO-639-3 is acceptable as the language of the text and for a language attribute of other metadata elements, ISO-639-1 is acceptable.↓after: Specifically, ISO-639-3 is acceptable as the language of the text, and for a language attribute of other metadata elements, ISO-639-1 is acceptable.before: With each field, you can add a language tag in the form of a two-character ISO using the dropdown menu.↓after: With WEKO3, you can add the language tag in the form of a two-character ISO using the dropdown menu, checkbox, and radio button.
See in context
June 30, 2023 at 1:08 am
It would be recommended that the introduction and recommendations be translated in multiple languages and shared from the COAR website to promote and encourage the multilingualism.
See in context
June 28, 2023 at 6:04 pm
Besides the recognition of the translator we encourage to recognize all the professionals involved in the editorial process(I added) mentioned in the resource, it takes a lot of time but it is worthy. This is an example:
.contributor.assistanttotheeditorinchief
Cruz Salas, Minerva
dc.contributor.businessmanager
Zempoalteca Quintana, Mario
dc.contributor.copyeditorandtranslator
Dashner Monk, Heather
dc.contributor.designer
Pérez Ramírez, Patricia
dc.contributor.editorinchief
Jiménez, Teresa Andreu
dc.contributor.layout
Álvarez Sotelo, María Elena
dc.contributor.salesandcirculationmanager
Creamer Tejeda, Cynthia
dc.contributor.translator
Fernández Hall, María Cristina
In fact, RECOLECTA evaluation for repositories considers this point: “4.5.- Existe un campo específico para indicar la descripción de la colaboración. En este campo se registra la entidad o persona responsable de coordinar, corregir, comentar o, en general, contribuir de alguna otra manera al desarrollo del recurso” (4.5.- There is a specific field to indicate the description of the collaboration. This field records the entity or person responsible for coordinating, correcting, commenting or, in general, contributing in some other way to the development of the resource.
https://calidadrevistas.fecyt.es/sites/default/files/informes/2021guiaevaluacionrecolecta_vf.pdf
See in context
June 28, 2023 at 5:29 pm
ORCID or ISNE help to identify authors or creators nowadays, but the use of authorities catalogues should be included.
See in context