The inclusion of keywords in many languages increases the discoverability of repository content. In this context, it is important to distinguish between free-text keywords (or “tags”) and controlled terms derived from a controlled multilingual vocabulary or thesaurus. In the former case, keywords in several languages are provided in the dc:subject field, making sure that the language is properly encoded. This approach does not ensure consistency, not does it reveal hierarchical relations among terms. The problem can be mitigated by selecting manually the terms to be added as keywords from controlled vocabularies. However, an optimal solution involves the integration of multilingual controlled vocabularies in the repository.
Multilingual vocabularies and thesauri
The use of controlled vocabularies or thesauri1 for bibliographic metadata ensures that the same concept is described consistently. Along with using controlled terms to indicate resource type, version, or usage rights, controlled vocabularies can be used to describe the subject content of the resource. In multilingual controlled vocabularies, each term ideally has only one equivalent in every language and the relations among terms are the same. In a digital environment, the vocabulary terms are assigned persistent identifiers that can easily be resolved.
However, the use of controlled vocabularies or thesauri involves some challenges
¶ 42 Leave a comment on paragraph 42 0 For example, DSpace offers three ways to integrate controlled vocabularies: https://wiki.lyrasis.org/display/DSDOC7x/Authority+Control+of+Metadata+Values
- ¶ 43 Leave a comment on paragraph 43 0
- Value pairs in a controlled list form
- XML file containing the terms (e.g. to support the integration of Dewey Decimal Classification or the Thesaurus of Greek terms in repositories)2
- SolR Authority (was used for the ORCID integration before DSpace 7: https://wiki.lyrasis.org/display/DSDOC7x/ORCID+Authority)
There have been a number of attempts to overcome the limitations of the existing controlled vocabularies. The project TRIPLE developed a new multilingual (nine languages) controlled vocabulary for Social Sciences and Humanities by building upon existing vocabularies.
The vocabulary RVM Web (https://rvmweb.bibl.ulaval.ca/), maintained by Université Laval and used by libraries across Canada, is an example of a controlled vocabulary seeking to eliminate cultural, historical, and colonial biases:
¶ 45 Leave a comment on paragraph 45 1 Integration of Wikidata into repositories, already implemented in Europeana, may be a widely applicable solution for providing multilingual keywords. Wikidata relies on both crowdsourcing and the existing authority files and it already contains a large number of data items in various languages. The import of terms from various vocabularies is enabled via the tool Mix’n’match.
Wikidata as keywords
Wikidata is a free knowledge base with more than 100 million data items. It acts as central storage for a general structured data of concepts, including the concept labels/translations in many languages. As a result, the use of Wikidata concepts as a controlled vocabulary of keywords is particularly promising as it can provide more multilingual interoperability with a lower time investment.
For example, Depositar – a research data repository based on CKAN – reuses Wikidata as the source of keywords, see more details here.
WikiData concepts and other controlled vocabulary terms can be encoded using JATS3 <kwd-group> and <kwd> tags, with the addition of vocab, vocab-identifier and vocab-term-identifier attributes defined in the NISO Standards Tag Suite (STS) https://www.niso-sts.org/ :
There is more than one way to do it, the JATS standard bundles the keywords by language using the <kwd-group> tag. The following is an example of metadata tagging of the wikidata concepts of photography (Q11633) and journalism (Q11030) with the concept labels in English (photography, journalism) and Polish (fotografia, dziennikarstwo) using JATS xml:
<kwd-group xml:lang="en" vocab="wikidata" vocab-identifier="https://www.wikidata.org/"> <kwd vocab-term-identifier="Q11633">photography</kwd> <kwd vocab-term-identifier="Q11030">journalism</kwd> </kwd-group> <kwd-group xml:lang="pl" vocab="wikidata" vocab-identifier="https://www.wikidata.org/"> <kwd vocab-term-identifier="Q11633">fotografia</kwd> <kwd vocab-term-identifier="Q11030">dziennikarstwo</kwd> </kwd-group>
There might be limitations for this in the current repository technologies.
Recommendation: adding all the attributes described in the example – vocab, vocab-identifier and vocab-term-identifier
Recommendations for repository software/platforms developers
- Enable a real-time integration of Wikidata – e.g. when a user starts typing in the appropriate metadata field, relevant Wikidata terms appear in a drop-down list for the user to select.
- Enable automatic assignment of controlled terms based on the existing metadata.
Automatic indexing of content could make the process of assigning controlled terms more efficient. This approach, which has been tested in individual institutional repositories, is already used by aggregators. For example, Europeana performs automatic metadata enrichment relying on external vocabularies and datasets such as GeoNames and DBpedia and uses the semantic relations and translations offered by these vocabularies. BASE assigns computed Dewey Decimal Classification terms based on available metadata. The same approach is used in the multilingual discovery platform GoTriple, where content harvested from various sources is automatically annotated using controlled terms, due to which it is possible to search GoTriple in multiple languages.
Additional steps forward could include the assignment of controlled terms based on the full text of deposited documents and enabling an automated import of the controlled terms assigned by aggregators.
1A registry of controlled vocabularies: https://bartoc.org/
2The first integration of COAR Resources Type Vocabulary was using either value pairs or XML files : http://repositorium.sdum.uminho.pt/handle/1822/46066?mode=full
3The Journal Article Tag Suite (JATS) is an XML format used to describe scientific literature published online. It is a technical standard developed by the National Information Standards Organization (NISO) and approved by the American National Standards Institute with the code Z39.96-2012. The NISO project was a continuation of the work done by NLM/NCBI, and popularized by the NLM’s PubMed Central as a de facto standard for archiving and interchange of scientific open-access journals and its contents with XML. With the NISO standardization the NLM initiative has gained a wider reach, and several other repositories, such as SciELO and Redalyc, adopted the XML formatting for scientific articles: https://en.wikipedia.org/wiki/Journal_Article_Tag_Suite
In JATS (Journal Article Tag Suite), any metadata field could be tagged with a language. In the DTD format of the JATS schema, the xml:lang attribute can be applied to almost any element, see: https://jats.nlm.nih.gov/articleauthoring/tag-library/1.2/attribute/xml-lang.html. Examples: PubMed Central translated titles https://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/dobs.html#dob-at-transtitle. Using the JATS schema, the language of keywords is recorded using the xml:lang attribute of the <kwd-group> tag (see: https://jats.nlm.nih.gov/articleauthoring/tag-library/1.2/element/kwd-group.html). JATS groups the keywords by language, with a series of <kwd> tags immediately under each language’s <kwd-goup> tag.