Recommendations on multilingual keywords

The inclusion of keywords in many languages increases the discoverability of repository content. In this context, it is important to distinguish between free-text keywords (or “tags”) and controlled terms derived from a controlled multilingual vocabulary or thesaurus. In the former case, keywords in several languages are provided in the dc:subject field, making sure that the language is properly encoded. This approach does not ensure consistency, not does it reveal hierarchical relations among terms. The problem can be mitigated by selecting manually the terms to be added as keywords from controlled vocabularies. However, an optimal solution involves the integration of multilingual controlled vocabularies in the repository.

40 Leave a comment on paragraph 40 3

Multilingual vocabularies and thesauri

The use of controlled vocabularies or thesauri1 for bibliographic metadata ensures that the same concept is described consistently. Along with using controlled terms to indicate resource type, version, or usage rights, controlled vocabularies can be used to describe the subject content of the resource. In multilingual controlled vocabularies, each term ideally has only one equivalent in every language and the relations among terms are the same. In a digital environment, the vocabulary terms are assigned persistent identifiers that can easily be resolved.

However, the use of controlled vocabularies or thesauri involves some challenges
  • In order to be integrated with repositories, controlled vocabularies must be expressed as machine-readable data.
  • Forced equivalency: it is not always possible to find true equivalents in all languages, due to which the meaning of terms and relations between them in one language will not be accurately reflected in their counterparts in other languages.
  • The process of assigning controlled terms may be time-consuming. 
  • Researchers are usually not familiar with the concept of controlled vocabularies. If librarians do not have the required expert knowledge, the terms may be too general and inaccurate.
  • There are many disciplinary specific controlled vocabularies and it is not possible to apply all of them in multidisciplinary repositories. On the other hand, general vocabularies may not be able to describe the content accurately.
  • Widely used controlled vocabularies (e.g. Library of Congress Subject Headings, or Getty vocabularies) are not equally inclusive to various cultural contexts and social groups.
  • 41 Leave a comment on paragraph 41 0

    42 Leave a comment on paragraph 42 0 For example, DSpace offers three ways to integrate controlled vocabularies: https://wiki.lyrasis.org/display/DSDOC7x/Authority+Control+of+Metadata+Values 

    44 Leave a comment on paragraph 44 0 The DSpace 7 Configurable entities, though not initially designed for this usage, could be another way to implement controlled vocabularies.

    There have been a number of attempts to overcome the limitations of the existing controlled vocabularies. The project TRIPLE developed a new multilingual (nine languages) controlled vocabulary for Social Sciences and Humanities by building upon existing vocabularies.

    The vocabulary RVM Web (https://rvmweb.bibl.ulaval.ca/), maintained by Université Laval and used by libraries across Canada, is an example of a controlled vocabulary seeking to eliminate cultural, historical, and colonial biases:
  • It’s bilingual – in English and French, but not for all terms;
  • Initially (around 1970) it was built by translating Library of Congress Subject Headings (LCSH) and is now an independent product;
  • English version uses MeSH, AAT (Getty Thesaurus), HOMOsaurus (newly used) and LCSH;
  • It is not automated;
  • Open version RVM FAST does not contain AAT MeSH and HOMOsaurus, only LCSH: https://rvmweb.bibl.ulaval.ca/rvmweb/recherche/init.do?repertoire=rvmfast (there is a plan to make it compliant with Linked Open Data in order include it in DBpedia, in the short term); example: https://rvm.bibl.ulaval.ca/rvmweb/lod/notice.do?noControle=RVMFAST-000315572&repertoire=RVMFAST
  • Included in WebDewey;
  • Unique identifier for each term (not yet public right now);
  • Synchronization between the different products (LCSH, RAMEAU, AAT, etc.). This will hopefully be improved with the use of IDs;
  • How to push updates of the terms used in systems?
  • 45 Leave a comment on paragraph 45 1 Integration of Wikidata into repositories, already implemented in Europeana, may be a widely applicable solution for providing multilingual keywords. Wikidata relies on both crowdsourcing and the existing authority files and it already contains a large number of data items in various languages. The import of terms from various vocabularies is enabled via the tool Mix’n’match.

    46 Leave a comment on paragraph 46 1

    Wikidata as keywords

    Wikidata is a free knowledge base with more than 100 million data items. It acts as central storage for a general structured data of concepts, including the concept labels/translations in many languages. As a result, the use of Wikidata concepts as a controlled vocabulary of keywords is particularly promising as it can provide more multilingual interoperability with a lower time investment.

    For example, Depositar – a research data repository based on CKAN – reuses Wikidata as the source of keywords, see more details here.

    WikiData concepts and other controlled vocabulary terms can be encoded using JATS3 <kwd-group> and <kwd> tags, with the addition of vocab, vocab-identifier and vocab-term-identifier attributes defined in the NISO Standards Tag Suite (STS) https://www.niso-sts.org/ :
  • the name of the controlled vocabulary (“wikidata”) in the vocab attribute (https://www.niso-sts.org/TagLibrary/niso-sts-TL-1-2-html/attribute/vocab.html
  • the vocabulary identifier (“https://www.wikidata.org/”) in the vocab-identifier attribute (https://www.niso-sts.org/TagLibrary/niso-sts-TL-1-2-html/attribute/vocab-identifier.html)
  • the identifier/URL of each keyword in the vocab-term-identifier (e.g. “Q11030”) attribute (https://www.niso-sts.org/TagLibrary/niso-sts-TL-1-2-html/attribute/vocab-term-identifier.html). For WikiData, this is the identifier of the concept, not the language-specific label of the concept.

  • There is more than one way to do it, the JATS standard bundles the keywords by language using the <kwd-group> tag.  The following is an example of metadata tagging of the wikidata concepts of photography (Q11633) and journalism (Q11030) with the concept labels in English (photography, journalism) and Polish (fotografia, dziennikarstwo) using JATS xml:
    <kwd-group xml:lang="en" vocab="wikidata" vocab-identifier="https://www.wikidata.org/">
       <kwd vocab-term-identifier="Q11633">photography</kwd>
       <kwd vocab-term-identifier="Q11030">journalism</kwd>
    <kwd-group xml:lang="pl" vocab="wikidata" vocab-identifier="https://www.wikidata.org/">
       <kwd vocab-term-identifier="Q11633">fotografia</kwd>
       <kwd vocab-term-identifier="Q11030">dziennikarstwo</kwd>

    There might be limitations for this in the current repository technologies.

    Recommendation: adding all the attributes described in the example – vocab, vocab-identifier and vocab-term-identifier 

    Recommendations for repository software/platforms developers

    • Enable a real-time integration of Wikidata – e.g. when a user starts typing in the appropriate metadata field, relevant Wikidata terms appear in a drop-down list for the user to select.
    • Enable automatic assignment of controlled terms based on the existing metadata.

    Automatic indexing of content could make the process of assigning controlled terms more efficient. This approach, which has been tested in individual institutional repositories, is already used by aggregators. For example, Europeana performs automatic metadata enrichment relying on external vocabularies and datasets such as GeoNames and DBpedia and uses the semantic relations and translations offered by these vocabularies. BASE assigns computed Dewey Decimal Classification terms based on available metadata. The same approach is used in  the multilingual discovery platform GoTriple, where content harvested from various sources is automatically annotated using controlled terms, due to which it is possible to search GoTriple in multiple languages.

    Additional steps forward could include the assignment of controlled terms based on the full text of deposited documents and enabling an automated import of the controlled terms assigned by aggregators. 

    1A registry of controlled vocabularies: https://bartoc.org/

    2The first integration of COAR Resources Type Vocabulary was using either value pairs or XML files : http://repositorium.sdum.uminho.pt/handle/1822/46066?mode=full

    3The Journal Article Tag Suite (JATS) is an XML format used to describe scientific literature published online. It is a technical standard developed by the National Information Standards Organization (NISO) and approved by the American National Standards Institute with the code Z39.96-2012. The NISO project was a continuation of the work done by NLM/NCBI, and popularized by the NLM’s PubMed Central as a de facto standard for archiving and interchange of scientific open-access journals and its contents with XML. With the NISO standardization the NLM initiative has gained a wider reach, and several other repositories, such as SciELO and Redalyc, adopted the XML formatting for scientific articles: https://en.wikipedia.org/wiki/Journal_Article_Tag_Suite

    In JATS (Journal Article Tag Suite), any metadata field could be tagged with a language. In the DTD format of the JATS schema, the xml:lang attribute can be applied to almost any element, see: https://jats.nlm.nih.gov/articleauthoring/tag-library/1.2/attribute/xml-lang.html. Examples: PubMed Central translated titles https://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/dobs.html#dob-at-transtitle. Using the JATS schema, the language of keywords is recorded using the xml:lang attribute of the <kwd-group> tag (see: https://jats.nlm.nih.gov/articleauthoring/tag-library/1.2/element/kwd-group.html).  JATS groups the keywords by language, with a series of <kwd> tags immediately under each language’s <kwd-goup> tag. 

    Source: https://comments.coar-repositories.org/recommendations-on-multilingual-keywords/