|

Appendix 6: Fixing missing document language in EPrints repository records

REAL is a repository running EPrints, commissioned in 2008, which contains presently more than 220000 items in eight collections. The content is diverse, partly current research articles uploaded by researchers, partly material digitalised by the parent institution, the Library and Information Centre of the Hungarian Academy of Sciences. The current software version is 3.3.15.

The language field for documents – though present – was not, up till now, visible in the web document upload forms, nor in any views of an item, and thus depositors or librarians were unable to set it or check its content.

<documents>

<document id=”http://real.mtak.hu/id/document/xxxxx”>

<files>

<file id=”http://real.mtak.hu/id/file/yyyyy”>

<filename>zzzzz.pdf</filename>

</file>

</files>

<eprintid>wwwwwwww</eprintid>

<format>text</format>

<language>hu</language>

<security>public</security>

</document> 

</documents>

We have recently exposed the field, and found that its content was set by EPrints based on the language setting used in the browser at deposit – that is, the values contained are more or less random. To find out (and set) the correct values for hundreds of thousands of items, we produced a list of IDs for the items to check, downloaded metadata in DC format, extracted the title, and tried to guess the language of the document based on the language of the title.

Our script started with a hypothesis (the first hypothesis was that the language of the document is hungarian), the title words were fed to a spellchecker, and if more than half of the words were

recognised, we accepted the hypothesis as true. In the next run remaining items were checked against the “language is english” hypothesis, then further languages were tested.

The C-shell script excerpt below shows the test of the title against the “language is italian” hypothesis, using the spell checker hunspell .

@ den = `grep ^title: $3-eprint-$item.txt |tr -d ‘{}[]’| awk -F’:’ ‘{print $2,$3}’ | awk -F’=’ ‘{print $1}’ | hunspell -d it_IT -l | wc -l`

@ enu = `grep ^title: $3-eprint-$item.txt |tr -d ‘{}[]’| awk -F’:’ ‘{print $2,$3}’ | awk -F’=’ ‘{print $1}’ | wc -w`

@ discr = `echo $den $enu | ~/unixstat/stat/bin/dm “floor (x1/x2+0.49)”`

Experience with this method shows that – with some filtering – the error rate could be reduced to 1-2%, which is much better then the present error of 40-50%. We have to note that there are complicated, multilingual or highly technical (e.g. mathematics) documents, which represent a challenge. We do not know how to label bilingual / multilingual documents.



Source: https://comments.coar-repositories.org/appendix-6-fixing-missing-document-language-in-eprints-repository-records/