Vis enkel innførsel

dc.contributor.authorEnstad, Tita Ranveig
dc.contributor.authorTrosterud, Trond
dc.contributor.authorRøsok, Marie Iversdatter
dc.contributor.authorBeyer, Yngvil Nesheim
dc.contributor.authorRoald, Marie
dc.date.accessioned2025-05-07T14:06:10Z
dc.date.available2025-05-07T14:06:10Z
dc.date.issued2025-03
dc.description.abstractOptical Character Recognition (OCR) is crucial to the National Library of Norway’s (NLN)digitisation process as it converts scanned documents into machinereadable text. However, for the Sámi documents in NLN’s collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN’s collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an outof-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machineannotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.en_US
dc.descriptionSource at <a href=https://hdl.handle.net/10062/107275>https://hdl.handle.net/10062/107275</a>.en_US
dc.identifier.citationEnstad, Trosterud, Røsok, Beyer, Roald: Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway. In: Johansson R, Stymne. Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025. University of Tartu Libraryen_US
dc.identifier.cristinIDFRIDAID 2377582
dc.identifier.isbn978-9908-53-109-0
dc.identifier.issn1736-8197
dc.identifier.issn1736-6305
dc.identifier.urihttps://hdl.handle.net/10037/37032
dc.language.isoengen_US
dc.publisherUniversity of Tartuen_US
dc.rights.accessRightsopenAccessen_US
dc.rights.holderCopyright 2025 University of Tartu Libraryen_US
dc.titleComparative analysis of optical character recognition methods for Sámi texts from the National Library of Norwayen_US
dc.typeChapteren_US
dc.typeBokkapittelen_US


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel