ub.xmlui.mirage2.page-structure.muninLogoub.xmlui.mirage2.page-structure.openResearchArchiveLogo
    • EnglishEnglish
    • norsknorsk
  • Velg spraaknorsk 
    • EnglishEnglish
    • norsknorsk
  • Administrasjon/UB
Vis innførsel 
  •   Hjem
  • Fakultet for humaniora, samfunnsvitenskap og lærerutdanning
  • Institutt for språk og kultur
  • Artikler, rapporter og annet (språk og kultur)
  • Vis innførsel
  •   Hjem
  • Fakultet for humaniora, samfunnsvitenskap og lærerutdanning
  • Institutt for språk og kultur
  • Artikler, rapporter og annet (språk og kultur)
  • Vis innførsel
JavaScript is disabled for your browser. Some features of this site may not work without it.

Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway

Permanent lenke
https://hdl.handle.net/10037/37032
Thumbnail
Åpne
article.pdf (223.0Kb)
(PDF)
Dato
2025-03
Type
Chapter
Bokkapittel

Forfatter
Enstad, Tita Ranveig; Trosterud, Trond; Røsok, Marie Iversdatter; Beyer, Yngvil Nesheim; Roald, Marie
Sammendrag
Optical Character Recognition (OCR) is crucial to the National Library of Norway’s (NLN)digitisation process as it converts scanned documents into machinereadable text. However, for the Sámi documents in NLN’s collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN’s collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an outof-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machineannotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.
Beskrivelse
Source at https://hdl.handle.net/10062/107275.
Forlag
University of Tartu
Sitering
Enstad, Trosterud, Røsok, Beyer, Roald: Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway. In: Johansson R, Stymne. Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025. University of Tartu Library
Metadata
Vis full innførsel
Samlinger
  • Artikler, rapporter og annet (språk og kultur) [1472]
Copyright 2025 University of Tartu Library

Bla

Bla i hele MuninEnheter og samlingerForfatterlisteTittelDatoBla i denne samlingenForfatterlisteTittelDato
Logg inn

Statistikk

Antall visninger
UiT

Munin bygger på DSpace

UiT Norges Arktiske Universitet
Universitetsbiblioteket
uit.no/ub - munin@ub.uit.no

Tilgjengelighetserklæring