Linguistics vs. digital editions: The Tromsø Old Russian and OCS Treebank
Permanent lenke
https://hdl.handle.net/10037/22366Dato
2015Type
Journal articleTidsskriftartikkel
Peer reviewed
Sammendrag
The Tromsø Old Russian and OCS Treebank (TOROT, nestor.uit.no)1 is, along with its parent treebank, the PROIEL corpus (foni.uio.no), the only existing treebank of Old Church Slavonic (OCS), Old East Slavic and Middle Russian texts. There are other tagged resources, such as the Old Russian subcorpus of the Russian National Corpus2 and the Manuskript corpus,3 but none of them, to our knowledge, currently provide syntactic annotation.
The TOROT presently contains approximately 160,000 word tokens of fully annotated OCS (Codex Marianus4 and Codex Suprasliensis), 85,000 word tokens of fully annotated Kiev-era Old East Slavic, and 60,000 word tokens of fully annotated 15th–17th-century Middle Russian. In addition, it contains the Codex Zographensis with automatic and partially hand-corrected morphological annotation and lemmatisation (sections of the Gospels missing in the Codex Marianus also have full syntactic annotation), and the PROIEL version of the Greek Gospels, with which the Codex Marianus and the Codex Zographensis are both aligned at token level (automatically, then hand-corrected).