Automatic Identification of Shared Arguments in Verbal Coordinations
We describe automatic conversion of the SynTagRus dependency treebank of Russian to the PROIEL format (with the ultimate purpose of obtaining a single-format diachronic treebank spanning more than a thousand years), focusing on analysis of shared arguments in verbal coordinations. Whether arguments are shared or private is not marked in the SynTagRus native format, but the PROIEL format indicates sharing by means of secondary dependencies. In order to recover missing information and insert secondary dependencies into the converted SynTagRus, we create a simple guessing algorithm based on four probabilistic features: how likely a given argument type is to be shared; how likely an argument in a given position is to be shared; how likely a given verb is to have a given argument; how likely a given verb is to have a given argument frame. Boosted with a few deterministic rules and trained on a small manually annotated sample (346 sentences), the guesser very successfully inserts shared subjects (F-score 0.97), which results in excellent overall performance (F-score 0.92). Non-subject arguments are shared much more rarely, and for them the results are poorer (0.31 for objects; 0.22 for obliques). We show, however, that there are strong reasons to believe that performance can be increased if a larger training sample is used and the guesser gets to see enough positive examples. Apart from describing a useful practical solution, the paper also provides quantitative data about and offers non-trivial insights into Russian verbal coordination.
SiteringKompiuternaia lingvistika i intellektual'nye tekhnologii (2015) nr. 14 (21) s. 33-43
Følgende lisensfil er knyttet til denne innførselen: