Big data in Russian linguistics? Another look at paucal constructions

Nesset, Tore

dc.contributor.author	Nesset, Tore
dc.date.accessioned	2019-08-08T12:11:53Z
dc.date.available	2019-08-08T12:11:53Z
dc.date.issued	2019-05-28
dc.description.abstract	With the advent of large web-based corpora, Russian linguistics steps into the era of “big data”. But how useful are large datasets in our field? What are the advantages? Which problems arise? The present study seeks to shed light on these questions based on an investigation of the Russian paucal construction in the RuTenTen corpus, a web-based corpus with more than ten billion words. The focus is on the choice between adjectives in the nominative (dve/tri/četyre starye knigi) and genitive (dve/tri/četyre staryx knigi) in paucal constructions with the numerals dve, tri or četyre and a feminine noun. Three generalizations emerge. First, the large RuTenTen dataset enables us to identify predictors that could not be explored in smaller corpora. In particular, it is shown that predicates, modifiers, prepositions and word-order affect the case of the adjective. Second, we identify situations where the RuTenTen data cannot be straightforwardly reconciled with findings from earlier studies or there appear to be discrepancies between different statistical models. In such cases, further research is called for. The effect of the numeral (dve, tri vs. četyre) and verbal government are relevant examples. Third, it is shown that adjectives in the nominative have more easily learnable predictors that cover larger classes of examples and show clearer preferences for the relevant case. It is therefore suggested that nominative adjectives have the potential to outcompete adjectives in the genitive over time. Although these three generalizations are valuable additions to our knowledge of Russian paucal constructions, three problems arise. Large internet-based corpora like the RuTenTen corpus (a) are not balanced, (b) involve a certain amount of “noise”, and (c) do not provide metadata. As a consequence of this, it is argued, it may be wise to exercise some caution with regard to conclusions based on “big data”.	en_US
dc.description	Source at <a href=https://doi.org/10.1515/slaw-2019-0012>https://doi.org/10.1515/slaw-2019-0012</a>.	en_US
dc.identifier.citation	Nesset, T. (2019). Big data in Russian linguistics? Another look at paucal constructions. <i>Zeitschrift für Slawistik, 64</i>(2), 157-174. https://doi.org/10.1515/slaw-2019-0012	en_US
dc.identifier.cristinID	FRIDAID 1701141
dc.identifier.doi	https://doi.org/10.1515/slaw-2019-0012
dc.identifier.issn	0044-3506
dc.identifier.issn	2196-7016
dc.identifier.uri	https://hdl.handle.net/10037/15873
dc.language.iso	eng	en_US
dc.publisher	De Gruyter	en_US
dc.relation.journal	Zeitschrift für Slawistik
dc.rights.accessRights	openAccess	en_US
dc.subject	VDP::Humanities: 000::Linguistics: 010::Russian language: 028	en_US
dc.subject	VDP::Humaniora: 000::Språkvitenskapelige fag: 010::Russisk språk: 028	en_US
dc.subject	Big data	en_US
dc.subject	corpus linguistics	en_US
dc.subject	Russian	en_US
dc.subject	numeral	en_US
dc.subject	paucal	en_US
dc.title	Big data in Russian linguistics? Another look at paucal constructions	en_US
dc.type	Journal article	en_US
dc.type	Tidsskriftartikkel	en_US
dc.type	Peer reviewed	en_US

File(s) in this item

Name:: article.pdf
Size:: 2.358Mb
Format:: PDF
Description:: Publisher's version

View/Open

This item appears in the following collection(s)

Artikler, rapporter og annet (språk og kultur) [1477]

Show simple item record