Each text in corpus is represented in plain text and with morphological and syntactic annotation (UDPipe, homonymy resolved automatically) + has metainformation - date, theme, authorship, text difficulcy…etc (depending on source)

By now, about 5 billions of words are 77% literary texts (33 literary magazines), 19% of naive poetry, 2% of news (4 popular sites) and 2% of other (popular science, culture mags, social networks, amateur poems and prose), with documentation available.

See also: Omnia Russica corpus - a bigger version of Taiga available! 33 billion words from Taiga, Common Crawl, Wikipedia and Aranea corpus.

Segment information

Taiga corpus is an ambitious project to become the largest fully available webcorpus constructed from open text sources. Taiga corpus is:

With these principles, we believe that a corpus product that meets modern requirements of corpus linguistics can be created - it will not be a black box, it will be reflecting modern language and its features, not biased and capable of encouraging more cooperation between developers and linguists.

This project is a project in the HSE Compling framework

Project creators

Under inspiring supervision of Olga Lyashevskaya

References:

Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017.

Support or Contact

Check out our documentation or contact us and we’ll help you sort it out.

We welcome users to ask question on Google Groups!