Each text in corpus is represented in plain text and with morphological and syntactic annotation (UDPipe, homonymy resolved automatically) + has metainformation - date, theme, authorship, text difficulcy…etc (depending on source)

By now, about 500 millions of words are 50% literary texts (33 literary magazines), 25% of news (4 popular sites) and 25% of other (popular science, culture mags, social networks, amateur poems and prose), with documentation available.

Segment information

Taiga corpus is an ambitious project to become the largest fully available webcorpus constructed from open text sources. Taiga corpus is:

With these principles, we believe that a corpus product that meets modern requirements of corpus linguistics can be created - it will not be a black box, it will be reflecting modern language and its features, not biased and capable of encouraging more cooperation between developers and linguists.

This project is a project in the HSE Compling framework

Project creators

Under inspiring supervision of Olga Lyashevskaya


Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017.

Support or Contact

Check out our documentation or contact us and we’ll help you sort it out.

We welcome users to ask question on Google Groups!