Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

Each text in corpus is represented in plain text and with morphological and syntactic annotation (UDPipe, homonymy resolved automatically) + has metainformation - date, theme, authorship, text difficulcy…etc (depending on source)

By now, about 5 billions of words are 77% literary texts (33 literary magazines), 19% of naive poetry, 2% of news (4 popular sites) and 2% of other (popular science, culture mags, social networks, amateur poems and prose), with documentation available.

See also: Omnia Russica corpus - a bigger version of Taiga available! 33 billion words from Taiga, Common Crawl, Wikipedia and Aranea corpus.

Segment information

We have gathered the resources with respect to popular NLP-problems:

thematic modelling - news with theme tags, all the sites which provide rubrication (news, poems, prose)
readability of texts - a popular science magazine NPlus1 has a readability metric for each text, provided by editor.
NER and fact extraction - news with references to mentioned person’s page or wiki-information, news with personalia tags
key-words extraction - news with key-word tags, hashtags on social media
authorship attribution - all the texts with author information - magazines, news, and more important - social media - with gender, age, city, time and education mark-up.
chat-bot training - open-source film subtitles
text generation - any resource depending on genre
rare words studying, frequency dictionaries - literary magazines, social media
morphological and syntactic parsers - any resource with respect to the genre

Taiga corpus is an ambitious project to become the largest fully available webcorpus constructed from open text sources. Taiga corpus is:

open source, CC BY-SA 3.0
big - about 5 billion words by now
sorted by datasets applicable to different machine laearning tasks
made by linguists, experienced in text crawling, parsing and filtering
rich with metainformation
POS-tagged and syntactically tagged in Universal Dependencies

With these principles, we believe that a corpus product that meets modern requirements of corpus linguistics can be created - it will not be a black box, it will be reflecting modern language and its features, not biased and capable of encouraging more cooperation between developers and linguists.

This project is a project in the HSE Compling framework

Project creators

Tatiana Shavrina (rybolos@gmail.com)
Yana Kurmachova (yana.kurmacheva@gmail.com)

Under inspiring supervision of Olga Lyashevskaya

References:

Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017.

Support or Contact

Check out our documentation or contact us and we’ll help you sort it out.

We welcome users to ask question on Google Groups!

Taiga Сorpus

An open-source corpus for machine learning.