Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.
Each text in corpus is represented in plain text and with morphological and syntactic annotation (UDPipe, homonymy resolved automatically) + has metainformation - date, theme, authorship, text difficulcy…etc (depending on source)
By now, about 5 billions of words are 77% literary texts (33 literary magazines), 19% of naive poetry, 2% of news (4 popular sites) and 2% of other (popular science, culture mags, social networks, amateur poems and prose), with documentation available.
See also: Omnia Russica corpus - a bigger version of Taiga available! 33 billion words from Taiga, Common Crawl, Wikipedia and Aranea corpus.
Segment information
We have gathered the resources with respect to popular NLP-problems:
- thematic modelling - news with theme tags, all the sites which provide rubrication (news, poems, prose)
- readability of texts - a popular science magazine NPlus1 has a readability metric for each text, provided by editor.
- NER and fact extraction - news with references to mentioned person’s page or wiki-information, news with personalia tags
- key-words extraction - news with key-word tags, hashtags on social media
- authorship attribution - all the texts with author information - magazines, news, and more important - social media - with gender, age, city, time and education mark-up.
- chat-bot training - open-source film subtitles
- text generation - any resource depending on genre
- rare words studying, frequency dictionaries - literary magazines, social media
- morphological and syntactic parsers - any resource with respect to the genre
Taiga corpus is an ambitious project to become the largest fully available webcorpus constructed from open text sources. Taiga corpus is:
- open source, CC BY-SA 3.0
- big - about 5 billion words by now
- sorted by datasets applicable to different machine laearning tasks
- made by linguists, experienced in text crawling, parsing and filtering
- rich with metainformation
- POS-tagged and syntactically tagged in Universal Dependencies
With these principles, we believe that a corpus product that meets modern requirements of corpus linguistics can be created - it will not be a black box, it will be reflecting modern language and its features, not biased and capable of encouraging more cooperation between developers and linguists.
This project is a project in the HSE Compling framework
Project creators
- Tatiana Shavrina (rybolos@gmail.com)
- Yana Kurmachova (yana.kurmacheva@gmail.com)
Under inspiring supervision of Olga Lyashevskaya
References:
Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017.
Support or Contact
Check out our documentation or contact us and we’ll help you sort it out.
We welcome users to ask question on Google Groups!