An exhaustive list of open-source corpora for Russian
on Nlp, Machine learning, Data, Open source, Russian
During my work as an NLP-engineer, I always encountered a lot of corpus projects, that are not so publicly well-known and mentioned, yet they are a good source of text data for different kinds of research. Here I share this list with you, not forgetting to include more popular projects in it, of course, so that the list was complete.
Send your pull-requests to this post or comment below, if you know any dataset or corpus of the Russian language, which is not mentioned here!
Table of contents
Big and Open
- Russian Twitter Corpus
The corpus of short texts in Russian on the basis of Twitter posts. Suitable for training a language model for social media or for short texts, training classifiers for sentiment analysis and text toxicity.
17,6 million tweets available!
- Russian Common Crawl Data
541 TB of raw text data from the web. Contains duplicates, sources and dates of the web-pages are non-obvious.
- Taiga Corpus
minute of shameless self-promotion
Taiga corpus is a corpus project to become the largest fully available webcorpus constructed from open text sources. Data available on request, containing datasets for text classification, language modelling, fake news detection, thematic modelling, authorship attribution, social media research, etc.
6,5 billion of tokens available
Special Corpora
Morphology and Syntax
- OpenCorpora
The first open-source corpus for Russian - about 2 million words in manual annotation available + dictionary
- Russian National Corpus
A subcorpus of Russian National Corpus is distributed openly by request. Morphological annotation with manual verification.
1 million words available
- General Internet-Corpus of Russian
LiveJournal and VKontakte corpus with the automatically resolved ambiguity
- 2 million wordforms available
Annotation: Abbyy Compreno + rule-based verification Available by request
- MorphoRuEval Data
Data for MorphoRuEval track - competition of automatic POS-tagging for Russian
Consists of:
- plain texts: 1) LiveJournal (from GICR) 30 million words 2) Facebook, Twitter, VKontakte—30 million words 3) Librusec—300 million words
- annotated data: 1) RNC Open: a manually disambiguated subcorpus of the Russian National Corpus—1.2 million words (fiction, news, nonfiction, spoken, blog) 2) GICR corpus with the resolved homonymy—1 million words 3) OpenCorpora.org data—400 thousand tokens 4) UD SynTagRus—900 thousand tokens (fiction, news)
- SynTagRus
A subcorpus of Russian National Corpus, with fiction and news, with manual syntactic annotation. Now converted to UD.
- 1 million sentences available
- Russian Universal Dependencies Data
Russian texts with morphological and syntactic annotation in UD, checked manually
Parallel Universal Dependencies (PUD)
NER Parsing
- FactRuEval Data
Data from FactRuEval track for Russian - ORG, LOC, PER, LOCORG annotation with manual verification - from Lentapedia project and Wikinews
- Gareev Corpus
A corpus compiled for different sources, training set for Deepmipt Ner system
Available by request, see paper:
Rinat Gareev, Maksim Tkachenko, Valery Solovyev, Andrey Simanovsky, Vladimir Ivanov: Introducing Baselines for Russian Named Entity Recognition. Computational Linguistics and Intelligent Text Processing, 329 – 342 (2013).
- Persons-1000
The collection of NER-tagged sentences annotated by experts of the Russian Academy of Sciences
- Wikipedia Dumps
Russian Wikipedia - full articles in different formats
- 300 millions of tokens available
- DBpedia
Structured Info from Wikipedia in many languages, including Russian
RDF format downloads:
Spellcheckers
- SpellRuEval trainset
Dataset for SpellRuEval task for Russian - a track for spell-checkers for social media
10 000 sentences with errors from General Internet-Corpus of Russian
Sentiment Analysis
- Russian Twitter Corpus (again)
The corpus of short texts in Russian on the basis of Twitter posts. Suitable for training a language model for social media or for short texts, training classifiers for sentiment analysis and text toxicity.
17,6 million tweets available for downloading!
- SentiRuEval trainset
Dataset from SentiRuEval task, 2016
about 20 000 tagged Tweets with manual verification
Open Corpus Derivatives
Vector Models
- RusVectores Vector Models
On RusVectores project you can download pre-trained fasttext, word2vec and doc2vec models on the main Russian webcorpora: (all available under CC licence)
- Russian National Corpus
- Taiga
- General Internet-Corpus
- Aranea
- News Corpus
- Wikipedia
- Word2vec
Russian skip-gram model available for download resulting from the RUSSE evaluation track
Ngrams
- Google Ngrams
N-grams, calculated on Google Books data - available for multiple languages, Russian as well
- Russian National Corpus Ngrams
N-grams on Russian National Corpus, easy for downloading, top-100 also available.
- Alexa Ngrams
Data from top 10M domains of the web. More than 5 billion ngrams available on request for multiple languages:
- Common Crawl Ngrams
N-grams on Common Crawl Corpus, resulting from very noisy, yet big data from the web:
P.S. There is a great resource called NLPub, where some of the resources are also listed.
It is also good to add the other not listed resources there in wiki articles.