An exhaustive list of open-source corpora for Russian

30 Aug 2018 on Nlp, Machine learning, Data, Open source, Russian

All projects for Russian with open source texts

During my work as an NLP-engineer, I always encountered a lot of corpus projects, that are not so publicly well-known and mentioned, yet they are a good source of text data for different kinds of research. Here I share this list with you, not forgetting to include more popular projects in it, of course, so that the list was complete.

Send your pull-requests to this post or comment below, if you know any dataset or corpus of the Russian language, which is not mentioned here!

Table of contents

Big and Open
Special Corpora
Open Corpus Derivatives
1. Vector Models
2. N-grams

Big and Open

Russian Twitter Corpus

The corpus of short texts in Russian on the basis of Twitter posts. Suitable for training a language model for social media or for short texts, training classifiers for sentiment analysis and text toxicity.

17,6 million tweets available!

Russian Common Crawl Data

541 TB of raw text data from the web. Contains duplicates, sources and dates of the web-pages are non-obvious.

Taiga Corpus

minute of shameless self-promotion

Taiga corpus is a corpus project to become the largest fully available webcorpus constructed from open text sources. Data available on request, containing datasets for text classification, language modelling, fake news detection, thematic modelling, authorship attribution, social media research, etc.

6,5 billion of tokens available

Special Corpora

Morphology and Syntax

OpenCorpora

The first open-source corpus for Russian - about 2 million words in manual annotation available + dictionary

Russian National Corpus

A subcorpus of Russian National Corpus is distributed openly by request. Morphological annotation with manual verification.

1 million words available

General Internet-Corpus of Russian

LiveJournal and VKontakte corpus with the automatically resolved ambiguity

2 million wordforms available

Annotation: Abbyy Compreno + rule-based verification Available by request

MorphoRuEval Data

Data for MorphoRuEval track - competition of automatic POS-tagging for Russian

Consists of:

plain texts: 1) LiveJournal (from GICR) 30 million words 2) Facebook, Twitter, VKontakte—30 million words 3) Librusec—300 million words
annotated data: 1) RNC Open: a manually disambiguated subcorpus of the Russian National Corpus—1.2 million words (fiction, news, nonfiction, spoken, blog) 2) GICR corpus with the resolved homonymy—1 million words 3) OpenCorpora.org data—400 thousand tokens 4) UD SynTagRus—900 thousand tokens (fiction, news)

SynTagRus

A subcorpus of Russian National Corpus, with fiction and news, with manual syntactic annotation. Now converted to UD.

1 million sentences available

Russian Universal Dependencies Data

Russian texts with morphological and syntactic annotation in UD, checked manually

SynTagRus, one more time

Parallel Universal Dependencies (PUD)

NER Parsing

FactRuEval Data

Data from FactRuEval track for Russian - ORG, LOC, PER, LOCORG annotation with manual verification - from Lentapedia project and Wikinews

Gareev Corpus

A corpus compiled for different sources, training set for Deepmipt Ner system

Available by request, see paper:

Rinat Gareev, Maksim Tkachenko, Valery Solovyev, Andrey Simanovsky, Vladimir Ivanov: Introducing Baselines for Russian Named Entity Recognition. Computational Linguistics and Intelligent Text Processing, 329 – 342 (2013).

Persons-1000

The collection of NER-tagged sentences annotated by experts of the Russian Academy of Sciences

Wikipedia Dumps

Russian Wikipedia - full articles in different formats

300 millions of tokens available

DBpedia

Structured Info from Wikipedia in many languages, including Russian

RDF format downloads:

Spellcheckers

SpellRuEval trainset

Dataset for SpellRuEval task for Russian - a track for spell-checkers for social media

10 000 sentences with errors from General Internet-Corpus of Russian

Sentiment Analysis

Russian Twitter Corpus (again)

The corpus of short texts in Russian on the basis of Twitter posts. Suitable for training a language model for social media or for short texts, training classifiers for sentiment analysis and text toxicity.

17,6 million tweets available for downloading!

SentiRuEval trainset

Dataset from SentiRuEval task, 2016

about 20 000 tagged Tweets with manual verification

Open Corpus Derivatives

Vector Models

RusVectores Vector Models

On RusVectores project you can download pre-trained fasttext, word2vec and doc2vec models on the main Russian webcorpora: (all available under CC licence)

Russian National Corpus
Taiga
General Internet-Corpus
Aranea
News Corpus
Wikipedia

Word2vec

Russian skip-gram model available for download resulting from the RUSSE evaluation track

Ngrams

Google Ngrams

N-grams, calculated on Google Books data - available for multiple languages, Russian as well

Russian National Corpus Ngrams

N-grams on Russian National Corpus, easy for downloading, top-100 also available.

Alexa Ngrams

Data from top 10M domains of the web. More than 5 billion ngrams available on request for multiple languages:

Common Crawl Ngrams

N-grams on Common Crawl Corpus, resulting from very noisy, yet big data from the web:

P.S. There is a great resource called NLPub, where some of the resources are also listed.
It is also good to add the other not listed resources there in wiki articles.

About

Hi! I am Tatiana Shavrina and I am a fascinated linguist who has moved to data science.

In this blog we will touch upon various interesting questions that link artificial intelligence and theoretical linguistics, interpretation of the results of machine learning, and also consider frequent NLP-developement errors that could be obvious to linguists (if they would read the code).

So…let’s profit from both of the fields and dive into it!

About me:

MY PROJECTS

2022 - mGPT project team lead, GPT-3-like text generation model in 61 languages, open-source link
2021 - Team Leader of ruDALL-E project, text-to-image generation in Russian, open-source link
2021 - Team Leader of ruGPT-3, text generation model in Russian, open-source link
2020 - Team Leader of Russian SuperGLUE Benchmark, see russiansuperglue.com Russian leaderboard of universal language models and transformers aiming to solve Natural Language Understanding task
2019 - Team leader of Automatic Exam Passing Passing Final Russian Language Exam with BERT on the level of an average pupil See Demo and Code and LREC 2020 paper
2018 - creator of Omnia Russica Corpus - 33b words. Combining all available sources for the needs of machine learning See omnia-russica
2017 - sole creator of Taiga Corpus, an open-source corpus for machine learning tasks for Russian

WHERE I DO IT

from 2021 - Research Project Director in NLP and Multimodality, Aritificial Intelligence Research Institute
from 2018 - Research Lead in NLP and Multimodality Research at Sber
2017 - 2018 - Data Scientist at 1C company
from 2013 - Linguist and Project Coordinator at General Internet-Corpus of Russian
2011 - 2013 - Linguist at Russian National Corpus

EDUCATION

2018 – 2021 - Ph.D. at Computational Linguistics Department, National Research University “Higher School of Economics”
2016 – 2018 Master student at National Research University “Higher School of Economics”, Faculty of humanities, School of linguistics, program “Computational linguistics”
2012 – 2016 Bachelor at Lomonosov Moscow State University, Faculty of Philology, Department of Theoretical and Applied Linguistics

ADDITIONAL EDUCATION

2018 – LXMLS summer school on Deep Learning in NLP, Instituto Superior Técnico, Lisbon
2016 – 2017 – 1-year course “Practical Data Analysis and Machine Learning” at National Research University “Higher School of Economics”, Faculty of Computer Science
2017 – UCREL Summer School in Computational linguistics and other digital methods, University of Lancaster, Lancaster, UK
2017 – The New York - St. Petersburg Institute of Linguistics, Cognition and Culture (NYI), Stony Brook University; St. Petersburg State University.
2017 – The actual problems of competitive intelligence at National Research University “Higher School of Economics”, Institute of security problems
2016 - The New York - St. Petersburg Institute of Linguistics, Cognition and Culture (NYI), Stony Brook University; St. Petersburg State University.
2013 – CIMO Summer School in the University of Turku, Finland

400x202