An exhaustive list of open-source corpora for Russian

All projects for Russian with open source texts

During my work as an NLP-engineer, I always encountered a lot of corpus projects, that are not so publicly well-known and mentioned, yet they are a good source of text data for different kinds of research. Here I share this list with you, not forgetting to include more popular projects in it, of course, so that the list was complete.

Send your pull-requests to this post or comment below, if you know any dataset or corpus of the Russian language, which is not mentioned here!

Table of contents

  1. Big and Open
  2. Special Corpora
    1. Morphology and Syntax
    2. NER Parsing
    3. Spellcheckers
    4. Sentiment Analysis
  3. Open Corpus Derivatives
    1. Vector Models
    2. N-grams

Big and Open

  • Russian Twitter Corpus

The corpus of short texts in Russian on the basis of Twitter posts. Suitable for training a language model for social media or for short texts, training classifiers for sentiment analysis and text toxicity.

17,6 million tweets available!


  • Russian Common Crawl Data

541 TB of raw text data from the web. Contains duplicates, sources and dates of the web-pages are non-obvious.


  • Taiga Corpus

minute of shameless self-promotion

Taiga corpus is a corpus project to become the largest fully available webcorpus constructed from open text sources. Data available on request, containing datasets for text classification, language modelling, fake news detection, thematic modelling, authorship attribution, social media research, etc.

6,5 billion of tokens available


Special Corpora

Morphology and Syntax

  • OpenCorpora

The first open-source corpus for Russian - about 2 million words in manual annotation available + dictionary


  • Russian National Corpus

A subcorpus of Russian National Corpus is distributed openly by request. Morphological annotation with manual verification.

1 million words available


  • General Internet-Corpus of Russian

LiveJournal and VKontakte corpus with the automatically resolved ambiguity

  • 2 million wordforms available

Annotation: Abbyy Compreno + rule-based verification Available by request


  • MorphoRuEval Data

Data for MorphoRuEval track - competition of automatic POS-tagging for Russian

Consists of:

  • plain texts: 1) LiveJournal (from GICR) 30 million words 2) Facebook, Twitter, VKontakte—30 million words 3) Librusec—300 million words
  • annotated data: 1) RNC Open: a manually disambiguated subcorpus of the Russian National Corpus—1.2 million words (fiction, news, nonfiction, spoken, blog) 2) GICR corpus with the resolved homonymy—1 million words 3) data—400 thousand tokens 4) UD SynTagRus—900 thousand tokens (fiction, news)


  • SynTagRus

A subcorpus of Russian National Corpus, with fiction and news, with manual syntactic annotation. Now converted to UD.

  • 1 million sentences available


  • Russian Universal Dependencies Data

Russian texts with morphological and syntactic annotation in UD, checked manually

SynTagRus, one more time


Parallel Universal Dependencies (PUD)


NER Parsing

  • FactRuEval Data

Data from FactRuEval track for Russian - ORG, LOC, PER, LOCORG annotation with manual verification - from Lentapedia project and Wikinews


  • Gareev Corpus

A corpus compiled for different sources, training set for Deepmipt Ner system


Available by request, see paper:

Rinat Gareev, Maksim Tkachenko, Valery Solovyev, Andrey Simanovsky, Vladimir Ivanov: Introducing Baselines for Russian Named Entity Recognition. Computational Linguistics and Intelligent Text Processing, 329 – 342 (2013).

  • Persons-1000

The collection of NER-tagged sentences annotated by experts of the Russian Academy of Sciences


  • Wikipedia Dumps

Russian Wikipedia - full articles in different formats

  • 300 millions of tokens available


  • DBpedia

Structured Info from Wikipedia in many languages, including Russian


RDF format downloads:



  • SpellRuEval trainset

Dataset for SpellRuEval task for Russian - a track for spell-checkers for social media

10 000 sentences with errors from General Internet-Corpus of Russian


Sentiment Analysis

  • Russian Twitter Corpus (again)

The corpus of short texts in Russian on the basis of Twitter posts. Suitable for training a language model for social media or for short texts, training classifiers for sentiment analysis and text toxicity.

17,6 million tweets available for downloading!


  • SentiRuEval trainset

Dataset from SentiRuEval task, 2016

about 20 000 tagged Tweets with manual verification


Open Corpus Derivatives

Vector Models

  • RusVectores Vector Models

On RusVectores project you can download pre-trained fasttext, word2vec and doc2vec models on the main Russian webcorpora: (all available under CC licence)

  • Russian National Corpus
  • Taiga
  • General Internet-Corpus
  • Aranea
  • News Corpus
  • Wikipedia


  • Word2vec

Russian skip-gram model available for download resulting from the RUSSE evaluation track



  • Google Ngrams

N-grams, calculated on Google Books data - available for multiple languages, Russian as well


  • Russian National Corpus Ngrams

N-grams on Russian National Corpus, easy for downloading, top-100 also available.


  • Alexa Ngrams

Data from top 10M domains of the web. More than 5 billion ngrams available on request for multiple languages:


  • Common Crawl Ngrams

N-grams on Common Crawl Corpus, resulting from very noisy, yet big data from the web:


P.S. There is a great resource called NLPub, where some of the resources are also listed.
It is also good to add the other not listed resources there in wiki articles.

© 2017-2018. All rights reserved.

Powered by Hydejack v7.5.1