An exhaustive list of open-source corpora for Russian

All projects for Russian with open source texts

During my work as an NLP-engineer, I always encountered a lot of corpus projects, that are not so publicly well-known and mentioned, yet they are a good source of text data for different kinds of research. Here I share this list with you, not forgetting to include more popular projects in it, of course, so that the list was complete.

Send your pull-requests to this post or comment below, if you know any dataset or corpus of the Russian language, which is not mentioned here!

Table of contents

  1. Big and Open
  2. Special Corpora
    1. Morphology and Syntax
    2. NER Parsing
    3. Spellcheckers
    4. Sentiment Analysis
  3. Open Corpus Derivatives
    1. Vector Models
    2. N-grams

Big and Open

  • Russian Twitter Corpus

The corpus of short texts in Russian on the basis of Twitter posts. Suitable for training a language model for social media or for short texts, training classifiers for sentiment analysis and text toxicity.

17,6 million tweets available!

link

  • Russian Common Crawl Data

541 TB of raw text data from the web. Contains duplicates, sources and dates of the web-pages are non-obvious.

link

  • Taiga Corpus

minute of shameless self-promotion

Taiga corpus is a corpus project to become the largest fully available webcorpus constructed from open text sources. Data available on request, containing datasets for text classification, language modelling, fake news detection, thematic modelling, authorship attribution, social media research, etc.

6,5 billion of tokens available

link


Special Corpora

Morphology and Syntax

  • OpenCorpora

The first open-source corpus for Russian - about 2 million words in manual annotation available + dictionary

link

  • Russian National Corpus

A subcorpus of Russian National Corpus is distributed openly by request. Morphological annotation with manual verification.

1 million words available

link

  • General Internet-Corpus of Russian

LiveJournal and VKontakte corpus with the automatically resolved ambiguity

  • 2 million wordforms available

Annotation: Abbyy Compreno + rule-based verification Available by request

link

  • MorphoRuEval Data

Data for MorphoRuEval track - competition of automatic POS-tagging for Russian

Consists of:

  • plain texts: 1) LiveJournal (from GICR) 30 million words 2) Facebook, Twitter, VKontakte—30 million words 3) Librusec—300 million words
  • annotated data: 1) RNC Open: a manually disambiguated subcorpus of the Russian National Corpus—1.2 million words (fiction, news, nonfiction, spoken, blog) 2) GICR corpus with the resolved homonymy—1 million words 3) OpenCorpora.org data—400 thousand tokens 4) UD SynTagRus—900 thousand tokens (fiction, news)

link

  • SynTagRus

A subcorpus of Russian National Corpus, with fiction and news, with manual syntactic annotation. Now converted to UD.

  • 1 million sentences available

link

  • Russian Universal Dependencies Data

Russian texts with morphological and syntactic annotation in UD, checked manually

SynTagRus, one more time

GSD

Parallel Universal Dependencies (PUD)

Taiga

NER Parsing

  • FactRuEval Data

Data from FactRuEval track for Russian - ORG, LOC, PER, LOCORG annotation with manual verification - from Lentapedia project and Wikinews

link

  • Gareev Corpus

A corpus compiled for different sources, training set for Deepmipt Ner system

link

Available by request, see paper:

Rinat Gareev, Maksim Tkachenko, Valery Solovyev, Andrey Simanovsky, Vladimir Ivanov: Introducing Baselines for Russian Named Entity Recognition. Computational Linguistics and Intelligent Text Processing, 329 – 342 (2013).

  • Persons-1000

The collection of NER-tagged sentences annotated by experts of the Russian Academy of Sciences

link

  • Wikipedia Dumps

Russian Wikipedia - full articles in different formats

  • 300 millions of tokens available

link

  • DBpedia

Structured Info from Wikipedia in many languages, including Russian

link

RDF format downloads:

link

Spellcheckers

  • SpellRuEval trainset

Dataset for SpellRuEval task for Russian - a track for spell-checkers for social media

10 000 sentences with errors from General Internet-Corpus of Russian

link

Sentiment Analysis

  • Russian Twitter Corpus (again)

The corpus of short texts in Russian on the basis of Twitter posts. Suitable for training a language model for social media or for short texts, training classifiers for sentiment analysis and text toxicity.

17,6 million tweets available for downloading!

link

  • SentiRuEval trainset

Dataset from SentiRuEval task, 2016

about 20 000 tagged Tweets with manual verification

link


Open Corpus Derivatives

Vector Models

  • RusVectores Vector Models

On RusVectores project you can download pre-trained fasttext, word2vec and doc2vec models on the main Russian webcorpora: (all available under CC licence)

  • Russian National Corpus
  • Taiga
  • General Internet-Corpus
  • Aranea
  • News Corpus
  • Wikipedia

link

  • Word2vec

Russian skip-gram model available for download resulting from the RUSSE evaluation track

link

Ngrams

  • Google Ngrams

N-grams, calculated on Google Books data - available for multiple languages, Russian as well

link

  • Russian National Corpus Ngrams

N-grams on Russian National Corpus, easy for downloading, top-100 also available.

link

  • Alexa Ngrams

Data from top 10M domains of the web. More than 5 billion ngrams available on request for multiple languages:

link

  • Common Crawl Ngrams

N-grams on Common Crawl Corpus, resulting from very noisy, yet big data from the web:

link


P.S. There is a great resource called NLPub, where some of the resources are also listed.
It is also good to add the other not listed resources there in wiki articles.


© 2017-2018. All rights reserved.

Powered by Hydejack v7.5.1