Corpus segments

Here you can explore information about our corpus sources and download them

Segment information

Genres Tokens, millions %
News 92 1.5
Literary Texts 4605 76
Special datasets 2.5 0.5
Social media 80 1.5
Subtitles 101 1.5
Poems 1130 19

Token distribution per segment

Stihi.ru

Meta-attributes:

Distribution of Poems by Genre

alt text

Click here for more info.

Proza.ru

Meta-attributes:

Distribution of Texts by Genre

alt text

Click here for more info.

Lenta.ru

Meta-attributes:

Distribution of Articles by Category

alt text

Click here for more info.

Interfax

Meta-attributes:

Distribution of Articles by Tag

alt text

Click here for more info.

NPlus1

Meta-attributes:

Distribution of Texts by Difficulty

alt text

Click here for more info.

Komsomolskaya Pravda

Meta-attributes:

Distribution of Articles by Region

alt text

Click here for more info.

Russian Magazines Hall

Meta-attributes:

Click here for more info.

Fontanka.ru

Meta-attributes:

Distribution of Articles by Year

alt text

Click here for more info.

Arzamas

Meta-attributes:

Distribution of Articles by Category

alt text

Click here for more info.

TV Subtitles

Meta-attributes:

Distribution of Texts by Language

alt text

Click here for more info.