Corpus segments
Here you can explore information about our corpus sources and download them
Segment information
Genres | Tokens, millions | % |
---|---|---|
News | 92 | 1.5 |
Literary Texts | 4605 | 76 |
Special datasets | 2.5 | 0.5 |
Social media | 80 | 1.5 |
Subtitles | 101 | 1.5 |
Poems | 1130 | 19 |
Token distribution per segment
Stihi.ru
Meta-attributes:
- ‘textrubric’ – genre of the poem
- ‘textid’ – unique ID
- ‘textname’ – poem title
- ‘author’ – author(s)
- ‘authortexts’ – number of poems written by the author
- ‘authorreaders’ – number of visitors who read the poem
- ‘date’ – date of publication
- ‘time’ – time of publication
- ‘source’ – reference to the original source (sometimes unavailable)
Distribution of Poems by Genre
Click here for more info.
Proza.ru
Meta-attributes:
- ‘textrubric’ – text genre
- ‘textid’ – unique ID
- ‘textname’ – title
- ‘author’ – author(s)
- ‘authortexts’ – number of texts written by the author
- ‘authorreaders’ – number of visitors who read the text
- ‘date’ – date of publication
- ‘time’ – time of publication
- ‘source’ – reference to the original source (sometimes unavailable)
Distribution of Texts by Genre
Click here for more info.
Lenta.ru
Meta-attributes:
- ‘textid’ – unique ID
- ‘textname’ – article title
- ‘textrubric’ – article category
- ‘date’ – date of publication
- ‘time’ – time of publication
- ‘tags’ – article tags
- ‘source’ – reference to the original source (sometimes unavailable)
Distribution of Articles by Category
Click here for more info.
Interfax
Meta-attributes:
- ‘textid’ – unique ID
- ‘textname’ – title
- ‘textrubric’ – article category
- ‘date’ – date of publication
- ‘time’ – time of publication
- ‘tags’ – article tags
- ‘source’ – reference to the original source (sometimes unavailable)
Distribution of Articles by Tag
Click here for more info.
NPlus1
Meta-attributes:
- ‘textid’ – unique ID
- ‘textname’ – title
- ‘textdiff’ – text difficulty
- ‘author’ – author(s)
- ‘textrubric’ – article category
- ‘date’ – date of publication
- ‘time’ – time of publication
- ‘tags’ – article tags
- ‘source’ – reference to the original source (sometimes unavailable)
Distribution of Texts by Difficulty
Click here for more info.
Komsomolskaya Pravda
Meta-attributes:
- ‘textid’ – unique ID
- ‘textname’ – title
- ‘textregion’ – news by region
- ‘textrubric’ – article category
- ‘date’ – date of publication
- ‘time’ – time of publication
- ‘tags’ – article tags
- ‘source’ – reference to the original source (sometimes unavailable)
Distribution of Articles by Region
Click here for more info.
Russian Magazines Hall
Meta-attributes:
- ‘textid’ – unique ID
- ‘textname’ – title
- ‘magazine’ – magazine title
- ‘author’ – author(s)
- ‘date’ – date of publication
- ‘time’ – time of publication
- ‘tags’ – tags
- ‘source’ – reference to the original source (sometimes unavailable)
Click here for more info.
Fontanka.ru
Meta-attributes:
- ‘textid’ – unique ID
- ‘textname’ – title
- ‘textregion’ – news by region
- ‘textrubric’ – article category
- ‘date’ – date of publication
- ‘time’ – time of publication
- ‘tags’ – article tags
- ‘source’ – reference to the original source (sometimes unavailable)
Distribution of Articles by Year
Click here for more info.
Arzamas
Meta-attributes:
- ‘textid’ – unique ID
- ‘textname’ – title
- ‘authors’ – author(s)
- ‘authorprofession’ – author’s profession
- ‘about_author’ – short author bio
- ‘textrubric’ – article category
- ‘date’ – date of publication
- ‘time’ – time of publication
- ‘tags’ – article tags
- ‘source’ – reference to the original source (sometimes unavailable)
Distribution of Articles by Category
Click here for more info.
TV Subtitles
Meta-attributes:
- ‘textid’ – unique ID
- ‘title’ – film title
- ‘language’ – language
- ‘filepath’ – file path
Distribution of Texts by Language
Click here for more info.