Corpus of Rusyn Wikipedia

Query the Corpus

The corpus contains texts from the Rusyn Wikipedia, as of 2024-03-15. The text are tokenized and segmented into sentences, they are however not lemmatized. The orthography follows the original orthography in the Wikipedia articles (one of the official variants, or an unofficial orthography). The corpus thus reflects the preferences of Wikipedia editors more than the language itself.

Attributes

word	word form (cyrillic, case sensitive)
lc	word form (cyrillic, lowercase)
trans	word form (ASCII, lowercase)

Structures

doc	document (Wikipedia page)
doc.id	unique ID of the document
doc.url	URL of the page
doc.title	page name
doc.timestamp	time of the last edit
p	paragraph
s	sentence
g	no whitespace here

Transliteration

To facilitate querying the corpus using non-cyrillics, non-diacritics enabled keyboard layouts, an ad-hoc transliteration (roughly based on the BGN/PCGN 2016 romanization) is used in the trans attribute (in CQL queries), and also in the Simple Query.

а	a
б	b
в	v
г	h
ґ	g
д	d
е	e
ё	jo
є	je
ж	zh
з	z
и	y
і	i
ї	ji
й	j
к	k
л	l
м	m
н	n
о	o
п	p
р	r
с	s
т	t
у	u
ф	f
х	x
ц	c
ч	ch
ш	sh
щ	sc
ъ	'
ы	y
ь	'
ѣ	ji
э	e
ю	ju
я	ja

Please cite

Garabík, Radovan: Korpus textov rusínskej wikipédie. In Kultúra slova, vol. 58, no.1, p. 55–59. ISSN 0023-5202.

Changelog

2024-03-15 – second version
2023-09-21 – first released version

Ľ. Štúr Institute of Linguistics

Slovak Academy of Sciences