Corpus of Rusyn Wikipedia

Query the Corpus

The corpus contains texts from the Rusyn Wikipedia, as of 2023-09-12. The text are tokenized and segmented into sentences, they are however not lemmatized. The orthography follows the original orthography in the Wikipedia articles (one of the official variants, or an unofficial orthography). The corpus thus reflects the preferences of Wikipedia editors more than the language itself.

Attributes

word

word form (cyrillic, case sensitive)

lc

word form (cyrillic, lowercase)

trans

word form (ASCII, lowercase)

Structures

doc

document (Wikipedia page)

doc.id

unique ID of the document

doc.url

URL of the page

doc.title

page name

doc.timestamp

time of the last edit

p

paragraph

s

sentence

g

no whitespace here

Transliteration

To facilitate querying the corpus using non-cyrillics, non-diacritics enabled keyboard layouts, an ad-hoc transliteration (roughly based on the BGN/PCGN 2016 romanization) is used in the trans attribute (in CQL queries), and also in the Simple Query.

а

a

б

b

в

v

г

h

ґ

g

д

d

е

e

ё

jo

є

je

ж

zh

з

z

и

y

і

i

ї

ji

й

j

к

k

л

l

м

m

н

n

о

o

п

p

р

r

с

s

т

t

у

u

ф

f

х

x

ц

c

ч

ch

ш

sh

щ

sc

ъ

'

ы

y

ь

'

ѣ

ji

э

e

ю

ju

я

ja

Changelog

  • 2023-09-21 – first released version

See also