Corpus of Rusyn Wikipedia
The corpus contains texts from the Rusyn Wikipedia, as of 2024-03-15. The text are tokenized and segmented into sentences, they are however not lemmatized. The orthography follows the original orthography in the Wikipedia articles (one of the official variants, or an unofficial orthography). The corpus thus reflects the preferences of Wikipedia editors more than the language itself.
Attributes
word |
word form (cyrillic, case sensitive) |
lc |
word form (cyrillic, lowercase) |
trans |
word form (ASCII, lowercase) |
Structures
doc |
document (Wikipedia page) |
doc.id |
unique ID of the document |
doc.url |
URL of the page |
doc.title |
page name |
doc.timestamp |
time of the last edit |
p |
paragraph |
s |
sentence |
g |
no whitespace here |
Transliteration
To facilitate querying the corpus using non-cyrillics, non-diacritics enabled keyboard layouts, an ad-hoc transliteration (roughly based on the BGN/PCGN 2016 romanization) is used in the trans attribute (in CQL queries), and also in the Simple Query.
а |
a |
б |
b |
в |
v |
г |
h |
ґ |
g |
д |
d |
е |
e |
ё |
jo |
є |
je |
ж |
zh |
з |
z |
и |
y |
і |
i |
ї |
ji |
й |
j |
к |
k |
л |
l |
м |
m |
н |
n |
о |
o |
п |
p |
р |
r |
с |
s |
т |
t |
у |
u |
ф |
f |
х |
x |
ц |
c |
ч |
ch |
ш |
sh |
щ |
sc |
ъ |
' |
ы |
y |
ь |
' |
ѣ |
ji |
э |
e |
ю |
ju |
я |
ja |
Please cite
Garabík, Radovan: Korpus textov rusínskej wikipédie. In Kultúra slova, vol. 58, no.1, p. 55–59. ISSN 0023-5202.
Changelog
- 2024-03-15 – second version
- 2023-09-21 – first released version