Error Corpus of Slovak CHIBY

The CHIBY corpus is based on the revisions of Slovak Wikipedia. It contains revisions of paragraphs with automatic revision type annotation. As the goal is to maximize inclusiveness, the corpus considers each pair of subsequent revisions, after filtering for acceptable edits, but vandal changes sometimes get through.

There are following revision types (roughly following the annotation of SketchEngine error corpora):

spelling
punct – revision in punctuation
typographical – revision only in diacritics
diacritics
capitalisation
lexicosemantic
unclassified

Whitespace changes are not annotated, they are visible in parallel viewing mode.

This is very much work in progress. Based on Wiki Edits 2.0.

Please cite

Radovan Garabík: Slovenská Wikipédia ako zdroj dát pre korpus chýb. In Kultúra slova, 2026, vol. 60, no 2, pp. 74-79. ISSN 0023-5202.

Changelog

2024-12-18 – current version v0.4 is based on Wikipedia as of 2024-12-01;corpus contains 718 610 sentences (15 028 790 tokens, 11 087 501 words)
2019-08-05 – current version v0.3 adds <capitalisation> and <typographical> structures, revisions in digits moved to <unclassified>, data files are available
2019-07-25 – first released version v0.2 is based on Wikipedia as of 2019-07-01; corpus contains 573 689 sentences (11 8301 44 tokens, 8 713 055 words)

Literature

Roman Grundkiewicz, Marcin Junczys-Dowmunt. The WikEd error corpus: A corpus of corrective Wikipedia edits and its application to grammatical error correction. In: International Conference on Natural Language Processing. Springer, Cham, 2014. p. 478-490.
Jiří Kletečka: Wikipedia Learner’s Corpus. Brno, 2017 [cit. 2018-03-07]. Bachelor’s thesis. Masaryk University, Faculty of Informatics. Thesis supervisor Vít Baisa.
Karel Pala, Pavel Rychlý, Pavel Smrž: Text Corpus with Errors. In: Matoušek V., Mautner P. (eds) Text, Speech and Dialogue. TSD 2003. Lecture Notes in Computer Science, vol 2807. Springer, Berlin, Heidelberg
https://github.com/ufal/wiki-error-corpus

Ľ. Štúr Institute of Linguistics

Slovak Academy of Sciences

Error Corpus of Slovak CHIBY

Please cite

Changelog

Literature