Error Corpus of Slovak CHIBY

The CHIBY corpus is based on the revisions of Slovak Wikipedia. It contains revisions of paragraphs with automatic revision type annotation. As the goal is to maximize inclusiveness, the corpus considers each pair of subsequent revisions, after filtering for acceptable edits, but vandal changes sometimes get through.

There are following revision types (roughly following the annotation of SketchEngine error corpora):

Whitespace changes are not annotated, they are visible in parallel viewing mode.

This is very much work in progress. Based on Wiki Edits 2.0.


  • 2019-08-05 – current version v0.3 adds <capitalisation> and <typographical> structures, revisions in digits moved to <unclassified>, data files are available

  • 2019-07-25 – first released version v0.2 is based on Wikipedia as of 2019-07-01; corpus contains 573 689 sentences (11 8301 44 tokens, 8 713 055 words)


