Error Corpus of Slovak CHIBY

The CHIBY corpus is based on the revisions of Slovak Wikipedia. It contains revisions of paragraphs with automatic revision type annotation. As the goal is to maximize inclusiveness, the corpus considers each pair of subsequent revisions, after filtering for acceptable edits, but vandal changes sometimes get through.

There are following revision types (roughly following the annotation of SketchEngine error corpora):

Whitespace changes are not annotated, they are visible in parallel viewing mode.

This is very much work in progress. Based on Wiki Edits 2.0.

Changelog

  • 2019-08-05 – current version v0.3 adds <capitalisation> and <typographical> structures, revisions in digits moved to <unclassified>, data files are available
  • 2019-07-25 – first released version v0.2 is based on Wikipedia as of 2019-07-01; corpus contains 573 689 sentences (11 8301 44 tokens, 8 713 055 words)

Literature

  • Roman Grundkiewicz, Marcin Junczys-Dowmunt. The WikEd error corpus: A corpus of corrective Wikipedia edits and its application to grammatical error correction. In: International Conference on Natural Language Processing. Springer, Cham, 2014. p. 478-490.
  • Jiří Kletečka: Wikipedia Learner’s Corpus. Brno, 2017 [cit. 2018-03-07]. Bachelor’s thesis. Masaryk University, Faculty of Informatics. Thesis supervisor Vít Baisa.
  • Karel Pala, Pavel Rychlý, Pavel Smrž: Text Corpus with Errors. In: Matoušek V., Mautner P. (eds) Text, Speech and Dialogue. TSD 2003. Lecture Notes in Computer Science, vol 2807. Springer, Berlin, Heidelberg
  • https://github.com/ufal/wiki-error-corpus