Typical Corpus Examples

Click here for the interface.

A method to automatically extract short meaningful examples from a text corpus, using a Monte Carlo‐like sampling method, ranking examples by their frequency and desired length. You can query for lemma, word or an arbitrary CQL query supported by the target corpus.

Tuning Options

  • Corpus:
    • select the corpus
  • query:
    • query, either a word form, lemma or a CQL expression
  • query type:
    • auto: automatically select CQL if the query looks like a valid CQL syntax, otherwise substitute with [lemma="query" | word="query"]
    • lemma
    • word form
    • CQL
  • Filter 1st: filter only the 1st occurrence from the document. Recommended for all but rare words.
  • Sample size:
    • size of the sample (concordances). Increase if there are no reasonable examples or you are chasing rare phraseology (slows down the response).
  • Ideal length:
    • examples of this size (in tokens) have better score.
  • Weight long context by:
    • prefer long context (longer than the ideal length above) by this factor. From 0 to 10; reasonable values are maybe from 0.5 to 3.
  • Left context weight, Right context weight:
    • multiply length of left or right context by this number. This counts towards ideal length and long context. Use to (un)balance left and right context according to the direction of syntactical structures (e.g. head initial - increase right weight). Most useful values are between 0 and 1.
  • Min count in sample:
    • left context, right context and the whole example has to occur at least this times in the sample. 2 is a good value; 1 is useful only for testing (and scoring, returns everything); you can increase the value for common phrases and large samples.

Click here for the interface.

Citation

Radovan Garabík, Agáta Karčová: Analyzing grammatical anomalies in lexical data for fun and profit. In: Júlia Ballagó, Veronika Lipp (eds.) 1st International Conference on Lexicology and Lexicography. Book of abstracts. Budapest: ELTE Research Centre for Linguistics, 2025. pp. 23-24.


This work received support from the CA21167 COST action UniDive, funded by COST (European Cooperation in Science and Technology).