Frequencies and ARF (Araneum Slovacum VII Maximum)

Datasets containing raw frequencies and Average Reduced Frequencies from the Araneum Slovacum VII Maximum web corpus.

MSD tags follow the Slovak National Corpus tagset; POS is simply the first character of the MSD tag.

Format

The datasets are delivered in the form of tab separated files, with the following columns:

t \t f(t) \t ARF(t) \t h(t)

t is the “token”, it can be either a word form, a word form and a MSD tag, or a lemma and a POS tag (depending on the dataset). f(t) is the (raw) frequency of t in the corpus (number of occurrences).

We define the homogenity h(t) of the token t as:

h(t) = (ARF(t)-1) / (f(t)-1)

Note that if f(t)=1, it follows from the definition of ARF that ARF(t)=1 and h(t) is undefined. In our data, we use h(t)=-1 to mark an undefined value (so that you can use the same numeric parsing code, without special conditions).

Description of the files:

file

first column (t)

dataset-arf-araneum-slovacum-vii-word.tsv.xz

word form (case sensitive)

dataset-arf-araneum-slovacum-vii-lemma-pos.tsv.xz

lemma+POS (concatenated, i.e. POS==t[-1])

dataset-arf-araneum-slovacum-vii-word-tag.tsv.xz

word+MSD tag (tab separated, i.e. two columns)

Download

The files can be downloaded here.