Explore Word Embedding for Semantic Similarity of Words

Throughout this document, we use the term “word” to mean either lemma or word form, depending on the selected model, or a bigram (constituents joined by an underscore).

Basic Usage

  1. select language
  2. enter word
    • e.g. beer shows words semantically close to beer; in German model we see that Bier in German is much more delineated and semantically separated from other beverages/food
    • or a list of words, or an arithmetic expression: king - he + she should pont to the queen as the semantically nearest word
  3. fixed (two word) collocations can be queried by joining the words with the _ underscore character, e.g. Attorney_General
  4. select model
  5. select visualization
  6. click Go!

»The query interface is here«

Single Word Query

The expected input is a single word. The result is a usual table of similar words, sorted according to their (semantic) similarity, and a usual visualization.

Multiple Word Query

The expected input is a list of words, separated by spaces or plus signs. The output is a list of word nearest to the normalized sum of vectors of the input words - i.e. the words semantically similar to all the words in the input. Words preceeded by a minus sign are substracted – the result is the region away in meaning from these words.

Two Word Query

As a special case, when only two words are entered, their semantic distance (number between 0 and 1) is displayed (in addition to the usual table of similar words and vizualization).

Citation

Please cite:

Garabík, Radovan. Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool. Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje 46, no. 2 (2020): 603-618. https://doi.org/10.31724/rihjj.46.2.8

Chinese Models

Special note on Chinese models:

There are three models of simplified Chinese, based on the same source. These models do not follow the same pattern as the other languages, given Chinese language and writing system specific features. Instead there are these models:

  • Model trained on the level of individual words (词). This is the closest to the usual web embedding usage. Latin script elements are uppercased and tokenized as separate words.
  • Model trained on the level of individual characters (字). Latin script elements are still uppercased and tokenized as separate words.
  • Model trained on pinyin representation of words. This is not a simple transcription of characters in the interface, but rather the underlying structure of the model. The differences are in semantic relations of homophones. To facilitate the queries, tones are written using digits 1 to 5 (neutral tone has the number 5).

Download

Several models are available for download (in Gensim format) here: https://www.juls.savba.sk/data/semä/

Changelog

(more detailed changelog is in the Slovak version of this page)

  • 2022-09-02 Ukrainian models
  • 2022-08-15 models of older Slovak (National Corpus) corpora
  • 2021-09-01 new lemmatized Hungarian model
  • 2021-04-14 new English models (lemma, word, fastText)
  • 2020-06-29 new Czech fastText model
  • 2020-06-20 new French models
  • 2020-05-11 new Czech models (lemma & word)
  • 2020-03-23 new Chinese models
  • 2020-02-04 new Slovak models (5.3·109 tokens)
  • 2019-10-24 direct vector arithmetic (king-he+she)
  • 2019-10-12 Russian Omnia Russica model added
  • 2019-05-24 added FastText model (some languages only)
  • 2019-05-10 added: sk ll model
  • 2019-05-03 Gnuplot visualization
  • 2019-02-28 added: Croatian, Slovene
  • 2018-09 added: Spanish, Estonian
  • 2018-06 added: English, Latvian, French
  • 2018-05-02 Aranea languages added
  • 2018-04-04 filter lemma
  • 2017-06-05 first version