Explore Word Embedding for Semantic Similarity of Words

Throughout this document, we use the term “word” to mean either lemma or word form, depending on the selected model, or a bigram (constituents joined by an underscore).

Basic Usage

select language
enter word
- e.g. beer shows words semantically close to beer; in German model we see that Bier in German is much more delineated and semantically separated from other beverages/food
- or a list of words, or an arithmetic expression: king - he + she should pont to the queen as the semantically nearest word
fixed (two word) collocations can be queried by joining the words with the _ underscore character, e.g. Attorney_General
select model
select visualization
click Go!

»The query interface is here«

Single Word Query

The expected input is a single word. The result is a usual table of similar words, sorted according to their (semantic) similarity, and a usual visualization.

Multiple Word Query

The expected input is a list of words, separated by spaces or plus signs. The output is a list of word nearest to the normalized sum of vectors of the input words - i.e. the words semantically similar to all the words in the input. Words preceeded by a minus sign are substracted – the result is the region away in meaning from these words.

Two Word Query

As a special case, when only two words are entered, their semantic distance (number between 0 and 1) is displayed (in addition to the usual table of similar words and vizualization).

Citation

Please cite:

Garabík, Radovan. Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool. Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje 46, no. 2 (2020): 603-618. https://doi.org/10.31724/rihjj.46.2.8

Chinese Models

Special note on Chinese models:

There are three models of simplified Chinese, based on the same source. These models do not follow the same pattern as the other languages, given Chinese language and writing system specific features. Instead there are these models:

Model trained on the level of individual words (词). This is the closest to the usual web embedding usage. Latin script elements are uppercased and tokenized as separate words.
Model trained on the level of individual characters (字). Latin script elements are still uppercased and tokenized as separate words.
Model trained on pinyin representation of words. This is not a simple transcription of characters in the interface, but rather the underlying structure of the model. The differences are in semantic relations of homophones. To facilitate the queries, tones are written using digits 1 to 5 (neutral tone has the number 5).

Download

Several models are available for download (in Gensim format) here: https://www.juls.savba.sk/data/semä/

Changelog

(more detailed changelog is in the Slovak version of this page)

2022-09-02 Ukrainian models
2022-08-15 models of older Slovak (National Corpus) corpora
2021-09-01 new lemmatized Hungarian model
2021-04-14 new English models (lemma, word, fastText)
2020-06-29 new Czech fastText model
2020-06-20 new French models
2020-05-11 new Czech models (lemma & word)
2020-03-23 new Chinese models
2020-02-04 new Slovak models (5.3·10⁹ tokens)
2019-10-24 direct vector arithmetic (king-he+she)
2019-10-12 Russian Omnia Russica model added
2019-05-24 added FastText model (some languages only)
2019-05-10 added: sk ll model
2019-05-03 Gnuplot visualization
2019-02-28 added: Croatian, Slovene
2018-09 added: Spanish, Estonian
2018-06 added: English, Latvian, French
2018-05-02 Aranea languages added
2018-04-04 filter lemma
2017-06-05 first version

Ľ. Štúr Institute of Linguistics

Slovak Academy of Sciences