Explore Word Embedding for Semantic Similarity of Words
Throughout this document, we use the term “word” to mean either lemma or word form, depending on the selected model, or a bigram (constituents joined by an underscore).
- select language
- enter word
fixed (two word) collocations can be queried by joining the words with the _ underscore character, e.g. ginger_ale
- select model
- select visualization
Single Word Query
The expected input is a single word. The result is a usual table of similar words, sorted according to their (semantic) similarity, and a usual visualization.
Multiple Word Query
The expected input is a list of words, separated by spaces or plus signs. The output is a list of word nearest to the normalized sum of vectors of the input words - i.e. the words semantically similar to all the words in the input. Words preceeded by a minus sign are substracted – the result is the region away in meaning from these words.
Two Word Query
As a special case, when only two words are entered, their semantic distance (number between 0 and 1) is displayed (in addition to the usual table of similar words and vizualization).
Radovan Garabík: Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool. In: Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje, 2020, 46(2). In print.
Special note on Chinese models:
There are three models of simplified Chinese, based on the same source. These models do not follow the same pattern as the other languages, given Chinese language and writing system specific features. Instead there are these models:
- Model trained on the level of individual words (词). This is the closest to the usual web embedding usage. Latin script elements are uppercased and tokenized as separate words.
- Model trained on the level of individual characters (字). Latin script elements are still uppercased and tokenized as separate words.
- Model trained on pinyin representation of words. This is not a simple transcription of characters in the interface, but rather the underlying structure of the model. The differences are in semantic relations of homophones. To facilitate the queries, tones are written using digits 1 to 5 (neutral tone has the number 5).
(more detailed changelog is in the Slovak version of this page)
2020-05-11 new Czech models (lemma & word)
- 2020-03-23 new Chinese models
2020-02-04 new Slovak models (5.3·109 tokens)
2019-10-24 direct vector arithmetic (king-he+she)
- 2019-10-12 Russian Omnia Russica model added
2019-05-24 added FastText model (some languages only)
2019-05-10 added: sk ll model
- 2019-05-03 Gnuplot visualization
- 2019-02-28 added: Croatian, Slovene
- 2018-09 added: Spanish, Estonian
- 2018-06 added: English, Latvian, French
- 2018-05-02 Aranea languages added
- 2018-04-04 filter lemma
- 2017-06-05 first version