King – man + woman = queen:
what analogies can and cannot do in word embeddings.

a summary of paper to appear in Proceedings of NAACL-HLT 2016 (SRW): Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t.

Download the paper, extra stats and data:
Download BATS dataset

Analogies have become a popular benchmark for word embeddings since the striking demonstration of “linguistic regularities” (Mikolov, Yih, & Zweig, 2013). This term refers to any linguistic relation that holds between several pairs of words. It can be semantic, as in the male/female relation (brother:sister, husband:wife, poet:poetess), or morphological (e.g. infinitive and gerund of verbs: sing:singing, write:writing, go:going).

The original paper showed that such linguistic patterns can be detected in word embeddings using the offset between word vectors. Consider an analogy problem with two pairs of words a:b :: c:?d. Mikolov et. al showed that the answer d can be found by calculating the hypothetical vector a – b + c; the word vector the closest to this hypothetical vector should be the correct answer d. A frequently cited example is king – man + woman = queen.

In (Mikolov, Chen, Corrado, & Dean, 2013) the authors developed an analogy test set that comprised 15 linguistic categories: 9 morphological and 5 semantic categories, with 20-70 unique word pairs per category which are combined in all possible ways to yield 8,869 semantic and 10,675 syntactic questions. This set became a popular benchmark for word embeddings. The current state-of-the-art on this test is summed up below. The best result is currently achieved by the GloVe model (Pennington, Socher, & Manning, 2014).


Semantic categories

Syntactic categories





(Mikolov, Chen, et al., 2013) 




(Pennington et al., 2014)




(Cui, Gao, Bian, Qiu, & Liu, 2014)




(Garten, Sagae, Ustun, & Dehghani, 2015)




(Zhou, Sun, Liu, & Lau, 2015)




(Lebret & Collobert, 2015)

However, what these stats do not show is that in fact there is a lot of variation in how successful the model is on different relations. Consider performance of GloVe and an SVD-based model, trained on 5B words web-corpus:

Also, while this test shows the possibility of analogical reasoning with word embeddings, it does not provide a good estimate of the extent to which it is possible. It contains only 15 categories (or even 14, since the “common countries and capitals” and “countries and capitals of the world” differ only by the word frequencies). It is not balanced, as there are different amounts of categories of different types, and they contain different amounts of source pairs (and the two categories explore the same capital/country relation together constitute 56.72\% of all semantic questions).

Besides, all of the categories in this test set are “binary” in the sense that there is only one correct answer (e.g. each country has only one capital). But most semantic categories are not binary. For example, the young of a lion is a cub, but the young of a dog can be puppy, or pup, or a whelp. It has already been shown that lexicographic semantic relations such as synonymy and hypernymy are not reliably discovered in the same way (Köper, Scheible, & im Walde, 2015).

To address these issues we developed BATS – a Bigger Analogy Test Set. It that includes 4 types of linguistic relations: inflectional and derivational morphology, and lexicographic and encyclopedic semantics. Each relation is represented with 10 categories, and each category contains 50 unique word pairs. For a test in the vector offset paradigm it yields 2480 questions – 99,200 in all set. Our sources and word selection procedures are described in the paper (Gladkova, Drozd, & Matsuoka, 2016).

The Bigger Analogy Test Set – structure and examples




 I01: regular plurals (student:students)

 I02: plurals - orthographic changes (wife:wives)



 I03: comparative degree (strong:stronger)

 I04: superlative degree (strong:strongest)


 I05: infinitive: 3Ps.Sg (follow:follows)

 I06: infinitive: participle (follow:following)

 I07: infinitive: past (follow:followed)

 I08: participle: 3Ps.Sg (following:follows)

 I09: participle: past (following:followed)

 I10: 3Ps.Sg : past (follows:followed)


 No stem change





 D01: noun+less (life:lifeless)

 D02: un+adj. (able:unable)

 D03: adj.+ly (usual:usually)

 D04: over+adj./Ved (used:overused)

 D05: adj.+ness (same:sameness)

 D06: re+verb (create:recreate

 D07: verb+able (allow:allowable)

 Stem change

 D08: verb+er (provide:provider

 D09:  verb+ation (continue:continuation)

 D10:  verb+ment (argue:argument)




 L01: animals (cat:feline

 L02: miscellaneous (plum:fruit, shirt:clothes)


 L03: miscellaneous (bag:pouch, color:white)




 L04: substance (sea:water)

 L05: member (player:team)

 L06: part-whole (car:engine)



 L07: intensity (cry:scream)

 L08: exact (sofa:couch)



 L09: gradable (clean:dirty)

 L10: binary (up:down)



 E01: capitals (Athens:Greece)

 E02: country:language (Bolivia:Spanish)

 E03: UK city:county (York:Yorkshire)


 E04: nationalities (Lincoln:American)

 E05: occupation (Lincoln:president)


 E06: the young (cat:kitten)

 E07: sounds (dog:bark)

 E08: shelter (fox:den)


 E09: thing:color (blood:red)

 E10: male:female (actor:actress)

BATS has two important features not present in previous tests. First, its morphological categories are morphological categories are sampled to reduce homonymy. For example, for verb present tense the Google set includes pairs like walk:walks, which could be both verbs and nouns. It is impossible to completely eliminate homonymy, as a big corpus will have some creative uses for almost any word, but we reduce it by excluding words attributed to more than one part-of-speech in WordNet.

The semantic part of BATS does include homonyms, since semantic categories are overall smaller than morphological categories, and it is the more frequently used words that tend to have multiple functions. For example, both dog and cat are also listed in WordNet as verbs, and aardvark is not; an homonym-free list of animals would mostly contain low-frequency words, which in itself decreases performance. However, we did our best to avoid clearly ambiguous words; e.g. prophet Muhammad was not included in the E05 category, because many people have the same name.

Secondly, where applicable, BATS contains several acceptable answers (sourced from WordNet) where applicable. For example, both mammal and canine are hypernyms of dog. In some cases alternative spellings are also listed (e.g. organize: reorganize/reorganise).

We release BATS together with the Python testing script that implements two methods for solving analogies: the original a-b+c vector offset method proposed by Mikolov et al., and the alternative method proposed by (Levy, Goldberg, & Ramat-Gan, 2014).

The script can also be used with any other word categories. It accepts text files with two words separated by tab or space, as in:

descartes         french

dickens            english/british

dostoyevsky     russian

The script will uncase all words in the source list. Alternative answers for the second word can be separated with /. The script will generate all possible combinations of the source pairs (if there are several acceptable answers, the first choice is used for generating questions).

The script relies on the VSMlib developed by authors. Please relate to the project page for installation details. Inside the script set params["name_method"] variable can be set to "3CosAdd" or "PairDistance" to select a method. "dirs" list contains absolute path to embeddings. Supported are SVD-based format of vsmlib, GloVe and w2v models. params["name_dataset"] and params["dir_root_dataset"] variables specify the test set location.

The script outputs the overall accuracy, and also separate files for each category that scores individual questions. The files have the following tab-separated fields:

word_a   word_b  word_c   word_d   model_answer   YES/NO

In case there are several acceptable options for the words b and/or d, they are listed as Python lists. For example:

cloud   ['white', 'gray', 'grey'] ant       ['black', 'brown', 'red']            black            YES

cloud   ['white', 'gray', 'grey'] apple   ['red', 'orange', 'yellow', 'golden']            black   NO

BATS is considerably more difficult than the test set developed by Mikolov et al. Our best-performing SVD-based model and GloVe achieved only 22.1\% and 28.5\% average accuracy, respectively. Lexicographic and, unexpectedly, derivational categories appear to be particularly challenging.

We also investigated how varying dimensionality and window size of SVD-based model affects performance on individual categories. Although the popular belief is that semantic categories prefer larger windows, we found that at least for our model there is no such correlation. The performance on all categories was best at windows 2-4. The scores for individual categories on window sizes 2-8, 1000 dimensions can be found in the paper.

As for dimensionality, for approximately half of the categories our SVD-based model performed best at around 1200 dimensions, but many other categories reached their peak and started declining between 200 and 1100 dimensions. Here, too, there was no correlation between relation type and preferred vector size. This suggests that choice of model parameters should be optimized to target particular relations in a given task, rather than relation types. Full data on dimensions can be found in BATS_dimensionality_effect_per_category.pdf

Some of the low scores could be attributed to low word frequencies in our corpus, but there are also categories with similar frequency distributions that do not yield equal accuracy. (see BATS_frequency_distributions_per_category.pdf)

We hope that this study will draw attention of NLP community to the necessity of further improving word embeddings and analogical reasoning methods for derivational and lexicographic relations.



Cui, Q., Gao, B., Bian, J., Qiu, S., & Liu, T.-Y. (2014). Learning Effective Word Embedding using Morphological Word Similarity. arXiv Preprint arXiv:1407.1687. Retrieved from

Garten, J., Sagae, K., Ustun, V., & Dehghani, M. (2015). Combining Distributed Vector Representations for Words. In Proceedings of NAACL-HLT (pp. 95–101). Retrieved from

Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. Proceedings of NAACL-HLT 2016 (SRW).

Köper, M., Scheible, C., & im Walde, S. S. (2015). Multilingual reliability and ‘semantic’ structure of continuous word spaces. IWCS 2015, 40.

Lebret, R., & Collobert, R. (2015). Rehabilitation of Count-based Models for Word Vector Representations. In Computational Linguistics and Intelligent Text Processing (pp. 417–429). Springer. Retrieved from

Levy, O., Goldberg, Y., & Ramat-Gan, I. (2014). Linguistic regularities in sparse and explicit word representations. CoNLL-2014, 171–180.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv Preprint arXiv:1301.3781. Retrieved from

Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL (pp. 746–751). Retrieved from

Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12, 1532–1543.

Zhou, C., Sun, C., Liu, Z., & Lau, F. (2015). Category Enhanced Word Embedding. arXiv Preprint arXiv:1511.08629. Retrieved from



Back to the main page.