Tools

Co-occurrence extraction kernel for explicit word embeddings

One of the current big questions in NLP is the relative (de)merits of word embeddings based on co-occurrence counts and neural nets. While the latter are currently more fashionable, it is not clear to what extent they are conceptually different. O. Levy and Y. Goldberg have in fact shown that the famous skip-gram model of word2vec is in fact performing implicit matrix factorization similar to the SVD technique.

However, word2vec remains the preferred tool for constructing word embeddings, and one of the reasons for that is that building SVD-based embeddings is more computationally demanding. To help this I developed an efficient co-occurrence extraction kernel that can process billions-of-words scale corpora on a single compute node (distributed mode is also supported) within several-hours timeframe.

The kernel can be found at https://github.com/undertherain/nlp_cooc It is licensed under the Apache License, Version 2.0. GCC >= 5.2 is necessary for compilation.

VSMlib

VSMlib is a work-in-progress Python library for working with word embeddings that I developed and use in my own work. It can be found at https://github.com/undertherain/vsmlib. The tutorials section of this site show some examples of this library. VSMlib supports dense vectors in binary and plain-text format, as well as sparse vectors stored in h5p format. Currently VSMlib provides methods for retrieving nearest neighbors of word vectors and several visualization modules. Bla-bla-bla

Word analogy toolkit

One of the most famous success stories of word embeddings is solving word analogies. Mikolov et. al. 2013 showed that the simple linear offset between pairs of words can sometimes detect a linguistic relation. The famous example is king - man + woman = queen. This means that subtracting the man vector from king vector and adding woman yields a hypothetical vector, the nearest neighbor of which is queen. However, this works well for only a few linguistic relations.

As described in this paper , the toolkit provides three methods for detecting analogies that are based on sets of word pairs rather than individual pairs. Additionally it implemets Mikolov’s and Levy and Goldberg’s methods. The script is included in the vsmlib repository and requires vsmlib to run. It takes one argument – the path to configuration file in YAML format. A sample configuration file is provided. A big dataset for word analogies can be found at http://vsm.blackbird.pw/bats.

Supported embeddings include sparse or SVD-transformed co-occurrence vectors obtained by kernel provided here, word2vec in binary format and GloVe. The type of the embedding type can be specified with the config file, otherwise vsmlib will try to deduce embeddings type automatically. The script outputs results in the following format:

Q:	boy 
Exp A:	girl
	A:sc_ttl: 1.01	girl	YES
	A:sc_ttl: 0.89	woman	NO
"Q" indicates "source" word or question, "Expt A" indicates expected answer and following N lines starting with tabulations are the "guesses" produced by selected method. Results are stored in specified folder ("./out" by default) with subfolders created for each method and embeddings type.

Back to the main page.