Warm-up example

let's find words which are most related to the given one

all the needed functinality is already implemented in the vsmlib

but for now we'll try to do everything from scratch to get a better grasp

Let's import some necessary modules:

In [1]:
import os
import numpy as np
import scipy.spatial.distance
import heapq

this is the dir where our files are:

In [22]:
#dir_root="/mnt/work/nlp_scratch/SVD/Eng/explicit_BNC_w2_m10_svd_500_C0.6/"
dir_root="/storage/scratch/SVD/BNC/explicit_BNC_w2_m10_svd_500_C0.6/"

let's look what's there:

In [7]:
os.listdir(dir_root)
Out[7]:
['provenance.txt', 'vectors.npy', 'ids']

file 'ids' contains word ids - id corresponds to row index in the co-occurrence matrix or in the correspoding matrix after dimensionality reduction

it looks like this:

In [8]:
with open(os.path.join(dir_root,"ids")) as f:
    for i in range(5):
        print(next(f).strip())
one	0
said	1
time	2
two	3
like	4

now let's load them into a couple of dictionaries to be able to translate words to their ids and backwards

In [9]:
dic_word2id={}
dic_id2word={}
with open(os.path.join(dir_root,"ids")) as f:
    for line in f:
        tokens=line.split()
        dic_word2id[tokens[0]] = int(tokens[1])
        dic_id2word[int(tokens[1])] = tokens[0]
In [23]:
print(dic_word2id["apple"])
print(dic_id2word[3263])
3263
apple

now let's load the vectors

normally we use hdf5 format to be abble to process out-of-core or in distributed fashion

but for smaller corpora like BNC there's also a version in numpy native format - to simplify the example

In [11]:
vectors = np.load(os.path.join(dir_root,"vectors.npy"))
vectors
Out[11]:
array([[  8.46300840e-01,   1.84261882e+00,   6.48741782e-01, ...,
          3.02480268e+00,   1.39322674e+00,  -3.09761643e+00],
       [  9.31759179e-01,  -3.74915302e-01,   3.43395686e+00, ...,
          4.82231236e+00,   5.65342665e+00,  -2.43924093e+00],
       [  6.93321452e-02,   4.03676361e-01,  -7.73058891e-01, ...,
          2.63729239e+00,   1.09847951e+00,  -2.46159387e+00],
       ..., 
       [ -1.25323851e-02,   4.50774394e-02,   1.40409451e-02, ...,
         -7.87336975e-02,  -4.89492789e-02,  -5.31415008e-02],
       [ -4.02985606e-03,  -3.51128392e-02,   3.81313115e-02, ...,
         -3.73153761e-02,   7.75657175e-03,  -3.17390561e-02],
       [  3.60635035e-02,  -7.11380551e-03,  -4.37753722e-02, ...,
         -4.29305099e-02,   9.67629161e-03,  -5.13765104e-02]], dtype=float32)

now let's compare two word vectors with cosine metric:

In [12]:
row1=vectors[dic_word2id["apple"]]
row2=vectors[dic_word2id["banana"]]
1-scipy.spatial.distance.cosine(row1,row2)
Out[12]:
0.58657183882050712

we use 1-cos(a) to turn the distance into something like similairy score - "the higher - the better"

let's for the sace of comparison check some un-related word

In [13]:
row1=vectors[dic_word2id["apple"]]
row2=vectors[dic_word2id["hydrogen"]]
1-scipy.spatial.distance.cosine(row1,row2)
Out[13]:
0.01345982147805358

-the score is pretty low

now the final piece - let's check all the words in the vocabulary and take top 10 with higher similarity score

In [21]:
row1 = vectors[dic_word2id["apple"]]
heap = []
def iter_vectors():
    for i in range(vectors.shape[0]):
        row2=vectors[i]
        score = 1 - scipy.spatial.distance.cosine(row1,row2)
        yield (score, i)
[(score,dic_id2word[word]) for score, word in heapq.nlargest(10,iter_vectors())]
#next(iter_vectors())
#h
Out[21]:
[(1.0000000999898755, 'apple'),
 (0.61400752577032369, 'fruit'),
 (0.58657183882050712, 'banana'),
 (0.5850951585421692, 'plum'),
 (0.58464719369713347, 'apples'),
 (0.58429190787471474, 'cherry'),
 (0.57961781805129187, 'pear'),
 (0.56222494060762729, 'raspberry'),
 (0.55915105347501826, 'cherries'),
 (0.55673065270094269, 'apricot')]

Now let's do the same with vsmlib

In [17]:
import vsmlib
m=vsmlib.model.load_from_dir(dir_root)
this is dense 
In [18]:
#m.clip_negatives()
#m.normalize()
print(m.name)
print(m.provenance)
explicit_BNC_w2_m10_svd_500_C0.6
source corpus : /work/alex/data/corpora/raw_texts/BNC/
words in corpus : 49608567
unique words : 342203
minimal frequency: 10
unique words : 89857
windows size : 2
frequency weightening : PMI
applied scipy.linal.svd, 500 singular vectors, sigma in the power of 0.6
In [19]:
m.get_most_similar_words("apple")
Out[19]:
[['apple', 1.0000000999898755],
 ['fruit', 0.61400752577032369],
 ['banana', 0.58657183882050712],
 ['plum', 0.5850951585421692],
 ['apples', 0.58464719369713347],
 ['cherry', 0.58429190787471474],
 ['pear', 0.57961781805129187],
 ['raspberry', 0.56222494060762729],
 ['cherries', 0.55915105347501826],
 ['apricot', 0.55673065270094269]]