Classifying words

let's discover classes in words

for now we'll try to do everything from scratch to get a better grasp

Let's import some necessary modules:

In [1]:
import os
import numpy as np

this is the dir where our vectors are stored:

In [2]:
dir_root="/mnt/work/nlp_scratch/SVD/Eng/explicit_BNC_w2_m10_svd_500_C0.6/"
#dir_root="/storage/scratch/SVD/BNC/explicit_BNC_w2_m10_svd_500_C0.6/"
In [94]:
#dictionaries to translate words to id-s and back
dic_word2id={}
dic_id2word={}
with open(os.path.join(dir_root,"ids")) as f:
    for line in f:
        tokens=line.split()
        dic_word2id[tokens[0]] = int(tokens[1])
        dic_id2word[int(tokens[1])] = tokens[0]
In [95]:
print(dic_word2id["apple"])
print(dic_id2word[3263])
3263
apple

let's load the vectors

normally we use hdf5 format to be abble to process out-of-core or in distributed fashion

but for smaller corpora like BNC there's also a version in numpy native format - to simplify the example

In [7]:
vectors = np.load(os.path.join(dir_root,"vectors.npy"))
print ("{} word vectors loaded".format(vectors.shape[0]))
89857 word vectors loaded

this is our training set:

we converted all words to the lower case

In [56]:
countries = ["australia","ireland","poland","ukraine","algeria","india","japan","turkey","germany","brazil"]
fruits = ["apple","banana","orange","pineapple","pear","peach","kiwi","avocado","apricot","papaya","pumpkin","mango"]

unsupervised clustering

In [57]:
feature_vectors=[]
labels=[]
classes=[]
for i in countries:
    feature_vectors.append(vectors[dic_word2id[i]])
    labels.append(i)
    classes.append("country")
for i in fruits:
    feature_vectors.append(vectors[dic_word2id[i]])
    labels.append(i)
    classes.append("fruit")
feature_vectors=np.asarray(feature_vectors)

Let's try further dimnesionality reduction with PCA to be able to plot our data in 2D

In [96]:
import sklearn.decomposition
pca = sklearn.decomposition.PCA(n_components=2)
p = pca.fit_transform(feature_vectors)
In [97]:
import matplotlib as mpl
from matplotlib import pyplot as plt
mpl.style.use('bmh')
%matplotlib inline
mpl.rcParams['figure.figsize'] = (16.0, 10.0)
ax = plt.subplot('111', axisbg='white')
dic_colors={"country":"red","fruit":"blue"}
for i in range(len(labels)):
    plt.plot(p[i,0],p[i,1],marker="o",color=dic_colors[classes[i]])
    plt.text(p[i,0]-0.2,p[i,1]-0.25,labels[i])
#plt.legend()

clusters of fruits and countries look clearly separable

Supervised classification

let's do supervised classification with logistic regression

In [71]:
import sklearn.linear_model
model_lr =  sklearn.linear_model.LogisticRegression()
In [79]:
scores = sklearn.cross_validation.cross_val_score(model_lr, feature_vectors, classes, scoring='accuracy', cv=10)
print ("average accuracy = {}".format(scores.mean()))
average accuracy = 1.0

new let's use logistic regression to classify new words

for example, I'm not quite sure what is "lemon" - sounds like some country in the Middle East

let's now train our regression model on all available samples

In [81]:
model_lr.fit(feature_vectors,classes)
Out[81]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)
In [93]:
word = "lemon"
print (model_lr.predict(vectors[dic_word2id[word]])[0])
print ("predicted probabilities: ",model_lr.predict_proba(vectors[dic_word2id[word]]))
fruit
predicted probabilities:  [[ 0.00438447  0.99561553]]