let's discover classes in words

for now we'll try to do everything from scratch to get a better grasp

Let's import some necessary modules:

In [1]:

```
import os
import numpy as np
```

this is the dir where our vectors are stored:

In [2]:

```
dir_root="/mnt/work/nlp_scratch/SVD/Eng/explicit_BNC_w2_m10_svd_500_C0.6/"
#dir_root="/storage/scratch/SVD/BNC/explicit_BNC_w2_m10_svd_500_C0.6/"
```

In [94]:

```
#dictionaries to translate words to id-s and back
dic_word2id={}
dic_id2word={}
with open(os.path.join(dir_root,"ids")) as f:
for line in f:
tokens=line.split()
dic_word2id[tokens[0]] = int(tokens[1])
dic_id2word[int(tokens[1])] = tokens[0]
```

In [95]:

```
print(dic_word2id["apple"])
print(dic_id2word[3263])
```

let's load the vectors

normally we use hdf5 format to be abble to process out-of-core or in distributed fashion

but for smaller corpora like BNC there's also a version in numpy native format - to simplify the example

In [7]:

```
vectors = np.load(os.path.join(dir_root,"vectors.npy"))
print ("{} word vectors loaded".format(vectors.shape[0]))
```

this is our training set:

we converted all words to the lower case

In [56]:

```
countries = ["australia","ireland","poland","ukraine","algeria","india","japan","turkey","germany","brazil"]
fruits = ["apple","banana","orange","pineapple","pear","peach","kiwi","avocado","apricot","papaya","pumpkin","mango"]
```

In [57]:

```
feature_vectors=[]
labels=[]
classes=[]
for i in countries:
feature_vectors.append(vectors[dic_word2id[i]])
labels.append(i)
classes.append("country")
for i in fruits:
feature_vectors.append(vectors[dic_word2id[i]])
labels.append(i)
classes.append("fruit")
feature_vectors=np.asarray(feature_vectors)
```

Let's try further dimnesionality reduction with PCA to be able to plot our data in 2D

In [96]:

```
import sklearn.decomposition
pca = sklearn.decomposition.PCA(n_components=2)
p = pca.fit_transform(feature_vectors)
```

In [97]:

```
import matplotlib as mpl
from matplotlib import pyplot as plt
mpl.style.use('bmh')
%matplotlib inline
mpl.rcParams['figure.figsize'] = (16.0, 10.0)
ax = plt.subplot('111', axisbg='white')
dic_colors={"country":"red","fruit":"blue"}
for i in range(len(labels)):
plt.plot(p[i,0],p[i,1],marker="o",color=dic_colors[classes[i]])
plt.text(p[i,0]-0.2,p[i,1]-0.25,labels[i])
#plt.legend()
```

clusters of fruits and countries look clearly separable

let's do supervised classification with logistic regression

In [71]:

```
import sklearn.linear_model
model_lr = sklearn.linear_model.LogisticRegression()
```

In [79]:

```
scores = sklearn.cross_validation.cross_val_score(model_lr, feature_vectors, classes, scoring='accuracy', cv=10)
print ("average accuracy = {}".format(scores.mean()))
```

new let's use logistic regression to classify new words

for example, I'm not quite sure what is "lemon" - sounds like some country in the Middle East

let's now train our regression model on all available samples

In [81]:

```
model_lr.fit(feature_vectors,classes)
```

Out[81]:

In [93]:

```
word = "lemon"
print (model_lr.predict(vectors[dic_word2id[word]])[0])
print ("predicted probabilities: ",model_lr.predict_proba(vectors[dic_word2id[word]]))
```