Data

BATS – the Bigger Analogy Test Set. →

The dataset known as the Google test set became the de-facto standard for evaluating word embeddings, but it is not balanced and contains only 15 linguistic relations with 19,544 questions in total. We developed a balanced test set that contains 40 different relations with 50 unique pairs per relation, yielding 99,200 analogy questions.

Downloadable dense vectors:

Each archive contains:
provenance.txt - information of how this was obtained - model parameters etc.
ids - vocabulary
vectors.h5p - vectors
freq_per_id - frequencies of words in original corpus, if applicable

Source corpus Lang Dims W Method Size Link
BNC Eng 500 2 PMI; SVD C=0.6 318 MB mirror1
GIGA + Wiki + proza Rus 500 2 PMI; SVD C=0.6 2.3 GB mirror1

Back to the main page.