BATS – the Bigger Analogy Test Set. →
The dataset known as the Google test set became the de-facto standard for evaluating word embeddings, but it is not balanced and contains only 15 linguistic relations with 19,544 questions in total. We developed a balanced test set that contains 40 different relations with 50 unique pairs per relation, yielding 99,200 analogy questions.
Downloadable dense vectors:Each archive contains:
provenance.txt - information of how this was obtained - model parameters etc.
ids - vocabulary
vectors.h5p - vectors
freq_per_id - frequencies of words in original corpus, if applicable
|BNC||Eng||500||2||PMI; SVD C=0.6||318 MB||mirror1|
|GIGA + Wiki + proza||Rus||500||2||PMI; SVD C=0.6||2.3 GB||mirror1|
Back to the main page.