Gensim: How to output similarity matrix using word2vec in gensim?

Created on 17 Nov 2013 · 6Comments · Source: RaRe-Technologies/gensim

I have tested the nice api of wod2vec. It works perfectly on windows (without cython 8 hours) and ubuntu (with cython 15 minutes).

I looked into the output model and binary word2vec_format model. However, I can't get a matrix of similarity using word2vec.

This similarity matrix is necessary, based on which the words could be clustered or visualized using network analysis.

Source

chengjun

Most helpful comment

@AsmaZbt

model.init_sims() used for initialization of model.syn0norm
model.syn0norm.T is transposed version of model.syn0norm
model.syn0norm is matrix, that contains normalized word-vectors (i.e. with length = 1)
matrix = numpy.dot(model.syn0norm, model.syn0norm.T) gives you distance matrix between all vectors (i.e. matrix[i][j] = distance between word i and word j)

menshikh-iv on 7 Feb 2018

👍6 👀1 🚀1 ❤1 😕1 🎉1 😄1 👎1

All 6 comments

You can get individual vectors with model['word']. Then it depends on what "similarity" you want to use (cosine is popular).

word2vec itself offers model.similarity('word1', 'word2') for cosine similarity between two words.

To compute a matrix between all vectors at once (faster), you can use numpy or gensim.similarities.

NumPy is simpler, but everything is in RAM (a |words| * |words| matrix of floats = 10GB for just 50k words!): model.init_sims(); matrix = numpy.dot(model.syn0norm, model.syn0norm.T)

Gensim can work with larger-than-RAM similarities: index = gensim.similarities.MatrixSimilarity(gensim.matutils.Dense2Corpus(model.syn0)); for sims in index: print sims. Check out the tutorial at http://radimrehurek.com/gensim/tut3.html .

piskvorky on 17 Nov 2013

Closing this issue. Feel free to re-open if there's a specific problem/question.

piskvorky on 21 Nov 2013

hello , piskvorky

could you please explaine me what represente model.init_sims(); matrix = numpy.dot(model.syn0norm, model.syn0norm.T) what is model.syn0norm.T ?

thank you so much 👍

AsmaZbt on 7 Feb 2018

@AsmaZbt

model.init_sims() used for initialization of model.syn0norm
model.syn0norm.T is transposed version of model.syn0norm
model.syn0norm is matrix, that contains normalized word-vectors (i.e. with length = 1)
matrix = numpy.dot(model.syn0norm, model.syn0norm.T) gives you distance matrix between all vectors (i.e. matrix[i][j] = distance between word i and word j)

menshikh-iv on 7 Feb 2018

👍6 👀1 🚀1 ❤1 😕1 🎉1 😄1 👎1

@menshikh-iv thank you so much its so clear (y) thank you so much

AsmaZbt on 8 Feb 2018

😄1

It seems weird to me that there is no consistent solution for getting (cosine) similarity matrix from word2vec.
There are such a solution for tf-idf and many other, even for SoftCosineSimilarity:

def softcosinesim(texts):
    model = Word2Vec(texts, size=20, min_count=1)
    termsim_index = WordEmbeddingSimilarityIndex(model.wv)
    dictionary = Dictionary(texts)
    bow_corpus = [dictionary.doc2bow(document) for document in texts]
    similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
    docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix)
    sims = docsim_index[bow_corpus]  # calculate similarity of query to each doc from bow_corpus
    return sims