Gensim: How to output similarity matrix using word2vec in gensim?

Created on 17 Nov 2013  路  6Comments  路  Source: RaRe-Technologies/gensim

I have tested the nice api of wod2vec. It works perfectly on windows (without cython 8 hours) and ubuntu (with cython 15 minutes).

I looked into the output model and binary word2vec_format model. However, I can't get a matrix of similarity using word2vec.

This similarity matrix is necessary, based on which the words could be clustered or visualized using network analysis.

Most helpful comment

@AsmaZbt

  • model.init_sims() used for initialization of model.syn0norm
  • model.syn0norm.T is transposed version of model.syn0norm
  • model.syn0norm is matrix, that contains normalized word-vectors (i.e. with length = 1)
  • matrix = numpy.dot(model.syn0norm, model.syn0norm.T) gives you distance matrix between all vectors (i.e. matrix[i][j] = distance between word i and word j)

All 6 comments

You can get individual vectors with model['word']. Then it depends on what "similarity" you want to use (cosine is popular).

word2vec itself offers model.similarity('word1', 'word2') for cosine similarity between two words.

To compute a matrix between all vectors at once (faster), you can use numpy or gensim.similarities.

NumPy is simpler, but everything is in RAM (a |words| * |words| matrix of floats = 10GB for just 50k words!): model.init_sims(); matrix = numpy.dot(model.syn0norm, model.syn0norm.T)

Gensim can work with larger-than-RAM similarities: index = gensim.similarities.MatrixSimilarity(gensim.matutils.Dense2Corpus(model.syn0)); for sims in index: print sims. Check out the tutorial at http://radimrehurek.com/gensim/tut3.html .

Closing this issue. Feel free to re-open if there's a specific problem/question.

hello , piskvorky

could you please explaine me what represente model.init_sims(); matrix = numpy.dot(model.syn0norm, model.syn0norm.T) what is model.syn0norm.T ?

thank you so much 馃憤

@AsmaZbt

  • model.init_sims() used for initialization of model.syn0norm
  • model.syn0norm.T is transposed version of model.syn0norm
  • model.syn0norm is matrix, that contains normalized word-vectors (i.e. with length = 1)
  • matrix = numpy.dot(model.syn0norm, model.syn0norm.T) gives you distance matrix between all vectors (i.e. matrix[i][j] = distance between word i and word j)

@menshikh-iv thank you so much its so clear (y) thank you so much

It seems weird to me that there is no consistent solution for getting (cosine) similarity matrix from word2vec.
There are such a solution for tf-idf and many other, even for SoftCosineSimilarity:

def softcosinesim(texts):
    model = Word2Vec(texts, size=20, min_count=1)
    termsim_index = WordEmbeddingSimilarityIndex(model.wv)
    dictionary = Dictionary(texts)
    bow_corpus = [dictionary.doc2bow(document) for document in texts]
    similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
    docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix)
    sims = docsim_index[bow_corpus]  # calculate similarity of query to each doc from bow_corpus
    return sims
Was this page helpful?
0 / 5 - 0 ratings

Related issues

k0nserv picture k0nserv  路  3Comments

volj1 picture volj1  路  4Comments

franciscojavierarceo picture franciscojavierarceo  路  3Comments

mmunozm picture mmunozm  路  3Comments

shubhvachher picture shubhvachher  路  4Comments