I have tested the nice api of wod2vec. It works perfectly on windows (without cython 8 hours) and ubuntu (with cython 15 minutes).
I looked into the output model and binary word2vec_format model. However, I can't get a matrix of similarity using word2vec.
This similarity matrix is necessary, based on which the words could be clustered or visualized using network analysis.
You can get individual vectors with model['word']. Then it depends on what "similarity" you want to use (cosine is popular).
word2vec itself offers model.similarity('word1', 'word2') for cosine similarity between two words.
To compute a matrix between all vectors at once (faster), you can use numpy or gensim.similarities.
NumPy is simpler, but everything is in RAM (a |words| * |words| matrix of floats = 10GB for just 50k words!): model.init_sims(); matrix = numpy.dot(model.syn0norm, model.syn0norm.T)
Gensim can work with larger-than-RAM similarities: index = gensim.similarities.MatrixSimilarity(gensim.matutils.Dense2Corpus(model.syn0)); for sims in index: print sims. Check out the tutorial at http://radimrehurek.com/gensim/tut3.html .
Closing this issue. Feel free to re-open if there's a specific problem/question.
hello , piskvorky
could you please explaine me what represente model.init_sims(); matrix = numpy.dot(model.syn0norm, model.syn0norm.T) what is model.syn0norm.T ?
thank you so much 馃憤
@AsmaZbt
model.init_sims() used for initialization of model.syn0normmodel.syn0norm.T is transposed version of model.syn0normmodel.syn0norm is matrix, that contains normalized word-vectors (i.e. with length = 1)matrix = numpy.dot(model.syn0norm, model.syn0norm.T) gives you distance matrix between all vectors (i.e. matrix[i][j] = distance between word i and word j)@menshikh-iv thank you so much its so clear (y) thank you so much
It seems weird to me that there is no consistent solution for getting (cosine) similarity matrix from word2vec.
There are such a solution for tf-idf and many other, even for SoftCosineSimilarity:
def softcosinesim(texts):
model = Word2Vec(texts, size=20, min_count=1)
termsim_index = WordEmbeddingSimilarityIndex(model.wv)
dictionary = Dictionary(texts)
bow_corpus = [dictionary.doc2bow(document) for document in texts]
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix)
sims = docsim_index[bow_corpus] # calculate similarity of query to each doc from bow_corpus
return sims
Most helpful comment
@AsmaZbt
model.init_sims()used for initialization ofmodel.syn0normmodel.syn0norm.Tis transposed version ofmodel.syn0normmodel.syn0normis matrix, that contains normalized word-vectors (i.e. with length = 1)matrix = numpy.dot(model.syn0norm, model.syn0norm.T)gives you distance matrix between all vectors (i.e.matrix[i][j]= distance between wordiand wordj)