Gensim: Doc2Vec - checking similarity between words and docs

Created on 29 Oct 2015 · 2Comments · Source: RaRe-Technologies/gensim

I checked out the blog here on Word2Vec and Doc2Vec examples using gensim and tried to use the function as given there under the section of - 'Summarizing sentences & documents':

def get_vector(word):
   return model.syn0norm[model.vocab[word].index]
def calculate_similarity(sentence, word):
   vec_a = get_vector(sentence)
   vec_b = get_vector(word)
   sim = np.dot(vec_a, vec_b)
   return sim
calculate_similarity('SENT_47973, 'casual')

I used the IMDB dataset and the models as learnt from running the Doc2Vec ipython notebook example.

I made the change of syn0norm to syn0 in the return statement for get_vector(), but the function does not work when passed a doc_id as got by:

doc2vec_model = Doc2Vec.load('imdb-d2v.doc2vec')
doc_id = np.random.randint(doc2vec_model.docvecs.count)
print calculate_similarity(doc_id, 'movies')

Source

newterminator

👍1

Most helpful comment

Since gensim 0.12 document vectors are in a separate structure, the docvecs property of the main model. So you won't get a document's vector from the main model's syn0/syn0norm.

You can still compare word and document vectors with a few extra steps – each of the main model's and the docvecs model's similarity methods can take an external vector (instead of a lookup key). There's an example on the mailing list:

https://groups.google.com/d/msg/gensim/Fujja7aOH6E/C3WArofWbNIJ

gojomo on 29 Oct 2015

👍4

All 2 comments

Since gensim 0.12 document vectors are in a separate structure, the docvecs property of the main model. So you won't get a document's vector from the main model's syn0/syn0norm.

https://groups.google.com/d/msg/gensim/Fujja7aOH6E/C3WArofWbNIJ

gojomo on 29 Oct 2015

👍4

Thank you Gordon @gojomo for your lightning quick response, and for clarifying the difference between the storage of the word vectors and the document vectors.

I will check out your example as linked and take it from there.

newterminator on 29 Oct 2015

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Segmentation fault using build_vocab(..., update=True) for Doc2Vec

danoneata · 27Comments

AssertionError: sparse documents must not contain any explicit zero entries and the similarity matrix S must satisfy x^T * S * x > 0 for any nonzero bag-of-words vector x.

DennisCologne · 37Comments

Set up Azure pipelines for gensim

mpenkov · 34Comments

word2vec (& doc2vec) training doesn't benefit from all CPU cores with high `workers` values

jticknor · 42Comments

Gensim Doc2Vec model Segmentation Faulting for Large Corpus

mohsin-ashraf · 31Comments