Gensim: Doc2VecKeyedVectors doesn't effectively support setitem()/add()

Created on 21 Nov 2019 · 2Comments · Source: RaRe-Technologies/gensim

Per user report on SO, neither assignment to a bracketed-access (as would be implemented by __setitem__()) nor use of the add() method will successfully mutate a Doc2VecKeyedVectors object.

Looking closer, it seems the superclass __setItem__() passes through to superclass add(), which was only ever implemented for word-centric sets of vectors – consulting/updating properties like .vocab that only exist as empty values in Doc2VecKeyedVectors because of the currently confused inheritance created by #1777.

bug feature

Source

gojomo

Most helpful comment

@ThijsKranenburg - If it works for your purposes, it's good enough! Note though you've not yet done enough to look-up the new vectors by identifier – that's also require adding entries to the model.docvecs.doctags dict. And the possible effects of such a workaround on any further training are unclear.

gojomo on 14 Jan 2020

👍2

All 2 comments

As an addition to the SO post, I want to add new documents to the model.

It seems this should be done with the add() method, but since this is not working I figured the following work-around out:

model = Doc2Vec.load(PATH_to_model)

# Add vector and identifier to original values
model.docvecs.vectors_docs =  np.vstack([model.docvecs.vectors_docs, new_vec])
model.docvecs.index2entity.append(new_identifier)

# Test if new document is included
model.docvecs.most_similar(positive = [new_vec])

Calling the most_similar() method returns results including this new document, also after saving and loading the model. So it seems to work.

My question is whether this is a 'correct' way of working around this bug, or if I am missing something.

ThijsKranenburg on 13 Jan 2020

👍1

gojomo on 14 Jan 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

loading fastText model trained with pretrained_vectors still fails (see: #2350)

cbjrobertson · 24Comments

LSI worker getting "stuck"

robguinness · 53Comments

Improve FastText loading times

tmylk · 33Comments

word2vec (& doc2vec) training doesn't benefit from all CPU cores with high `workers` values

jticknor · 42Comments

AssertionError: sparse documents must not contain any explicit zero entries and the similarity matrix S must satisfy x^T * S * x > 0 for any nonzero bag-of-words vector x.

DennisCologne · 37Comments

Gensim: Doc2VecKeyedVectors doesn't effectively support __setitem__()/add()

Most helpful comment

All 2 comments

Related issues

Gensim: Doc2VecKeyedVectors doesn't effectively support setitem()/add()