Gensim: Cannot call most_similar on Doc2VecKeyedVectors

Created on 2 Apr 2019 · 8Comments · Source: RaRe-Technologies/gensim

Problem description

I am trying to build my own Doc2VecKeyedVectors from a subset of vectors of my model, and then perform most_similar() on the result.

When calling most_similar(), I get the following error (see full trace in "steps to reproduce section"):

AttributeError: 'list' object has no attribute 'shape'

I noticed that the new Doc2VecKeyedVectors object I created has an empty list value for its vector_docs attributes, which I believe should be a (non-empty) np.ndarray instead of a list.

Steps/code/corpus to reproduce

Minimal reproduceable sample:

full_model_keyedvecs = model.docvecs # a pretrained model of type gensim.models.doc2vec.Doc2Vec
relevant_ids = [...] # insert list of indices used to build TaggedDocuments
relevant_vectors = [full_model_keyedvecs.vectors_docs[i, :] for i in relevant_ids]
relevant_vectors = np.array(relevant_vectors)
keyed_vecs = gensim.models.keyedvectors.Doc2VecKeyedVectors(vector_size=300, mapfile_path=None)
keyed_vecs.add(entities=relevant_ids, weights=relevant_vectors, replace=False)

--> at this point, I have successfully added the subset of keyed vectors, as I know by examining the size of keyed_vecs.vectors.shape
assert keyed_vecs.vectors.shape[0] == len(relevant_ids)

in_training_doc_id = 222 # an id I know is in relevant_ids

--> the following line causes the error

sims = keyed_vecs.most_similar(positive=[in_training_doc_id], topn=500)

Full stack trace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-229-1c2b6021aae3> in <module>
---> 11 keyed_vecs.most_similar(positive=[630608])

~\AppData\Local\Continuum\miniconda3\envs\amlenv36\lib\site-packages\gensim\models\keyedvectors.py in most_similar(self, positive, negative, topn, clip_start, clip_end, indexer)
   1623             negative = []
   1624 
-> 1625         self.init_sims()
   1626         clip_end = clip_end or len(self.vectors_docs_norm)
   1627 

~\AppData\Local\Continuum\miniconda3\envs\amlenv36\lib\site-packages\gensim\models\keyedvectors.py in init_sims(self, replace)
   1584                         mode='w+', shape=self.vectors_docs.shape)
   1585                 else:
-> 1586                     self.vectors_docs_norm = empty(self.vectors_docs.shape, dtype=REAL)
   1587                 np_divide(
   1588                     self.vectors_docs, sqrt((self.vectors_docs ** 2).sum(-1))[..., newaxis], self.vectors_docs_norm)

AttributeError: 'list' object has no attribute 'shape'

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

Python 3.6.7 |Anaconda custom (64-bit)| (default, Dec 10 2018, 20:35:02) [MSC v.1915 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import platform; print(platform.platform())
Windows-10-10.0.17763-SP0
import sys; print("Python", sys.version)
Python 3.6.7 |Anaconda custom (64-bit)| (default, Dec 10 2018, 20:35:02) [MSC v.1915 64 bit (AMD64)]
import numpy; print("NumPy", numpy.__version__)
NumPy 1.15.4
import scipy; print("SciPy", scipy.__version__)
SciPy 1.1.0
import gensim; print("gensim", gensim.__version__)
[…]\AppData\Local\Continuum\miniconda3\envs\amlenv36\lib\site-packages\gensim\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
gensim 3.5.0
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1

bug difficulty hard impact HIGH reach MEDIUM

Source

chlaterz

Most helpful comment

Sorry about that, leaving a real reproducible example below. Thanks.

import gensim
import numpy as np

relevant_ids = [1, 2]
relevant_vectors = [np.array([1,1,1]), np.array([2,2,2])]

keyed_vecs = gensim.models.keyedvectors.Doc2VecKeyedVectors(vector_size=3, mapfile_path=None)
keyed_vecs.add(entities=relevant_ids, weights=relevant_vectors, replace=False)
assert keyed_vecs.vectors.shape[0] == len(relevant_ids)

# the following line will cause the error
sims = keyed_vecs.most_similar(positive=[1], topn=1)

chlaterz on 2 Apr 2019

👍2

All 8 comments

Thank you for providing example code. Could you please provide a reproducible example? Ideally, it would be code that we can run directly through the Python interpreter.

mpenkov on 2 Apr 2019

Sorry about that, leaving a real reproducible example below. Thanks.

import gensim
import numpy as np

relevant_ids = [1, 2]
relevant_vectors = [np.array([1,1,1]), np.array([2,2,2])]

keyed_vecs = gensim.models.keyedvectors.Doc2VecKeyedVectors(vector_size=3, mapfile_path=None)
keyed_vecs.add(entities=relevant_ids, weights=relevant_vectors, replace=False)
assert keyed_vecs.vectors.shape[0] == len(relevant_ids)

# the following line will cause the error
sims = keyed_vecs.most_similar(positive=[1], topn=1)

chlaterz on 2 Apr 2019

👍2

Using the current develop branch, I get an error, but it is different to what you originally reported:

TypeError                                 Traceback (most recent call last)
<ipython-input-1-fb71a5f502fa> in <module>
     10
     11 # the following line will the cause error
---> 12 sims = keyed_vecs.most_similar(positive=[1], topn=1)

~/git/gensim/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, clip_start, clip_end, indexer)
   1665             negative = []
   1666
-> 1667         self.init_sims()
   1668         clip_end = clip_end or len(self.vectors_docs_norm)
   1669

~/git/gensim/gensim/models/keyedvectors.py in init_sims(self, replace)
   1628                     mode='w+', shape=self.vectors_docs.shape)
   1629             else:
-> 1630                 self.vectors_docs_norm = _l2_norm(self.vectors_docs, replace=replace)
   1631
   1632     def most_similar(self, positive=None, negative=None, topn=10, clip_start=0, clip_end=None, indexer=None):

~/git/gensim/gensim/models/keyedvectors.py in _l2_norm(m, replace)
   2349
   2350     """
-> 2351     dist = sqrt((m ** 2).sum(-1))[..., newaxis]
   2352     if replace:
   2353         m /= dist

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

Can you please double-check?

mpenkov on 3 Apr 2019

@mpenkov note the user reported gensim 3.5.0 (not latest development branch).

piskvorky on 3 Apr 2019

👍1

I double checked, and @piskvorky is correct. FYI, I did a fresh install a day or two ago following the conda install command found here: https://radimrehurek.com/gensim/install.html

chlaterz on 3 Apr 2019

@chlaterz I'm not sure if the Doc2VecKeyedVectors is designed to be used like that... is there an example or documentation that you're basing your code on? You may need to use the classes in doc2vec.py instead.

Unfortunately, the documentation for the Doc2VecKeyedVectors class is rather lacking, so we need to reverse-engineer a little bit.

mpenkov on 20 Apr 2019

Chiming in a about the documentation for this class, it's a little obfuscated what is the use case for this method:

most_similar(positive=None, negative=None, topn=10, clip_start=0, clip_end=None, indexer=None)

i.e., what does this method actually do, and whether this class directly provides any method for getting the top-n closest documents (document vectors).

matanster on 13 Jul 2019

👍1

I think the issue is stemming from the fact that the _l2_norm() function is expecting a numpy array and in init_sims(), self.vectors_docs is a python list. Thus the operation dist = sqrt((m ** 2).sum(-1))[..., newaxis] cannot be performed because m is a python list and not a numpy array.

I haven't gone through the code enough to decide where to convert the self.vectors_docs to a numpy array. Also, in keyedvectors.py, self.vectors_docs is never assigned values other than in __init__ when it is assigned a []. @piskvorky
@mpenkov