I am trying to build my own Doc2VecKeyedVectors from a subset of vectors of my model, and then perform most_similar() on the result.
When calling most_similar(), I get the following error (see full trace in "steps to reproduce section"):
AttributeError: 'list' object has no attribute 'shape'
I noticed that the new Doc2VecKeyedVectors object I created has an empty list value for its vector_docs attributes, which I believe should be a (non-empty) np.ndarray instead of a list.
Minimal reproduceable sample:
full_model_keyedvecs = model.docvecs # a pretrained model of type gensim.models.doc2vec.Doc2Vec
relevant_ids = [...] # insert list of indices used to build TaggedDocuments
relevant_vectors = [full_model_keyedvecs.vectors_docs[i, :] for i in relevant_ids]
relevant_vectors = np.array(relevant_vectors)
keyed_vecs = gensim.models.keyedvectors.Doc2VecKeyedVectors(vector_size=300, mapfile_path=None)
keyed_vecs.add(entities=relevant_ids, weights=relevant_vectors, replace=False)
--> at this point, I have successfully added the subset of keyed vectors, as I know by examining the size of keyed_vecs.vectors.shape
assert keyed_vecs.vectors.shape[0] == len(relevant_ids)
in_training_doc_id = 222 # an id I know is in relevant_ids
--> the following line causes the error
sims = keyed_vecs.most_similar(positive=[in_training_doc_id], topn=500)
Full stack trace:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-229-1c2b6021aae3> in <module>
---> 11 keyed_vecs.most_similar(positive=[630608])
~\AppData\Local\Continuum\miniconda3\envs\amlenv36\lib\site-packages\gensim\models\keyedvectors.py in most_similar(self, positive, negative, topn, clip_start, clip_end, indexer)
1623 negative = []
1624
-> 1625 self.init_sims()
1626 clip_end = clip_end or len(self.vectors_docs_norm)
1627
~\AppData\Local\Continuum\miniconda3\envs\amlenv36\lib\site-packages\gensim\models\keyedvectors.py in init_sims(self, replace)
1584 mode='w+', shape=self.vectors_docs.shape)
1585 else:
-> 1586 self.vectors_docs_norm = empty(self.vectors_docs.shape, dtype=REAL)
1587 np_divide(
1588 self.vectors_docs, sqrt((self.vectors_docs ** 2).sum(-1))[..., newaxis], self.vectors_docs_norm)
AttributeError: 'list' object has no attribute 'shape'
Please provide the output of:
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
Python 3.6.7 |Anaconda custom (64-bit)| (default, Dec 10 2018, 20:35:02) [MSC v.1915 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
import platform; print(platform.platform())
Windows-10-10.0.17763-SP0
import sys; print("Python", sys.version)
Python 3.6.7 |Anaconda custom (64-bit)| (default, Dec 10 2018, 20:35:02) [MSC v.1915 64 bit (AMD64)]
import numpy; print("NumPy", numpy.__version__)
NumPy 1.15.4
import scipy; print("SciPy", scipy.__version__)
SciPy 1.1.0
import gensim; print("gensim", gensim.__version__)
[鈥\AppData\Local\Continuum\miniconda3\envs\amlenv36\lib\site-packages\gensim\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
gensim 3.5.0
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1
Thank you for providing example code. Could you please provide a reproducible example? Ideally, it would be code that we can run directly through the Python interpreter.
Sorry about that, leaving a real reproducible example below. Thanks.
import gensim
import numpy as np
relevant_ids = [1, 2]
relevant_vectors = [np.array([1,1,1]), np.array([2,2,2])]
keyed_vecs = gensim.models.keyedvectors.Doc2VecKeyedVectors(vector_size=3, mapfile_path=None)
keyed_vecs.add(entities=relevant_ids, weights=relevant_vectors, replace=False)
assert keyed_vecs.vectors.shape[0] == len(relevant_ids)
# the following line will cause the error
sims = keyed_vecs.most_similar(positive=[1], topn=1)
Using the current develop branch, I get an error, but it is different to what you originally reported:
TypeError Traceback (most recent call last)
<ipython-input-1-fb71a5f502fa> in <module>
10
11 # the following line will the cause error
---> 12 sims = keyed_vecs.most_similar(positive=[1], topn=1)
~/git/gensim/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, clip_start, clip_end, indexer)
1665 negative = []
1666
-> 1667 self.init_sims()
1668 clip_end = clip_end or len(self.vectors_docs_norm)
1669
~/git/gensim/gensim/models/keyedvectors.py in init_sims(self, replace)
1628 mode='w+', shape=self.vectors_docs.shape)
1629 else:
-> 1630 self.vectors_docs_norm = _l2_norm(self.vectors_docs, replace=replace)
1631
1632 def most_similar(self, positive=None, negative=None, topn=10, clip_start=0, clip_end=None, indexer=None):
~/git/gensim/gensim/models/keyedvectors.py in _l2_norm(m, replace)
2349
2350 """
-> 2351 dist = sqrt((m ** 2).sum(-1))[..., newaxis]
2352 if replace:
2353 m /= dist
TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'
Can you please double-check?
@mpenkov note the user reported gensim 3.5.0 (not latest development branch).
I double checked, and @piskvorky is correct. FYI, I did a fresh install a day or two ago following the conda install command found here: https://radimrehurek.com/gensim/install.html
@chlaterz I'm not sure if the Doc2VecKeyedVectors is designed to be used like that... is there an example or documentation that you're basing your code on? You may need to use the classes in doc2vec.py instead.
Unfortunately, the documentation for the Doc2VecKeyedVectors class is rather lacking, so we need to reverse-engineer a little bit.
Chiming in a about the documentation for this class, it's a little obfuscated what is the use case for this method:
most_similar(positive=None, negative=None, topn=10, clip_start=0, clip_end=None, indexer=None)
i.e., what does this method actually do, and whether this class directly provides any method for getting the top-n closest documents (document vectors).
I think the issue is stemming from the fact that the _l2_norm() function is expecting a numpy array and in init_sims(), self.vectors_docs is a python list. Thus the operation dist = sqrt((m ** 2).sum(-1))[..., newaxis] cannot be performed because m is a python list and not a numpy array.
I haven't gone through the code enough to decide where to convert the self.vectors_docs to a numpy array. Also, in keyedvectors.py, self.vectors_docs is never assigned values other than in __init__ when it is assigned a []. @piskvorky
@mpenkov
Most helpful comment
Sorry about that, leaving a real reproducible example below. Thanks.