Hi all,
I trained a Doc2Vec model successfully with the data of the Kaggle Tutorial "Bag of Words Meets Bags of Popcorn" https://www.kaggle.com/c/word2vec-nlp-tutorial/data. The methods most_similar and doesnt_match are working like expected.
However, when I use the infer_vector method, the error AttributeError: 'Doc2Vec' object has no attribute 'syn1' arises. When I check the model, there is just an model.syn0 available.
Systeminfo
MacOSX 10.10.5,
Python 2.7.10
Packages (I don't use Cyclone at the moment...)
boto (2.38.0)
bz2file (0.98)
gensim (0.12.2)
httpretty (0.8.6)
numpy (1.10.1)
pip (7.1.2)
requests (2.8.1)
scipy (0.16.0)
setuptools (18.2)
six (1.10.0)
smart-open (1.3.0)
wheel (0.24.0
Example in IPython
In [5]: from gensim.models import Doc2Vec
In [6]: model = Doc2Vec.load('./Doc2Vec300features_40minwords_10context')
In [7]: model.infer_vector("hallo ich bin ein text".split())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-7-c4a827fd56d1> in <module>()
----> 1 model.infer_vector("hallo ich bin ein text".split())
/Users/{myuser}/Documents/Dev/virt_env2/lib/python2.7/site-packages/gensim/models/doc2vec.pyc in infer_vector(self, doc_words, alpha, min_alpha, steps)
694 train_document_dm(self, doc_words, doctag_indexes, alpha, work, neu1,
695 learn_words=False, learn_hidden=False,
--> 696 doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
697 alpha = ((alpha - min_alpha) / (steps - i)) + min_alpha
698
/Users/{myuser}/Documents/Dev/virt_env2/lib/python2.7/site-packages/gensim/models/doc2vec_inner.pyx in gensim.models.doc2vec_inner.train_document_dm (./gensim/models/doc2vec_inner.c:4736)()
419
420 if hs:
--> 421 syn1 = <REAL_t *>(np.PyArray_DATA(model.syn1))
422
423 if negative:
AttributeError: 'Doc2Vec' object has no attribute 'syn1'
THX for your Help! :)
I could handle the error i two ways:
hs=0 by initializing the model ormodel.init_sims()I'm not a deep expert in this topic, but I think the hs (hierachical sampling for training) seems to me important isn't it? Also the init_sims which is "freezing" the model, so that its faster an smaller, is a good thing.
I just figured out. it also works with init_sims(replace=False).
In [9]: from gensim.models import Doc2Vec
In [10]: model = Doc2Vec.load('./Doc2VecMini300features_40minwords_10context')
In [11]: type(model.syn1)
Out[11]: numpy.ndarray
In [12]: model.init_sims(replace=True)
In [13]: type(model.syn1)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-13-5f166d4ec376> in <module>()
----> 1 type(model.syn1)
AttributeError: 'Doc2Vec' object has no attribute 'syn1'
init_sims(replace=True) seems do delete the attribute syn1 from the model which is used by infer_vector.
I think it's clear now. model.infer_vectors trains the new documents with the neural weights of the actual model (https://github.com/piskvorky/gensim/blob/develop/gensim/models/doc2vec.py#L684).
As model.init_sims(replace=True) is deleting them for memory save reasons, the method model.infer_vectors can not work. It's the same reason why model.train is not working after model.init_sims(replace=True).
When I'm right, it might be good to give an appropriate error/warning, or/and add a comment in the docs.
Thanks for your report. Yes, inference works almost exactly like training, so a model with training-state discarded won't be able to reasonably infer either. The comment for init_sims(replace=True) could be a bit clearer.
This area might benefit from a bit more renaming/commenting/refactoring for full clarity, for a few reasons related to what...
init_sims(replace=True) will clear syn1 (which is only used when hs=1) but not syn1neg (which serves the same role when negative>0) is a bit inconsistent. syn0 at all, but would still need the syn1 or syn1neg. So if for some reason you did bother to call init_sims(replace=True) on it, its ability to infer would survive _if_ it were based on negative sampling (since syn1neg isn't discarded)... but would break if using hierarchical-softmax (since syn1 is discarded). So it's unclear if init_sims(replace=True) should _imply_ 'minimize my model in all ways', or if that should become a different explicit step (as suggested in #446).syn0 ('context') and syn1/syn1neg ('prediction') can outperform the plain syn0 context vectors. Supporting experiments with that would further change the situations when the syn1/syn1neg should be consulted/discarded. @gojomo Should this be marked as easy?
@tmylk these are ease-of-use/least-surprise/ease-of-understanding that overlap a bit with the expressed interest of #446... none of the edits are hard but deciding what makes the most sense to the average user would take some familiarity with the code/uses.
m = g.Doc2Vec.load(saved_path) #load model
test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r","utf-8").readlines() ]
output = open(output_file, "w")
for d in test_docs:
output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")
when I run these code, I get the error:AttributeError: 'Doc2Vec' object has no attribute 'neg_labels'
so , what should I do to set the parameters,? I am a beginner, thank you!
@StevenChen1993 - Are you receiving a "slow version" warning in logs when you use Doc2Vec? neg_labels is a part of the model only needed/created when the optimized code is unavailable. So you could see this message if the model were created in an environment where gensim was fully installed (training had access to the optimized code), but then re-loaded to an environment where installation of the optimized variants failed. The best fix would be to make sure your deployment installation has the optimized paths (isn't getting the "slow version" message), perhaps by uninstalling and reinstalling gensim and watching for any errors. Otherwise, training/inference could be 100x slower for that environment. (Alternatively, you could patch a neg_labels into your loaded model like down here in the slow path and use the slow inference.)
thanks for your answer!
Hi
I had the same exception. The problem was that the model was trained using the fast version but when I installed gensim (3.8.0) on Windows I did not get a warning that the slow version was used.
I followed the instruction on https://radimrehurek.com/gensim/install.html which then successfully installed the fast version of Gensim (3.8.0) on Windows:
conda install -c conda-forge gensim
PS:
The following did NOT install the fast version on Windows and neither did it print a warning that the slow versino was used:
conda install gensim
Thanks @felixsmueller ; cross-linking to #2600 .
@mpenkov do we instruct people to use the conda-forge instead? I always forget what does what, I'm not familiar / a fan of that ecosystem.
@piskvorky I'm totally unfamiliar with conda myself. Do any of the gensim developers actually use it? If yes, it'd be good for that person to handle it. If no, then I suppose "one of us" could dedicate some time towards learning more about it, and then come back to solving this problem, although I must admit it isn't a particularly tempting endeavor.
I'm also struggling to understand whether we're dealing with a problem in gensim proper, or if it's a problem with the feedstock (https://github.com/conda-forge/gensim-feedstock/).
I believe @gojomo has used it.
I guess having binary wheels for Windows fixes most of such issues – we can now just tell people to do pip install. And forget about debugging and updating the proprietary conda ecosystem.
+1
@gojomo @menshikh-iv Any thoughts?
@mpenkov up to you, conda widely used by the data-science community, for this reason, I'm -1 to drop that.
I wonder if we can find a conda zealot who is willing to maintain the feedstock officially. Essentially, a new maintainer for https://github.com/conda-forge/gensim-feedstock/ and a go-to person for conda issues...
I think there are 3 ways to install Gensim in the conda ecosystem:
pip.Though I may have messed that up completely! Someone correct me.
I usually like to use (mini)conda to manage my dev environment. (The 'mini' version because I don't want the installation-overhead/complexity/etc of the full 'anaconda' package set.) I tend to install jupyter, numpy, scipy via the native conda installation – to be sure to get their well-maintained/well-optimized versions of those central packages from their repo – but then just pip install things like gensim. That's worked well enough for me, on MacOS & Linux OSes, and handles the same whether using Python 2 or Python 3 (without needing different virtual-environment helpers).
So from my perspective: We don't have to do any extra conda-work, or worry about other 'conda-forge' repos, or whatever – just encourage people to use pip install, no matter their environment.
(But also: this all seems a digression from what I see as the real reason for this bug-report: tiny behavioral differences between the 'optimized' and 'pure-python' paths, plus other recurring issues where the optimized code isn't available. Dropping the pure-python paths entirely will simplify maintenance immensely, though the code would then be less useful as a teaching tool.)
Most helpful comment
I usually like to use (mini)conda to manage my dev environment. (The 'mini' version because I don't want the installation-overhead/complexity/etc of the full 'anaconda' package set.) I tend to install jupyter, numpy, scipy via the native
condainstallation – to be sure to get their well-maintained/well-optimized versions of those central packages from their repo – but then justpip installthings likegensim. That's worked well enough for me, on MacOS & Linux OSes, and handles the same whether using Python 2 or Python 3 (without needing different virtual-environment helpers).So from my perspective: We don't have to do any extra conda-work, or worry about other 'conda-forge' repos, or whatever – just encourage people to use
pip install, no matter their environment.(But also: this all seems a digression from what I see as the real reason for this bug-report: tiny behavioral differences between the 'optimized' and 'pure-python' paths, plus other recurring issues where the optimized code isn't available. Dropping the pure-python paths entirely will simplify maintenance immensely, though the code would then be less useful as a teaching tool.)