Gensim: Doc2Vec.infer_vector: AttributeError: 'Doc2Vec' object has no attribute 'syn1'

Created on 16 Oct 2015  Â·  18Comments  Â·  Source: RaRe-Technologies/gensim

Hi all,
I trained a Doc2Vec model successfully with the data of the Kaggle Tutorial "Bag of Words Meets Bags of Popcorn" https://www.kaggle.com/c/word2vec-nlp-tutorial/data. The methods most_similar and doesnt_match are working like expected.

However, when I use the infer_vector method, the error AttributeError: 'Doc2Vec' object has no attribute 'syn1' arises. When I check the model, there is just an model.syn0 available.

Systeminfo

MacOSX 10.10.5,
Python 2.7.10

Packages (I don't use Cyclone at the moment...)

boto (2.38.0)
bz2file (0.98)
gensim (0.12.2)
httpretty (0.8.6)
numpy (1.10.1)
pip (7.1.2)
requests (2.8.1)
scipy (0.16.0)
setuptools (18.2)
six (1.10.0)
smart-open (1.3.0)
wheel (0.24.0

Example in IPython

In [5]: from gensim.models import Doc2Vec
In [6]: model = Doc2Vec.load('./Doc2Vec300features_40minwords_10context')
In [7]: model.infer_vector("hallo ich bin ein text".split())
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-c4a827fd56d1> in <module>()
----> 1 model.infer_vector("hallo ich bin ein text".split())

/Users/{myuser}/Documents/Dev/virt_env2/lib/python2.7/site-packages/gensim/models/doc2vec.pyc in infer_vector(self, doc_words, alpha, min_alpha, steps)
    694                 train_document_dm(self, doc_words, doctag_indexes, alpha, work, neu1,
    695                                   learn_words=False, learn_hidden=False,
--> 696                                   doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
    697             alpha = ((alpha - min_alpha) / (steps - i)) + min_alpha
    698

/Users/{myuser}/Documents/Dev/virt_env2/lib/python2.7/site-packages/gensim/models/doc2vec_inner.pyx in gensim.models.doc2vec_inner.train_document_dm (./gensim/models/doc2vec_inner.c:4736)()
    419
    420     if hs:
--> 421         syn1 = <REAL_t *>(np.PyArray_DATA(model.syn1))
    422
    423     if negative:

AttributeError: 'Doc2Vec' object has no attribute 'syn1'

THX for your Help! :)

bug conda difficulty easy documentation

Most helpful comment

I usually like to use (mini)conda to manage my dev environment. (The 'mini' version because I don't want the installation-overhead/complexity/etc of the full 'anaconda' package set.) I tend to install jupyter, numpy, scipy via the native conda installation – to be sure to get their well-maintained/well-optimized versions of those central packages from their repo – but then just pip install things like gensim. That's worked well enough for me, on MacOS & Linux OSes, and handles the same whether using Python 2 or Python 3 (without needing different virtual-environment helpers).

So from my perspective: We don't have to do any extra conda-work, or worry about other 'conda-forge' repos, or whatever – just encourage people to use pip install, no matter their environment.

(But also: this all seems a digression from what I see as the real reason for this bug-report: tiny behavioral differences between the 'optimized' and 'pure-python' paths, plus other recurring issues where the optimized code isn't available. Dropping the pure-python paths entirely will simplify maintenance immensely, though the code would then be less useful as a teaching tool.)

All 18 comments

I could handle the error i two ways:

  1. setting the parameter hs=0 by initializing the model or
  2. not calling model.init_sims()

I'm not a deep expert in this topic, but I think the hs (hierachical sampling for training) seems to me important isn't it? Also the init_sims which is "freezing" the model, so that its faster an smaller, is a good thing.

I just figured out. it also works with init_sims(replace=False).

In [9]: from gensim.models import Doc2Vec
In [10]: model = Doc2Vec.load('./Doc2VecMini300features_40minwords_10context')
In [11]: type(model.syn1)
Out[11]: numpy.ndarray
In [12]: model.init_sims(replace=True)
In [13]: type(model.syn1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-5f166d4ec376> in <module>()
----> 1 type(model.syn1)

AttributeError: 'Doc2Vec' object has no attribute 'syn1'

init_sims(replace=True) seems do delete the attribute syn1 from the model which is used by infer_vector.

I think it's clear now. model.infer_vectors trains the new documents with the neural weights of the actual model (https://github.com/piskvorky/gensim/blob/develop/gensim/models/doc2vec.py#L684).
As model.init_sims(replace=True) is deleting them for memory save reasons, the method model.infer_vectors can not work. It's the same reason why model.train is not working after model.init_sims(replace=True).

When I'm right, it might be good to give an appropriate error/warning, or/and add a comment in the docs.

Thanks for your report. Yes, inference works almost exactly like training, so a model with training-state discarded won't be able to reasonably infer either. The comment for init_sims(replace=True) could be a bit clearer.

This area might benefit from a bit more renaming/commenting/refactoring for full clarity, for a few reasons related to what...

  • The fact that init_sims(replace=True) will clear syn1 (which is only used when hs=1) but not syn1neg (which serves the same role when negative>0) is a bit inconsistent.
  • There is one variant of Doc2Vec training – pure DBOW – that doesn't initialize or need the words syn0 at all, but would still need the syn1 or syn1neg. So if for some reason you did bother to call init_sims(replace=True) on it, its ability to infer would survive _if_ it were based on negative sampling (since syn1neg isn't discarded)... but would break if using hierarchical-softmax (since syn1 is discarded). So it's unclear if init_sims(replace=True) should _imply_ 'minimize my model in all ways', or if that should become a different explicit step (as suggested in #446).
  • in a few projects/papers it has been mentioned that the word vectors created out of a concatenation of the syn0 ('context') and syn1/syn1neg ('prediction') can outperform the plain syn0 context vectors. Supporting experiments with that would further change the situations when the syn1/syn1neg should be consulted/discarded.

@gojomo Should this be marked as easy?

@tmylk these are ease-of-use/least-surprise/ease-of-understanding that overlap a bit with the expressed interest of #446... none of the edits are hard but deciding what makes the most sense to the average user would take some familiarity with the code/uses.

m = g.Doc2Vec.load(saved_path) #load model
test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r","utf-8").readlines() ]
output = open(output_file, "w")
for d in test_docs:
    output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")

when I run these code, I get the error:AttributeError: 'Doc2Vec' object has no attribute 'neg_labels'
so , what should I do to set the parameters,? I am a beginner, thank you!

@StevenChen1993 - Are you receiving a "slow version" warning in logs when you use Doc2Vec? neg_labels is a part of the model only needed/created when the optimized code is unavailable. So you could see this message if the model were created in an environment where gensim was fully installed (training had access to the optimized code), but then re-loaded to an environment where installation of the optimized variants failed. The best fix would be to make sure your deployment installation has the optimized paths (isn't getting the "slow version" message), perhaps by uninstalling and reinstalling gensim and watching for any errors. Otherwise, training/inference could be 100x slower for that environment. (Alternatively, you could patch a neg_labels into your loaded model like down here in the slow path and use the slow inference.)

thanks for your answer!

Hi

I had the same exception. The problem was that the model was trained using the fast version but when I installed gensim (3.8.0) on Windows I did not get a warning that the slow version was used.

I followed the instruction on https://radimrehurek.com/gensim/install.html which then successfully installed the fast version of Gensim (3.8.0) on Windows:
conda install -c conda-forge gensim

PS:
The following did NOT install the fast version on Windows and neither did it print a warning that the slow versino was used:
conda install gensim

Thanks @felixsmueller ; cross-linking to #2600 .

@mpenkov do we instruct people to use the conda-forge instead? I always forget what does what, I'm not familiar / a fan of that ecosystem.

@piskvorky I'm totally unfamiliar with conda myself. Do any of the gensim developers actually use it? If yes, it'd be good for that person to handle it. If no, then I suppose "one of us" could dedicate some time towards learning more about it, and then come back to solving this problem, although I must admit it isn't a particularly tempting endeavor.

I'm also struggling to understand whether we're dealing with a problem in gensim proper, or if it's a problem with the feedstock (https://github.com/conda-forge/gensim-feedstock/).

I believe @gojomo has used it.

I guess having binary wheels for Windows fixes most of such issues – we can now just tell people to do pip install. And forget about debugging and updating the proprietary conda ecosystem.

+1

@gojomo @menshikh-iv Any thoughts?

@mpenkov up to you, conda widely used by the data-science community, for this reason, I'm -1 to drop that.

I wonder if we can find a conda zealot who is willing to maintain the feedstock officially. Essentially, a new maintainer for https://github.com/conda-forge/gensim-feedstock/ and a go-to person for conda issues...

I think there are 3 ways to install Gensim in the conda ecosystem:

  1. Using Anaconda (some sort of pre-packaged platform, they charge money for some versions)

    • packages inside, incl. Gensim, are updated by the Continuum Analytics team. I don't think we can do upgrades ourselves.

  2. Using an "external" conda-forge channel, which is open source.

    • We could upgrade ourselves, if there's someone to do the maintenance and support. I have zero interest myself.

  3. Using normal pip.

    • Easiest option, no extra work for us.

Though I may have messed that up completely! Someone correct me.

I usually like to use (mini)conda to manage my dev environment. (The 'mini' version because I don't want the installation-overhead/complexity/etc of the full 'anaconda' package set.) I tend to install jupyter, numpy, scipy via the native conda installation – to be sure to get their well-maintained/well-optimized versions of those central packages from their repo – but then just pip install things like gensim. That's worked well enough for me, on MacOS & Linux OSes, and handles the same whether using Python 2 or Python 3 (without needing different virtual-environment helpers).

So from my perspective: We don't have to do any extra conda-work, or worry about other 'conda-forge' repos, or whatever – just encourage people to use pip install, no matter their environment.

(But also: this all seems a digression from what I see as the real reason for this bug-report: tiny behavioral differences between the 'optimized' and 'pure-python' paths, plus other recurring issues where the optimized code isn't available. Dropping the pure-python paths entirely will simplify maintenance immensely, though the code would then be less useful as a teaching tool.)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

simonm3 picture simonm3  Â·  3Comments

menshikh-iv picture menshikh-iv  Â·  3Comments

johann-petrak picture johann-petrak  Â·  3Comments

bgokden picture bgokden  Â·  3Comments

Jianqiang picture Jianqiang  Â·  3Comments