Gensim: Using pre-trained word2vec models in doc2vec

Created on 19 May 2017  Â·  3Comments  Â·  Source: RaRe-Technologies/gensim

Is there a practical way of using pre-trained word2vec models in doc2vec?

There is a forked version of Gensim that does it but it is pretty old.
Referenced here: https://github.com/jhlau/doc2vec
Forked Gensim here: https://github.com/jhlau/gensim

Otherwise I would like to add this feature as jhlau did and merge it back.

Most helpful comment

You can manually patch-up a model to insert word-vectors from elsewhere before training. The existing intersect_word2vec_format() may be useful, directly or as an example - it assumes you've already created a model with its own vocabulary (including the frequency info needed for negative-sampling or frequent-word-downsampling), but then want to use some external source to replace some/all of the word-vector values.

I personally don't think the case for such re-use is yet strong – indeed in some often top-performing Doc2Vec training modes (like pure PV-DBOW), input-word-vectors aren't trained or used at all, so loading them would be completely superfluous. You can see some discussion of related issues, including links to messages elsewhere, in the Github issue thread: https://github.com/RaRe-Technologies/gensim/issues/1270#issuecomment-293437366

All 3 comments

You can manually patch-up a model to insert word-vectors from elsewhere before training. The existing intersect_word2vec_format() may be useful, directly or as an example - it assumes you've already created a model with its own vocabulary (including the frequency info needed for negative-sampling or frequent-word-downsampling), but then want to use some external source to replace some/all of the word-vector values.

I personally don't think the case for such re-use is yet strong – indeed in some often top-performing Doc2Vec training modes (like pure PV-DBOW), input-word-vectors aren't trained or used at all, so loading them would be completely superfluous. You can see some discussion of related issues, including links to messages elsewhere, in the Github issue thread: https://github.com/RaRe-Technologies/gensim/issues/1270#issuecomment-293437366

This fork supports the latest gensim 3.8, which can train doc2vec model with pretrained word2vec.

https://github.com/maohbao/gensim

As per above, I think the evidence for the benefit of such a technique is muddled.

Also: it should be possible simply by poking/prodding a standard model at the right points between instantiation and training – without any major changes or new-parameters to the relevant models, or using a forked version of gensim (that will drift further away from other changes/fixes over time).

Was this page helpful?
0 / 5 - 0 ratings