Gensim: Using pre-trained word2vec models in doc2vec

Created on 19 May 2017 · 3Comments · Source: RaRe-Technologies/gensim

Is there a practical way of using pre-trained word2vec models in doc2vec?

There is a forked version of Gensim that does it but it is pretty old.
Referenced here: https://github.com/jhlau/doc2vec
Forked Gensim here: https://github.com/jhlau/gensim

Otherwise I would like to add this feature as jhlau did and merge it back.

Source

bgokden

👍6

Most helpful comment

You can manually patch-up a model to insert word-vectors from elsewhere before training. The existing intersect_word2vec_format() may be useful, directly or as an example - it assumes you've already created a model with its own vocabulary (including the frequency info needed for negative-sampling or frequent-word-downsampling), but then want to use some external source to replace some/all of the word-vector values.

I personally don't think the case for such re-use is yet strong – indeed in some often top-performing Doc2Vec training modes (like pure PV-DBOW), input-word-vectors aren't trained or used at all, so loading them would be completely superfluous. You can see some discussion of related issues, including links to messages elsewhere, in the Github issue thread: https://github.com/RaRe-Technologies/gensim/issues/1270#issuecomment-293437366

gojomo on 19 May 2017

👍6 😄1

All 3 comments

gojomo on 19 May 2017

👍6 😄1

This fork supports the latest gensim 3.8, which can train doc2vec model with pretrained word2vec.

https://github.com/maohbao/gensim

maohbao on 17 Dec 2019

As per above, I think the evidence for the benefit of such a technique is muddled.

Also: it should be possible simply by poking/prodding a standard model at the right points between instantiation and training – without any major changes or new-parameters to the relevant models, or using a forked version of gensim (that will drift further away from other changes/fixes over time).

gojomo on 18 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings