Gensim: Word2Vec model to dict; Adding to the word2vec to production pipeline

Created on 10 Apr 2017 · 4Comments · Source: RaRe-Technologies/gensim

A lot of users use their trained word2vec model in production environments to get most_similar words to (for example) words in a user's entered query, or words in complete documents, on the fly. In times like these, using the word2vec model becomes very cumbersome, often taking the most amount of time in the pipeline. [1]

What I propose is a model_to_dict method, to be used right at the end of the word2vec pipeline. It would find and store, in preproduction, all most similar words to all words in the trained vocabulary.

The most similar words can be from a custom user list as in #1229 and we can allow the user to define a custom preprocessing function to pass all most similar words through before storing them. Being a dict, the query time will also be minimal which is great for this purpose! Since our dict just stores words, the size of the dict should be comparable to a multiple of the size of the vocab [2]

At the end of this, we expect a dictionary with keys as word2vec vocab and values as the most_similar words to them. Words in vocab that have empty most_similar words will not be stored in the dict. This will happen a lot if there is a custom results list as well as a pass function for result words applied on top of most similar cutoff.

[1] This is because we always calculate all cos distances from the query word to all words in vocab before returning topn most similar words. I can't think of a better way for that, yet.
[2] albeit a large multiple if users do not provide proper preprocessing pass function or have a small similarity cutoff. Maybe we can give them a warning about the same.

difficulty easy feature good first issue

Source

shubhvachher

Most helpful comment

I can see this being useful. However, it could take a lot of time/memory to compute. And, it seems like a 1-liner:

most_similars_precalc = {k : model.wv.most_similar(k) for k in model.wv.index2word}

(The variants would be slightly different if working with some subset of the vocabulary.)

So, this might be more appropriate as some examples in one of the documentation notebooks (with proper caveats about the time/memory cost of the every-word calculations).

gojomo on 12 Apr 2017

🚀1 ❤1 🎉1 😄1

All 4 comments

I can see this being useful. However, it could take a lot of time/memory to compute. And, it seems like a 1-liner:

most_similars_precalc = {k : model.wv.most_similar(k) for k in model.wv.index2word}

(The variants would be slightly different if working with some subset of the vocabulary.)

So, this might be more appropriate as some examples in one of the documentation notebooks (with proper caveats about the time/memory cost of the every-word calculations).

gojomo on 12 Apr 2017

🚀1 ❤1 🎉1 😄1

@shubhvachher Suggest adding it to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb

tmylk on 3 May 2017

Can you add this to notebook @shubhvachher?

menshikh-iv on 2 Oct 2017

@menshikh-iv @tmylk @gojomo
"albeit a large multiple if users do not provide proper preprocessing pass function or have a small similarity cutoff. Maybe we can give them a warning about the same."
Is that required on the Jupyter notebook?
Does the method have to give some additional parameter as well?