Gensim: Word2Vec online training not consistent

Created on 30 Jul 2018 · 3Comments · Source: RaRe-Technologies/gensim

Description

Word2Vec online training not consistent.
Initially I wanted to verify if this is due to any version upgrades but it is similar across the two versions I checked [Versions 0.13.3 and 3.4.0].
The basis is that we train a Word2Vec model on a corpus and make a copy of it and call the train method again but this time providing some new sentences to train on without any out of vocabulary words. The difference is expected to be non-zero as some vectors would be updated after the training but it is not consistent across corpus even if the initial word2vec are built the same way.
Would this be some issue with the code or is it expected with Word2Vec?

Expected Results

When training some new sentences (in my case without any OOV words) on the created word2vec, some of the vectors should change in value.

Actual Results

While working on the reproduction script for different gensim version, observed that online training is not consistent for different corpus. Here are the output for two different corpus based on the above code found in gist:

Output for gensim 3.4.0 and Text8 corpus :-

('Non zero vectors present:', True)
('Total unique words updated:', 26)

Output for gensim 3.4.0 and Newsgroup corpus :-

('Non zero vectors present:', False)
('Total unique words updated:', 0)

Steps/Code/Corpus to Reproduce

https://gist.github.com/sairampillai/d0448bdc57999eb38016f0d6cd32defd

Versions

Windows-10-10.0.14393
'Python', '2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)]'
'NumPy', '1.14.1'
'SciPy', '1.0.0'
'gensim', '3.4.0'
'FAST_VERSION', 0

Source

sairampillai

Most helpful comment

@sairampillai thanks, I found a reason of this behavior - mistake in your gist code

sentences = [item for item in model.wv.vocab.keys()]  # this is flat list with all words from vocab (i.e. 1 sentence, not sentences)
from random import shuffle
shuffle(sentences)
no_of_examples = 500

model2.train(sentences[0:no_of_examples], ... ) # you still pass flat list, this is incorrect, "train" accept sentences, not one sentence, check an documentation
# model2.train([sentences[0:no_of_examples]], ... )  # <- this is correct variant

menshikh-iv on 2 Aug 2018

😄1 👍1

All 3 comments

Hi @sairampillai, can you share your 20newsgroup-corpus.txt please, I want to reproduce this behavior (as I understand from your report, you expect some changes on 20-news dataset too, but nothing changed, am I right?)

menshikh-iv on 1 Aug 2018

@menshikh-iv Yes, that is correct. I am not seeing any change in the 20-news dataset. I have a piece of commented out code in the gist that I used to create the word2vec.

Here is the corpus text that I have been using:
20newsgroup-corpus.txt

sairampillai on 1 Aug 2018

👍1

@sairampillai thanks, I found a reason of this behavior - mistake in your gist code

sentences = [item for item in model.wv.vocab.keys()]  # this is flat list with all words from vocab (i.e. 1 sentence, not sentences)
from random import shuffle
shuffle(sentences)
no_of_examples = 500

model2.train(sentences[0:no_of_examples], ... ) # you still pass flat list, this is incorrect, "train" accept sentences, not one sentence, check an documentation
# model2.train([sentences[0:no_of_examples]], ... )  # <- this is correct variant

menshikh-iv on 2 Aug 2018

😄1 👍1

Was this page helpful?

0 / 5 - 0 ratings