Word2Vec online training not consistent.
Initially I wanted to verify if this is due to any version upgrades but it is similar across the two versions I checked [Versions 0.13.3 and 3.4.0].
The basis is that we train a Word2Vec model on a corpus and make a copy of it and call the train method again but this time providing some new sentences to train on without any out of vocabulary words. The difference is expected to be non-zero as some vectors would be updated after the training but it is not consistent across corpus even if the initial word2vec are built the same way.
Would this be some issue with the code or is it expected with Word2Vec?
When training some new sentences (in my case without any OOV words) on the created word2vec, some of the vectors should change in value.
While working on the reproduction script for different gensim version, observed that online training is not consistent for different corpus. Here are the output for two different corpus based on the above code found in gist:
Output for gensim 3.4.0 and Text8 corpus :-
('Non zero vectors present:', True)
('Total unique words updated:', 26)
Output for gensim 3.4.0 and Newsgroup corpus :-
('Non zero vectors present:', False)
('Total unique words updated:', 0)
https://gist.github.com/sairampillai/d0448bdc57999eb38016f0d6cd32defd
Windows-10-10.0.14393
'Python', '2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)]'
'NumPy', '1.14.1'
'SciPy', '1.0.0'
'gensim', '3.4.0'
'FAST_VERSION', 0
Hi @sairampillai, can you share your 20newsgroup-corpus.txt please, I want to reproduce this behavior (as I understand from your report, you expect some changes on 20-news dataset too, but nothing changed, am I right?)
@menshikh-iv Yes, that is correct. I am not seeing any change in the 20-news dataset. I have a piece of commented out code in the gist that I used to create the word2vec.
Here is the corpus text that I have been using:
20newsgroup-corpus.txt
@sairampillai thanks, I found a reason of this behavior - mistake in your gist code
sentences = [item for item in model.wv.vocab.keys()] # this is flat list with all words from vocab (i.e. 1 sentence, not sentences)
from random import shuffle
shuffle(sentences)
no_of_examples = 500
model2.train(sentences[0:no_of_examples], ... ) # you still pass flat list, this is incorrect, "train" accept sentences, not one sentence, check an documentation
# model2.train([sentences[0:no_of_examples]], ... ) # <- this is correct variant
Most helpful comment
@sairampillai thanks, I found a reason of this behavior - mistake in your gist code