Gensim: Train Word2Vec in multiple batches

Created on 1 Dec 2016 · 8Comments · Source: RaRe-Technologies/gensim

Hi, I am looking forward to train a word2vec model with a vast corpus of documents (approximately 250GB). The machine which will accommodate the experiments has 128GB RAM, so it is impossible to train the model at once.

I had the same issue in the past with less data and I found this blog post (http://rutumulkar.com/blog/2015/word2vec), which suggested a solution, but it was not part of the main distribution in version 0.12.4.

I observed in gensim code that in current version 0.13.3, the _build_vocab()_ function supports the update parameter.

So I wrote a piece of code like this:

# LOAD FIRST BATCH/FOLDER
loader = DataLoader(folder=folder)
sentences = loader.load_corpus()
# TRAIN INITIAL MODEL
model = Word2Vec(min_count=20, workers=16, size=emb_size, sg=1, negative=5,window=window)
model.build_vocab(sentences)
model.train(sentences)
print('vocabulary size:', len(model.index2word))
sentences = None
# LOAD SECOND BATCH/FOLDER
loader = DataLoader(parent_folder=folder2)
sentences = loader.load_corpus()
# UPDATE VOCABULARY AND TRAIN MODEL
model.build_vocab(sentences, update=True)
model.train(sentences)
print('vocabulary size:', len(model.index2word))
model.save_word2vec_format(filename, binary=True)

Running with 2 folders (16 documents,9 documents), I had the following output:

vocabulary size: 473
vocabulary size: 482

I have the following questions based on the above:

Beginning the second round of training:

Are the first 473 embeddings initialized as they came out from the first round of training?
Will they improve further using the sentences of the second folder or are they "frozen" during the second round?

I tried to save the model and load it again between the two rounds but I had the following error:
AttributeError: 'Word2Vec' object has no attribute 'syn1neg
Is it possible to do so, save-load and extend-improve my model?
Can I have an explanation of the other available parameters of _build_vocab()_ function:

def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=False):

Is there any simple rule to estimate the number of batches I will need to load separately in order to avoid memory errors?

Source

iliaschalkidis

All 8 comments

Hi @KiddoThe2B,

such an open question is better suited for the mailing list

tmylk on 1 Dec 2016

Also see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/online_w2v_tutorial.ipynb

tmylk on 1 Dec 2016

👎1 👍1

There's no need for using the vocabulary-expansion (kinda-but-not-really 'online') 'update' feature here. Gensim doesn't require all documents to be in RAM - just a corpus that is 'Iterable', and thus can present all its examples each time Word2Vec needs them. (Once for build_vocab, then again iter times for training.) You should be able to change or replace your DataLoader class to only stream examples from disk, and that will both be most-memory-efficient, and also give the best vectors (by not confining some examples/words to only training early, then being 'diluted' by the later 'update').

gojomo on 1 Dec 2016

👍1

So @gojomo, you suggest to build a dictionary object from the whole corpus on my own, then call _build_vocab()_ once and finally call _train()_ multiple times with different data from disk?

iliaschalkidis on 1 Dec 2016

@KiddoThe2B see https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/ and https://rare-technologies.com/word2vec-tutorial/.

Like Lev said, not a bug and discussion more suited for the mailing list.

piskvorky on 2 Dec 2016

👍1

@KiddoThe2B - No, make your corpus stream item-by-item from disk as an 'Iterable' object. Ask on list if you need more clarification.

gojomo on 2 Dec 2016

👍1

I read all suggested links, then I just renamed load_corpus() into __iter__() and replaced sentences.append(sentence) with yield sentence in my DataLoader class and it works like a charm!

So I figured out the problem, it seems everything is fine. I had to ask for support in the mailing list, but I'll keep in mind for the future.

Last question: _Does lazy loading from iterable objects effects Word2Vec efficiency? I checked about the vocabulary and it is the same, but what about training? I suppose not, if I get it right...._

Thank you all! I found the solution, but I also understood better generators and iterators in python :)

iliaschalkidis on 2 Dec 2016

No problem. Re. your question: use the mailing list.

piskvorky on 3 Dec 2016

Was this page helpful?

0 / 5 - 0 ratings