Gensim: Train Word2Vec in multiple batches

Created on 1 Dec 2016  路  8Comments  路  Source: RaRe-Technologies/gensim

Hi, I am looking forward to train a word2vec model with a vast corpus of documents (approximately 250GB). The machine which will accommodate the experiments has 128GB RAM, so it is impossible to train the model at once.

I had the same issue in the past with less data and I found this blog post (http://rutumulkar.com/blog/2015/word2vec), which suggested a solution, but it was not part of the main distribution in version 0.12.4.

I observed in gensim code that in current version 0.13.3, the _build_vocab()_ function supports the update parameter.

So I wrote a piece of code like this:

# LOAD FIRST BATCH/FOLDER
loader = DataLoader(folder=folder)
sentences = loader.load_corpus()
# TRAIN INITIAL MODEL
model = Word2Vec(min_count=20, workers=16, size=emb_size, sg=1, negative=5,window=window)
model.build_vocab(sentences)
model.train(sentences)
print('vocabulary size:', len(model.index2word))
sentences = None
# LOAD SECOND BATCH/FOLDER
loader = DataLoader(parent_folder=folder2)
sentences = loader.load_corpus()
# UPDATE VOCABULARY AND TRAIN MODEL
model.build_vocab(sentences, update=True)
model.train(sentences)
print('vocabulary size:', len(model.index2word))
model.save_word2vec_format(filename, binary=True)

Running with 2 folders (16 documents,9 documents), I had the following output:

vocabulary size: 473
vocabulary size: 482

I have the following questions based on the above:

Beginning the second round of training:

  • Are the first 473 embeddings initialized as they came out from the first round of training?
  • Will they improve further using the sentences of the second folder or are they "frozen" during the second round?
  • I tried to save the model and load it again between the two rounds but I had the following error:
    AttributeError: 'Word2Vec' object has no attribute 'syn1neg
    Is it possible to do so, save-load and extend-improve my model?

  • Can I have an explanation of the other available parameters of _build_vocab()_ function:

def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=False):

  • Is there any simple rule to estimate the number of batches I will need to load separately in order to avoid memory errors?

All 8 comments

Hi @KiddoThe2B,

such an open question is better suited for the mailing list

There's no need for using the vocabulary-expansion (kinda-but-not-really 'online') 'update' feature here. Gensim doesn't require all documents to be in RAM - just a corpus that is 'Iterable', and thus can present all its examples each time Word2Vec needs them. (Once for build_vocab, then again iter times for training.) You should be able to change or replace your DataLoader class to only stream examples from disk, and that will both be most-memory-efficient, and also give the best vectors (by not confining some examples/words to only training early, then being 'diluted' by the later 'update').

So @gojomo, you suggest to build a dictionary object from the whole corpus on my own, then call _build_vocab()_ once and finally call _train()_ multiple times with different data from disk?

@KiddoThe2B see https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/ and https://rare-technologies.com/word2vec-tutorial/.

Like Lev said, not a bug and discussion more suited for the mailing list.

@KiddoThe2B - No, make your corpus stream item-by-item from disk as an 'Iterable' object. Ask on list if you need more clarification.

I read all suggested links, then I just renamed load_corpus() into __iter__() and replaced sentences.append(sentence) with yield sentence in my DataLoader class and it works like a charm!

So I figured out the problem, it seems everything is fine. I had to ask for support in the mailing list, but I'll keep in mind for the future.

Last question: _Does lazy loading from iterable objects effects Word2Vec efficiency? I checked about the vocabulary and it is the same, but what about training? I suppose not, if I get it right...._

Thank you all! I found the solution, but I also understood better generators and iterators in python :)

No problem. Re. your question: use the mailing list.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shubhvachher picture shubhvachher  路  4Comments

coopwilliams picture coopwilliams  路  3Comments

vlad17 picture vlad17  路  4Comments

bgokden picture bgokden  路  3Comments

k0nserv picture k0nserv  路  3Comments