Hi, I am looking forward to train a word2vec model with a vast corpus of documents (approximately 250GB). The machine which will accommodate the experiments has 128GB RAM, so it is impossible to train the model at once.
I had the same issue in the past with less data and I found this blog post (http://rutumulkar.com/blog/2015/word2vec), which suggested a solution, but it was not part of the main distribution in version 0.12.4.
I observed in gensim code that in current version 0.13.3, the _build_vocab()_ function supports the update parameter.
So I wrote a piece of code like this:
# LOAD FIRST BATCH/FOLDER
loader = DataLoader(folder=folder)
sentences = loader.load_corpus()
# TRAIN INITIAL MODEL
model = Word2Vec(min_count=20, workers=16, size=emb_size, sg=1, negative=5,window=window)
model.build_vocab(sentences)
model.train(sentences)
print('vocabulary size:', len(model.index2word))
sentences = None
# LOAD SECOND BATCH/FOLDER
loader = DataLoader(parent_folder=folder2)
sentences = loader.load_corpus()
# UPDATE VOCABULARY AND TRAIN MODEL
model.build_vocab(sentences, update=True)
model.train(sentences)
print('vocabulary size:', len(model.index2word))
model.save_word2vec_format(filename, binary=True)
Running with 2 folders (16 documents,9 documents), I had the following output:
vocabulary size: 473
vocabulary size: 482
I have the following questions based on the above:
Beginning the second round of training:
I tried to save the model and load it again between the two rounds but I had the following error:
AttributeError: 'Word2Vec' object has no attribute 'syn1neg
Is it possible to do so, save-load and extend-improve my model?
Can I have an explanation of the other available parameters of _build_vocab()_ function:
def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=False):
Hi @KiddoThe2B,
such an open question is better suited for the mailing list
There's no need for using the vocabulary-expansion (kinda-but-not-really 'online') 'update' feature here. Gensim doesn't require all documents to be in RAM - just a corpus that is 'Iterable', and thus can present all its examples each time Word2Vec needs them. (Once for build_vocab, then again iter times for training.) You should be able to change or replace your DataLoader class to only stream examples from disk, and that will both be most-memory-efficient, and also give the best vectors (by not confining some examples/words to only training early, then being 'diluted' by the later 'update').
So @gojomo, you suggest to build a dictionary object from the whole corpus on my own, then call _build_vocab()_ once and finally call _train()_ multiple times with different data from disk?
@KiddoThe2B see https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/ and https://rare-technologies.com/word2vec-tutorial/.
Like Lev said, not a bug and discussion more suited for the mailing list.
@KiddoThe2B - No, make your corpus stream item-by-item from disk as an 'Iterable' object. Ask on list if you need more clarification.
I read all suggested links, then I just renamed load_corpus() into __iter__() and replaced sentences.append(sentence) with yield sentence in my DataLoader class and it works like a charm!
So I figured out the problem, it seems everything is fine. I had to ask for support in the mailing list, but I'll keep in mind for the future.
Last question: _Does lazy loading from iterable objects effects Word2Vec efficiency? I checked about the vocabulary and it is the same, but what about training? I suppose not, if I get it right...._
Thank you all! I found the solution, but I also understood better generators and iterators in python :)
No problem. Re. your question: use the mailing list.