I know that there are a few related issues currently open or closed, but couldn't find specific information for the case where all corpus is loaded to memory, sorry if I missed something.
When I'm trying to run Doc2Vec with an all-in-memory corpus on a 64 CPU machine with workers=40, I see partial CPU usage.
Main process uses ~%400 CPU, and there are 40 other processes using CPU ~%10 each. This in total corresponds to a usage of 8 cores in full.
This limitation is breaking for me because I want to train document embeddings on a corpus with ~25M documents, training of just 1 epoch would take 1 day with current speed.
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
model = Doc2Vec(vector_size=300, window=15, max_vocab_size=None, min_count=5, hs=0, negative=5, ns_exponent=0.75, sample=10e-5)
# tagged_docs is a list of TaggedDocument instances loaded in memory, not an iterator or generator
model.build_vocab(tagged_docs, progress_per=10000)
model.workers = 40
model.train(tagged_docs, total_examples=len(tagged_docs), epochs=1)
I expect near ideal CPU utilization.
It seems to be using only 8 of the cores in full.
2018-10-08 17:42:26,205 : INFO : training on a 30200106 raw words (16766153 effective words) took 62.0s, 270407 effective words/s
Linux-3.10.0-862.11.6.el7.x86_64-x86_64-with-redhat-7.5-Maipo
Python 3.6.6 | packaged by conda-forge | (default, Jul 26 2018, 09:53:17)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
NumPy 1.15.2
SciPy 1.1.0
gensim 3.6.0
FAST_VERSION 1
Hello @manuyavuz, the reason - issue with GIL (frequent context-switching). Solution - use our new corpus_file feature: see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb for more examples (contains w2v, fasttext & doc2vec). This fully gets rid of GIL and optimally utilize CPU.
HI @menshikh-iv, I've actually looked into that new feature, but it requires corpus_file to be a file path string or a file object.
Is it possible to use it with a preprocessed list of documents existing in memory?
The thing is, I have a machine with abundant RAM resource, so I want to fully utilize that by holding whole data on memory to reduce file IO.
Is new corpus_file based technique also utilize existing RAM as much as possible?
Thanks
Is it possible to use it with a preprocessed list of documents existing in memory?
No, because __iter__ method requires GIL anyway.
The thing is, I have a machine with abundant RAM resource, so I want to fully utilize that by holding whole data on memory to reduce file IO.
In this case (if you use Linux machine), you should
cat my_large_file.txt > /dev/null (_magic_ of linux caching)corpus_file="my_large_file.txt"Is new corpus_file based technique also utilize existing RAM as much as possible?
No, this utilize fixed part of RAM (w2v model not RAM-expensive), needed RAM size mostly depends on
In this case (if you use Linux machine), you should
Store your corpus in the needed format as text
Do something like cat my_large_file.txt > /dev/null (magic of linux caching)
Start training with corpus_file="my_large_file.txt"
I liked this trick, thanks!
But still, I will have to read my whole corpus once, preprocess, and write to a file before starting this process I guess. But that's ok.
Thanks so much again! I saw that during training has jumped to ~800000 words/s from ~250000 words/s using corpus_file
Most helpful comment
I liked this trick, thanks!
But still, I will have to read my whole corpus once, preprocess, and write to a file before starting this process I guess. But that's ok.
Thanks so much again! I saw that during training has jumped to
~800000 words/sfrom~250000 words/susingcorpus_file