Gensim: Doc2Vec not optimal CPU utilization on high number of cores with all-in-memory corpus

Created on 9 Oct 2018  路  4Comments  路  Source: RaRe-Technologies/gensim

Description

I know that there are a few related issues currently open or closed, but couldn't find specific information for the case where all corpus is loaded to memory, sorry if I missed something.

When I'm trying to run Doc2Vec with an all-in-memory corpus on a 64 CPU machine with workers=40, I see partial CPU usage.

Main process uses ~%400 CPU, and there are 40 other processes using CPU ~%10 each. This in total corresponds to a usage of 8 cores in full.

This limitation is breaking for me because I want to train document embeddings on a corpus with ~25M documents, training of just 1 epoch would take 1 day with current speed.

Steps/Code/Corpus to Reproduce

import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

model = Doc2Vec(vector_size=300, window=15, max_vocab_size=None, min_count=5, hs=0, negative=5, ns_exponent=0.75, sample=10e-5)

# tagged_docs is a list of TaggedDocument instances loaded in memory, not an iterator or generator
model.build_vocab(tagged_docs, progress_per=10000)
model.workers = 40
model.train(tagged_docs, total_examples=len(tagged_docs), epochs=1)

Expected Results


I expect near ideal CPU utilization.

Actual Results

It seems to be using only 8 of the cores in full.

2018-10-08 17:42:26,205 : INFO : training on a 30200106 raw words (16766153 effective words) took 62.0s, 270407 effective words/s

Versions

Linux-3.10.0-862.11.6.el7.x86_64-x86_64-with-redhat-7.5-Maipo
Python 3.6.6 | packaged by conda-forge | (default, Jul 26 2018, 09:53:17)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
NumPy 1.15.2
SciPy 1.1.0
gensim 3.6.0
FAST_VERSION 1

Most helpful comment

In this case (if you use Linux machine), you should

Store your corpus in the needed format as text
Do something like cat my_large_file.txt > /dev/null (magic of linux caching)
Start training with corpus_file="my_large_file.txt"

I liked this trick, thanks!

But still, I will have to read my whole corpus once, preprocess, and write to a file before starting this process I guess. But that's ok.

Thanks so much again! I saw that during training has jumped to ~800000 words/s from ~250000 words/s using corpus_file

All 4 comments

Hello @manuyavuz, the reason - issue with GIL (frequent context-switching). Solution - use our new corpus_file feature: see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb for more examples (contains w2v, fasttext & doc2vec). This fully gets rid of GIL and optimally utilize CPU.

HI @menshikh-iv, I've actually looked into that new feature, but it requires corpus_file to be a file path string or a file object.

Is it possible to use it with a preprocessed list of documents existing in memory?

The thing is, I have a machine with abundant RAM resource, so I want to fully utilize that by holding whole data on memory to reduce file IO.

Is new corpus_file based technique also utilize existing RAM as much as possible?

Thanks

Is it possible to use it with a preprocessed list of documents existing in memory?

No, because __iter__ method requires GIL anyway.

The thing is, I have a machine with abundant RAM resource, so I want to fully utilize that by holding whole data on memory to reduce file IO.

In this case (if you use Linux machine), you should

  1. Store your corpus in the needed format as text
  2. Do something like cat my_large_file.txt > /dev/null (_magic_ of linux caching)
  3. Start training with corpus_file="my_large_file.txt"

Is new corpus_file based technique also utilize existing RAM as much as possible?

No, this utilize fixed part of RAM (w2v model not RAM-expensive), needed RAM size mostly depends on

  • vocabulary size
  • embedding demensions

In this case (if you use Linux machine), you should

Store your corpus in the needed format as text
Do something like cat my_large_file.txt > /dev/null (magic of linux caching)
Start training with corpus_file="my_large_file.txt"

I liked this trick, thanks!

But still, I will have to read my whole corpus once, preprocess, and write to a file before starting this process I guess. But that's ok.

Thanks so much again! I saw that during training has jumped to ~800000 words/s from ~250000 words/s using corpus_file

Was this page helpful?
0 / 5 - 0 ratings