@piskvorky before stating my issue I wish to express my deep gratitude for creating this awesome port of word2vec. Humbled by the effort & depth in thinking (extends to tomas et al).
Coming to the issue, from the code & from other forum posts I gather that model.build_vocab is a single core but model.train should run on multiple cores. But on my 32-core EC2 instance it only takes 1 core in training 2B+ words dataset. I have attached screenshot of top & training output. I am unsure where else to look. Can you please help?
Look at pid 31696 which is the one undergoing training

training output from word2vec logger

My code -
def train_model(path, do_stem):
sentences = MySentences(local_path, stem_line=do_stem) #sentence iterator
model = word2vec.Word2Vec(workers=32, size=300, min_count=20, window=10, sample=1e-3)
logger.info('Created dummy word2vec model instance')
#building vocabulary, saving after vocabulary is built - this run on 1CORE
model.build_vocab(sentences)
vocab_name = '%s_word2vec.vocab' % u.rand_eng_word(words=1)
try: model.save(vocab_name)
except: logger.info('ERROR: Unable to save '+vocab_name)
logger.info('Vocabulary built')
#training model - why is this running on 1CORE ???
#model.train(sentences, queue_factor=32)
model.train(sentences)
logger.info('Training neural network done')
model_name = '%s_word2vec.bin' % u.rand_eng_word(words=1)
model.save(model_name)
logger.info('Model Trained & saved as '+model_name)
return
System Specifications -
Below are my configurations -
gensim version = '0.12.4'
scipy version = '0.17.0'
numpy version = '1.11.0'
gensim.models.word2vec.FAST_VERSION = 1
scipy.show_config()
atlas_3_10_blas_info:
NOT AVAILABLE
atlas_3_10_blas_threads_info:
NOT AVAILABLE
atlas_threads_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib64/atlas']
define_macros = [('NO_ATLAS_INFO', -1)]
language = f77
include_dirs = ['/usr/include']
blas_opt_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib64/atlas']
define_macros = [('HAVE_CBLAS', None), ('NO_ATLAS_INFO', -1)]
language = c
include_dirs = ['/usr/include']
openblas_info:
NOT AVAILABLE
atlas_blas_threads_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib64/atlas']
define_macros = [('HAVE_CBLAS', None), ('NO_ATLAS_INFO', -1)]
language = c
include_dirs = ['/usr/include']
lapack_opt_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib64/atlas']
define_macros = [('NO_ATLAS_INFO', -1)]
language = f77
include_dirs = ['/usr/include']
openblas_lapack_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
My guess is you're running the non-optimized Python word2vec (no C).
What does print gensim.models.word2vec.FAST_VERSION say?
@piskvorky I have put FAST_VERSION above. Please see "my configurations" section above for more. I have installed cython.
gensim.models.word2vec.FAST_VERSION = 1
>>> import cython
>>> cython.__version__
'0.24'
>>> import gensim
>>> print gensim.models.word2vec.FAST_VERSION
1
IMP NOTE: another behavior, when I load all sentences into RAM, & init word2vec like so. All 32 cores are at work.
model = word2vec.Word2Vec(cleaned_data, workers=u.ncores(), size=300, min_count=20, window=10, sample=1e-3)
But when I start streaming data & feed it like I showed above, it only uses 1 core. For a corpus of 2B+ words it is taking forever...
You don't need cython for gensim. FAST_VERSION==1 should be fine.
Maybe your sentence iterator is too slow? What is the streaming speed, just going over the sentences, without any word2vec?
@piskvorky Thanks again for taking time to reply.
What I don't get is, having specified workers = 32, shouldn't gensim spawn 32 processes (via multiprocessing)? Then these 32 would start reading off Sentence iterator which then could potentially cause I/O bottleneck.
In my case, only 1 process is reading off Sentence iterator, Why only 1 process? In this case should I/O speed matter? Why are there no 32 workers spawned by gensim word2vec?
Btw, this EC2 instance has a 500GB SSD & I have put these huge files on local disk to have high streaming speed. My guess is I/O should not be a problem.
Word2vec uses multiple threads, not multiple processes. One thread is dedicated to reading your input iterator; see parallelizing word2vec in gensim.
@srikar2097, can you measure how long the empty iterator pass takes, just to be sure? for sent in corpus: pass
@gojomo , can you think of any reason why the same corpus should parallelize differently with corpus_iterator vs list(corpus_iterator)?
@piskvorky I do little bit of pre-processing on each sentence. lowercase, remove special chars etc. Here are the benchmarks for sample of 625K sentences -
Benchmark code
start = timer()
for line in open("625k_sample.txt"):
#clean_data(line) #preprocessing step
pass
end = timer()
print(end - start)
No Preprocessing = 0m1.260s
With preprocessing (clean_data uncommented) = 28m4.144s
With no Preprocessing we see that it just takes 1.2 seconds to run through 625K sentences. With preprocessing it takes 29m. Either way this proves there is no I/O bottleneck. If anything, it takes 29 minutes with preprocessing as its running only on 1 thread. With 32 threads it probably would have taken about 1 minute (28m/32).
It looks like your input sentence iterator is the bottleneck then. It's too slow to feed the word2vec training threads, so they're starved for input.
I'd suggest speeding up your corpus preprocessing code. Or perhaps storing an already-preprocessed corpus and using that, if possible.
I think this explains the discrepancy between list(sentences) and sentences, so I'm closing this. Let me know if you have other questions!
I am having the exact same issue. To make things worse, in my case, it takes 3 days for the build_vocabulary phase to be completed.
Why is it not possible to create a multi-core version of the corpus iterator? At the end of the day, reading a list of files should be highly parallelizable.
@ahmedahmedov Of course it's possible! But it's not dirt-simple, and would generally put more constraints on the source data format than the simple single-iterator approach. And the overhead involved with combining counts from multiple threads can be a bit tricky to deal with. Still, it's a wishlist item – see #400 – that someone might do eventually.
If your data source is especially appropriate to handle in parallel – such as a single uncompressed file, or many equal-sized shards, or similar – you could always reimplement the scan_vocab() step within build_vocab(), simply ensuring your end-state is the same as the next steps expect. And if your approach seems generalizable for others, you could contribute it back to gensim.
(If your 3-day elapsed time isn't strictly due to a giant dataset, but some other bottleneck, moving to an SSD or ensuring that no swapping occurs at any point during the process might help, or preprocessing to eliminate excess tokens of little interest. Also note you can save() a model after the scan_vocab() finishes, to retain the raw counts, or after build_vocab(), to retain the min_count-trimmed vocabulary, and then re-load() it multiple times and mutate the metaparameter properties to try different training options after paying the vocabulary-discovery cost only once. )
Most helpful comment
@ahmedahmedov Of course it's possible! But it's not dirt-simple, and would generally put more constraints on the source data format than the simple single-iterator approach. And the overhead involved with combining counts from multiple threads can be a bit tricky to deal with. Still, it's a wishlist item – see #400 – that someone might do eventually.
If your data source is especially appropriate to handle in parallel – such as a single uncompressed file, or many equal-sized shards, or similar – you could always reimplement the
scan_vocab()step withinbuild_vocab(), simply ensuring your end-state is the same as the next steps expect. And if your approach seems generalizable for others, you could contribute it back to gensim.(If your 3-day elapsed time isn't strictly due to a giant dataset, but some other bottleneck, moving to an SSD or ensuring that no swapping occurs at any point during the process might help, or preprocessing to eliminate excess tokens of little interest. Also note you can
save()a model after thescan_vocab()finishes, to retain the raw counts, or afterbuild_vocab(), to retain themin_count-trimmed vocabulary, and then re-load()it multiple times and mutate the metaparameter properties to try different training options after paying the vocabulary-discovery cost only once. )