Word2Vec does not run faster with more workers caused by sentences length:
When I use raw text8 data, multi-core worked fine, but my corpus is short text, one single line only contains several words, and when I randomly split text8 data to multiple line (e.g. only 3 ~ 8 words per line), and found more workers become useless.
Linux-3.10.0-229.7.2.el7.x86_64-x86_64-with-centos-7.1.1503-Core
('Python', '2.7.5 (default, Nov 20 2015, 02:00:19) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')
('gensim', '2.3.0')
('FAST_VERSION', 1)
When you gave it giant lines, you may have seen deceptive speed indicators: due to implementation limits in the optimized code, texts over 10,000 tokens are truncated to 10,000 tokens – with the rest ignored. (Throwing away lots of data can make things look very fast!)
When you give it small lines, it is internally batching them together for efficiency (though not in such a way that any context windows overlap the breaks you've supplied). But the rates you're seeing are probably a more accurate estimate of what it takes to train on all supplied words.
The inability of the code to fully utilize all cores, or even increase throughput, with more than 3-16 workers (no matter how many cores are available) is a known limitation, mostly due to the 'global interpreter lock' single-threading imposed on Python code, and perhaps somewhat due to the current architecture of a single corpus-reading thread handing work out to multiple worker threads. (Though, that 2nd factor can be minimized if your corpus iterable is relatively efficient, such as by working with data only in RAM or from fast volumes.)
See related discussion in issues like #1486, #1291, #532, & #336.
Thank very much for you reply.
I searched for the problem for several days, and read the discussion in above issues and in google groups, but I found my problem maybe different from those previous issues, including:
https://github.com/RaRe-Technologies/gensim/issues/157
From the code word2vec.py I see the words are internally batched when I give small lines, but the training speed and core usage is different.
When I ran the raw text8 with 10 workers, below is the debug log snippet, the bach-size =10000 and sentences number=1, speed is 806140 words/s.
Meanwhile I noticed multi-core is active by use htop.
2017-07-27 11:43:03,572 : DEBUG : queueing job #3354 (10000 words, 1 sentences) at alpha 0.01518
2017-07-27 11:43:03,572 : DEBUG : queueing job #3355 (10000 words, 1 sentences) at alpha 0.01518
2017-07-27 11:43:03,572 : INFO : PROGRESS: at 38.38% examples, 806140 words/s, in_qsize 60, out_qsize 1
2017-07-27 11:43:03,586 : DEBUG : queueing job #3356 (10000 words, 1 sentences) at alpha 0.01517
2017-07-27 11:43:03,588 : DEBUG : queueing job #3357 (10000 words, 1 sentences) at alpha 0.01517
2017-07-27 11:43:03,589 : DEBUG : queueing job #3358 (10000 words, 1 sentences) at alpha 0.01517
2017-07-27 11:43:03,617 : DEBUG : queueing job #3359 (10000 words, 1 sentences) at alpha 0.01517
But when I split text8 to multiple lines, below is the log, the speed is only about 169806 words/s.
2017-07-27 14:09:02,062 : DEBUG : queueing job #758 (999 words, 385 sentences) at alpha 0.02478
2017-07-27 14:09:02,065 : DEBUG : queueing job #759 (994 words, 356 sentences) at alpha 0.02478
2017-07-27 14:09:02,068 : DEBUG : queueing job #760 (1000 words, 358 sentences) at alpha 0.02478
2017-07-27 14:09:02,068 : INFO : PROGRESS: at 0.82% examples, 169806 words/s, in_qsize 40, out_qsize 0
2017-07-27 14:09:02,071 : DEBUG : queueing job #761 (998 words, 351 sentences) at alpha 0.02478
2017-07-27 14:09:02,077 : DEBUG : queueing job #762 (997 words, 371 sentences) at alpha 0.02478
2017-07-27 14:09:02,081 : DEBUG : queueing job #763 (997 words, 352 sentences) at alpha 0.02478
Though it won't account for the full difference, note that rate-timings a bit deeper into training, or for the full training, can be more stable than rates at the very beginning, before all threads active and CPU caches warm.
How many cores does your system have?
Which batch-size(s) are you tuning?
Are you splitting the lines in-memory on-the-fly, or once to a separate line-broken file on disk?
Are you sure you didn't do something else, in the 2nd case, to force smaller (1000-word) training batches? (The default of 10000 would mean those words-totals per job should be a lot closer to 10000.)
(It's tough to be sure all the things varies in your tests without the full code.)
Thank you for your patience, I didn't clearly report these details and sorry for that.
I cleaned my tests code and put in gist below, with no params tuning, and you can see the differences:
https://gist.github.com/knighter/57a3ce26114b071c2287c84c355dfec5
comparison in short:
text8, 1 worker:
2017-07-28 11:45:42,161 : INFO : PROGRESS: at 49.36% examples, 131211 words/s, in_qsize 2, out_qsize 0
text8, 20 workers:
2017-07-28 11:50:29,296 : INFO : PROGRESS: at 49.84% examples, 875161 words/s, in_qsize 39, out_qsize 0
text8_split, 1 worker:
2017-07-28 11:54:01,222 : INFO : PROGRESS: at 49.74% examples, 224895 words/s, in_qsize 1, out_qsize 0
text8_split, 20 workers:
2017-07-28 11:59:06,013 : INFO : PROGRESS: at 49.75% examples, 239028 words/s, in_qsize 0, out_qsize 0
Thanks for the detailed report! That helps a lot.
The in_qsize 0, out_qsize 0 really does indicate the workers are starved for input, in the "only ~3 words per document" case of text8_split.
In other words, even the almost trivial loops here and here seem to become the bottleneck with super short documents.
The fact that the 1-worker text8_split is faster than non-split text8 is probably due to the fact it's doing much less training -- you set window=5, but with documents of only ~3 words, the longer contexts never materialize.
I really don't know what we could do about this -- we're hitting the limits of Python itself here. Perhaps the most general solution would be to change the main API of gensim from "user supplies a stream of input data" to "user supplies multiple streams of data". It's fully backward compatible (one stream is just a special case of many streams), but would allow us to parallelize more easily without as much batching, data shuffling etc. Basically advise users to split their large data into multiple parts, Spark/Hadoop-style. Applies to all algos (LDA, LSI, word2vec...). @menshikh-iv @gojomo thoughts?
Yes, the reason the 1-thread split run is faster is almost certainly due to the fact that with skip-gram and window=5, having many short sentences mean lots less training is happening due to windows truncated at sentence ends.
IO may still be a factor, depending on your volume type and the behavior of LineSentence. Running the test from a list-of-list-of-tokens in memory will better isolate that factor from the remaining Python multithreading, and our one-batching-master-thread, issues.
Also, maximum throughput for the small-sentences case might be reached with a worker-count between 1 and 20 - the contention of a larger number of threads for the single Python GIL may be a factor in starving the master thread.
@piskvorky Yes I think a shared abstraction for helping models open multiple non-contending (and ideally non-overlapping) streams into the corpus would be a good way to increase throughput. (The word2vec.c and fasttext.cc implementations just let every thread open their own handle into a different starting-offset of the file, and continue cycling through until the desired total number of examples is read. Because of thread overtaking issues there's no guarantee some parts of the file aren't read more times than others... but it probably doesn't matter in practice that their training samples aren't exactly N passes over each example.)
@gojomo Yes, working off RAM (lists) will help isolate the issue, but I don't think the number of threads nor the IO are a factor here. @knighter is comparing two identical setups, on identical data (the same number of workers, same IO...). The only difference here is the sentence length. One setup is starved, one isn't.
Regarding multiple streams: gensim would be agnostic as to where the streams come from. Seeking to different places in a single file is one option; reading from multiple separate files (possibly separate filesystems) another. In practice, I suspect most people simply use a local FS, so that's our target use-case to optimize.
@piskvorky If LineSentence is less efficient at reading the many-lined input, that might contribute for the 20-worker (unsplit) to 20-worker (split) starvation. The concatenation of small examples into larger batches may be relevant – but that was added because it greatly increased multithreaded parallelism in tests, by having long noGIL blocks, compared to small-examples-without-batching – at least in cases of 3-8 threads.
Perhaps either of these processes – LineSentence IO or batching – gets starved for cycles when more threads are all waiting for the GIL. (That is: the trivial loops mean many more time-slicing events/overhead and context-switching.) Is there a 'withGIL' to force a block of related statements to not be open to normal interpreter GIL-sharing?
@gojomo I met the same question. Does the long sentence longer than 10000 will be cut to 10000, and the rest data be discard while training? I do not see any declare about this process in API document.
Yes, there's still a hard limit on sentence length (= max effective number of words in a document).
Btw the performance issues around GIL were solved in #2127, closing this ticket. See the tutorial at https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb
CC @mpenkov @gojomo – let's link to that tutorial straight from the API docs. I had trouble locating it myself, it's pretty well hidden. The API docs only mention "You may use corpus_file instead of sentences to get performance boost", which is unnecessarily vague IMO. We can easily provide more docs and insight than that.
OK, I'll deal with that as part of the documentation refactoring
Hello @mpenkov , I still have a question. Does the batch size will affect the max length of the sentence? If I set the batch size = 128, So the max length of sentence will be set to 10000 or 128?
@shuxiaobo batch_size has no effect on the 10000-token limit for individual texts.
Most helpful comment
Yes, there's still a hard limit on sentence length (= max effective number of words in a document).
Btw the performance issues around GIL were solved in #2127, closing this ticket. See the tutorial at https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb
CC @mpenkov @gojomo – let's link to that tutorial straight from the API docs. I had trouble locating it myself, it's pretty well hidden. The API docs only mention "You may use
corpus_fileinstead ofsentencesto get performance boost", which is unnecessarily vague IMO. We can easily provide more docs and insight than that.