Spacy: Documentation needed on how to speed up the nlp.pipe() usage

Created on 31 Mar 2020 · 5Comments · Source: explosion/spaCy

There is documentation on how to use nlp.pipe() using a single process and not specifying batch size:
https://spacy.io/usage/processing-pipelines

And there is brief documentation on setting n_process and batch size:
https://spacy.io/api/language#pipe

But I am finding it hard to get a clear answer on the relationship between batch_size and n_processes for a simple use like entity extraction. So far using the vanilla nlp.pipe() is significantly faster than nlp()...as expected. However, using nlp.pipe(text, n_process=cpu_count()-1) is much slower than just nlp.pipe() even after scanning through batch_size options (50 to 1000).
On a small dataset of 2000 sentences:
data = ["I 'm very happy .", "I want to say that I 'm very happy ."]*1000
nlp.pipe() takes ~2 seconds whereas nlp.pipe(text, n_process=cpu_count()-1) takes upto 30 and just nlp() takes ~14

It would be good to know how to set the parameters of n_process and batch_size given a max cpu_count() from the multiprocessing library.

Additional info, I'm using windows and have a 12 core cpu.

docs feat / pipeline perf / speed

Source

RevanthRameshkumar

👍3

Most helpful comment

The main issue is that multiprocessing has a lot of overhead when starting child processes, and this overhead is especially high in windows, which uses spawn instead of fork. You might see improvements with multiprocessing for tasks that take much longer than a few seconds with one process, but it's not going to be helpful for short tasks.

You might also see some improvements by using a smaller number of processes for shorter tasks, but you'd have to experiment with this. It depends on the model and pipeline, how long your texts are, etc. and there's no single set of guidelines that will be optimal for every case.

See: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

adrianeboyd on 1 Apr 2020

👍2

All 5 comments

See: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

adrianeboyd on 1 Apr 2020

👍2

You might see improvements with multiprocessing for tasks that take much longer than a few seconds with one process, but it's not going to be helpful for short tasks.

In this case, are longer tasks equivalent to a much larger batch size? Are we creating a child process per batch in Spacy?

RevanthRameshkumar on 2 Apr 2020

There is documentation on how to use nlp.pipe() using a single process and not specifying batch size:
https://spacy.io/usage/processing-pipelines

And there is brief documentation on setting n_process and batch size:
https://spacy.io/api/language#pipe

But I am finding it hard to get a clear answer on the relationship between batch_size and n_processes for a simple use like entity extraction. So far using the vanilla nlp.pipe() is significantly faster than nlp()...as expected. However, using nlp.pipe(text, n_process=cpu_count()-1) is much slower than just nlp.pipe() even after scanning through batch_size options (50 to 1000).
On a small dataset of 2000 sentences:
data = ["I 'm very happy .", "I want to say that I 'm very happy ."]*1000
nlp.pipe() takes ~2 seconds whereas nlp.pipe(text, n_process=cpu_count()-1) takes upto 30 and just nlp() takes ~14

It would be good to know how to set the parameters of n_process and batch_size given a max cpu_count() from the multiprocessing library.

Additional info, I'm using windows and have a 12 core cpu.

I've tried everything I could but couldn't find a single example where n_process > 1 resulted in better performance. In fact performance are terribly worse even with n_process = 2.
I think developers should provide a minimal speed up example, otherwise so many people out there will lose precious time testing and benchmarking just to find out that there is no way to parallelize this kind of job with spaCy/scispacy.

bgeneto on 17 Jun 2020

👍1

I'm going to close this on my end. If there are further issues regarding documentation, there should be a new issue opened.

RevanthRameshkumar on 10 Jul 2020

I've run some tests with n_process as well. My results differ depending on if I disable modules or not.
Leaving all pipeline components enabled, I was able to see decent speed boosts as I increased the number of cores.
However, when disabling pipeline components (disable=["parser", "tagger", "ner"]), the results actually slightly worsen as I increase the cores.

I ran these tests on Linux with 64GB RAM and i7-8700K 6-core CPU.

Honestly would have expected to get a boost even when disabling components, as even tokenization itself could be done in parallel through several CPUs, but it seems that only the pipeline components themselves are parallelized.