There is documentation on how to use nlp.pipe() using a single process and not specifying batch size:
https://spacy.io/usage/processing-pipelines
And there is brief documentation on setting n_process and batch size:
https://spacy.io/api/language#pipe
But I am finding it hard to get a clear answer on the relationship between batch_size and n_processes for a simple use like entity extraction. So far using the vanilla nlp.pipe() is significantly faster than nlp()...as expected. However, using nlp.pipe(text, n_process=cpu_count()-1) is much slower than just nlp.pipe() even after scanning through batch_size options (50 to 1000).
On a small dataset of 2000 sentences:
data = ["I 'm very happy .", "I want to say that I 'm very happy ."]*1000
nlp.pipe() takes ~2 seconds whereas nlp.pipe(text, n_process=cpu_count()-1) takes upto 30 and just nlp() takes ~14
It would be good to know how to set the parameters of n_process and batch_size given a max cpu_count() from the multiprocessing library.
Additional info, I'm using windows and have a 12 core cpu.
The main issue is that multiprocessing has a lot of overhead when starting child processes, and this overhead is especially high in windows, which uses spawn instead of fork. You might see improvements with multiprocessing for tasks that take much longer than a few seconds with one process, but it's not going to be helpful for short tasks.
You might also see some improvements by using a smaller number of processes for shorter tasks, but you'd have to experiment with this. It depends on the model and pipeline, how long your texts are, etc. and there's no single set of guidelines that will be optimal for every case.
See: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
You might see improvements with multiprocessing for tasks that take much longer than a few seconds with one process, but it's not going to be helpful for short tasks.
In this case, are longer tasks equivalent to a much larger batch size? Are we creating a child process per batch in Spacy?
There is documentation on how to use nlp.pipe() using a single process and not specifying batch size:
https://spacy.io/usage/processing-pipelinesAnd there is brief documentation on setting n_process and batch size:
https://spacy.io/api/language#pipeBut I am finding it hard to get a clear answer on the relationship between batch_size and n_processes for a simple use like entity extraction. So far using the vanilla
nlp.pipe()is significantly faster thannlp()...as expected. However, usingnlp.pipe(text, n_process=cpu_count()-1)is much slower than justnlp.pipe()even after scanning through batch_size options (50 to 1000).
On a small dataset of 2000 sentences:
data = ["I 'm very happy .", "I want to say that I 'm very happy ."]*1000
nlp.pipe()takes ~2 seconds whereasnlp.pipe(text, n_process=cpu_count()-1)takes upto 30 and justnlp()takes ~14It would be good to know how to set the parameters of n_process and batch_size given a max cpu_count() from the multiprocessing library.
Additional info, I'm using windows and have a 12 core cpu.
I've tried everything I could but couldn't find a single example where n_process > 1 resulted in better performance. In fact performance are terribly worse even with n_process = 2.
I think developers should provide a minimal speed up example, otherwise so many people out there will lose precious time testing and benchmarking just to find out that there is no way to parallelize this kind of job with spaCy/scispacy.
I'm going to close this on my end. If there are further issues regarding documentation, there should be a new issue opened.
I've run some tests with n_process as well. My results differ depending on if I disable modules or not.
Leaving all pipeline components enabled, I was able to see decent speed boosts as I increased the number of cores.
However, when disabling pipeline components (disable=["parser", "tagger", "ner"]), the results actually slightly worsen as I increase the cores.
I ran these tests on Linux with 64GB RAM and i7-8700K 6-core CPU.
Honestly would have expected to get a boost even when disabling components, as even tokenization itself could be done in parallel through several CPUs, but it seems that only the pipeline components themselves are parallelized.
Most helpful comment
The main issue is that multiprocessing has a lot of overhead when starting child processes, and this overhead is especially high in windows, which uses
spawninstead offork. You might see improvements with multiprocessing for tasks that take much longer than a few seconds with one process, but it's not going to be helpful for short tasks.You might also see some improvements by using a smaller number of processes for shorter tasks, but you'd have to experiment with this. It depends on the model and pipeline, how long your texts are, etc. and there's no single set of guidelines that will be optimal for every case.
See: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods