Hi,
I used nlp.pipe on text to extract the entities but didn't find significant improvement on the speed to extract the entities.
My Input is variable called textList : It is a list containing around 2000 documents.
I split in into chunks of 400 each since beyond that it errors out with memory limit of 1000000 chars message.
Earlier when I just used, nlp(text), it used to take around 15 sec for each chunk(400 docs). so total time around 45 sec- 1 min.
After using nlp.pipe, I expected that with multithreading, it should have been one fourth the time.
Is there a recommendation on number of threads and batch size in the call to nlp.pipe?
My code is as below
nlp= spacy.load(path,disable=['parser', 'tagger', 'textcat'])
textList = np.array(textList)
if(len(textList)<400):
n=2
else:
n=int(len(textList)/400)
textChunks = np.array_split(textList,n)
entitiesfinal=[]
entities = []
textStr = [(' ').join(textChunks[i].tolist()) for i in range(0,n)]
docs= nlp.pipe(textStr, n_threads=4,batch_size=1)
for doc in docs:
entities =[{'text':ent.text,'type': ent.label_} for ent in doc.ents]
entitiesfinal.extend(entities)
Hi Jacob,
Hi Sandeep,
I have total 2000 documents, and have have already batched it to chunks of 400 documents each. So the batch size of 1 means each batch has 400 documents. I tried with batches of 64, 100 on the total 2000 documents as well, but didn't find any improvement
I read that multithreading doesn't improve much performance as multiprocessing since it uses the same memory space and CPU too.
So that could be one reason.
Besides I am not sure about the BLAS being used in my Numpy, though I have installed Anaconda3 so numpy came automatically with it.
I am also challenged to use multiprocessing since I am doing processing as a python script function and calling this function from an outside program. For multiprocessing, it requires you to call the processing from a __main__ function in python.
I tried enabling tagger, but didn't find difference in the accuracy.
One thing I observed when using spacy is that if I send really long documents it takes more time than sending a batch of small documents. So I would not be combining multiple documents into one large document but rather send batches of small documents, probably split up by sentences or new lines as what is apt for my problem area without confusing the model with too short sentences.
Multiprocessing is quite easy in Python using joblib. However developers found issues running it. I didn't try it myself yet. But instead I have different way of multiprocess invocation in my product for business reasons. However an issue is open on joblib way of multiprocessing.
However Python multithreading which usually suffers from GIL serialization issue is not faced in the spacy nlp.pipe as Matt improved parallelism in the cython layer rather than python layer. So regardless of multiprocessing might give you only a small advantage overall for the scale you are trying out.
If Linux, you can use the below command for getting the BLAS linkage information.
ldd python -c "import numpy.core; print(numpy.core.multiarray.__file__)". On windows Numpy should be linked to MKL already. This might be helpful for you.
Also it is good to know the actual hardware limitations where you are running the code.
cat /proc/cpuinfo
Yes, multiprocessing with joblib does give the error for pickable objects.
My OS and hardware config are as follows
Windows server 2012 R2, 64 bit
, 32GB RAM, 8 virtual CPU's.
As I said, I am processing a total of 2000 docs.
Each of my batch (containing 400 docs) takes around 15 secs.
Total execution time if done sequentially is around 75 sec. (15 * 5)
So I was trying to use multithreading or multiprocessing to do processing in parallel or concurrently so that execution time can be reduced.
But multi threading is not giving that improvement and multiprocessing is giving the pickle error.
@honnibal @ines anything missing in my code due to which pipe function not giving better performance as compared to sequential processing?
I've been able to get multiprocessing working using the pathos framework. It has a fork of the multiprocessing module that amongst other things uses dill, a pickler that can pickle more kinds of objects than the pickle module in the standard library.
You can simply import pathos.multiprocessing as mp, then use mp.Pool as you would multiprocessing.Pool.
Thanks, I tried pathos. But the program gets hanged at the mp.Pool statement with no error message as well. I am using python 3.6 , on Windows server 2012, 64 bit, 8 virtual CPU's
import pathos.multiprocessing as mp
poolObj = mp.Pool(5)
docs = poolObj.map(nlp,textStr)
The current spacy-nightly now works with the multiprocessing module out of the box!
Hi, which version of Spacy should I use in order to take advantage of multiprocessing?
This is my code:
start = time.time()
for doc in nlp.pipe(array_texts, batch_size=args.batchSize, n_threads=args.nThreads):
result.append(doc.cats)
print('Time Elapsed {} ms'.format((time.time() - start)*1000))
array_texts is a numpy array with mutiple texts.
@ines @honnibal any update on multi processing ? Dug around issues in this codebase and couldn't find anything pointing to multiprocessing actually working.
Multiprocessing should now be working well on spacy-nightly. Closing this now to merge discussion with https://github.com/explosion/spaCy/issues/2075
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
The current spacy-nightly now works with the multiprocessing module out of the box!