Spacy: using nlp.pipe didn't improve perfomance of nlp

Created on 16 Jun 2018  路  11Comments  路  Source: explosion/spaCy

Hi,

I used nlp.pipe on text to extract the entities but didn't find significant improvement on the speed to extract the entities.
My Input is variable called textList : It is a list containing around 2000 documents.
I split in into chunks of 400 each since beyond that it errors out with memory limit of 1000000 chars message.
Earlier when I just used, nlp(text), it used to take around 15 sec for each chunk(400 docs). so total time around 45 sec- 1 min.
After using nlp.pipe, I expected that with multithreading, it should have been one fourth the time.

Is there a recommendation on number of threads and batch size in the call to nlp.pipe?

My code is as below

nlp= spacy.load(path,disable=['parser', 'tagger', 'textcat'])
textList = np.array(textList)
if(len(textList)<400):
n=2
else:
n=int(len(textList)/400)
textChunks = np.array_split(textList,n)
entitiesfinal=[]
entities = []
textStr = [(' ').join(textChunks[i].tolist()) for i in range(0,n)]
docs= nlp.pipe(textStr, n_threads=4,batch_size=1)
for doc in docs:
entities =[{'text':ent.text,'type': ent.label_} for ent in doc.ents]
entitiesfinal.extend(entities)

  • Operating System: Windows 64 bit
  • Python Version Used: 3.6
  • spaCy Version Used: 2.0.11
  • Environment Information: Anaconda 3
enhancement perf / speed usage

Most helpful comment

The current spacy-nightly now works with the multiprocessing module out of the box!

All 11 comments

Hi Jacob,

  1. First of all as per Amdahl's law you will not get 4 times speed increase unless the who code is completely independent inside the cython layer.
  2. I think using a batch of 1 document will reduce the speed in case of multithreading that nlp.pipe provides. For each document the thread will engage the entire code flow and end up giving you the result. But that means we are eating up good amount of the thread's quantum to process and come back. So batching should give better results. I use 32 documents or 64 documents for batching which seems to be a good number given the memory footprint growth because of parallelism and my document length varying from a 100 characters to 1k or 10k characters.
  3. IMO, using tagger (not disabling tagger) should improve the overall accuracy. I think the POS tags are used as inputs for NER.

Hi Sandeep,

I have total 2000 documents, and have have already batched it to chunks of 400 documents each. So the batch size of 1 means each batch has 400 documents. I tried with batches of 64, 100 on the total 2000 documents as well, but didn't find any improvement
I read that multithreading doesn't improve much performance as multiprocessing since it uses the same memory space and CPU too.
So that could be one reason.
Besides I am not sure about the BLAS being used in my Numpy, though I have installed Anaconda3 so numpy came automatically with it.

I am also challenged to use multiprocessing since I am doing processing as a python script function and calling this function from an outside program. For multiprocessing, it requires you to call the processing from a __main__ function in python.

I tried enabling tagger, but didn't find difference in the accuracy.

One thing I observed when using spacy is that if I send really long documents it takes more time than sending a batch of small documents. So I would not be combining multiple documents into one large document but rather send batches of small documents, probably split up by sentences or new lines as what is apt for my problem area without confusing the model with too short sentences.

Multiprocessing is quite easy in Python using joblib. However developers found issues running it. I didn't try it myself yet. But instead I have different way of multiprocess invocation in my product for business reasons. However an issue is open on joblib way of multiprocessing.

However Python multithreading which usually suffers from GIL serialization issue is not faced in the spacy nlp.pipe as Matt improved parallelism in the cython layer rather than python layer. So regardless of multiprocessing might give you only a small advantage overall for the scale you are trying out.

If Linux, you can use the below command for getting the BLAS linkage information.
ldd python -c "import numpy.core; print(numpy.core.multiarray.__file__)". On windows Numpy should be linked to MKL already. This might be helpful for you.

Also it is good to know the actual hardware limitations where you are running the code.
cat /proc/cpuinfo

Yes, multiprocessing with joblib does give the error for pickable objects.
My OS and hardware config are as follows
Windows server 2012 R2, 64 bit
, 32GB RAM, 8 virtual CPU's.

As I said, I am processing a total of 2000 docs.
Each of my batch (containing 400 docs) takes around 15 secs.
Total execution time if done sequentially is around 75 sec. (15 * 5)
So I was trying to use multithreading or multiprocessing to do processing in parallel or concurrently so that execution time can be reduced.
But multi threading is not giving that improvement and multiprocessing is giving the pickle error.

@honnibal @ines anything missing in my code due to which pipe function not giving better performance as compared to sequential processing?

I've been able to get multiprocessing working using the pathos framework. It has a fork of the multiprocessing module that amongst other things uses dill, a pickler that can pickle more kinds of objects than the pickle module in the standard library.

You can simply import pathos.multiprocessing as mp, then use mp.Pool as you would multiprocessing.Pool.

Thanks, I tried pathos. But the program gets hanged at the mp.Pool statement with no error message as well. I am using python 3.6 , on Windows server 2012, 64 bit, 8 virtual CPU's
import pathos.multiprocessing as mp
poolObj = mp.Pool(5)
docs = poolObj.map(nlp,textStr)

The current spacy-nightly now works with the multiprocessing module out of the box!

Hi, which version of Spacy should I use in order to take advantage of multiprocessing?

This is my code:

start = time.time()
for doc in nlp.pipe(array_texts, batch_size=args.batchSize, n_threads=args.nThreads):
    result.append(doc.cats)
print('Time Elapsed {} ms'.format((time.time() - start)*1000))

array_texts is a numpy array with mutiple texts.

  • Operating System: CentOS Linux release 7.4.1708 (Core)
  • Python Version Used: 3.6.5
  • spaCy Version Used: 2.0.12 (installed using: pip install spacy)
  • Environment Information: Anaconda 3

@ines @honnibal any update on multi processing ? Dug around issues in this codebase and couldn't find anything pointing to multiprocessing actually working.

Multiprocessing should now be working well on spacy-nightly. Closing this now to merge discussion with https://github.com/explosion/spaCy/issues/2075

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

besirkurtulmus picture besirkurtulmus  路  3Comments

ajayrfhp picture ajayrfhp  路  3Comments

notnami picture notnami  路  3Comments

norrishd picture norrishd  路  3Comments

bebelbop picture bebelbop  路  3Comments