Spacy: Spacy nlp method running in multi thread mode by default

Created on 23 Jul 2018 · 7Comments · Source: explosion/spaCy

How to reproduce the behaviour

nlp = spacy.load('en_core_web_sm')
array_text = ['Hi there, this is a test' for i in range(0,10000)]
processed = nlp(', '.join(array_text))

This will cause 16 threads on my macbook pro to spin up for Spacy. So is Spacy running in multi-thread mode by default? How do I disable this behaviour? I was under the impression that it would only run in multi-thread mode if I used the nlp.pipe.

Your Environment

Operating System: Mac OS High Sierra
Python Version Used: 3.6
spaCy Version Used: 2.0.3
Environment Information:

docs

Source

eamonnmag

All 7 comments

It seems the problem is with BLAS. Once I change the OPENBLAS_NUM_THREADS var to 1, I don't see the same behaviour.

eamonnmag on 23 Jul 2018

👍1

Hit the same issue, OPENBLAS_NUM_THREADS=1 fixes.

Might be worth flagging this in the docs? I'm using Spacy inside a Spark job, and was seeing a load factor of ~1000 on the workers, since all the cores were trying to fan out to all the other cores :)

davidmcclure on 4 Aug 2018

👍1

Nice. Is that the case when using SpacyMagic or only with Spacy? Would also be great to be able to serialize the document otherwise we have to send back arrays of dictionaries representing each token. What sort of throughput do you get out of interest with spark?

eamonnmag on 4 Aug 2018

@davidmcclure @eamonnmag Upcoming releases of spaCy are switching to single-thread by default, as the multi-core utilisation is pretty poor (numpy is just parallelising the matrix multiplications, which is too low a unit of work).

I agree that this should be in the docs in the meantime.