The dataset has 164758 rows of text data, normal news article. I tried with spacy lemma first, and run for 3 hours with full usage of 24 cores without finish. Tested with a small set of 100, cost 10s. Means 164758 rows will cost about (164758 * 0.1) / (60*60) = 4.5 hours.
The same dataset with nltk lemmatizing with dask multiprocessing, and finished in 5 minutes.
Why spacy is sooo slow ? or misusing some function?
spacy code
def token_filter(token):
return not (token.is_punct | token.is_space | token.is_stop | len(token.text) <= 4)
def spacy_tokenizer(text):
for doc in nlp.pipe([text]):
tokens = [token.lemma_ for token in doc if token_filter(token)]
return tokens
%time data['token'] = data['text'].map(spacy_tokenizer)
nltk code
def tokenizer(text):
try:
tokens = [ word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
tokens = list(filter(lambda t: t not in punctuation, tokens))
tokens = list(filter(lambda t: t.lower() not in stop_words, tokens))
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
filtered_tokens = list(
map(lambda token: wordnet_lemmatizer.lemmatize(token.lower()), filtered_tokens))
filtered_tokens = list(filter(lambda t: t not in punctuation, filtered_tokens))
return filtered_tokens
except Exception as e:
raise e
def dask_tokenizer(df):
df['token'] = df['text'].map(tokenizer)
return df
import dask.dataframe as dd
from dask.multiprocessing import get
ddata = dd.from_pandas(data, npartitions=50)
%time final = ddata.map_partitions(dask_tokenizer).compute(get=get)
Info about spaCy
spaCy version 2.0.5
Location /opt/conda/lib/python3.6/site-packages/spacy
Platform Linux-4.4.0-103-generic-x86_64-with-debian-stretch-sid
Python version 3.6.3
Models en, en_default
Two possible speedups come to mind:
nlp(), it runs everything in its pipeline by default, including part-of-speech tagging, dependency parsing, and NER. If you just need lemmazing, you can turn some of those off: https://spacy.io/usage/processing-pipelines#disablingspacy_tokenizer function takes in a single document, puts it in a length 1 list, then runs nlp.pipe on it. nlp.pipe is meant to run on a large list, so you should just should run nlp.pipe directly on your long list of documents, rather than using map.Also see the discussion about tokenizing speed on #1508.
Thanks @ahalterman! That's all correct, so I'll close this.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Two possible speedups come to mind:
nlp(), it runs everything in its pipeline by default, including part-of-speech tagging, dependency parsing, and NER. If you just need lemmazing, you can turn some of those off: https://spacy.io/usage/processing-pipelines#disablingspacy_tokenizerfunction takes in a single document, puts it in a length 1 list, then runsnlp.pipeon it.nlp.pipeis meant to run on a large list, so you should just should runnlp.pipedirectly on your long list of documents, rather than usingmap.Also see the discussion about tokenizing speed on #1508.