Spacy: why the performance of lemmatizing of spacy is so slow compared with nltk

Created on 13 Jan 2018 · 3Comments · Source: explosion/spaCy

The dataset has 164758 rows of text data, normal news article. I tried with spacy lemma first, and run for 3 hours with full usage of 24 cores without finish. Tested with a small set of 100, cost 10s. Means 164758 rows will cost about (164758 * 0.1) / (60*60) = 4.5 hours.

The same dataset with nltk lemmatizing with dask multiprocessing, and finished in 5 minutes.

Why spacy is sooo slow ? or misusing some function?

spacy code

def token_filter(token):
    return not (token.is_punct | token.is_space | token.is_stop | len(token.text) <= 4)

def spacy_tokenizer(text):
    for doc in nlp.pipe([text]):
        tokens = [token.lemma_ for token in doc if token_filter(token)]
    return tokens

%time data['token'] = data['text'].map(spacy_tokenizer)

nltk code

def tokenizer(text):
    try:
        tokens = [ word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
        tokens = list(filter(lambda t: t not in punctuation, tokens))
        tokens = list(filter(lambda t: t.lower() not in stop_words, tokens))
        filtered_tokens = []
        for token in tokens:
            if re.search('[a-zA-Z]', token):
                filtered_tokens.append(token)
        filtered_tokens = list(
            map(lambda token: wordnet_lemmatizer.lemmatize(token.lower()), filtered_tokens))
        filtered_tokens = list(filter(lambda t: t not in punctuation, filtered_tokens))
        return filtered_tokens
    except Exception as e:
        raise e

def dask_tokenizer(df):
    df['token'] = df['text'].map(tokenizer)
    return df

import dask.dataframe as dd
from dask.multiprocessing import get
ddata = dd.from_pandas(data, npartitions=50)
%time final = ddata.map_partitions(dask_tokenizer).compute(get=get)

Your Environment

Info about spaCy

spaCy version      2.0.5          
Location           /opt/conda/lib/python3.6/site-packages/spacy
Platform           Linux-4.4.0-103-generic-x86_64-with-debian-stretch-sid
Python version     3.6.3          
Models             en, en_default

usage

Source

tonywangcn

Most helpful comment

Two possible speedups come to mind:

when you call spaCy's nlp(), it runs everything in its pipeline by default, including part-of-speech tagging, dependency parsing, and NER. If you just need lemmazing, you can turn some of those off: https://spacy.io/usage/processing-pipelines#disabling
Your spacy_tokenizer function takes in a single document, puts it in a length 1 list, then runs nlp.pipe on it. nlp.pipe is meant to run on a large list, so you should just should run nlp.pipe directly on your long list of documents, rather than using map.

Also see the discussion about tokenizing speed on #1508.

ahalterman on 13 Jan 2018

👍4

All 3 comments

Two possible speedups come to mind:

when you call spaCy's nlp(), it runs everything in its pipeline by default, including part-of-speech tagging, dependency parsing, and NER. If you just need lemmazing, you can turn some of those off: https://spacy.io/usage/processing-pipelines#disabling
Your spacy_tokenizer function takes in a single document, puts it in a length 1 list, then runs nlp.pipe on it. nlp.pipe is meant to run on a large list, so you should just should run nlp.pipe directly on your long list of documents, rather than using map.

Also see the discussion about tokenizing speed on #1508.

ahalterman on 13 Jan 2018

👍4

Thanks @ahalterman! That's all correct, so I'll close this.