Spacy: Custom component not executing when calling nlp.pipe

Created on 9 Feb 2018 · 3Comments · Source: explosion/spaCy

I created a custom component to filter out stop words and punctuation and add it to my pipeline like so:

nlp = spacy.load('en')

punctuations = string.punctuation
stopwords = spacy.lang.en.STOP_WORDS

def clean_component(doc):
    """ Clean up text. Tokenize, lowercase, and remove punctuation and stopwords """
    print("Running cleaner")
    # Remove punctuation, symbols (#) and stopwords
    doc = [tok.text for tok in doc if (tok.text not in stopwords and tok.pos_ != "PUNCT" and tok.pos_ != "SYM")]
    # Make all tokens lowercase
    doc = [tok.lower() for tok in doc]
    doc = ' '.join(doc)
    return nlp.make_doc(doc)

nlp.add_pipe(clean_component, name='cleaner', after='tagger')
print(nlp.pipe_names) # ['tagger', 'cleaner','parser', 'ner']

But when I run nlp.pipe on some text "Running cleaner" is printed but the text isn't filtered.

for doc in nlp.pipe(data['text'][:2]):
    print(doc)

The output is the same as the input. Am I using pipe wrong? Thanks.

Your Environment

Operating System: MacOS 10.13.3
Python Version Used: 3.6.4
spaCy Version Used: 2.0.7
Environment Information:

bug

Source

enerrio

All 3 comments

Thanks, this is a bug. When component functions don't have a .pipe() method, we call a helper function to pipe them, here: https://github.com/explosion/spaCy/blob/master/spacy/language.py#L721

This function should be yielding the result, but is instead yielding the original doc. Here's a minimal hack that should make your code work for now:


def clean_component(doc):
    """ Clean up text. Tokenize, lowercase, and remove punctuation and stopwords """
    print("Running cleaner")
    # Remove punctuation, symbols (#) and stopwords
    doc = [tok.text for tok in doc if (tok.text not in stopwords and tok.pos_ != "PUNCT" and tok.pos_ != "SYM")]
    # Make all tokens lowercase
    doc = [tok.lower() for tok in doc]
    doc = ' '.join(doc)
    return nlp.make_doc(doc)

def pipe_clean(docs, **kwargs):
    for doc in docs:
        yield clean_component(doc)

# Yes, adding attributes to functions works...It's just a bit dirty-looking. Arguably less confusing to
# make it a class. Shrug.
clean_component.pipe = pipe_clean

honnibal on 9 Feb 2018

🎉1

Btw, the stop words in spaCy are currently case-sensitive, so you might want to write your clean-up logic slightly differently. You should also take-care that the processing you're doing will likely have a huge impact on the accuracy of the parser and NER.

If you just want to get a bag of words that's lower-cased and doesn't have stop words, you might be better off keeping the original Doc object, and using the token.is_stop, token.lower_, token.is_punct etc attributes.

honnibal on 9 Feb 2018

👍1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.