I created a custom component to filter out stop words and punctuation and add it to my pipeline like so:
nlp = spacy.load('en')
punctuations = string.punctuation
stopwords = spacy.lang.en.STOP_WORDS
def clean_component(doc):
""" Clean up text. Tokenize, lowercase, and remove punctuation and stopwords """
print("Running cleaner")
# Remove punctuation, symbols (#) and stopwords
doc = [tok.text for tok in doc if (tok.text not in stopwords and tok.pos_ != "PUNCT" and tok.pos_ != "SYM")]
# Make all tokens lowercase
doc = [tok.lower() for tok in doc]
doc = ' '.join(doc)
return nlp.make_doc(doc)
nlp.add_pipe(clean_component, name='cleaner', after='tagger')
print(nlp.pipe_names) # ['tagger', 'cleaner','parser', 'ner']
But when I run nlp.pipe on some text "Running cleaner" is printed but the text isn't filtered.
for doc in nlp.pipe(data['text'][:2]):
print(doc)
The output is the same as the input. Am I using pipe wrong? Thanks.
Thanks, this is a bug. When component functions don't have a .pipe() method, we call a helper function to pipe them, here: https://github.com/explosion/spaCy/blob/master/spacy/language.py#L721
This function should be yielding the result, but is instead yielding the original doc. Here's a minimal hack that should make your code work for now:
def clean_component(doc):
""" Clean up text. Tokenize, lowercase, and remove punctuation and stopwords """
print("Running cleaner")
# Remove punctuation, symbols (#) and stopwords
doc = [tok.text for tok in doc if (tok.text not in stopwords and tok.pos_ != "PUNCT" and tok.pos_ != "SYM")]
# Make all tokens lowercase
doc = [tok.lower() for tok in doc]
doc = ' '.join(doc)
return nlp.make_doc(doc)
def pipe_clean(docs, **kwargs):
for doc in docs:
yield clean_component(doc)
# Yes, adding attributes to functions works...It's just a bit dirty-looking. Arguably less confusing to
# make it a class. Shrug.
clean_component.pipe = pipe_clean
Btw, the stop words in spaCy are currently case-sensitive, so you might want to write your clean-up logic slightly differently. You should also take-care that the processing you're doing will likely have a huge impact on the accuracy of the parser and NER.
If you just want to get a bag of words that's lower-cased and doesn't have stop words, you might be better off keeping the original Doc object, and using the token.is_stop, token.lower_, token.is_punct etc attributes.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.