Running merge_noun_chunks after the lemmatizer will not merge and propagate LEMMA annotations to the merged chunks:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("merge_noun_chunks")
text = "Hello world"
doc = nlp(text)
print([t.lemma_ for t in doc])
print(doc.has_annotation("LEMMA"))
Prints out:
['']
False
This might be an issue, for example, when using a matcher relying on such annotations to match small documents (i.e. if all annotations are removed):
ValueError: [E155] The pipeline needs to include a lemmatizer in order to use Matcher or PhraseMatcher with the attribute LEMMA. Try using `nlp()` instead of `nlp.make_doc()` or `list(nlp.pipe())` instead of `list(nlp.tokenizer.pipe())`.
Making sure that merge_noun_chunks is placed before the lemmatizer solves the issue:
nlp.add_pipe("merge_noun_chunks", before="lemmatizer")
I think it might be worth to either:
Thanks for the detailed report! We'll have a look at how best to address this.
The retokenizer currently resets LEMMA and NORM. Unset NORM on the token defaults back to NORM on the lexeme, which should still be okay in v3 and work with the Matcher. Unset LEMMA in v2 used to default back to the lookup lemma (in the python Token API, which still wouldn't have worked for the Matcher), but this is no longer the default for v3, so this should be updated.
I guess reasonable defaults would be:
SPACY preservedORTH values if lemmas were previously set, otherwise leave unset
Most helpful comment
The retokenizer currently resets
LEMMAandNORM. UnsetNORMon the token defaults back toNORMon the lexeme, which should still be okay in v3 and work with theMatcher. UnsetLEMMAin v2 used to default back to the lookup lemma (in the python Token API, which still wouldn't have worked for theMatcher), but this is no longer the default for v3, so this should be updated.I guess reasonable defaults would be:
SPACYpreservedORTHvalues if lemmas were previously set, otherwise leave unset