Spacy: merge_noun_chuncks removes lemmatizer's annotations

Created on 1 Jan 2021 · 2Comments · Source: explosion/spaCy

How to reproduce the behaviour

Running merge_noun_chunks after the lemmatizer will not merge and propagate LEMMA annotations to the merged chunks:

import spacy

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("merge_noun_chunks")

text = "Hello world"
doc = nlp(text)
print([t.lemma_ for t in doc])
print(doc.has_annotation("LEMMA"))

Prints out:

['']
False

This might be an issue, for example, when using a matcher relying on such annotations to match small documents (i.e. if all annotations are removed):

ValueError: [E155] The pipeline needs to include a lemmatizer in order to use Matcher or PhraseMatcher with the attribute LEMMA. Try using `nlp()` instead of `nlp.make_doc()` or `list(nlp.pipe())` instead of `list(nlp.tokenizer.pipe())`.

Solution

Making sure that merge_noun_chunks is placed before the lemmatizer solves the issue:

nlp.add_pipe("merge_noun_chunks", before="lemmatizer")

I think it might be worth to either:

mention this in the doc
or have merge_noun_chunks merge also lemmatizer's annotations

Info about spaCy

spaCy version: 3.0.0rc2
Platform: Linux-5.10.3-arch1-1-x86_64-with-glibc2.2.5
Python version: 3.8.7
Pipelines: en_core_web_sm (3.0.0a0), en_core_web_trf (3.0.0a0)

feat / lemmatizer feat / pipeline 🌙 nightly

Source

werew

👀1

Most helpful comment

The retokenizer currently resets LEMMA and NORM. Unset NORM on the token defaults back to NORM on the lexeme, which should still be okay in v3 and work with the Matcher. Unset LEMMA in v2 used to default back to the lookup lemma (in the python Token API, which still wouldn't have worked for the Matcher), but this is no longer the default for v3, so this should be updated.

I guess reasonable defaults would be: