Spacy: merge_noun_chuncks removes lemmatizer's annotations

Created on 1 Jan 2021  路  2Comments  路  Source: explosion/spaCy

How to reproduce the behaviour

Running merge_noun_chunks after the lemmatizer will not merge and propagate LEMMA annotations to the merged chunks:

import spacy

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("merge_noun_chunks")

text = "Hello world"
doc = nlp(text)
print([t.lemma_ for t in doc])
print(doc.has_annotation("LEMMA"))

Prints out:

['']
False

This might be an issue, for example, when using a matcher relying on such annotations to match small documents (i.e. if all annotations are removed):

ValueError: [E155] The pipeline needs to include a lemmatizer in order to use Matcher or PhraseMatcher with the attribute LEMMA. Try using `nlp()` instead of `nlp.make_doc()` or `list(nlp.pipe())` instead of `list(nlp.tokenizer.pipe())`.

Solution

Making sure that merge_noun_chunks is placed before the lemmatizer solves the issue:

nlp.add_pipe("merge_noun_chunks", before="lemmatizer")

I think it might be worth to either:

  • mention this in the doc
  • or have merge_noun_chunks merge also lemmatizer's annotations

Info about spaCy

  • spaCy version: 3.0.0rc2
  • Platform: Linux-5.10.3-arch1-1-x86_64-with-glibc2.2.5
  • Python version: 3.8.7
  • Pipelines: en_core_web_sm (3.0.0a0), en_core_web_trf (3.0.0a0)
feat / lemmatizer feat / pipeline 馃寵 nightly

Most helpful comment

The retokenizer currently resets LEMMA and NORM. Unset NORM on the token defaults back to NORM on the lexeme, which should still be okay in v3 and work with the Matcher. Unset LEMMA in v2 used to default back to the lookup lemma (in the python Token API, which still wouldn't have worked for the Matcher), but this is no longer the default for v3, so this should be updated.

I guess reasonable defaults would be:

  • merge: concatenate any existing lemmas with SPACY preserved
  • split: use the new ORTH values if lemmas were previously set, otherwise leave unset

All 2 comments

Thanks for the detailed report! We'll have a look at how best to address this.

The retokenizer currently resets LEMMA and NORM. Unset NORM on the token defaults back to NORM on the lexeme, which should still be okay in v3 and work with the Matcher. Unset LEMMA in v2 used to default back to the lookup lemma (in the python Token API, which still wouldn't have worked for the Matcher), but this is no longer the default for v3, so this should be updated.

I guess reasonable defaults would be:

  • merge: concatenate any existing lemmas with SPACY preserved
  • split: use the new ORTH values if lemmas were previously set, otherwise leave unset
Was this page helpful?
0 / 5 - 0 ratings