Spacy: Norwegian BokmÄl model handles lemmatization process for NOUNs with incorrectly results

Created on 28 Jun 2020  Â·  3Comments  Â·  Source: explosion/spaCy

Norwegian BokmÄl model 2.3.0 handles lemmatization process for NOUNs with incorrectly results.

For example in the sentence Formuesskatten er en skatt som utlignes pÄ grunnlag av nettoformuen din. not correctly determined lemma of Formuesskatten --> lemma Formuesskatten, correct lemma is Formuesskatt in this case.

For the previous release of Norwegian BokmÄl model 2.2.5 the lemma of Formuesskatten is correctly determined.

This error affects the subsequent process of decomposition of compound NOUNs.
If correct then:
NOUN formuesskatten --> lemma --> formuesskatt --> samset-leks +skatt

If incorrect then:
NOUN formuesskatten --> lemma --> formuesskatten --> samset-leks +skatten

For now I use older model (v2.2.5) for such kind of tasks.

How to reproduce the behaviour

import spacy

nlp = spacy.load("nb_core_news_sm")
doc = nlp("Formuesskatten er en skatt som utlignes pÄ grunnlag av nettoformuen din.")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Result:

Formuesskatten formuesskatten NOUN NOUN__Definite=Def|Gender=Masc|Number=Sing nsubj Xxxxx True False
er er AUX AUX__Mood=Ind|Tense=Pres|VerbForm=Fin cop xx True True
en en DET DET__Gender=Masc|Number=Sing|PronType=Art det xx True True
skatt skatt NOUN NOUN__Definite=Ind|Gender=Masc|Number=Sing ROOT xxxx True False
som som PRON PRON__PronType=Rel nsubj:pass xxx True True
utlignes utlignes VERB VERB__Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass acl:relcl xxxx True False
pÄ pÄ ADP ADP case xx True True
grunnlag grunnlag NOUN NOUN__Definite=Ind|Gender=Neut|Number=Sing obl xxxx True False
av av ADP ADP case xx True True
nettoformuen nettoformuen NOUN NOUN__Definite=Def|Gender=Masc|Number=Sing nmod xxxx True False
din din PRON PRON__Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs nmod xxx True False
. . PUNCT PUNCT punct . False False

Your Environment

  • Operating System: macOS 10.11.6
  • spaCy version: 2.3.0
  • Platform: macOS-10.11.6-x86_64-i386-64bit
  • Python version: 3.8.3
python -m spacy info

============================== Info about spaCy ==============================

spaCy version    2.3.0                         
Location         /Users/MalahovKS/Documents/Velychko/2020/nor-projects/nb-terms/venv/lib/python3.8/site-packages/spacy
Platform         macOS-10.11.6-x86_64-i386-64bit
Python version   3.8.3             
python -m spacy validate
✔ Loaded compatibility table

====================== Installed models (spaCy v2.3.0) ======================
â„č spaCy installation:
/Users/MalahovKS/Documents/Velychko/2020/nor-projects/nb-terms/venv/lib/python3.8/site-packages/spacy

TYPE      NAME              MODEL             VERSION                            
package   nb-core-news-sm   nb_core_news_sm   2.3.0   ✔
bug feat / lemmatizer lang / nb perf / accuracy

All 3 comments

Thanks for the report! I can replicate this and it looks like a bug in the lemmatizer.

The 2.3.0 models include more consistent tag maps with morphological features from the UD corpora, but it looks like the presence of the morphological features triggered some older English-specific code that skips lemmatization for singular nouns, which is clearly a bug here. We'll look into a fix!

Thanks for the report! I can replicate this and it looks like a bug in the lemmatizer.

The 2.3.0 models include more consistent tag maps with morphological features from the UD corpora, but it looks like the presence of the morphological features triggered some older English-specific code that skips lemmatization for singular nouns, which is clearly a bug here. We'll look into a fix!

Great 👍

Okay, this should be fixed in the upcoming v2.3.1 by #5663.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

smartinsightsfromdata picture smartinsightsfromdata  Â·  3Comments

nadachaabani1 picture nadachaabani1  Â·  3Comments

peterroelants picture peterroelants  Â·  3Comments

ines picture ines  Â·  3Comments

TropComplique picture TropComplique  Â·  3Comments