Spacy: Lemma_ for "I" returns weird value: -PRON-

Created on 7 Apr 2017  Â·  11Comments  Â·  Source: explosion/spaCy

Hey,

I noticed something weird when finding the lemma_ of tokens.
When I find the lemma_ for the token for 'cakes': nlp("cakes")[0].lemma_, I get what is expected: 'cake'.
The same thing applies for nlp("i")[0].lemma_ which gives 'i'. However, I get some weird behavior when I use an uppercase "I", as in "I am hungry".

>>> nlp = spacy.load('en')
>>> print(nlp("I")[0].lemma_)
'-PRON-'

I'm not sure if this is intended behavior, or a bug. If it's a bug, is this something that's been encountered before?

I'm running spacy 1.7.3 on osx.

  • spaCy version: 1.7.3
  • Platform: Darwin-16.4.0-x86_64-i386-64bit
  • Python version: 3.6.0
  • Installed models: en
enhancement

Most helpful comment

I'll repost my argument against '-PRON-’ lemmas here to make it visible to other interested participants: _lemmas should arguably be part of the language_.
I'm not a lexicographer or linguists, but looking at the definitions, I'm almost certain that it is the case. For practical reasons also: lemmatisation may be directly used for looking up items in external lexical resources. Using an artificial lemma is a guarantee that nothing will be found.

All 11 comments

This is expected behavior. See https://spacy.io/docs/api/annotation#lemmatization, https://github.com/explosion/spaCy/issues/906 and https://github.com/explosion/spaCy/issues/898#issuecomment-288164755.

@honnibal I think the amount of confusion/problems caused by this (https://github.com/explosion/spaCy/pull/952, https://github.com/explosion/spaCy/issues/898, https://github.com/explosion/spaCy/issues/906) warrants reconsidering this decision for the 2.0 release. The Universal Dependencies project seems to go with "I" as the lemma (taken from https://raw.githubusercontent.com/UniversalDependencies/UD_English/master/en-ud-dev.conllu):
~
2 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 4 nsubj _ _
~

@f11r You're probably right.

The behaviour here is inconsistent though --- so there's a mistake either way.

I'll repost my argument against '-PRON-’ lemmas here to make it visible to other interested participants: _lemmas should arguably be part of the language_.
I'm not a lexicographer or linguists, but looking at the definitions, I'm almost certain that it is the case. For practical reasons also: lemmatisation may be directly used for looking up items in external lexical resources. Using an artificial lemma is a guarantee that nothing will be found.

The look-up argument is decisive: the -PRON- lemma will be reversed in spaCy 2.

It sucks to change this, but it's better to be correct going forward.

Thanks @adam-ra for your input on this

@honnibal I guess it's never easy, any decision will make some users happy and upset others. But you are the benevolent dictator here ;)
Thanks for the discussions and making it all transparent!

@ines So, is this still considered a bug to be fixed in a 2.x release ?

@crystosis en_core_web_sm-2.0.0a7 still produces -PRON- lemmas

@crystosis @adam-ra

We've really gone back and forth on this (as you can see from the issue being moved around on our board...)

The thing is, all the alternatives really are worse, especially when you get to contractions and fused tokens. One alternative would be to have each pronoun be its own lemma...But then in the Universal Dependencies data, we get fused tokens where there's only one character for the pronoun. It's really not nice to have no lemma for these, but often getting the correct lemma would require a very difficult decision about the case, gender or other features of the word.

The other consideration is that we're really trying to have as few distinct types of changes in v2 as we can. The models are different, and so is loading and training, and so are the pipelines. Going from 0 changes to the annotation scheme to "just one" change seems quite undesirable. It's another type of thing for people to think about when they're upgrading.

So: lacking a better alternative, we prefer to keep the much unloved -PRON- lemma in v2. I'm sorry we haven't communicated clearly on this.

I agree with @adam-ra , but I guess as a workaround, you can do w.lemma_ if w.lemma_ != '-PRON-' else w.lower_ for w in d if d is a nlp() object.

@honnibal I understand the rationale and respect your decision. It's not a significant practical fuss for me either.

Perhaps it's worth mentioning that if you decide to support more languages, issues like this will crop up and some of them should have much broader scope than just pronouns. For instance in most (I guess all) Slavic languages adjectives inflect for gender and if you want them to have lemmas, you need to arbitrarily select one form (this partriarchal world traditionally prefers masculine forms).

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings