Spacy: Lemma_ for "I" returns weird value: -PRON-

Created on 7 Apr 2017 · 11Comments · Source: explosion/spaCy

Hey,

I noticed something weird when finding the lemma_ of tokens.
When I find the lemma_ for the token for 'cakes': nlp("cakes")[0].lemma_, I get what is expected: 'cake'.
The same thing applies for nlp("i")[0].lemma_ which gives 'i'. However, I get some weird behavior when I use an uppercase "I", as in "I am hungry".

>>> nlp = spacy.load('en')
>>> print(nlp("I")[0].lemma_)
'-PRON-'

I'm not sure if this is intended behavior, or a bug. If it's a bug, is this something that's been encountered before?

I'm running spacy 1.7.3 on osx.

spaCy version: 1.7.3
Platform: Darwin-16.4.0-x86_64-i386-64bit
Python version: 3.6.0
Installed models: en

enhancement

Source

ericzhao28

Most helpful comment

I'll repost my argument against '-PRON-’ lemmas here to make it visible to other interested participants: _lemmas should arguably be part of the language_.
I'm not a lexicographer or linguists, but looking at the definitions, I'm almost certain that it is the case. For practical reasons also: lemmatisation may be directly used for looking up items in external lexical resources. Using an artificial lemma is a guarantee that nothing will be found.

adam-ra on 12 Apr 2017

👍5

All 11 comments

This is expected behavior. See https://spacy.io/docs/api/annotation#lemmatization, https://github.com/explosion/spaCy/issues/906 and https://github.com/explosion/spaCy/issues/898#issuecomment-288164755.

@honnibal I think the amount of confusion/problems caused by this (https://github.com/explosion/spaCy/pull/952, https://github.com/explosion/spaCy/issues/898, https://github.com/explosion/spaCy/issues/906) warrants reconsidering this decision for the 2.0 release. The Universal Dependencies project seems to go with "I" as the lemma (taken from https://raw.githubusercontent.com/UniversalDependencies/UD_English/master/en-ud-dev.conllu):
~
2 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 4 nsubj _ _
~

f11r on 7 Apr 2017

👍1

@f11r You're probably right.

The behaviour here is inconsistent though --- so there's a mistake either way.

honnibal on 7 Apr 2017

adam-ra on 12 Apr 2017

👍5

The look-up argument is decisive: the -PRON- lemma will be reversed in spaCy 2.

It sucks to change this, but it's better to be correct going forward.

Thanks @adam-ra for your input on this

honnibal on 13 Apr 2017

👍4

@honnibal I guess it's never easy, any decision will make some users happy and upset others. But you are the benevolent dictator here ;)
Thanks for the discussions and making it all transparent!

adam-ra on 13 Apr 2017

@ines So, is this still considered a bug to be fixed in a 2.x release ?

crystosis on 31 Oct 2017

@crystosis en_core_web_sm-2.0.0a7 still produces -PRON- lemmas

adam-ra on 3 Nov 2017

@crystosis @adam-ra

We've really gone back and forth on this (as you can see from the issue being moved around on our board...)

The thing is, all the alternatives really are worse, especially when you get to contractions and fused tokens. One alternative would be to have each pronoun be its own lemma...But then in the Universal Dependencies data, we get fused tokens where there's only one character for the pronoun. It's really not nice to have no lemma for these, but often getting the correct lemma would require a very difficult decision about the case, gender or other features of the word.

The other consideration is that we're really trying to have as few distinct types of changes in v2 as we can. The models are different, and so is loading and training, and so are the pipelines. Going from 0 changes to the annotation scheme to "just one" change seems quite undesirable. It's another type of thing for people to think about when they're upgrading.

So: lacking a better alternative, we prefer to keep the much unloved -PRON- lemma in v2. I'm sorry we haven't communicated clearly on this.

honnibal on 4 Nov 2017

I agree with @adam-ra , but I guess as a workaround, you can do w.lemma_ if w.lemma_ != '-PRON-' else w.lower_ for w in d if d is a nlp() object.

nateGeorge on 6 Nov 2017

👍2

@honnibal I understand the rationale and respect your decision. It's not a significant practical fuss for me either.

Perhaps it's worth mentioning that if you decide to support more languages, issues like this will crop up and some of them should have much broader scope than just pronouns. For instance in most (I guess all) Slavic languages adjectives inflect for gender and if you want them to have lemmas, you need to arbitrarily select one form (this partriarchal world traditionally prefers masculine forms).

adam-ra on 6 Nov 2017

👍1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.