Spacy: Invalid lemma for `had` contraction

Created on 28 Apr 2020 · 1Comment · Source: explosion/spaCy

I'm not sure if this issue is in scope of this project, since as far as I know it's only possible to figure if the 'd contraction is actually had or would from the context of the sentence, but most of the time spaCy seems to work with contractions as expected and it would be nice to be able to rely on it.

How to reproduce the behaviour

import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp("I'd a dream")
print(doc[1].lemma_)
> would

The result I'd expect to print is have instead of would.

Your Environment

spaCy version: 2.2.4
Platform: Linux-5.6.7-arch1-1-x86_64-with-glibc2.2.5
Python version: 3.8.2

feat / lemmatizer lang / en perf / accuracy

Source

piotr-szpetkowski

Most helpful comment

Thanks for the report! This is coming from a rule (in the tokenizer exceptions) that assigns the lemma/tag would/MD to the contraction 'd. I think it would make sense to remove would/MD and let the tagger handle it instead. The tagger is still probably going to get this wrong a fair amount of the time (and the tagger will probably do better on 3rd person pronouns than 1st/2nd), but it doesn't make sense for a rule to say it's always would.