Spacy: Tokenizer: -ing contraction parsed incorrectly

Created on 9 Aug 2017  ยท  7Comments  ยท  Source: explosion/spaCy

Spacy doesn't properly tokenize words with contracted '-ing' ending:

import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("I'm lovin' it")
print(doc[1])
# 'm โ€“ CORRECT!
print(doc[1].lemma_)
# be โ€“ CORRECT!
print(doc[2])
# lovin โ€“ INCORRECT!
print(doc[2].lemma_)
# lovin โ€“ INCORRECT!
print(doc[2].pos_)
# ADJ โ€“ INCORRECT!

Info about spaCy

  • spaCy version: 1.9.0
  • Platform: Darwin-16.4.0-x86_64-i386-64bit
  • Python version: 3.6.2
  • Installed models: en_core_web_sm

Is there a workaround?

help wanted help wanted (easy) lang / en

Most helpful comment

@MathiasDesch Thanks so much for taking this on! And yeah, I assume the model likely hasn't seen any of those words during training, so it makes sense that spaCy still gets it wrong.

Just as a note, in case people come across this issue later: The best solution โ€“ if spaCy absolutely needs to get this right (e.g. if you're analysing marketing slogans or work for McDonalds ๐Ÿ˜œ) โ€“ is to post-train the tagger with a few examples. In v2.0, this should be very easy and quick.

All 7 comments

Thanks!

Sorry, looks like we forgot to fix this in the tokenizer exceptions for v2.0 โ€“ just added it and it will be included in the next model release.

Edit: Going forward, we should probably also solve this for similar words. The variety of -ing words that commonly have this contraction is limited, so including them all in the tokenizer exceptions would be a better solution than defining suffix rules (that may end up causing undesired or unpredictable results).

I think there is actually not much of words having this contraction commonly. I believe "lovin'" was introduced by Mac Donald a few years ago and stayed until now. This contraction is actually not registered as official ones.

Therefore, I think it might be a solution to assume that it's the only case where it is correct or that this contraction can be made with all 'ing' ending verbs but not to have a list of exceptions. It means making change in the tokenizer. This seems to be doable to have a general rule for it.

@MathiasDesch Yeah, I agree โ€“ the only other common ones I can think of are maybe goin', doin' and havin'... although, if you're working with texts that use these colloquially (and not as part of slogans etc.), the apostrophe is probably also less common. So maybe those forms could be handled in the new norm exceptions, to make sure they receive the same representations as the full forms.

My PR does not solve the POS related problem though. Lovin' is still predicted as an ADJ

@MathiasDesch Thanks so much for taking this on! And yeah, I assume the model likely hasn't seen any of those words during training, so it makes sense that spaCy still gets it wrong.

Just as a note, in case people come across this issue later: The best solution โ€“ if spaCy absolutely needs to get this right (e.g. if you're analysing marketing slogans or work for McDonalds ๐Ÿ˜œ) โ€“ is to post-train the tagger with a few examples. In v2.0, this should be very easy and quick.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings