Spacy: Tokenizer: -ing contraction parsed incorrectly

Created on 9 Aug 2017 · 7Comments · Source: explosion/spaCy

Spacy doesn't properly tokenize words with contracted '-ing' ending:

import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("I'm lovin' it")
print(doc[1])
# 'm – CORRECT!
print(doc[1].lemma_)
# be – CORRECT!
print(doc[2])
# lovin – INCORRECT!
print(doc[2].lemma_)
# lovin – INCORRECT!
print(doc[2].pos_)
# ADJ – INCORRECT!

Info about spaCy

spaCy version: 1.9.0
Platform: Darwin-16.4.0-x86_64-i386-64bit
Python version: 3.6.2
Installed models: en_core_web_sm

Is there a workaround?

help wanted help wanted (easy) lang / en

Source

Jeiwan

Most helpful comment

@MathiasDesch Thanks so much for taking this on! And yeah, I assume the model likely hasn't seen any of those words during training, so it makes sense that spaCy still gets it wrong.

Just as a note, in case people come across this issue later: The best solution – if spaCy absolutely needs to get this right (e.g. if you're analysing marketing slogans or work for McDonalds 😜) – is to post-train the tagger with a few examples. In v2.0, this should be very easy and quick.

ines on 13 Nov 2017

👍2

All 7 comments

Thanks!

honnibal on 9 Aug 2017

Sorry, looks like we forgot to fix this in the tokenizer exceptions for v2.0 – just added it and it will be included in the next model release.

Edit: Going forward, we should probably also solve this for similar words. The variety of -ing words that commonly have this contraction is limited, so including them all in the tokenizer exceptions would be a better solution than defining suffix rules (that may end up causing undesired or unpredictable results).

ines on 9 Nov 2017

I think there is actually not much of words having this contraction commonly. I believe "lovin'" was introduced by Mac Donald a few years ago and stayed until now. This contraction is actually not registered as official ones.

Therefore, I think it might be a solution to assume that it's the only case where it is correct or that this contraction can be made with all 'ing' ending verbs but not to have a list of exceptions. It means making change in the tokenizer. This seems to be doable to have a general rule for it.

MathiasDesch on 9 Nov 2017

👍1

@MathiasDesch Yeah, I agree – the only other common ones I can think of are maybe goin', doin' and havin'... although, if you're working with texts that use these colloquially (and not as part of slogans etc.), the apostrophe is probably also less common. So maybe those forms could be handled in the new norm exceptions, to make sure they receive the same representations as the full forms.

ines on 9 Nov 2017

My PR does not solve the POS related problem though. Lovin' is still predicted as an ADJ

MathiasDesch on 13 Nov 2017

@MathiasDesch Thanks so much for taking this on! And yeah, I assume the model likely hasn't seen any of those words during training, so it makes sense that spaCy still gets it wrong.

ines on 13 Nov 2017

👍2

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.