Spacy: Tokenizer.add_special_case seems to be ignored after first parse.

Created on 9 Aug 2017 · 5Comments · Source: explosion/spaCy

The IPython output below illustrates the issue. In particular, notice that Out[5] and Out[6] are different, but I would expect them to be the same. If I continue repeating that expression after the second time, the result is the same. If I add the special case again, it is correct again on the first parse but fails afterwards.

In [1]: from spacy.en import English
In [2]: en = English()
In [3]: from spacy.symbols import ORTH, LEMMA, POS
In [4]: en.tokenizer.add_special_case(u'reimbur', [{ORTH: u'reimbur', LEMMA: u'reimburse', POS: u'VERB'}])

In [5]: [w.lemma_ for w in en(u'reimbur, reimbur...')]
Out[5]: [u'reimburse', u',', u'reimburse', u'...']

In [6]: [w.lemma_ for w in en(u'reimbur, reimbur...')]
Out[6]: [u'reimbur', u',', u'reimbur', u'...']

I tried out some other variations and found pretty odd behavior that might help track down the root cause. The following commands were run after running the above, with no modifications to the en object or its attributes. It seems that this issue does not occur when the token does not have prefixes or suffixes. For those tokens that do have prefixes or suffixes, it occurs after the first time the token has been seen with that particular prefix/suffix. This makes me wonder if it's caching-related.

In [11]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur')]
Out[11]: [u'reimbur', u',', u'reimbur', u'...', u'reimburse']

In [12]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur')]
Out[12]: [u'reimbur', u',', u'reimbur', u'...', u'reimburse']

In [13]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur.')]
Out[13]: [u'reimbur', u',', u'reimbur', u'...', u'reimburse', u'.']

In [14]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur)')]
Out[14]: [u'reimbur', u',', u'reimbur', u'...', u'reimburse', u')']

In [15]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur)')]
Out[15]: [u'reimbur', u',', u'reimbur', u'...', u'reimbur', u')']

In [16]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur.')]
Out[16]: [u'reimbur', u',', u'reimbur', u'...', u'reimbur', u'.']

In [17]: [w.lemma_ for w in en(u'(reimbur)')]
Out[17]: [u'(', u'reimburse', u')']

In [18]: [w.lemma_ for w in en(u'(reimbur)')]
Out[18]: [u'(', u'reimbur', u')']

In [19]: [w.lemma_ for w in en(u'(reimbur')]
Out[19]: [u'(', u'reimburse']

In [20]: [w.lemma_ for w in en(u'(reimbur')]
Out[20]: [u'(', u'reimbur']

Your Environment

Python version: 2.7.13
Platform: Darwin-16.7.0-x86_64-i386-64bit
spaCy version: 1.9.0
Installed models: en

bug

Source

macks22

Most helpful comment

In other news --- this was the last open bug on the whole repository.

:tada: :tada: :tada:

honnibal on 24 Oct 2017

🎉2

All 5 comments

@honnibal @ines I know you're probably both quite busy working on the upcoming 2.0 release, but could one of you please confirm if this is indeed an issue so I can decide if it's worth allocating time to fix? Thanks!

macks22 on 5 Sep 2017

@macks22 Apologies for not getting to this sooner --- quite an interesting bug! Thanks for the report.

honnibal on 20 Oct 2017

I'm really surprised this hasn't come up before, since it seems like a serious error that's been in the library since before the first release.

Here's the problem. The tokenizer works on whitespace-delimited chunks, like you would get from e.g. string.split(' '). As the tokenizer runs, we can cache any chunk that consists entirely of lexemes that are in the vocab.

The problem occurs in the interaction of this tokenization cache with the special-cases. For the special-cases, we save a TokenC* array, since we can attach lemmas, tags etc to the special-case rule. This mixing of levels has always been problematic, but I've seen no real way around it. It's just too good to set this rule all in one place.

A chunk like reimbur, in your example consists of a special-case rule, and other lexemes. This means the wider chunk wasn't being found in the initial cache. When we write out this cached result, we're only writing out the lexemes, which don't contain the lemma. Subsequently, we resolve the chunk via the cache, by-passing the special-case rule. Special-case tokenization would be preserved, but the attributes are lost.

Most of the English special-case rules refer to contractions, which happen to not take punctuation very often. This might be why the bug went undetected all this time.

honnibal on 24 Oct 2017

👍1

In other news --- this was the last open bug on the whole repository.

:tada: :tada: :tada:

honnibal on 24 Oct 2017

🎉2

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.