The IPython output below illustrates the issue. In particular, notice that Out[5] and Out[6] are different, but I would expect them to be the same. If I continue repeating that expression after the second time, the result is the same. If I add the special case again, it is correct again on the first parse but fails afterwards.
In [1]: from spacy.en import English
In [2]: en = English()
In [3]: from spacy.symbols import ORTH, LEMMA, POS
In [4]: en.tokenizer.add_special_case(u'reimbur', [{ORTH: u'reimbur', LEMMA: u'reimburse', POS: u'VERB'}])
In [5]: [w.lemma_ for w in en(u'reimbur, reimbur...')]
Out[5]: [u'reimburse', u',', u'reimburse', u'...']
In [6]: [w.lemma_ for w in en(u'reimbur, reimbur...')]
Out[6]: [u'reimbur', u',', u'reimbur', u'...']
I tried out some other variations and found pretty odd behavior that might help track down the root cause. The following commands were run after running the above, with no modifications to the en object or its attributes. It seems that this issue does not occur when the token does not have prefixes or suffixes. For those tokens that do have prefixes or suffixes, it occurs after the first time the token has been seen with that particular prefix/suffix. This makes me wonder if it's caching-related.
In [11]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur')]
Out[11]: [u'reimbur', u',', u'reimbur', u'...', u'reimburse']
In [12]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur')]
Out[12]: [u'reimbur', u',', u'reimbur', u'...', u'reimburse']
In [13]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur.')]
Out[13]: [u'reimbur', u',', u'reimbur', u'...', u'reimburse', u'.']
In [14]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur)')]
Out[14]: [u'reimbur', u',', u'reimbur', u'...', u'reimburse', u')']
In [15]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur)')]
Out[15]: [u'reimbur', u',', u'reimbur', u'...', u'reimbur', u')']
In [16]: [w.lemma_ for w in en(u'reimbur, reimbur... reimbur.')]
Out[16]: [u'reimbur', u',', u'reimbur', u'...', u'reimbur', u'.']
In [17]: [w.lemma_ for w in en(u'(reimbur)')]
Out[17]: [u'(', u'reimburse', u')']
In [18]: [w.lemma_ for w in en(u'(reimbur)')]
Out[18]: [u'(', u'reimbur', u')']
In [19]: [w.lemma_ for w in en(u'(reimbur')]
Out[19]: [u'(', u'reimburse']
In [20]: [w.lemma_ for w in en(u'(reimbur')]
Out[20]: [u'(', u'reimbur']
@honnibal @ines I know you're probably both quite busy working on the upcoming 2.0 release, but could one of you please confirm if this is indeed an issue so I can decide if it's worth allocating time to fix? Thanks!
@macks22 Apologies for not getting to this sooner --- quite an interesting bug! Thanks for the report.
I'm really surprised this hasn't come up before, since it seems like a serious error that's been in the library since before the first release.
Here's the problem. The tokenizer works on whitespace-delimited chunks, like you would get from e.g. string.split(' '). As the tokenizer runs, we can cache any chunk that consists entirely of lexemes that are in the vocab.
The problem occurs in the interaction of this tokenization cache with the special-cases. For the special-cases, we save a TokenC* array, since we can attach lemmas, tags etc to the special-case rule. This mixing of levels has always been problematic, but I've seen no real way around it. It's just too good to set this rule all in one place.
A chunk like reimbur, in your example consists of a special-case rule, and other lexemes. This means the wider chunk wasn't being found in the initial cache. When we write out this cached result, we're only writing out the lexemes, which don't contain the lemma. Subsequently, we resolve the chunk via the cache, by-passing the special-case rule. Special-case tokenization would be preserved, but the attributes are lost.
Most of the English special-case rules refer to contractions, which happen to not take punctuation very often. This might be why the bug went undetected all this time.
In other news --- this was the last open bug on the whole repository.
:tada: :tada: :tada:
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
In other news --- this was the last open bug on the whole repository.
:tada: :tada: :tada: