When a special case token is followed by non-whitespace character the special token isn't recognized. I wrote a quick test to demonstrate:
import spacy
from spacy.symbols import ORTH
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
#print([w.text for w in nlp(text)])
assert [w.text for w in nlp(text)] == ['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...gimme...?', '!']
The last '...gimme...?' is broken up into '...', 'gimme', '...', '?' by the presence of the '!'.
This bug replicates only when you already used pipepline:
import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
print([w.text for w in nlp(text)])
#['...', 'gimme', '...', '?', 'that', '...', 'gimme', '...', '?', 'or', 'else', '...', 'gimme', '...', '?', '!']
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...', 'gimme', '...', '?', '!']
But:
import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...gimme...?', '!']
Both really helpful, thanks!
It seems to be even a little more nuanced. Comments inline with test case:
```python
import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md', parser=False, entity=False)
text = 'I like _MATH_ even _MATH_ when _MATH_, except when _MATH_ is _MATH_! but not _MATH_.'
print([w.text for w in nlp(text)])
nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
nlp = spacy.load('en_depent_web_md', parser=False, entity=False)
nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
Kind thanks for the great report --- was a very long-standing cache invalidation bug. Comment from the patch:
Add flush_cache method to tokenizer, to fix #1061
The tokenizer caches output for common chunks, for efficiency. This
cache is be invalidated when the tokenizer rules change, e.g. when a new
special-case rule is introduced. That's what was causing #1061.When the cache is flushed, we free the intermediate token chunks.
I think this is safe --- but if we start getting segfaults, this patch
is to blame. The resolution would be to simply not free those bits of
memory. They'll be freed when the tokenizer exits anyway.
Migrating to v2 and reran this test....Works except when special case is followed by a single period. Of note is that before adding the special case the suffix _. remains attached to the token.
import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_core_web_sm', parser=False, entity=False)
text = 'I like _MATH_ even _MATH_ when _MATH_, except when _MATH_ is _MATH_! or _MATH_? or _MATH_: or _MATH_; or even _MATH_.. but not _MATH_. or _MATH_.'
print([w.text for w in nlp(text)])
# As expected it treats prefix and suffix symbols as tokens but it fails to when suffix is _.
# ['I', 'like', '_', 'MATH', '_', 'even', '_', 'MATH', '_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_', 'MATH', '_', 'is', '_', 'MATH', '_', '!', 'or', '_', 'MATH', '_', '?', 'or', '_', 'MATH', '_', ':', 'or', '_', 'MATH', '_', ';', 'or', 'even', '_', 'MATH', '_', '..', 'but', 'not', '_', 'MATH_.', 'or', '_', 'MATH_.']
nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
# Special case allows desired tokenization except when token is followed by period
# ['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_MATH_', ',', 'except', 'when', '_MATH_', 'is', '_MATH_', '!', 'or', '_MATH_', '?', 'or', '_MATH_', ':', 'or', '_MATH_', ';', 'or', 'even', '_MATH_', '..', 'but', 'not', '_', 'MATH_.', 'or', '_', 'MATH_.']
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
This bug replicates only when you already used pipepline:
But:
Info about spaCy