Spacy: tokenizer.add_special_case not working when special token not followed by whitespace

Created on 16 May 2017  路  6Comments  路  Source: explosion/spaCy

When a special case token is followed by non-whitespace character the special token isn't recognized. I wrote a quick test to demonstrate:

import spacy
from spacy.symbols import ORTH

text = '...gimme...? that ...gimme...? or else ...gimme...?!'
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
#print([w.text for w in nlp(text)])
assert [w.text for w in nlp(text)] == ['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...gimme...?', '!']

The last '...gimme...?' is broken up into '...', 'gimme', '...', '?' by the presence of the '!'.

Your Environment

  • spaCy version: 1.8.2
  • Platform: Darwin-16.5.0-x86_64-i386-64bit
  • Python version: 3.6.0
  • Installed models: en_depent_web_md
bug

Most helpful comment

This bug replicates only when you already used pipepline:

import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
print([w.text for w in nlp(text)])
#['...', 'gimme', '...', '?', 'that', '...', 'gimme', '...', '?', 'or', 'else', '...', 'gimme', '...', '?', '!']
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...', 'gimme', '...', '?', '!']

But:

import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...gimme...?', '!']

Info about spaCy

  • spaCy version: 1.8.2
  • Platform: Linux-4.10.13-1-ARCH-x86_64-with-arch
  • Python version: 3.6.1
  • Installed models: en, en_depent_web_md

All 6 comments

This bug replicates only when you already used pipepline:

import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
print([w.text for w in nlp(text)])
#['...', 'gimme', '...', '?', 'that', '...', 'gimme', '...', '?', 'or', 'else', '...', 'gimme', '...', '?', '!']
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...', 'gimme', '...', '?', '!']

But:

import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...gimme...?', '!']

Info about spaCy

  • spaCy version: 1.8.2
  • Platform: Linux-4.10.13-1-ARCH-x86_64-with-arch
  • Python version: 3.6.1
  • Installed models: en, en_depent_web_md

Both really helpful, thanks!

It seems to be even a little more nuanced. Comments inline with test case:

```python
import spacy
from spacy.symbols import ORTH

nlp = spacy.load('en_depent_web_md', parser=False, entity=False)
text = 'I like _MATH_ even _MATH_ when _MATH_, except when _MATH_ is _MATH_! but not _MATH_.'
print([w.text for w in nlp(text)])

As expected it treats prefix and suffix symbols as tokens

['I', 'like', '_', 'MATH', '_', 'even', '_', 'MATH', '_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_', 'MATH', '_', 'is', '_', 'MATH', '_', '!', 'but', 'not', '_', 'MATH_.']

nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])

Special case allows desired tokenization expect when token isn't followed by whitespace

['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_MATH_', 'is', '_', 'MATH', '_', '!', 'but', 'not', '_', 'MATH_.']

Reset pipeline

nlp = spacy.load('en_depent_web_md', parser=False, entity=False)
nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])

As SlavaGanzin points out adding the special case before using the pipeline results in the expected behavior except when token followed by a period (or is lead by period)

['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_MATH_', ',', 'except', 'when', '_MATH_', 'is', '_MATH_', '!', 'but', 'not', '_', 'MATH_.']```

Kind thanks for the great report --- was a very long-standing cache invalidation bug. Comment from the patch:

Add flush_cache method to tokenizer, to fix #1061

The tokenizer caches output for common chunks, for efficiency. This
cache is be invalidated when the tokenizer rules change, e.g. when a new
special-case rule is introduced. That's what was causing #1061.

When the cache is flushed, we free the intermediate token chunks.
I think this is safe --- but if we start getting segfaults, this patch
is to blame. The resolution would be to simply not free those bits of
memory. They'll be freed when the tokenizer exits anyway.

Migrating to v2 and reran this test....Works except when special case is followed by a single period. Of note is that before adding the special case the suffix _. remains attached to the token.

import spacy
from spacy.symbols import ORTH

nlp = spacy.load('en_core_web_sm', parser=False, entity=False)
text = 'I like _MATH_ even _MATH_ when _MATH_, except when _MATH_ is _MATH_! or _MATH_? or _MATH_: or _MATH_; or even _MATH_.. but not _MATH_. or _MATH_.'

print([w.text for w in nlp(text)])
# As expected it treats prefix and suffix symbols as tokens but it fails to when suffix is _.
# ['I', 'like', '_', 'MATH', '_', 'even', '_', 'MATH', '_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_', 'MATH', '_', 'is', '_', 'MATH', '_', '!', 'or', '_', 'MATH', '_', '?', 'or', '_', 'MATH', '_', ':', 'or', '_', 'MATH', '_', ';', 'or', 'even', '_', 'MATH', '_', '..', 'but', 'not', '_', 'MATH_.', 'or', '_', 'MATH_.']

nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
# Special case allows desired tokenization except when token is followed by period 
# ['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_MATH_', ',', 'except', 'when', '_MATH_', 'is', '_MATH_', '!', 'or', '_MATH_', '?', 'or', '_MATH_', ':', 'or', '_MATH_', ';', 'or', 'even', '_MATH_', '..', 'but', 'not', '_', 'MATH_.', 'or', '_', 'MATH_.']

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings