Spacy: Spacy fails to tokenize the closing parenthesis as a suffix if preceded by 8: "8)"

Created on 9 Jul 2020 · 5Comments · Source: explosion/spaCy

How to reproduce the behaviour

import spacy; 
nlp = spacy.load('en_core_web_sm')
doc = nlp("(8)")
[tok.text for tok in doc]

returns:
['(', '8)']

Your Environment

Ubuntu 18.04.4 LTS

Operating System:
Linux
Python Version Used:
3.6.9
spaCy Version Used:
Spacy 2.2.3 (but also happens with Spacy 2.3.0)
Environment Information:

feat / tokenizer usage

Source

cveaux

Most helpful comment

The emoticon list could definitely be shortened.

Two tips:

To see why the tokenizer is tokenizing a particular way, use nlp.tokenizer.explain:

print(nlp.tokenizer.explain("(8)"))
# [('PREFIX', '('), ('SPECIAL-1', '8)')]

To remove a special case (which are called rules internally):

nlp.tokenizer.rules = {k: v for k, v in nlp.tokenizer.rules.items() if k != "8)"}
print(nlp.tokenizer.explain("(8)"))
# [('PREFIX', '('), ('TOKEN', '8'), ('SUFFIX', ')')]

(I'm recommending doing it this way with a dict comprehension since reassigning the rules property also clears the internal tokenizer cache, which del nlp.tokenizer.rules["8)"] wouldn't. This is not the best state of affairs, but it's how it works for now.)

adrianeboyd on 10 Jul 2020

👍2

All 5 comments

Hm, I think I had this as an exception from a list of text emoticons. If it's still in the exceptions data, it's definitely better to remove it.

honnibal on 9 Jul 2020

👍1

The emoticon list could definitely be shortened.

Two tips:

To see why the tokenizer is tokenizing a particular way, use nlp.tokenizer.explain:

print(nlp.tokenizer.explain("(8)"))
# [('PREFIX', '('), ('SPECIAL-1', '8)')]

To remove a special case (which are called rules internally):

nlp.tokenizer.rules = {k: v for k, v in nlp.tokenizer.rules.items() if k != "8)"}
print(nlp.tokenizer.explain("(8)"))
# [('PREFIX', '('), ('TOKEN', '8'), ('SUFFIX', ')')]

adrianeboyd on 10 Jul 2020

👍2

Hi, thanks for the prompt responses. These tips are super useful indeed, I forgot to look at the tokenizer exceptions.
In my case I am using a custom tokenizer like follows:
tokenizer = Tokenizer(nlp.vocab, rules=special_cases, prefix_search=prefix_re.search, suffix_search=suffix_re.search, infix_finditer=infix_re.finditer)
with special_cases = nlp.Defaults.tokenizer_exceptions so I guess I can simply do a del special_cases["8)"] before creating the tokenizer, right?

cveaux on 10 Jul 2020

In the same line, I also noticed that Wed and wed are in the special rules, which could be problematic for certain applications (the verb wed is tokenized as we'd).

print(nlp.tokenizer.explain("wed"))
[('SPECIAL-1', 'we'), ('SPECIAL-2', 'd')]

cveaux on 10 Jul 2020

Yes, you can use del like normal if you do it before initializing the tokenizer. It's mainly if you're editing the settings for a tokenizer that's already been initialized. If you haven't tokenized anything yet it's also fine, but if you've already used the model to tokenize a few examples and are modifying it on-the-fly, you can run into some really confusing cache-related behavior.

adrianeboyd on 13 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings