import spacy;
nlp = spacy.load('en_core_web_sm')
doc = nlp("(8)")
[tok.text for tok in doc]
returns:
['(', '8)']
Ubuntu 18.04.4 LTS
Hm, I think I had this as an exception from a list of text emoticons. If it's still in the exceptions data, it's definitely better to remove it.
The emoticon list could definitely be shortened.
Two tips:
nlp.tokenizer.explain:print(nlp.tokenizer.explain("(8)"))
# [('PREFIX', '('), ('SPECIAL-1', '8)')]
nlp.tokenizer.rules = {k: v for k, v in nlp.tokenizer.rules.items() if k != "8)"}
print(nlp.tokenizer.explain("(8)"))
# [('PREFIX', '('), ('TOKEN', '8'), ('SUFFIX', ')')]
(I'm recommending doing it this way with a dict comprehension since reassigning the rules property also clears the internal tokenizer cache, which del nlp.tokenizer.rules["8)"] wouldn't. This is not the best state of affairs, but it's how it works for now.)
Hi, thanks for the prompt responses. These tips are super useful indeed, I forgot to look at the tokenizer exceptions.
In my case I am using a custom tokenizer like follows:
tokenizer = Tokenizer(nlp.vocab,
rules=special_cases,
prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer)
with special_cases = nlp.Defaults.tokenizer_exceptions so I guess I can simply do a del special_cases["8)"] before creating the tokenizer, right?
In the same line, I also noticed that Wed and wed are in the special rules, which could be problematic for certain applications (the verb wed is tokenized as we'd).
print(nlp.tokenizer.explain("wed"))
[('SPECIAL-1', 'we'), ('SPECIAL-2', 'd')]
Yes, you can use del like normal if you do it before initializing the tokenizer. It's mainly if you're editing the settings for a tokenizer that's already been initialized. If you haven't tokenized anything yet it's also fine, but if you've already used the model to tokenize a few examples and are modifying it on-the-fly, you can run into some really confusing cache-related behavior.
Most helpful comment
The emoticon list could definitely be shortened.
Two tips:
nlp.tokenizer.explain:(I'm recommending doing it this way with a dict comprehension since reassigning the
rulesproperty also clears the internal tokenizer cache, whichdel nlp.tokenizer.rules["8)"]wouldn't. This is not the best state of affairs, but it's how it works for now.)