From my understanding, the code below should produce a Doc object of length 15, however instead it produces an object of length 16 since one term "e2g" is split into two.
This doesn't appear to be the appropriate behavior.
For reference, the length of the testword list doesn't appear to matter. It seems to be the case the many strings that ends in G and have a number preceding that G (2g, V2G) may also produce this behavior.
regex => (.*)\d+g
import spacy
nlp = spacy.load('en_core_web_md')
testwords=u'utilizing utm utmost utms utterly uwb ux v1 v2 e2g v2i v2v v3 vacant vacating'
doc=nlp(testwords);print len(doc), len(testwords.split(" "))
16 15
zip(doc, testwords.split(" "))
[(utilizing, u'utilizing'),
(utm, u'utm'),
(utmost, u'utmost'),
(utms, u'utms'),
(utterly, u'utterly'),
(uwb, u'uwb'),
(ux, u'ux'),
(v1, u'v1'),
(v2, u'v2'),
(e2, u'e2g'),
(g, u'v2i'),
(v2i, u'v2v'),
(v2v, u'v3'),
(v3, u'vacant'),
(vacant, u'vacating')]
Digging a little further.. It's probably the case that these terms are being treated as if they are a number that is in units of grams.
spacy/language_data/punctuation.py
Accordingly, they are being parsed as if they were like the term 100GB, and split into 100, and GB.
This is just a hypothesis.. but I hope that I'm wrong!
Thanks for the report and analysis! this is interesting – the "g" should definitely only be split off if it's preceded by _only numbers_ (in which case it's fair to assume that it most commonly stands for grams or gigabytes).
So this might indeed be an issue with the suffix rules and should be easy to fix with some fiddling and a simple regression test. Will look into this! (Also tagging this issue with help wanted (easy) in case someone else wants to investigate this).
Suggesting r'(?<=\s[0-9]+(\.[0-9][0-9]?)?)(?:{u})'.format(u=UNITS), for
spacy/language_data/punctuation.py
Whitespace and 'real number' before the unit.. (Though this doesn't address 10,999G.. but I'm not sure what the protocol should truly be as spacy is truly global... or should that be intergalactic?)
Also... It seems that something similar should be included for currency...
and finally... shouldn't seconds (s) be included in the list of units? Or does that create other issues with plurals?
The suggested change helps in the original case, but it seems that there may be other rules that are being applied that aren't quite obvious.
In the case below "id", "ima", and "im" are split.
I'm assuming terms like these are so rare that they are never used in features for many users, however for my use case, I can't know a priori that they will not impact the pipeline (that said.. I'm also interested in helping ensure that the code can function as one would expect it to!)
import spacy
nlp = spacy.load('en_core_web_md')
testwords=u'v2G 1m id ID did dim im ima filler words to see'
doc=nlp(testwords);print len(doc), len(testwords.split(" "))
zip(testwords.split(" "),doc)
Out[7]:
[(u'v2G', v2G),
(u'1m', 1m),
(u'id', i),
(u'ID', d),
(u'did', ID),
(u'dim', did),
(u'im', dim),
(u'ima', i),
(u'filler', m),
(u'words', i),
(u'to', m),
(u'see', a)]
Those special rules would be defined here
https://github.com/explosion/spaCy/blob/master/spacy/en/tokenizer_exceptions.py
Ok, based on these exceptions, I see why id, im, and ima are treated in that manner... though it seems that this is a bit of a challenge as ID (identification) and IM (instant messaging) clash with vernacular use of the same characters.
@slremy Thanks for the suggestion and examples! The pattern can't include whitespace, though – the special case rules are applied _after_ spaCy splits the text on whitespace characters. (But this also means you don't have to worry about whitespace.) If you have a minute and want to submit a quick pull request with a test for the number + "g" case, that would be nice!
And yes, spaCy assumes that "ima" = "i" + "m" (am) + "a" (going to) and "id" = "i" + "d" (would/had). Cases like this are always tricky, so we usually go with what's likely most common in English text. I think the reason why "s" for seconds is not included in that list is because of cases like "90s" (which can mean both the 1990s or 90 seconds).
If you need very custom rules for your data, you can always add your own tokenization rules, or create your own tokenizer with custom regular expressions – see the docs on custom tokenization for examples. Alternatively, if you need to match more complex patterns consisting of several tokens and whitespace, you can use the rule-based matcher and write patterns like:
[{ORTH: 'g'}, {IS_DIGIT: True}, {IS_ALPHA: True}] # each dict represents one token
Btw, the Matcher got a little overhaul in v2.0, so if you want to try the alpha, you can also check out the more detailed examples in the alpha docs, including patterns for matching phone numbers, emoji and hashtags.
Since a built-in solution for this will ultimately come down to improving spaCy's regex handling, I'm closing this and merging it with the master regex issue in #1642.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
@slremy Thanks for the suggestion and examples! The pattern can't include whitespace, though – the special case rules are applied _after_ spaCy splits the text on whitespace characters. (But this also means you don't have to worry about whitespace.) If you have a minute and want to submit a quick pull request with a test for the number + "g" case, that would be nice!
And yes, spaCy assumes that "ima" = "i" + "m" (am) + "a" (going to) and "id" = "i" + "d" (would/had). Cases like this are always tricky, so we usually go with what's likely most common in English text. I think the reason why "s" for seconds is not included in that list is because of cases like "90s" (which can mean both the 1990s or 90 seconds).
If you need very custom rules for your data, you can always add your own tokenization rules, or create your own tokenizer with custom regular expressions – see the docs on custom tokenization for examples. Alternatively, if you need to match more complex patterns consisting of several tokens and whitespace, you can use the rule-based matcher and write patterns like:
Btw, the
Matchergot a little overhaul in v2.0, so if you want to try the alpha, you can also check out the more detailed examples in the alpha docs, including patterns for matching phone numbers, emoji and hashtags.