When tokenizing Hebrew, the full stop at the end of a sentence is not tokenized, while if the sentence ends with either a question mark, an exclamation mark, or ellipses, those marks are tokenized.
Example:
from spacy.he import Hebrew
tokenizer = Hebrew().tokenizer
print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה.')))
# ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה.']
print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה?')))
# ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '?']
print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה!')))
# ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']
print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה..')))
# ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']
print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה...')))
# ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...']
Thanks for the report. I think this is caused by the global regex rules for punctuation, some of which currently only cover latin characters. We originally chose the approach of spelling out the individual characters because it made it easier to create uppercase/lowercase sets, and kept things a bit more readable while we were tidying up the language data and inviting more people to contribute.
But now that we're adding more and more languages, this keeps coming up so we should fix this. (If I remember correctly, this was already causing problems for people working with Bengali and developing Russian integration.)
~I'll open a separate issue about this for spaCy v2.0, but in short:~ Never mind, just making this the master issue. Steps are:
regex library to handle compiling the correct character classesI'll take a shot at fixing it.
Thanks a lot! I also added your examples to the tests for Hebrew btw (see commit above) and xfailed the one that ends with a full stop.
I think our overall test coverage for the tokenizer and prefixes/suffixes/infixes is pretty good by now, so this should hopefully help with testing the fix.
It seems I can't run the tests. Both on Windows and a fresh Lubuntu VM, pytest screams there's no module named spacy.gold. I'm using Python 3.6.1 and pip install -r requirements.txted.
While I'm asking, when you said to remove the explicit character list, you meant everything from _ALPHA_LOWER to _HYPHENS, or something else?
Thanks :)
Ah, have you tried installing the current directory in development mode and then rebuilding spaCy from source?
pip install -e .
If it still complains, you might be running the wrong version of pytest by accident (i.e. the system one or something – this is always super frustrating, because it produces incredibly confusing errors).
About the characters: The main focus should be _ALPHA_LOWER and _ALPHA_UPPER. As for hyphens and other characters, it might be best to keep these a little more explicit. It's not so many, and there might always be a case where we want to exclude certain characters on purpose etc.
Hey, how did you manage to import the Hebrew? Trying spacy.he but not finding it
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
I'll take a shot at fixing it.