Spacy: Tokenization of punctuation in Hebrew and other non-latin languages

Created on 19 Apr 2017 · 7Comments · Source: explosion/spaCy

When tokenizing Hebrew, the full stop at the end of a sentence is not tokenized, while if the sentence ends with either a question mark, an exclamation mark, or ellipses, those marks are tokenized.

Example:

from spacy.he import Hebrew

tokenizer = Hebrew().tokenizer

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה.')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה.']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה?')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '?']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה!')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה..')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה...')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...']

Info about spaCy

spaCy version: 1.8.0
Platform: Linux-3.13.0-74-generic-x86_64-with-debian-jessie-sid
Python version: 3.6.1
Installed models:

help wanted (easy) 🌙 nightly

Source

beneyal

Most helpful comment

I'll take a shot at fixing it.

beneyal on 19 Apr 2017

🎉2

All 7 comments

Thanks for the report. I think this is caused by the global regex rules for punctuation, some of which currently only cover latin characters. We originally chose the approach of spelling out the individual characters because it made it easier to create uppercase/lowercase sets, and kept things a bit more readable while we were tidying up the language data and inviting more people to contribute.

But now that we're adding more and more languages, this keeps coming up so we should fix this. (If I remember correctly, this was already causing problems for people working with Bengali and developing Russian integration.)

~I'll open a separate issue about this for spaCy v2.0, but in short:~ Never mind, just making this the master issue. Steps are:

get rid of explicit character list
use regex library to handle compiling the correct character classes

ines on 19 Apr 2017

👍1

I'll take a shot at fixing it.

beneyal on 19 Apr 2017

🎉2

Thanks a lot! I also added your examples to the tests for Hebrew btw (see commit above) and xfailed the one that ends with a full stop.

I think our overall test coverage for the tokenizer and prefixes/suffixes/infixes is pretty good by now, so this should hopefully help with testing the fix.

ines on 19 Apr 2017

It seems I can't run the tests. Both on Windows and a fresh Lubuntu VM, pytest screams there's no module named spacy.gold. I'm using Python 3.6.1 and pip install -r requirements.txted.

While I'm asking, when you said to remove the explicit character list, you meant everything from _ALPHA_LOWER to _HYPHENS, or something else?

Thanks :)

beneyal on 19 Apr 2017

Ah, have you tried installing the current directory in development mode and then rebuilding spaCy from source?

pip install -e .

If it still complains, you might be running the wrong version of pytest by accident (i.e. the system one or something – this is always super frustrating, because it produces incredibly confusing errors).

About the characters: The main focus should be _ALPHA_LOWER and _ALPHA_UPPER. As for hyphens and other characters, it might be best to keep these a little more explicit. It's not so many, and there might always be a case where we want to exclude certain characters on purpose etc.

ines on 19 Apr 2017

Hey, how did you manage to import the Hebrew? Trying spacy.he but not finding it