Spacy: Tokenization speed too slow for French

Created on 7 Feb 2019 · 9Comments · Source: explosion/spaCy

Feature description

The evaluation on the UD corpora has shown that French tokenization is significantly slower (in words per second) than other languages. I'm opening this issue to experiment with a few ideas and discuss potential solutions.

Historical trail

Benchmarking the tokenizer has some flunctuations, but across runs this seems to be the general performance for the French blank model:

early Jan 2019: 5-6K
30 Jan 2019 (from regex to re): 8-9K
early Feb 2019 (after PR https://github.com/explosion/spaCy/pull/3218): 27-32K

enhancement feat / tokenizer lang / fr perf / speed

Source

svlandeg

Most helpful comment

Ok, will prepare the PR !

svlandeg on 18 Feb 2019

🎉2

All 9 comments

I tried simplifying the ALPHA character class, which brings a lot of complexities because of Latin letters that are included but irrelevant for French, e.g. https://en.wikipedia.org/wiki/Latin_Extended-D.
I assumed that the large list of tokenization exceptions using the complicated ALPHA class would run faster when limiting to a much smaller set of Latin characters relevant to French only.

Unfortunately, this didn't seem to have much effect on the tokenization speed.

svlandeg on 7 Feb 2019

When I remove the 16K exceptions list entirely though, ofcourse some unit tests fail, but the UD evaluation accuracy remains almost the same on fr_gsd-ud-train.txt:

Precision drops from 99% to 98%
F-score stays at 99%

However, WPS rises to 73-74K!

Ofcourse, the UD corpora don't test every case and some of the exceptions from the original list could be really useful in some domains. But is it worth slowing down the tokenizer so much?

svlandeg on 7 Feb 2019

If we do remove the exceptions, one potential direction is to also remove the hyphens as infix for French. This makes a lot of the current French exceptions redundant (most of the list + additional regular expressions), but will ofcourse result in some undersegmentation instead of oversegmentation.

orig state: 0.1% under, 0.4% over, 99% precision, 99% F
without exception list: 0.09% under, 0.7% over, 98% precision, 99% F
without hyphen as infix: 0.2% under, 0.2% over, 99% precision, 99% F

This will bring WPS at around 90K.

I know the general trend is to move toward oversegmentation (because the models can learn to merge), but I'd be interested in hearing your opinion (@ines @honnibal) on how to move forward for French! I can prepare the PR accordingly.

svlandeg on 7 Feb 2019

👍1

We actually use a custom french tokenizer (we just drop the exceptions) for all our in house models because of this.

aborsu on 8 Feb 2019

Makes sense @aborsu. So do you then remove the hyphen as infix, too? Or do you have another solution to deal with names like Sophie-Anne ? Or is that not an issue on the data you're working with?

svlandeg on 9 Feb 2019

@svlandeg we keep it as infix, but in our context it doesn't really matter. We mostly train models for entity detection, so the entity detector will recombine the tokens into a single entity. The fact that it is one or more token has no importance to us.
Of course this is specific for our use case, I wouldn't presume to make it standard behavior.

aborsu on 11 Feb 2019

👍1

@svlandeg Very interesting analysis!

I would lean towards removing the exceptions, and keeping the hyphen as an infix. Then we can hope that the model learns to rejoin the names etc in the parser. I'm happy to give this a try if you want to make the PR?