The evaluation on the UD corpora has shown that French tokenization is significantly slower (in words per second) than other languages. I'm opening this issue to experiment with a few ideas and discuss potential solutions.
Benchmarking the tokenizer has some flunctuations, but across runs this seems to be the general performance for the French blank model:

I tried simplifying the ALPHA character class, which brings a lot of complexities because of Latin letters that are included but irrelevant for French, e.g. https://en.wikipedia.org/wiki/Latin_Extended-D.
I assumed that the large list of tokenization exceptions using the complicated ALPHA class would run faster when limiting to a much smaller set of Latin characters relevant to French only.
Unfortunately, this didn't seem to have much effect on the tokenization speed.
When I remove the 16K exceptions list entirely though, ofcourse some unit tests fail, but the UD evaluation accuracy remains almost the same on fr_gsd-ud-train.txt:
However, WPS rises to 73-74K!
Ofcourse, the UD corpora don't test every case and some of the exceptions from the original list could be really useful in some domains. But is it worth slowing down the tokenizer so much?
If we do remove the exceptions, one potential direction is to also remove the hyphens as infix for French. This makes a lot of the current French exceptions redundant (most of the list + additional regular expressions), but will ofcourse result in some undersegmentation instead of oversegmentation.
This will bring WPS at around 90K.
I know the general trend is to move toward oversegmentation (because the models can learn to merge), but I'd be interested in hearing your opinion (@ines @honnibal) on how to move forward for French! I can prepare the PR accordingly.
We actually use a custom french tokenizer (we just drop the exceptions) for all our in house models because of this.
Makes sense @aborsu. So do you then remove the hyphen as infix, too? Or do you have another solution to deal with names like Sophie-Anne ? Or is that not an issue on the data you're working with?
@svlandeg we keep it as infix, but in our context it doesn't really matter. We mostly train models for entity detection, so the entity detector will recombine the tokens into a single entity. The fact that it is one or more token has no importance to us.
Of course this is specific for our use case, I wouldn't presume to make it standard behavior.
@svlandeg Very interesting analysis!
I would lean towards removing the exceptions, and keeping the hyphen as an infix. Then we can hope that the model learns to rejoin the names etc in the parser. I'm happy to give this a try if you want to make the PR?
Ok, will prepare the PR !
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Ok, will prepare the PR !