Spacy: Non-english (french) models a bit long to load

Created on 16 Aug 2018 · 12Comments · Source: explosion/spaCy

Hello !

I'm using spacy since few months now. Mainly using downloadable pretrained language models. The problem is that until now I was using the default english model. But then now I want to use the french model (the default one is really not accurate so I currently use the fr_core_news_md). But my problem with french models is that it takes me almost half a minute just load those models (spacy.load('fr')) while it requires just a second for the english model.

I guess everybody can imagine how painfull it is to wait half a min everytime I run a test.

Does anyone knows why is it so long for these models to load compared to the english ones ?

Does anyone knows any tricks to not reload the models during development ?

Thanks for your help

My Environment

Operating System: OSX/10.13.6
Python Version Used: 3.6.6
spaCy Version Used: 2.0.9
Environment Information:
- spaCy version: 2.0.9
- Platform: Darwin-17.7.0-x86_64-i386-64bit
- Python version: 3.6.6
- Models: fr_core_news_md, fr, en

help wanted lang / fr perf / speed

Source

Kodoyosa

👍4

Most helpful comment

Here is what I tried and how I finally manage to avoid the loading time of the models.

As you said @ines a microservice that keeps running was the solution. So I started making a simple API server (with Flask) that loads the models we required. But the problem is that our application using spacy was quite too big to rewrite and adapt every methods using features from Spacy.
Actually all I need was the Doc object we used to obtain with doc = nlp('some text').

1st attempt (would have save some time):

tried to convert Doc object to bytes to be able to send it through JSON then convert it back to object on our application.

It didn't goes very well. I required Vocab() with the Language which was a bit heavy and when I finally tried to convert (from_bytes()) all I obtain was errors (something wasn't valid (I didn't take notes of those errors, sorry)). I think there was an issue with to_bytes() or/and from_bytes() somewhere (I'll try to give more info about that when I'll have some time for it).

So I went for the 2nd attempt (which is the final one):

'Simply' send every informations about the Doc object (+ Tokens) through JSON then rebuild Doc, Token and Span from those info as similar as possible to how those classes work for Spacy

Since most attributes, properties and methods are the same as before, we can still work with the documentation of Spacy (for Doc, Token and Span) as we did until now on our application.

And that's it ! Everything works.

The Q°emotion team thank you for your help !

Kodoyosa on 30 Aug 2018

👍2 🎉1

All 12 comments

I think a big problem with French at the moment is actually the language data, especially the very complex tokenizer exceptions that are compiled and loaded (also see _tokenizer_exceptions_list.py). We've actually noticed this in our tests as well, and we'd love to simplify those exceptions in favour of performance.

One option could be to cut down the exception list to only the most common words and compile all variations (with different unicode apostrophes etc.) once, instead of doing it at runtime (see here). Another option would be to try to express more of this logic in the punctuation rules, and/or at least remove the words from the exceptions list that are already covered by the prefix, suffix and infix rules.

I'll tag this issue with help wanted in case someone from the French spaCy community has some ideas and would like to contribute. We always appreciate help with this kind of stuff!

Does anyone knows any tricks to not reload the models during development ?

In general, you definitely want to make sure to only load a model once per process and keep it in memory. For example, if you're running tests with pytest, provide the nlp object via a fixture thats session-scoped. But of course, this still means that you have to load it once.

Everything else really depends on what you need to do. If your application only needs to parse some text and then compute something else based on that, you could run a simple development REST API and query against that. Basically, a microservice that you keep running and which provides all models you need and loads them once. Instead of calling into spaCy directly, you then make a request to that microservice and receive the data you need.

ines on 16 Aug 2018

👍2

Hi, @ines thanks for your answer. I shared the issue on Linkedin to try to have some help -> https://www.linkedin.com/feed/update/urn:li:activity:6435852651438567425

FlorianBruniaux on 16 Aug 2018

❤1

Thanks for you answer !

I kind of understand the struggle for the french language models.

Concerning the loading of those models. I actually manage to load the model just once when I have multiple tests to run but still have to load it at the begining (which means that I still have to load it even when I need just one test to run).

(I'll try to run that microservice thing then give some feedback here depending of how it goes.)

Kodoyosa on 16 Aug 2018

@FlorianBruniaux Thanks a lot, this is really cool! Btw, #2668 is another relevant French enhancement proposal. We'd love to transition more languages over to a fully rule-based lemmatizer, and French, Spanish and German are currently the top priority 😃

@Kodoyosa

(I'll try to run that microservice thing then give some feedback here depending of how it goes.)

Great, let me know how you go! Here's some inspiration:

Simple REST microservices for spaCy: https://github.com/explosion/spacy-services/
Experimental GraphQL API I built the other day: https://github.com/ines/spacy-graphql

So the idea would be that during development, you wouldn't load the model and call doc.ents, but instead query your API, which serves you the model and returns the entities in whichever format you need.

ines on 16 Aug 2018

👍1

Here is what I tried and how I finally manage to avoid the loading time of the models.

1st attempt (would have save some time):

tried to convert Doc object to bytes to be able to send it through JSON then convert it back to object on our application.

So I went for the 2nd attempt (which is the final one):

'Simply' send every informations about the Doc object (+ Tokens) through JSON then rebuild Doc, Token and Span from those info as similar as possible to how those classes work for Spacy

Since most attributes, properties and methods are the same as before, we can still work with the documentation of Spacy (for Doc, Token and Span) as we did until now on our application.

And that's it ! Everything works.

The Q°emotion team thank you for your help !

Kodoyosa on 30 Aug 2018

👍2 🎉1

I think a big problem with French at the moment is actually the language data, especially the very complex tokenizer exceptions that are compiled and loaded (also see _tokenizer_exceptions_list.py). We've actually noticed this in our tests as well, and we'd love to simplify those exceptions in favour of performance.

One option could be to cut down the exception list to only the most common words and compile all variations (with different unicode apostrophes etc.) once, instead of doing it at runtime (see here). Another option would be to try to express more of this logic in the punctuation rules, and/or at least remove the words from the exceptions list that are already covered by the prefix, suffix and infix rules.

Hi @ines ,

I've tried to profile a little bit where most time gets lost. On my machine, en_core_web_sm takes 0.7s to load on average, and fr_core_news_md 35s.

Language.init() ca 7s
util.py, from_disk()
- 12s after starting to load vocab
  - 4s additionally to load key2row
  - 6s additionally in _ml.py, link_vectors_to_models() going through each word in vocab
- 4s loading tokenizer

When I cut down the FR_BASE_EXCEPTIONS in _tokenizer_exceptions_list.py to an empty list:

Language.init() goes to virtually 0s
loading tokenizer goes to 2s
the rest remains the same, still acounting for ca. 26s in total.

The vectors.key2row has length 579.447 and the vocab 1.292.817 (or 1.192.663 when cutting down the FR_BASE_EXCEPTIONS), which seems to cause the biggest issues. This is linked to the directory spacy/spacy/data/fr_core_news_md/fr_core_news_md-2.0.0/vocab, right? The corresponding directory for en_core_web_sm is much smaller. I'm trying to understand how these files were compiled in the first place, and how the French parsing strategy could be adapted to reduce the loading time...

svlandeg on 19 Nov 2018

@svlandeg Thanks for the detailed analysis!

The corresponding directory for en_core_web_sm is much smalle

The models ending on _sm do not ship with any word vectors, which is why they're so small (and good base models to add your own vectors to). _md and _lg models come with vectors, whichg makes them larger and increases the loading time – it'd be nice to make this faster, but we do have to accept some of the loading time here.

When comparing the English and French models, it'd make more sense to actually look at both _sm models: en_core_web_sm vs. fr_core_news_sm. So assuming the French init and tokenizer always take the same time to set up, this would be 7s + 4s (11s, way too long) vs. 2s total without the tokenizer exceptions.

ines on 26 Nov 2018

You're right @ines. There's actually another 2 seconds lost between the start & end of defining the TOKENIZER_EXCEPTIONS. That brings fr_core_news_sm to about 13s, compared to 1s for en_core_web_smon my system...

svlandeg on 26 Nov 2018

@svlandeg Nice, thanks – so I think we've definitely confirmed the initial suspicion then!

This part here is probably quite problematic, just from looking at it:

https://github.com/explosion/spaCy/blob/58757c5684426d8ea1453a9a041915ddbad60359/spacy/lang/fr/tokenizer_exceptions.py#L97-L103

Two nested loops over two lists of two items each, two replace calls per loop, another list comprehension over the same list, appended to previous list, plus a conversion to a set, then to a list. This plus compiling all the regular expressions is probably where most of the time is spent.

ines on 26 Nov 2018

Hi @ines : I'm looking into it - the snippet you cited but also the overlap between words in the exception list and the regular expressions. Hope to have a PR for you next week ;-)

svlandeg on 29 Nov 2018

❤1 👍1

I rewrote the nested for loops which were producing a lot of unnecessary redundant variations (270K variants brought back to 100K after the set() operation). Cutting all those even made the set() operation redundant in this version of the code, cf https://github.com/explosion/spaCy/pull/3023

I added more regexps and ran them on the exception lists, and removed every entry that hit. This reduced the 34K lines in the exceptions list to 16K and further reduced the infixes_exc from 100K to 50K.

This shaved off 3-4s, from 12-13s to 8-9s. Still not ideal as the corresponding English model only takes 1 second...

svlandeg on 7 Dec 2018

❤1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.