Hello !
I'm using spacy since few months now. Mainly using downloadable pretrained language models. The problem is that until now I was using the default english model. But then now I want to use the french model (the default one is really not accurate so I currently use the fr_core_news_md). But my problem with french models is that it takes me almost half a minute just load those models (spacy.load('fr')) while it requires just a second for the english model.
I guess everybody can imagine how painfull it is to wait half a min everytime I run a test.
Does anyone knows why is it so long for these models to load compared to the english ones ?
Does anyone knows any tricks to not reload the models during development ?
Thanks for your help
I think a big problem with French at the moment is actually the language data, especially the very complex tokenizer exceptions that are compiled and loaded (also see _tokenizer_exceptions_list.py). We've actually noticed this in our tests as well, and we'd love to simplify those exceptions in favour of performance.
One option could be to cut down the exception list to only the most common words and compile all variations (with different unicode apostrophes etc.) once, instead of doing it at runtime (see here). Another option would be to try to express more of this logic in the punctuation rules, and/or at least remove the words from the exceptions list that are already covered by the prefix, suffix and infix rules.
I'll tag this issue with help wanted in case someone from the French spaCy community has some ideas and would like to contribute. We always appreciate help with this kind of stuff!
Does anyone knows any tricks to not reload the models during development ?
In general, you definitely want to make sure to only load a model once per process and keep it in memory. For example, if you're running tests with pytest, provide the nlp object via a fixture thats session-scoped. But of course, this still means that you have to load it once.
Everything else really depends on what you need to do. If your application only needs to parse some text and then compute something else based on that, you could run a simple development REST API and query against that. Basically, a microservice that you keep running and which provides all models you need and loads them once. Instead of calling into spaCy directly, you then make a request to that microservice and receive the data you need.
Hi, @ines thanks for your answer. I shared the issue on Linkedin to try to have some help -> https://www.linkedin.com/feed/update/urn:li:activity:6435852651438567425
Thanks for you answer !
I kind of understand the struggle for the french language models.
Concerning the loading of those models. I actually manage to load the model just once when I have multiple tests to run but still have to load it at the begining (which means that I still have to load it even when I need just one test to run).
(I'll try to run that microservice thing then give some feedback here depending of how it goes.)
@FlorianBruniaux Thanks a lot, this is really cool! Btw, #2668 is another relevant French enhancement proposal. We'd love to transition more languages over to a fully rule-based lemmatizer, and French, Spanish and German are currently the top priority 😃
@Kodoyosa
(I'll try to run that microservice thing then give some feedback here depending of how it goes.)
Great, let me know how you go! Here's some inspiration:
So the idea would be that during development, you wouldn't load the model and call doc.ents, but instead query your API, which serves you the model and returns the entities in whichever format you need.
Here is what I tried and how I finally manage to avoid the loading time of the models.
As you said @ines a microservice that keeps running was the solution. So I started making a simple API server (with Flask) that loads the models we required. But the problem is that our application using spacy was quite too big to rewrite and adapt every methods using features from Spacy.
Actually all I need was the Doc object we used to obtain with doc = nlp('some text').
1st attempt (would have save some time):
Doc object to bytes to be able to send it through JSON then convert it back to object on our application.It didn't goes very well. I required Vocab() with the Language which was a bit heavy and when I finally tried to convert (from_bytes()) all I obtain was errors (something wasn't valid (I didn't take notes of those errors, sorry)). I think there was an issue with to_bytes() or/and from_bytes() somewhere (I'll try to give more info about that when I'll have some time for it).
So I went for the 2nd attempt (which is the final one):
Doc object (+ Tokens) through JSON then rebuild Doc, Token and Span from those info as similar as possible to how those classes work for SpacySince most attributes, properties and methods are the same as before, we can still work with the documentation of Spacy (for Doc, Token and Span) as we did until now on our application.
And that's it ! Everything works.
The Q°emotion team thank you for your help !
I think a big problem with French at the moment is actually the language data, especially the very complex tokenizer exceptions that are compiled and loaded (also see
_tokenizer_exceptions_list.py). We've actually noticed this in our tests as well, and we'd love to simplify those exceptions in favour of performance.One option could be to cut down the exception list to only the most common words and compile all variations (with different unicode apostrophes etc.) once, instead of doing it at runtime (see here). Another option would be to try to express more of this logic in the punctuation rules, and/or at least remove the words from the exceptions list that are already covered by the prefix, suffix and infix rules.
Hi @ines ,
I've tried to profile a little bit where most time gets lost. On my machine, en_core_web_sm takes 0.7s to load on average, and fr_core_news_md 35s.
When I cut down the FR_BASE_EXCEPTIONS in _tokenizer_exceptions_list.py to an empty list:
The vectors.key2row has length 579.447 and the vocab 1.292.817 (or 1.192.663 when cutting down the FR_BASE_EXCEPTIONS), which seems to cause the biggest issues. This is linked to the directory spacy/spacy/data/fr_core_news_md/fr_core_news_md-2.0.0/vocab, right? The corresponding directory for en_core_web_sm is much smaller. I'm trying to understand how these files were compiled in the first place, and how the French parsing strategy could be adapted to reduce the loading time...
@svlandeg Thanks for the detailed analysis!
The corresponding directory for en_core_web_sm is much smalle
The models ending on _sm do not ship with any word vectors, which is why they're so small (and good base models to add your own vectors to). _md and _lg models come with vectors, whichg makes them larger and increases the loading time – it'd be nice to make this faster, but we do have to accept some of the loading time here.
When comparing the English and French models, it'd make more sense to actually look at both _sm models: en_core_web_sm vs. fr_core_news_sm. So assuming the French init and tokenizer always take the same time to set up, this would be 7s + 4s (11s, way too long) vs. 2s total without the tokenizer exceptions.
You're right @ines. There's actually another 2 seconds lost between the start & end of defining the TOKENIZER_EXCEPTIONS. That brings fr_core_news_sm to about 13s, compared to 1s for en_core_web_smon my system...
@svlandeg Nice, thanks – so I think we've definitely confirmed the initial suspicion then!
This part here is probably quite problematic, just from looking at it:
Two nested loops over two lists of two items each, two replace calls per loop, another list comprehension over the same list, appended to previous list, plus a conversion to a set, then to a list. This plus compiling all the regular expressions is probably where most of the time is spent.
Hi @ines : I'm looking into it - the snippet you cited but also the overlap between words in the exception list and the regular expressions. Hope to have a PR for you next week ;-)
I rewrote the nested for loops which were producing a lot of unnecessary redundant variations (270K variants brought back to 100K after the set() operation). Cutting all those even made the set() operation redundant in this version of the code, cf https://github.com/explosion/spaCy/pull/3023
I added more regexps and ran them on the exception lists, and removed every entry that hit. This reduced the 34K lines in the exceptions list to 16K and further reduced the infixes_exc from 100K to 50K.
This shaved off 3-4s, from 12-13s to 8-9s. Still not ideal as the corresponding English model only takes 1 second...
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Here is what I tried and how I finally manage to avoid the loading time of the models.
As you said @ines a microservice that keeps running was the solution. So I started making a simple API server (with Flask) that loads the models we required. But the problem is that our application using spacy was quite too big to rewrite and adapt every methods using features from Spacy.
Actually all I need was the
Docobject we used to obtain withdoc = nlp('some text').1st attempt (would have save some time):
Docobject to bytes to be able to send it through JSON then convert it back to object on our application.It didn't goes very well. I required
Vocab()with the Language which was a bit heavy and when I finally tried to convert (from_bytes()) all I obtain was errors (something wasn't valid (I didn't take notes of those errors, sorry)). I think there was an issue withto_bytes()or/andfrom_bytes()somewhere (I'll try to give more info about that when I'll have some time for it).So I went for the 2nd attempt (which is the final one):
Docobject (+Tokens) through JSON then rebuildDoc,TokenandSpanfrom those info as similar as possible to how those classes work for SpacySince most
attributes,propertiesandmethodsare the same as before, we can still work with the documentation of Spacy (forDoc,TokenandSpan) as we did until now on our application.And that's it ! Everything works.
The Q°emotion team thank you for your help !