Spacy: Model differences

Created on 23 Apr 2017  Â·  8Comments  Â·  Source: explosion/spaCy

For english alone, there are 3 helpful available models.

However, not much in detail how different they are.

en_core_web_sm          50 MB   Vocab, syntax, entities, word vectors
en_core_web_md          1 GB    Vocab, syntax, entities, word vectors
en_depent_web_md    328 MB  Vocab, syntax, entities

Can you provide some descriptions on their accuracy, entity type recognition, use-case which will be helpful?

docs models

Most helpful comment

Yes, that's a good idea! I think the model releases would be a good place for this info as well, and it could be combined with the accuracy numbers.

Just edited the release notes of the new French model as an example: https://github.com/explosion/spacy-models/releases/tag/fr_depvec_web_lg-1.0.0

Will start updating the other models as well.

All 8 comments

Thanks for opening this issue – since this question has come up before, I agree that this should definitely be more clear in the docs. I'll just post all notes here so we can discuss them and add them to the docs.

Differences and accuracy

Most differences are obviously statistical. In general, we do expect larger models to be "better" and more accurate overall. Ultimately, it depends on your use case and requirements. People have reported pretty good results with the smaller model, so we usually recommend trying that first, writing a few test specific to your use case and then comparing the results to a larger model, if necessary.

We're also going to compile a better list of accuracy numbers and distribute them with each model, for example in its meta.json.

| Model | Parser accuracy | POS tagging accuracy | NER accuracy |
| --- | :---: | :---: | :---: |
| en_core_web_sm | ~89% | coming | coming |
| en_core_web_md, en_depent_web_md | ~90.6% | coming | coming |

Model releases and release notes

All models are published as GitHub releases and their release notes contain more detailed info. Going forward, we'll also add a "Changes" section to new model releases that'll list all updates since the last release, to give you a better idea of how that model is different. You can see an example of that in the pre-release of an alpha model we're currently testing.

Model naming conventions

In general, spaCy expects all model packages to follow the naming convention of [lang]_[name]. For our models, we also chose to divide the name into three components:

| Name | Description |
| --- | --- |
| type | model capabilities (e.g. core for general-purpose model with vocabulary, syntax, entities and word vectors, or depent for only vocab, syntax and entities.) |
| genre | type of text the model is trained on (e.g. web for web text, news for news text) |
| size | Model size indicator (sm, md or lg)

For example, en_depent_web_md is a medium-sized English model trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities.

I hope those naming conventions aren't too confusing – but we felt like it was necessary to decide on a scheme like this upfront to make we don't end up with confusing or indistinguishable model names. Especially since there will be many more models in the future – either published by us, or by the community. (For example, if you were to train a Spanish NER model on dialog text, you'd call it es_ent_dialog_md and it'd be clear what it is.)

✅ TODO

  • come up with generalised format to distribute the accuracies with the models
  • add figures to the model meta.json files, docs and releases
  • update docs with more information on model differences, how to pick the right model etc.

That's a minor difference in accuracy compared to file size.
So I have to ask, since you always apply the model in a context for it is trained for, can you give a brief idea the kind of training data you picked. You mentioned blogs, news and comments. Specifically, what type?
This would make it clear, for a user, what to expect on its data and where and how much to train on a new one.

Yes, that's a good idea! I think the model releases would be a good place for this info as well, and it could be combined with the accuracy numbers.

Just edited the release notes of the new French model as an example: https://github.com/explosion/spacy-models/releases/tag/fr_depvec_web_lg-1.0.0

Will start updating the other models as well.

Now those specs _are_ industrial strength. ;)

Difference between en_core_web_sm and en_core_web_md that I noticed:
latter has more vocab words and more word vectors.
I believe that this should be in the documentation.

Yeah. The file size is a good indicator.
You noticed on any of your test data? Might give a clear idea on what, when someone uses it.

On Sat, Apr 29, 2017 at 6:27 PM, Dan notifications@github.com wrote:

Difference between en_core_web_sm and en_core_web_md that I noticed:
latter has more vocab words and more word vectors.
I believe that this should be in the documentation.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/explosion/spaCy/issues/1010#issuecomment-298167515,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA6OZ9iYljaXLvh1b40OaqEJjefbUf_sks5r0zPFgaJpZM4NFh_0
.

--
Arjun Menon

The Best things happen when Worst things don't happen with the Best things
And hey, this message represents the official view of the voices in my
head.

Is there a difference between the default model 'en' and 'en_core_web_sm'? @ines

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings