Spacy: 💫 Finalise vector support and add vector specs to model meta

Created on 24 Oct 2017  Â·  3Comments  Â·  Source: explosion/spaCy

Related issues: #1092, #1341, #1204

Finalise vector support

The vector support of the en_core_web_sm model in v2.0 is still being finalised. However, the stable version will definitely include some vectors, and will let you get context-sensitive token vectors from the tensorizer. This needs to be wired up properly again.

Documentation of model vector specs

The way the included word vectors are documented in the current models documentation and new v2.0 model directory still isn't ideal. Vector details are only present in the "description" – instead, they should be added to their own "vectors" key in the meta.json. The details could be read off the model automatically after training, e.g. by spacy train. This would also mean that users training their own model would have this information added automatically.

Example

{
    "lang": "en",
    "name": "core_web_sm",
    "version": "2.0.0",
    "pipeline": ["tagger", "parser", "ner"],
    "vectors": {
        "width": 300,
        "entries": 5000
    }
}

The v2.0 model directory requests each model's meta.json and uses this info to populate the model details. This ensures that the website is always up to date with the latest release. On the front-end, all that has to be done is add a row for the vectors info, and populate it via the ModelLoader script if a "vectors" object is present in the meta. We'll also need to update our internal model build process to make sure the vectors info is added to each individual model release.

Other documentation

While fixing this, we also need to revisit the word vectors & similarity guide to make sure it doesn't contain any misleading information about the vectors included in the models.

docs enhancement models 🌙 nightly

Most helpful comment

Hi! All tokens still seem to return True for is_oov. Checked with spacy 2.0.5 and en_core_web_sm 2.0.0.

>>> en = spacy.load('en_core_web_sm')
>>> x = en('This is a tessssst')
>>> [w.is_oov for w in x]
[True, True, True, True]

However, en_core_web_md 2.0.0 works just fine!

All 3 comments

Fixed and documented on develop, and will be included in the next version.

Hi! All tokens still seem to return True for is_oov. Checked with spacy 2.0.5 and en_core_web_sm 2.0.0.

>>> en = spacy.load('en_core_web_sm')
>>> x = en('This is a tessssst')
>>> [w.is_oov for w in x]
[True, True, True, True]

However, en_core_web_md 2.0.0 works just fine!

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

enerrio picture enerrio  Â·  3Comments

bebelbop picture bebelbop  Â·  3Comments

muzaluisa picture muzaluisa  Â·  3Comments

notnami picture notnami  Â·  3Comments

melanietosik picture melanietosik  Â·  3Comments