Related issues: #1092, #1341, #1204
The vector support of the en_core_web_sm model in v2.0 is still being finalised. However, the stable version will definitely include some vectors, and will let you get context-sensitive token vectors from the tensorizer. This needs to be wired up properly again.
The way the included word vectors are documented in the current models documentation and new v2.0 model directory still isn't ideal. Vector details are only present in the "description" – instead, they should be added to their own "vectors" key in the meta.json. The details could be read off the model automatically after training, e.g. by spacy train. This would also mean that users training their own model would have this information added automatically.
{
"lang": "en",
"name": "core_web_sm",
"version": "2.0.0",
"pipeline": ["tagger", "parser", "ner"],
"vectors": {
"width": 300,
"entries": 5000
}
}
The v2.0 model directory requests each model's meta.json and uses this info to populate the model details. This ensures that the website is always up to date with the latest release. On the front-end, all that has to be done is add a row for the vectors info, and populate it via the ModelLoader script if a "vectors" object is present in the meta. We'll also need to update our internal model build process to make sure the vectors info is added to each individual model release.
While fixing this, we also need to revisit the word vectors & similarity guide to make sure it doesn't contain any misleading information about the vectors included in the models.
Fixed and documented on develop, and will be included in the next version.
Hi! All tokens still seem to return True for is_oov. Checked with spacy 2.0.5 and en_core_web_sm 2.0.0.
>>> en = spacy.load('en_core_web_sm')
>>> x = en('This is a tessssst')
>>> [w.is_oov for w in x]
[True, True, True, True]
However, en_core_web_md 2.0.0 works just fine!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Hi! All tokens still seem to return
Trueforis_oov. Checked with spacy 2.0.5 and en_core_web_sm 2.0.0.However, en_core_web_md 2.0.0 works just fine!