Are there plans for medium ((and large) sized models for German language?
I refer to models with glove vectors like the English equivalent, i.e. the equivalent of en_core_web_md and de_core_web_ld.
In case, what could I do to help?
If you just need access to word vectors, you can find some good data packs here: https://fasttext.cc/docs/en/crawl-vectors.html
They can be easily converted for use in spaCy using a command like:
python -m spacy init-model /tmp/de_vectors_web_lg -v /path/to/vectors.zip
This will create a model directory for you to load, /tmp/de_vectors_web_lg. You can use the vectors with the sm model by passing the nlp.vocab object as an argument to spacy.load():
nlp = spacy.load('de')
spacy.load('/tmp/de_vectors_web_lg', vocab=nlp.vocab)
nlp.to_disk('/tmp/de_model_with_vectors')
Because the sm models aren't trained with vectors, you're free to load your own like this. You can't do that with md or lg models --- those have to be used with the vectors the model was trained with. That's a big disadvantage --- so we're likely to move away from providing models based on pre-trained vectors, and prefer instead to have the vectors in separate packages. We might still have some models that use pre-trained vectors, depending no accuracy advantages --- but there's a big advantage to letting you choose the vectors at runtime.
spacy-nightly now has a de_core_news_md model with word vectors.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
If you just need access to word vectors, you can find some good data packs here: https://fasttext.cc/docs/en/crawl-vectors.html
They can be easily converted for use in spaCy using a command like:
python -m spacy init-model /tmp/de_vectors_web_lg -v /path/to/vectors.zipThis will create a model directory for you to load,
/tmp/de_vectors_web_lg. You can use the vectors with thesmmodel by passing thenlp.vocabobject as an argument tospacy.load():Because the
smmodels aren't trained with vectors, you're free to load your own like this. You can't do that withmdorlgmodels --- those have to be used with the vectors the model was trained with. That's a big disadvantage --- so we're likely to move away from providing models based on pre-trained vectors, and prefer instead to have the vectors in separate packages. We might still have some models that use pre-trained vectors, depending no accuracy advantages --- but there's a big advantage to letting you choose the vectors at runtime.