Spacy: Creating my own model

Created on 3 Jul 2017  路  5Comments  路  Source: explosion/spaCy

Hey guys,

I'm trying to create new on Model and I'm having some trouble.

I'm basically trying to follow this link and i'm using this repo

python -m spacy model pt myModel portuguese/word_freq '' portuguese/word_vector.bz2

however I'm having trouble to generate vector.bin

File "/Users/urbano/Documents/fontes/spacy_18/spaCy/.env/lib/python2.7/site-packages/spacy-1.8.2-py2.7-macosx-10.11-x86_64.egg/spacy/cli/model.py", line 47, in create_model
write_binary_vectors(vectors_path.as_posix(), vectors_dest.as_posix())
File "spacy/vocab.pyx", line 672, in spacy.vocab.write_binary_vectors (spacy/vocab.cpp:14452)
File "spacy/vocab.pyx", line 681, in spacy.vocab.write_binary_vectors (spacy/vocab.cpp:14319)
ValueError: could not convert string to float: train_countqKU

Here it is my files. One mini sample data, the word frequency that I generated and a word vector that I did using this code

Files used to reproduce this error
sample.zip

usage

Most helpful comment

tail -n +2 gensim_vector_file.txt > gensim_vector_file.new && mv -f gensim_vector_file.new gensim_vector_file.txt will remove the header added to these files by gensim.
I would advise against using text editors to edit these files because when files are large they can take considerable time and may even hang the system.
If you are trying to generate vectors.bin from gensim_vector_file.txt.bz2, follow these steps to the gensim_vector_file.txt mentioned above:

  • bzip2 gensim_vector_file.txt
  • spacy.vocab.write_binary_vectors('gensim_vector_file.txt.bz2','vectors.bin')
    That should do it.

All 5 comments

Trying to create a model here too.

I think this is a quirk in the Gensim format --- it adds a header to the word vectors file that has to be removed. You could do this in a text editor, or I think someone has a script?

tail -n +2 gensim_vector_file.txt > gensim_vector_file.new && mv -f gensim_vector_file.new gensim_vector_file.txt will remove the header added to these files by gensim.
I would advise against using text editors to edit these files because when files are large they can take considerable time and may even hang the system.
If you are trying to generate vectors.bin from gensim_vector_file.txt.bz2, follow these steps to the gensim_vector_file.txt mentioned above:

  • bzip2 gensim_vector_file.txt
  • spacy.vocab.write_binary_vectors('gensim_vector_file.txt.bz2','vectors.bin')
    That should do it.

Update: spaCy v2.0 comes with a lot of improvements around storing, managing and customising word vectors, including a new Vectors class. See this page for more details: https://spacy.io/usage/vectors-similarity#custom

There's also a new vocab command to help you compile a vocabulary from a JSON-formatted lexicon file.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings