Hey guys,
I'm trying to create new on Model and I'm having some trouble.
I'm basically trying to follow this link and i'm using this repo
python -m spacy model pt myModel portuguese/word_freq '' portuguese/word_vector.bz2
however I'm having trouble to generate vector.bin
File "/Users/urbano/Documents/fontes/spacy_18/spaCy/.env/lib/python2.7/site-packages/spacy-1.8.2-py2.7-macosx-10.11-x86_64.egg/spacy/cli/model.py", line 47, in create_model
write_binary_vectors(vectors_path.as_posix(), vectors_dest.as_posix())
File "spacy/vocab.pyx", line 672, in spacy.vocab.write_binary_vectors (spacy/vocab.cpp:14452)
File "spacy/vocab.pyx", line 681, in spacy.vocab.write_binary_vectors (spacy/vocab.cpp:14319)
ValueError: could not convert string to float: train_countqKU
Here it is my files. One mini sample data, the word frequency that I generated and a word vector that I did using this code
Files used to reproduce this error
sample.zip
Trying to create a model here too.
I think this is a quirk in the Gensim format --- it adds a header to the word vectors file that has to be removed. You could do this in a text editor, or I think someone has a script?
tail -n +2 gensim_vector_file.txt > gensim_vector_file.new && mv -f gensim_vector_file.new gensim_vector_file.txt will remove the header added to these files by gensim.
I would advise against using text editors to edit these files because when files are large they can take considerable time and may even hang the system.
If you are trying to generate vectors.bin from gensim_vector_file.txt.bz2, follow these steps to the gensim_vector_file.txt mentioned above:
bzip2 gensim_vector_file.txtspacy.vocab.write_binary_vectors('gensim_vector_file.txt.bz2','vectors.bin')Update: spaCy v2.0 comes with a lot of improvements around storing, managing and customising word vectors, including a new Vectors class. See this page for more details: https://spacy.io/usage/vectors-similarity#custom
There's also a new vocab command to help you compile a vocabulary from a JSON-formatted lexicon file.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
tail -n +2 gensim_vector_file.txt > gensim_vector_file.new && mv -f gensim_vector_file.new gensim_vector_file.txtwill remove the header added to these files by gensim.I would advise against using text editors to edit these files because when files are large they can take considerable time and may even hang the system.
If you are trying to generate
vectors.binfromgensim_vector_file.txt.bz2, follow these steps to thegensim_vector_file.txtmentioned above:bzip2 gensim_vector_file.txtspacy.vocab.write_binary_vectors('gensim_vector_file.txt.bz2','vectors.bin')That should do it.