Hi,
After loading the wiki.hi.vec from Facebook's fasttext (language is Hindi) https://github.com/facebookresearch/fastText and running the sample code on this, I am getting this error:
File "trial.py", line 27, in main nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
File "vocab.pyx", line 337, in spacy.vocab.Vocab.set_vector
File "vectors.pyx", line 244, in spacy.vectors.Vectors.add
ValueError: could not broadcast input array from shape (299) into shape (300)
The code is as follows:
#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using fastText
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals
import plac
import numpy
from spacy.language import Language
@plac.annotations(
vectors_loc=("Path to vectors", "positional", None, str))
def main(vectors_loc):
nlp = Language() # start off with a blank Language class
with open(vectors_loc, 'rb') as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
nlp.vocab.reset_vectors(width=int(nr_dim))
for line in file_:
line = line.decode('utf8')
pieces = line.split()
word = pieces[0]
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
# test the vectors and similarity
text = 'class colspan'
doc = nlp(text)
print(text, doc[0].similarity(doc[1]))
if __name__ == '__main__':
plac.call(main)
This happened to me too with the English vectors. It happens because some of the words are actually whitespace. I just skip lines where len(vector) != nr_dim.
@danielhers Thanks! Seems like we can fix the reader:
>>> string = ' 0.1 0.2 0.3'
>>> string.rsplit(' ', 3)
[' ', '0.1', '0.2', '0.3']
This code specifies that the string should be split into nr_dim+1 pieces, counting from the right. This should allow the first token to have whitespace. I've updated the example accordingly.
Nice, although the whitespace vectors are probably of limited use, as I think the tokenizer will always skip them in actual text.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
This happened to me too with the English vectors. It happens because some of the words are actually whitespace. I just skip lines where
len(vector) != nr_dim.