Spacy: Unable to load facebook 's fastext Hindi vector into spaCy

Created on 9 Nov 2017  路  4Comments  路  Source: explosion/spaCy

Hi,
After loading the wiki.hi.vec from Facebook's fasttext (language is Hindi) https://github.com/facebookresearch/fastText and running the sample code on this, I am getting this error:

File "trial.py", line 27, in main nlp.vocab.set_vector(word, vector)  # add the vectors to the vocab
File "vocab.pyx", line 337, in spacy.vocab.Vocab.set_vector
File "vectors.pyx", line 244, in spacy.vectors.Vectors.add
ValueError: could not broadcast input array from shape (299) into shape (300)

The code is as follows:

#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using fastText
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals
import plac
import numpy

from spacy.language import Language


@plac.annotations(
    vectors_loc=("Path to vectors", "positional", None, str))
def main(vectors_loc):
    nlp = Language()  # start off with a blank Language class
    with open(vectors_loc, 'rb') as file_:
        header = file_.readline()
        nr_row, nr_dim = header.split()
        nlp.vocab.reset_vectors(width=int(nr_dim))
        for line in file_:
            line = line.decode('utf8')
            pieces = line.split()
            word = pieces[0]
            vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
            nlp.vocab.set_vector(word, vector)  # add the vectors to the vocab
    # test the vectors and similarity
    text = 'class colspan'
    doc = nlp(text)
    print(text, doc[0].similarity(doc[1]))


if __name__ == '__main__':
    plac.call(main)
  • Operating System: macOS High Sierra 10.13
  • Python Version Used: 3.6.2
  • spaCy Version Used: 2.0.2
  • Environment Information:
examples

Most helpful comment

This happened to me too with the English vectors. It happens because some of the words are actually whitespace. I just skip lines where len(vector) != nr_dim.

All 4 comments

This happened to me too with the English vectors. It happens because some of the words are actually whitespace. I just skip lines where len(vector) != nr_dim.

@danielhers Thanks! Seems like we can fix the reader:

>>> string = '  0.1 0.2 0.3'
>>> string.rsplit(' ', 3)
[' ', '0.1', '0.2', '0.3']

This code specifies that the string should be split into nr_dim+1 pieces, counting from the right. This should allow the first token to have whitespace. I've updated the example accordingly.

Nice, although the whitespace vectors are probably of limited use, as I think the tokenizer will always skip them in actual text.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

peterroelants picture peterroelants  路  3Comments

besirkurtulmus picture besirkurtulmus  路  3Comments

norrishd picture norrishd  路  3Comments

bebelbop picture bebelbop  路  3Comments

ajayrfhp picture ajayrfhp  路  3Comments