Gensim: Gensim error when loading FastText

Created on 8 Mar 2017 · 18Comments · Source: RaRe-Technologies/gensim

Hello,

I simply executed model = FastText.load_fasttext_format('/Users/smidj/Repositories/snippetbot/data/fasttext/wiki.en/wiki.en')

I got an error:

OSError                                   Traceback (most recent call last)
<ipython-input-3-83dcc56dad43> in <module>()
----> 1 model = FastText.load_fasttext_format('/Users/smidj/Repositories/snippetbot/data/fasttext/wiki.en/wiki.en')

/Users/smidj/.virtualenvs/snippetbot/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_fasttext_format(cls, model_file)
    236         model = cls()
    237         model.wv = cls.load_word2vec_format('%s.vec' % model_file)
--> 238         model.load_binary_data('%s.bin' % model_file)
    239         return model
    240 

/Users/smidj/.virtualenvs/snippetbot/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_binary_data(self, model_binary_file)
    254             self.load_model_params(f)
    255             self.load_dict(f)
--> 256             self.load_vectors(f)
    257 
    258     def load_model_params(self, file_handle):

/Users/smidj/.virtualenvs/snippetbot/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_vectors(self, file_handle)
    301 
    302         self.num_original_vectors = num_vectors
--> 303         self.wv.syn0_all = np.fromstring(file_handle.read(num_vectors * dim * float_size), dtype=dtype)
    304         self.wv.syn0_all = self.wv.syn0_all.reshape((num_vectors, dim))
    305         assert self.wv.syn0_all.shape == (self.bucket + len(self.wv.vocab), self.size), \

OSError: [Errno 22] Invalid argument

I use Mac OS Sierra, python 3.6 (in virtaulenv), and gensim version 1.0.1.

Thanks,
Jakub

bug difficulty easy

Source

jaksmid

Most helpful comment

I have created the pull request here -
https://github.com/RaRe-Technologies/gensim/pull/1214

jaksmid on 14 Mar 2017

👍2

All 18 comments

The corpus seem to be very large to download and hence hard to reproduce. Do you get the same error at a different reasonably sized corpus as well?

markroxor on 8 Mar 2017

Does a simple `file.read(model_file)' work for you? It might be related to a bug with reading large files on osx

tmylk on 9 Mar 2017

@tmylk Indeed, it does produce the same error. Good catch!

jaksmid on 9 Mar 2017

Thanks for reporting it. Could you please post a fix or workaround that you find here? Do we need to add chunksize to read? We need to make it work on OSX.

tmylk on 9 Mar 2017

It may be enough to just use np.fromfile(file_handle, dtype=dtype) in place of the wasteful read-it-all-via-a-string fromstring/read.

gojomo on 10 Mar 2017

Thanks @gojomo for the suggestion. However, in that case I am getting

ValueError: cannot reshape array of size 2111621876 into shape (4519370,300)

in the next line of fasttext.py - self.wv.syn0_all = self.wv.syn0_all.reshape((num_vectors, dim))

jaksmid on 13 Mar 2017

I see, so the functionally equivalent is np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim). Let me check whether it works (it takes some time to load the model).

jaksmid on 13 Mar 2017

This works, however I noticed a strange thing. Do we set model size correctly?
The following lines

model = FastText.load_fasttext_format('/Users/smidj/Repositories/snippetbot/data/fasttext/wiki.en/wiki.en')
print(str(model.vector_size))
print(str(model["hello"]))

print 100 (should be 300) and then vector of size 300 (correct).

jaksmid on 13 Mar 2017

Seems vector_size mayalso need to be explicitly set. I would definitely do some sanity checks (or unit tests) bases on tiny example fasttext exports, to be sure there's no mis-sizing/mis-alignment happening (that risks leaving everything free of raised exceptions but totally confused//corrupted about real values).

gojomo on 13 Mar 2017

The problem seems to be at fasttext.py in load_model_params.

The dim argument is assigned to self.size, which does not exist yet. I think this line should be as follows

self.vector_size = dim

Regarding the suggestion of @gojomo, do we have some reasonable sized fasttext model somewhere or do we need to train it?

jaksmid on 14 Mar 2017

I am also thinking that the current functionality does not allow to load just vector file together with the vector_size.

We load the vector_size in keyedvectors.py.load_word2vec_format, but we do not store it anywhere. Maybe we can add also create self.vector_size in KeyedVectors class and save it for later use in case you just want the model, not the retraining.

What do you think?

jaksmid on 14 Mar 2017

I have created the pull request here -
https://github.com/RaRe-Technologies/gensim/pull/1214

jaksmid on 14 Mar 2017

👍2

I agree that we should add some unit tests to double–check. How about using lee_fasttext.bin which is already included in test/test_data, that should be of reasonable size?

markus-beuckelmann on 14 Mar 2017

I am looking at the current unit tests and seems like there is already a test for the load_fasttext_format in the test_fasttext_wrapper.py with some sanity checks. @gojomo, @markus-beuckelmann are those the sanity checks you had in mind or do you suggest adding extra checks?

jaksmid on 15 Mar 2017

I'm not sure that any of the existing tests verify that the values loaded into gensim for particular words are the same as they would be in the original package. (For example, if vectors for both 'and' and 'the' were similarly corrupted by misaligned-reading, the existing tests of their relationships might still pass.) The verification I'm thinking of could 1st be done just manually, on some known (or toy-sized) fasttext-created vector-sets. Then maybe it'd become a unit-test, if a tiny set with known target values is bundled.

gojomo on 15 Mar 2017

So I updated the unittest with the expected values taken either from the vec file or obtained via the original package. Please @gojomo review the updated testLoadFastTextFormat whether it is something you had in mind.

Thanks,
Jakub

jaksmid on 16 Mar 2017

Potential fix in https://github.com/RaRe-Technologies/gensim/pull/1214

tmylk on 17 Mar 2017

Yes, verifying that the same vector is observed for a test word whether loaded in the original native code, or gensim, addresses my concern.

gojomo on 17 Mar 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings