Hello,
I downloaded fasttext English wikipedia
I simply executed model = FastText.load_fasttext_format('/Users/smidj/Repositories/snippetbot/data/fasttext/wiki.en/wiki.en')
I got an error:
OSError Traceback (most recent call last)
<ipython-input-3-83dcc56dad43> in <module>()
----> 1 model = FastText.load_fasttext_format('/Users/smidj/Repositories/snippetbot/data/fasttext/wiki.en/wiki.en')
/Users/smidj/.virtualenvs/snippetbot/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_fasttext_format(cls, model_file)
236 model = cls()
237 model.wv = cls.load_word2vec_format('%s.vec' % model_file)
--> 238 model.load_binary_data('%s.bin' % model_file)
239 return model
240
/Users/smidj/.virtualenvs/snippetbot/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_binary_data(self, model_binary_file)
254 self.load_model_params(f)
255 self.load_dict(f)
--> 256 self.load_vectors(f)
257
258 def load_model_params(self, file_handle):
/Users/smidj/.virtualenvs/snippetbot/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_vectors(self, file_handle)
301
302 self.num_original_vectors = num_vectors
--> 303 self.wv.syn0_all = np.fromstring(file_handle.read(num_vectors * dim * float_size), dtype=dtype)
304 self.wv.syn0_all = self.wv.syn0_all.reshape((num_vectors, dim))
305 assert self.wv.syn0_all.shape == (self.bucket + len(self.wv.vocab), self.size), \
OSError: [Errno 22] Invalid argument
I use Mac OS Sierra, python 3.6 (in virtaulenv), and gensim version 1.0.1.
Thanks,
Jakub
The corpus seem to be very large to download and hence hard to reproduce. Do you get the same error at a different reasonably sized corpus as well?
Does a simple `file.read(model_file)' work for you? It might be related to a bug with reading large files on osx
@tmylk Indeed, it does produce the same error. Good catch!
Thanks for reporting it. Could you please post a fix or workaround that you find here? Do we need to add chunksize to read? We need to make it work on OSX.
It may be enough to just use np.fromfile(file_handle, dtype=dtype) in place of the wasteful read-it-all-via-a-string fromstring/read.
Thanks @gojomo for the suggestion. However, in that case I am getting
ValueError: cannot reshape array of size 2111621876 into shape (4519370,300)
in the next line of fasttext.py - self.wv.syn0_all = self.wv.syn0_all.reshape((num_vectors, dim))
I see, so the functionally equivalent is np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim). Let me check whether it works (it takes some time to load the model).
This works, however I noticed a strange thing. Do we set model size correctly?
The following lines
model = FastText.load_fasttext_format('/Users/smidj/Repositories/snippetbot/data/fasttext/wiki.en/wiki.en')
print(str(model.vector_size))
print(str(model["hello"]))
print 100 (should be 300) and then vector of size 300 (correct).
Seems vector_size mayalso need to be explicitly set. I would definitely do some sanity checks (or unit tests) bases on tiny example fasttext exports, to be sure there's no mis-sizing/mis-alignment happening (that risks leaving everything free of raised exceptions but totally confused//corrupted about real values).
The problem seems to be at fasttext.py in load_model_params.
The dim argument is assigned to self.size, which does not exist yet. I think this line should be as follows
self.vector_size = dim
Regarding the suggestion of @gojomo, do we have some reasonable sized fasttext model somewhere or do we need to train it?
I am also thinking that the current functionality does not allow to load just vector file together with the vector_size.
We load the vector_size in keyedvectors.py.load_word2vec_format, but we do not store it anywhere. Maybe we can add also create self.vector_size in KeyedVectors class and save it for later use in case you just want the model, not the retraining.
What do you think?
I have created the pull request here -
https://github.com/RaRe-Technologies/gensim/pull/1214
I agree that we should add some unit tests to double鈥揷heck. How about using lee_fasttext.bin which is already included in test/test_data, that should be of reasonable size?
I am looking at the current unit tests and seems like there is already a test for the load_fasttext_format in the test_fasttext_wrapper.py with some sanity checks. @gojomo, @markus-beuckelmann are those the sanity checks you had in mind or do you suggest adding extra checks?
I'm not sure that any of the existing tests verify that the values loaded into gensim for particular words are the same as they would be in the original package. (For example, if vectors for both 'and' and 'the' were similarly corrupted by misaligned-reading, the existing tests of their relationships might still pass.) The verification I'm thinking of could 1st be done just manually, on some known (or toy-sized) fasttext-created vector-sets. Then maybe it'd become a unit-test, if a tiny set with known target values is bundled.
So I updated the unittest with the expected values taken either from the vec file or obtained via the original package. Please @gojomo review the updated testLoadFastTextFormat whether it is something you had in mind.
Thanks,
Jakub
Potential fix in https://github.com/RaRe-Technologies/gensim/pull/1214
Yes, verifying that the same vector is observed for a test word whether loaded in the original native code, or gensim, addresses my concern.
Most helpful comment
I have created the pull request here -
https://github.com/RaRe-Technologies/gensim/pull/1214