I am trying to load in the English, pre-trained, bin model from FastText, but am getting a ValueError.
Versions:
Windows-10-10.0.17134-SP0
Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.2
FAST_VERSION 0
I am running the following line:
model_fast = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')
However, I get this error:
ValueError Traceback (most recent call last)
<ipython-input-7-615bb517a8f5> in <module>
----> 1 model_fast = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')
c:\users\david\desktop\bennet~1\bennet~1\bert_t~1\env\lib\site-packages\gensim\models\fasttext.py in load_facebook_model(path, encoding)
1241
1242 """
-> 1243 return _load_fasttext_format(path, encoding=encoding, full_model=True)
1244
1245
c:\users\david\desktop\bennet~1\bennet~1\bert_t~1\env\lib\site-packages\gensim\models\fasttext.py in _load_fasttext_format(model_file, encoding, full_model)
1321 """
1322 with smart_open(model_file, 'rb') as fin:
-> 1323 m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
1324
1325 model = FastText(
c:\users\david\desktop\bennet~1\bennet~1\bert_t~1\env\lib\site-packages\gensim\models\_fasttext_bin.py in load(fin, encoding, full_model)
272 model.update(raw_vocab=raw_vocab, vocab_size=vocab_size, nwords=nwords)
273
--> 274 vectors_ngrams = _load_matrix(fin, new_format=new_format)
275
276 if not full_model:
c:\users\david\desktop\bennet~1\bennet~1\bert_t~1\env\lib\site-packages\gensim\models\_fasttext_bin.py in _load_matrix(fin, new_format)
235
236 matrix = np.fromfile(fin, dtype=dtype, count=num_vectors * dim)
--> 237 matrix = matrix.reshape((num_vectors, dim))
238 return matrix
239
ValueError: cannot reshape array of size 1116604308 into shape (4000000,300)
Thank you for reporting this. I've reproduced the problem on my side.
I will investigate and report back here when I have some results.
Please try unzipping the file using gunzip or a similar utility before opening it via gensim. In my case, I was able to load the model successfully after doing this.
I suspect some of the lower-level I/O code (possibly smart_open) that handles decompression is at fault here.
Thank you for reporting this. I've reproduced the problem on my side.
I will investigate and report back here when I have some results.
sorry to bother but I want to ask that do you konw anything about the time_slices in the Dtmmodel?I would be appreciate for your time.
model = DtmModel(path_to_dtm_binary,corpus = corpus,time_slices=time_slice,id2word=dic,num_topics = num_topics,alpha = alpha,mode='time')
@giant-armadillo Sorry, this is not the appropriate place for such questions. Please try on the mailing list.
I've investigated this issue for a few hours. Fortunately, I can exclude smart_open from the list of culprits. Unfortunately, the problem appears to lie deeper than that. There is some interplay between the standard library gzip and numpy from_file function that's causing different behavior under some conditions:
I've tried flushing the buffer and seeking around before calling from_file, to no avail. I suspect there may be some sort of bug lurking in either gzip or numpy. At this stage, I doubt the bug is in our code, because we're not doing anything that convoluted (just fin.read() a lot of times) and the problem instantly goes away when we stop using gzip.
I don't have the capacity to continue investigating this today, but I'll be coming back to this over the next few days (and weeks, if it takes that long). In the meanwhile, here's my script to reproduce the problem:
import argparse
import contextlib
import io
import gzip
import sys
import numpy as np
from gensim.models._fasttext_bin import load
@contextlib.contextmanager
def my_open(path):
"""Transparently decompresses .gz file by wrapping in gzip.GzipFile."""
if path.endswith('.gz'):
with io.open(path, 'rb') as fin_outer:
with gzip.GzipFile(fileobj=fin_outer, mode='rb') as fin:
yield fin
else:
with io.open(path, 'rb') as fin:
yield fin
def test_load(fin):
"""Load the model using our internal FastText I/O.
This works for .bin but breaks for .gz.
"""
model = load(fin)
print(model)
def test_seek(fin):
"""Seek to the offset of the matrix directly, try to load it.
This works for both .bin and .gz.
"""
fin.seek(37176278)
matrix = np.fromfile(fin, dtype=np.dtype('float32'), count=120000000)
print(matrix.shape)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('path')
parser.add_argument('--test-function', default='test_load')
args = parser.parse_args()
test_function = globals()[args.test_function]
with my_open(args.path) as fin:
test_function(fin)
if __name__ == '__main__':
main()
Pinging @piskvorky because this bug is interesting.
Awesome detective work @mpenkov !
For anyone bitten by this: until we (or numpy or the python devs…) resolve this, use a decompressed file instead. Avoid .gz input in FastText (@mpenkov and in other models?).
After discussing with @menshikh-iv, our conclusion is that this may be a bug with numpy: https://github.com/numpy/numpy/issues/13470
Wow, thanks for the quick reply and fix. I can confirm that unzipping the model before reading it in solves the problem.