Gensim: Using the load_facebook_model method produces ValueError on array reshaping

Created on 3 May 2019 · 8Comments · Source: RaRe-Technologies/gensim

I am trying to load in the English, pre-trained, bin model from FastText, but am getting a ValueError.

Versions:

Windows-10-10.0.17134-SP0
Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.2
FAST_VERSION 0

I am running the following line:
model_fast = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')

However, I get this error:

ValueError                                Traceback (most recent call last)
<ipython-input-7-615bb517a8f5> in <module>
----> 1 model_fast = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')

c:\users\david\desktop\bennet~1\bennet~1\bert_t~1\env\lib\site-packages\gensim\models\fasttext.py in load_facebook_model(path, encoding)
   1241 
   1242     """
-> 1243     return _load_fasttext_format(path, encoding=encoding, full_model=True)
   1244 
   1245 

c:\users\david\desktop\bennet~1\bennet~1\bert_t~1\env\lib\site-packages\gensim\models\fasttext.py in _load_fasttext_format(model_file, encoding, full_model)
   1321     """
   1322     with smart_open(model_file, 'rb') as fin:
-> 1323         m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
   1324 
   1325     model = FastText(

c:\users\david\desktop\bennet~1\bennet~1\bert_t~1\env\lib\site-packages\gensim\models\_fasttext_bin.py in load(fin, encoding, full_model)
    272     model.update(raw_vocab=raw_vocab, vocab_size=vocab_size, nwords=nwords)
    273 
--> 274     vectors_ngrams = _load_matrix(fin, new_format=new_format)
    275 
    276     if not full_model:

c:\users\david\desktop\bennet~1\bennet~1\bert_t~1\env\lib\site-packages\gensim\models\_fasttext_bin.py in _load_matrix(fin, new_format)
    235 
    236     matrix = np.fromfile(fin, dtype=dtype, count=num_vectors * dim)
--> 237     matrix = matrix.reshape((num_vectors, dim))
    238     return matrix
    239 

ValueError: cannot reshape array of size 1116604308 into shape (4000000,300)

bug

Source

dpalbrecht

All 8 comments

Thank you for reporting this. I've reproduced the problem on my side.

I will investigate and report back here when I have some results.

mpenkov on 4 May 2019

Please try unzipping the file using gunzip or a similar utility before opening it via gensim. In my case, I was able to load the model successfully after doing this.

I suspect some of the lower-level I/O code (possibly smart_open) that handles decompression is at fault here.

mpenkov on 4 May 2019

🎉1

Thank you for reporting this. I've reproduced the problem on my side.

I will investigate and report back here when I have some results.

sorry to bother but I want to ask that do you konw anything about the time_slices in the Dtmmodel?I would be appreciate for your time.
model = DtmModel(path_to_dtm_binary,corpus = corpus,time_slices=time_slice,id2word=dic,num_topics = num_topics,alpha = alpha,mode='time')

giant-armadillo on 4 May 2019

@giant-armadillo Sorry, this is not the appropriate place for such questions. Please try on the mailing list.

mpenkov on 4 May 2019

I've investigated this issue for a few hours. Fortunately, I can exclude smart_open from the list of culprits. Unfortunately, the problem appears to lie deeper than that. There is some interplay between the standard library gzip and numpy from_file function that's causing different behavior under some conditions:

When numpy reads directly from an io.open stream, everything works fine
When numpy reads from an io.open stream via gzip.GzipFile, things fall apart
a. We can seek to the required offset directly and call numpy.from_file to load the matrix directly. This works.
b. Performing the required reads from the stream (e.g. loading model parameters, vocabulary, etc) before calling numpy.from_file breaks. More specifically, numpy gives us less matrix elements than we need. Moreover, those elements appear to be junk.

I've tried flushing the buffer and seeking around before calling from_file, to no avail. I suspect there may be some sort of bug lurking in either gzip or numpy. At this stage, I doubt the bug is in our code, because we're not doing anything that convoluted (just fin.read() a lot of times) and the problem instantly goes away when we stop using gzip.

I don't have the capacity to continue investigating this today, but I'll be coming back to this over the next few days (and weeks, if it takes that long). In the meanwhile, here's my script to reproduce the problem:

import argparse
import contextlib
import io
import gzip
import sys

import numpy as np
from gensim.models._fasttext_bin import load

@contextlib.contextmanager
def my_open(path):
    """Transparently decompresses .gz file by wrapping in gzip.GzipFile."""
    if path.endswith('.gz'):
        with io.open(path, 'rb') as fin_outer:
            with gzip.GzipFile(fileobj=fin_outer, mode='rb') as fin:
                yield fin
    else:
        with io.open(path, 'rb') as fin:
            yield fin


def test_load(fin):
    """Load the model using our internal FastText I/O.

    This works for .bin but breaks for .gz.
    """
    model = load(fin)
    print(model)


def test_seek(fin):
    """Seek to the offset of the matrix directly, try to load it.

    This works for both .bin and .gz.
    """
    fin.seek(37176278)
    matrix = np.fromfile(fin, dtype=np.dtype('float32'), count=120000000)
    print(matrix.shape)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('path')
    parser.add_argument('--test-function', default='test_load')
    args = parser.parse_args()

    test_function = globals()[args.test_function]

    with my_open(args.path) as fin:
        test_function(fin)


if __name__ == '__main__':
    main()

Pinging @piskvorky because this bug is interesting.

mpenkov on 4 May 2019

👀1 👍1

Awesome detective work @mpenkov !

For anyone bitten by this: until we (or numpy or the python devs…) resolve this, use a decompressed file instead. Avoid .gz input in FastText (@mpenkov and in other models?).