Gensim: Filename overwrite in `FastText.load_fasttext_format`

Created on 7 Mar 2019  路  8Comments  路  Source: RaRe-Technologies/gensim

I download FB pretrained common-crawl model using https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz from https://fasttext.cc/docs/en/crawl-vectors.html page.
When I try to load it, I get an error

Code:

from gensim.models import FastText

ft_cc = FastText.load_fasttext_format("models/cc.en.300.bin.gz", full_model=False)

Stacktrace:

FileNotFoundError                         Traceback (most recent call last)
<timed exec> in <module>

~/.local/lib/python3.6/site-packages/gensim/models/fasttext.py in load_fasttext_format(cls, model_file, encoding, full_model)
   1012 
   1013         """
-> 1014         return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model)
   1015 
   1016     def load_binary_data(self, encoding='utf8'):

~/.local/lib/python3.6/site-packages/gensim/models/fasttext.py in _load_fasttext_format(model_file, encoding, full_model)
   1245     if not model_file.endswith('.bin'):
   1246         model_file += '.bin'
-> 1247     with smart_open(model_file, 'rb') as fin:
   1248         m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
   1249 

~/.local/lib/python3.6/site-packages/smart_open/smart_open_lib.py in smart_open(uri, mode, **kw)
    179         raise TypeError('mode should be a string')
    180 
--> 181     fobj = _shortcut_open(uri, mode, **kw)
    182     if fobj is not None:
    183         return fobj

~/.local/lib/python3.6/site-packages/smart_open/smart_open_lib.py in _shortcut_open(uri, mode, **kw)
    299     #
    300     if six.PY3:
--> 301         return open(parsed_uri.uri_path, mode, buffering=buffering, **open_kwargs)
    302     elif not open_kwargs:
    303         return open(parsed_uri.uri_path, mode, buffering=buffering)

FileNotFoundError: [Errno 2] No such file or directory: 'models/cc.en.300.bin.gz.bin'

Problem in filename overwriting that happens
https://github.com/RaRe-Technologies/gensim/blob/cebc9dbeacddf7689c35aadb8854ae850d4561d2/gensim/models/fasttext.py#L1302-L1303

To avoid that, I should gunzip model file first (that's definitely not a thing that user expect).

Versions:
no matters, all python & gensim version affected, including last release.

difficulty easy

Most helpful comment

I'm already not a maintainer @piskvorky, this change was introduced 2 years ago (when I only start, just merged PR by Jayant approve), I really remember nothing about it.

I found an "source" of this changes, this code firstly was introduced in #1341 (and after moves to other classes, like FastText).

@jayantj @prakhar2b can you comment that?

All 8 comments

Not necessarily a bug, but I do think we could support this. Especially since we're already using smart_open (I think), so .gz comes for free.

The other option is to inform users through documentation/tutorials that they need to unpack their models first.

@menshikh-iv can you open a PR?

Not necessarily a bug,

from user PoV - definitely a bug (fasttext try to open a file that user don't specify, instead of passed file and failed on it).

can you open a PR?

No time, sorry

@menshikh-iv what is the if not model_file.endswith('.bin'): condition for? Looks weird to me, to be changing the user input like that.

@piskvorky ask an author of this code, not me (btw @mpenkov isn't an author too, they just refactored it, according to git blame).

I guess that this is for the case when user have 2 files X.bin and X.vec and allow user to specify load_fasttext_format("X") (without extension) and automatically load X.bin (instead of specified X that doesnt exist).

Well asking you, the project maintainer from when these changes happened and were merged, seemed like a good start to understand them :)

If the filename change is not needed (or its reason is undocumented/unknown/outdated), then I propose simply removing it. Silently modifying user input looks like an anti-pattern to me.

Feel free to open a PR when you have the time. We'll have to be careful with updating tutorials too (if they rely on this silent extension promotion).

I'm already not a maintainer @piskvorky, this change was introduced 2 years ago (when I only start, just merged PR by Jayant approve), I really remember nothing about it.

I found an "source" of this changes, this code firstly was introduced in #1341 (and after moves to other classes, like FastText).

@jayantj @prakhar2b can you comment that?

My understanding is this mirrors FB behavior. You specify model_name, and it creates model_name.bin and model_name.vec.

This feature allows the user to specify model_name only and not worry about the file extension. In my opinion, we鈥檙e better off without it.

I made these changes two years ago. There was a bug in facebook's french pretrained vector
mismatch in vec and bin files in french pretrained vector

As a workaround, we made changes in gensim to load fastText models using only bin files. #1341

A lot of changes were made after that in Gensim and fastText by someone else, and this bug has arisen most probably because backward compatibility was not taken care of.

@menshikh-iv @piskvorky

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mikkokotila picture mikkokotila  路  3Comments

ahmedbhabbas picture ahmedbhabbas  路  4Comments

menshikh-iv picture menshikh-iv  路  3Comments

Laubeee picture Laubeee  路  3Comments

vlad17 picture vlad17  路  4Comments