I download FB pretrained common-crawl model using https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz from https://fasttext.cc/docs/en/crawl-vectors.html page.
When I try to load it, I get an error
Code:
from gensim.models import FastText
ft_cc = FastText.load_fasttext_format("models/cc.en.300.bin.gz", full_model=False)
Stacktrace:
FileNotFoundError Traceback (most recent call last)
<timed exec> in <module>
~/.local/lib/python3.6/site-packages/gensim/models/fasttext.py in load_fasttext_format(cls, model_file, encoding, full_model)
1012
1013 """
-> 1014 return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model)
1015
1016 def load_binary_data(self, encoding='utf8'):
~/.local/lib/python3.6/site-packages/gensim/models/fasttext.py in _load_fasttext_format(model_file, encoding, full_model)
1245 if not model_file.endswith('.bin'):
1246 model_file += '.bin'
-> 1247 with smart_open(model_file, 'rb') as fin:
1248 m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
1249
~/.local/lib/python3.6/site-packages/smart_open/smart_open_lib.py in smart_open(uri, mode, **kw)
179 raise TypeError('mode should be a string')
180
--> 181 fobj = _shortcut_open(uri, mode, **kw)
182 if fobj is not None:
183 return fobj
~/.local/lib/python3.6/site-packages/smart_open/smart_open_lib.py in _shortcut_open(uri, mode, **kw)
299 #
300 if six.PY3:
--> 301 return open(parsed_uri.uri_path, mode, buffering=buffering, **open_kwargs)
302 elif not open_kwargs:
303 return open(parsed_uri.uri_path, mode, buffering=buffering)
FileNotFoundError: [Errno 2] No such file or directory: 'models/cc.en.300.bin.gz.bin'
Problem in filename overwriting that happens
https://github.com/RaRe-Technologies/gensim/blob/cebc9dbeacddf7689c35aadb8854ae850d4561d2/gensim/models/fasttext.py#L1302-L1303
To avoid that, I should gunzip model file first (that's definitely not a thing that user expect).
Versions:
no matters, all python & gensim version affected, including last release.
Not necessarily a bug, but I do think we could support this. Especially since we're already using smart_open (I think), so .gz comes for free.
The other option is to inform users through documentation/tutorials that they need to unpack their models first.
@menshikh-iv can you open a PR?
Not necessarily a bug,
from user PoV - definitely a bug (fasttext try to open a file that user don't specify, instead of passed file and failed on it).
can you open a PR?
No time, sorry
@menshikh-iv what is the if not model_file.endswith('.bin'): condition for? Looks weird to me, to be changing the user input like that.
@piskvorky ask an author of this code, not me (btw @mpenkov isn't an author too, they just refactored it, according to git blame).
I guess that this is for the case when user have 2 files X.bin and X.vec and allow user to specify load_fasttext_format("X") (without extension) and automatically load X.bin (instead of specified X that doesnt exist).
Well asking you, the project maintainer from when these changes happened and were merged, seemed like a good start to understand them :)
If the filename change is not needed (or its reason is undocumented/unknown/outdated), then I propose simply removing it. Silently modifying user input looks like an anti-pattern to me.
Feel free to open a PR when you have the time. We'll have to be careful with updating tutorials too (if they rely on this silent extension promotion).
I'm already not a maintainer @piskvorky, this change was introduced 2 years ago (when I only start, just merged PR by Jayant approve), I really remember nothing about it.
I found an "source" of this changes, this code firstly was introduced in #1341 (and after moves to other classes, like FastText).
@jayantj @prakhar2b can you comment that?
My understanding is this mirrors FB behavior. You specify model_name, and it creates model_name.bin and model_name.vec.
This feature allows the user to specify model_name only and not worry about the file extension. In my opinion, we鈥檙e better off without it.
I made these changes two years ago. There was a bug in facebook's french pretrained vector
mismatch in vec and bin files in french pretrained vector
As a workaround, we made changes in gensim to load fastText models using only bin files. #1341
A lot of changes were made after that in Gensim and fastText by someone else, and this bug has arisen most probably because backward compatibility was not taken care of.
@menshikh-iv @piskvorky
Most helpful comment
I'm already not a maintainer @piskvorky, this change was introduced 2 years ago (when I only start, just merged PR by Jayant approve), I really remember nothing about it.
I found an "source" of this changes, this code firstly was introduced in #1341 (and after moves to other classes, like
FastText).@jayantj @prakhar2b can you comment that?