Gensim: save_facebook_model() - AssertionError

Created on 10 Jun 2020 · 12Comments · Source: RaRe-Technologies/gensim

Problem description

I am trying to save the trained model of fasttext using the new save_facebook_model function.
I was unable to do it so because an assertionError arises in the code line:
assert vocab_n == len(model.wv.vocab)
The vocabulary of my model is of 2000264:

len(model.wv.vocab)
2000264

I tried with a model with a vocabulary of 4500 and it worked. So I guess there is a limitation in that. But the error message did not tell any of that.

Steps/code/corpus to reproduce

from gensim.models.fasttext import save_facebook_model

save_facebook_model(model,'own_fasttext_model_pretrained.bin')

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-201-0a3c1c458b74> in <module>
      2 from gensim.models.fasttext import load_facebook_model, load_facebook_vectors,save_facebook_model
      3 
----> 4 save_facebook_model(model,'own_fasttext_model_pretrained.bin')

/opt/conda/lib/python3.7/site-packages/gensim/models/fasttext.py in save_facebook_model(model, path, encoding, lr_update_rate, word_ngrams)
   1334     """
   1335     fb_fasttext_parameters = {"lr_update_rate": lr_update_rate, "word_ngrams": word_ngrams}
-> 1336     gensim.models._fasttext_bin.save(model, path, fb_fasttext_parameters, encoding)

/opt/conda/lib/python3.7/site-packages/gensim/models/_fasttext_bin.py in save(model, fout, fb_fasttext_parameters, encoding)
    666     if isinstance(fout, str):
    667         with open(fout, "wb") as fout_stream:
--> 668             _save_to_stream(model, fout_stream, fb_fasttext_parameters, encoding)
    669     else:
    670         _save_to_stream(model, fout, fb_fasttext_parameters, encoding)

/opt/conda/lib/python3.7/site-packages/gensim/models/_fasttext_bin.py in _save_to_stream(model, fout, fb_fasttext_parameters, encoding)
    629 
    630     # Save words and ngrams vectors
--> 631     _input_save(fout, model)
    632     fout.write(struct.pack('@?', False))  # Save 'quot_', which is False for unsupervised models
    633 

/opt/conda/lib/python3.7/site-packages/gensim/models/_fasttext_bin.py in _input_save(fout, model)
    573 
    574     assert vocab_dim == ngrams_dim
--> 575     assert vocab_n == len(model.wv.vocab)
    576     assert ngrams_n == model.wv.bucket
    577 

AssertionError:

Versions

Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Linux-4.9.0-12-amd64-x86_64-with-debian-9.12
Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) 
[GCC 7.3.0]
NumPy 1.18.1
SciPy 1.4.1
gensim 3.8.3
FAST_VERSION 0

bug

Source

imendibo

All 12 comments

Thank you for reporting this.

How did you train the model? We cannot reproduce the problem without it.

mpenkov on 11 Jun 2020

Yes - it's probably not just the size, but something about how that model was created/trained/modified. Knowing more about the steps it went through, in your code, would help – but best of all would be a complete standalone code example which can trigger the assertion error.

(In any case, knowing that a mysterious failure of that assertion has happened "in the field", we could add a more descriptive failure message, with some of the relevant values, to narrow what kind of corruption occurred. For example, is the len(vocab) suspiciously small or the vectors_vocab.shape[0] suspiciously big?)

gojomo on 11 Jun 2020

👍1

The model was created loading the facebook model cc.en.300.bin and training it with my own dataset of sentences:

from gensim.models.fasttext import load_facebook_model

model = load_facebook_model('fasttext/cc.en.300.bin')
model.build_vocab(sentences=list(df_train.desc_tokens_spacy), update = True)
model.train(sentences=list(df_train.desc_tokens_spacy), total_examples=len(df_train.desc_tokens_spacy), epochs=10)

imendibo on 12 Jun 2020

Thanks! What if you don't try any additional vocab-updates/training? That is, try the .save_facebook_model() immediately after the load_facebook_model()?

gojomo on 12 Jun 2020

I just tried what you commented @gojomo and it saved the model that was just loaded. I have also extracted the length of the vocab and it was of 2000000.

imendibo on 12 Jun 2020

That's very helpful, thanks - it suggests it is the vocab-expansion that causes the inconsistency.

I suspect there might be other side-effects of this as well - the FastText class keeps both raw full-word vectors (as directly trained), and composed vectors (those raw full-words plus their constituent ngrams) - & this suggests vocab-expansion might not be updating both in sync. That could mean the word-vectors being used from other operations may be stale, and other operations might similarly hit some expected-size mismatch.

gojomo on 12 Jun 2020

👍1

@gojomo is this still the case, post-#2698 and #2891?

Related to #2873, #2879.

piskvorky on 26 Jul 2020

I've never personally reproduced this, & don't know the real cause, so unsure of the likelihood of its persistence after other fixes. (I'm not sure it necessarily even requires loading an FB-native model - any vocab-expansion via build_vocab(... update=True) might be enough... but there's not yet a compact standalone triggering case.)

It might have been fixed by other refactoring... @imendibo, if you are comfortable installing & running the develop in-progress branch of Gensim, and you could try reproducing this there, that'd be a help.

In case it's unclear, I'm not really a fan of the build_vocab(..., update=True) functionality, going back to its original arrival in #365 - it's never seemed completed (still segfaults in Doc2Vec), & makes something look sensible/supported that's actually very shaky in practice. I've never used it myself, & wouldn't encourage its use, so my time budget for fixing its problems is negligible. (My hunch would be that something based instead on the TranslationMatrix functionality would be a more-grounded & reliable way to expand vocabularies: train a new model on new data, but then translate the new words back into the old space, perhaps even with an explicit, tunable 'new-vs-old-relative-weight' parameter. Or maybe implement some well-described 'fine-tuning' idea that's written up in some research article. But I've not had a project or time to do that R&D.)

(I don't think #2891, specifically, would change this behavior either way, but #2698's adjustments to FT and FTKV behaviors might've remedied whatever failure-to-keep-state-in-sync caused this.)

gojomo on 27 Jul 2020

@imendibo can you check with the current develop? We've fixed a bunch of stuff there.

piskvorky on 7 Sep 2020

👍1

Hello,
I have the same issue here with the update_vocab=True.
I am with gensim 4.0.0.dev

here the snipped with the following error (same as before) : :
```model = gensim.models.fasttext.load_facebook_model("fasttext.bin")
model.build_vocab(corpus_file="data.txt", update=True)
model.train(
corpus_file="data.txt", epochs=model.epochs,
total_examples=model.corpus_count, total_words=model.corpus_total_words
)
gensim.models.fasttext.save_facebook_model(model, "test.bin")

File "/Users/cviricel/365talents-nodejs/components/data-service/.venv/lib/python3.6/site-packages/gensim-4.0.0.dev0-py3.6-macosx-10.15-x86_64.egg/gensim/models/fasttext.py", line 1145, in save_facebook_model
gensim.models._fasttext_bin.save(model, path, fb_fasttext_parameters, encoding)
File "/Users/cviricel/365talents-nodejs/components/data-service/.venv/lib/python3.6/site-packages/gensim-4.0.0.dev0-py3.6-macosx-10.15-x86_64.egg/gensim/models/_fasttext_bin.py", line 674, in save
_save_to_stream(model, fout_stream, fb_fasttext_parameters, encoding)
File "/Users/cviricel/365talents-nodejs/components/data-service/.venv/lib/python3.6/site-packages/gensim-4.0.0.dev0-py3.6-macosx-10.15-x86_64.egg/gensim/models/_fasttext_bin.py", line 637, in _save_to_stream
_input_save(fout, model)
File "/Users/cviricel/365talents-nodejs/components/data-service/.venv/lib/python3.6/site-packages/gensim-4.0.0.dev0-py3.6-macosx-10.15-x86_64.egg/gensim/models/_fasttext_bin.py", line 581, in _input_save
assert vocab_n == len(model.wv)
AssertionError```

ClementViricel on 11 Sep 2020

😕1

In order to understand, i have printed the assertion conditions before and after build_vocab and before and after training.
It seems that before build vocab len(model.wv) and vocab_n (= model.wv.vectors_vocab.shape[0])are equal (500 496 and 500 496 in my case).
After build_vocab(..., update = True) the remaining are not equel anymore (506 278 and 503 387).
This behavious is weird because it implies that the vocab size related to model.wv.vectors_vocab is not equal to the vocab size from model.wv
After and before training nothing change, therefore on save the assertion is broken.