Currently gensim cannot load and continue training native fastText model. According to the docs [1], this is because it only loads input-hidden matrix. However, fastText also saves hidden-output matrix [2].
Moreover, even the input-hidden matrix could support some sort of transfer learning, with hidden-output matrix inited randomly, similar to how gensim.models.Word2Vec.intersect_word2vec_format() works.
Please correct me if I'm wrong here, but I think there is no technical issue preventing loading and continue training fastText model. How about supporting this feature?
@tranhungnghiep thanks for the request, as I remember, FB distribute 2 type of models
.vec file (i.e. no ngrams, only 1 matrix for words) in plain text format, for loading this, you should use KeyedVectors.load_word2vec_format.bin, FastText.load_fasttext_format should be used for ngrams & continue an training processI think that this is a bug of current implementation (this already should works)
from gensim.models import FastText
from gensim.test.utils import common_texts
m = FastText.load_fasttext_format("wiki.ru.bin") # load wiki FB model from https://fasttext.cc/docs/en/pretrained-vectors.html
m.build_vocab(common_texts, update=True) # this doesn't work, but should. See also https://github.com/RaRe-Technologies/gensim/issues/2139
"""
/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in build_vocab(self, sentences, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
480 return super(FastText, self).build_vocab(
481 sentences, update=update, progress_per=progress_per,
--> 482 keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
483
484 def _set_train_params(self, **kwargs):
/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.pyc in build_vocab(self, sentences, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
805 trim_rule=trim_rule, **kwargs)
806 report_values['memory'] = self.estimate_memory(vocab_size=report_values['num_retained_words'])
--> 807 self.trainables.prepare_weights(self.hs, self.negative, self.wv, update=update, vocabulary=self.vocabulary)
808
809 def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False):
/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in prepare_weights(self, hs, negative, wv, update, vocabulary)
932
933 def prepare_weights(self, hs, negative, wv, update=False, vocabulary=None):
--> 934 super(FastTextTrainables, self).prepare_weights(hs, negative, wv, update=update, vocabulary=vocabulary)
935 self.init_ngrams_weights(wv, update=update, vocabulary=vocabulary)
936
/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.pyc in prepare_weights(self, hs, negative, wv, update, vocabulary)
1744 self.reset_weights(hs, negative, wv)
1745 else:
-> 1746 self.update_weights(hs, negative, wv)
1747
1748 def seeded_vector(self, seed_string, vector_size):
/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.pyc in update_weights(self, hs, negative, wv)
1791 self.syn1 = vstack([self.syn1, zeros((gained_vocab, self.layer1_size), dtype=REAL)])
1792 if negative:
-> 1793 self.syn1neg = vstack([self.syn1neg, zeros((gained_vocab, self.layer1_size), dtype=REAL)])
1794 wv.vectors_norm = None
1795
AttributeError: 'FastTextTrainables' object has no attribute 'syn1neg'
"""
m.train(common_texts, epochs=1, total_examples=len(common_texts))
"""
Exception in thread Thread-17:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 164, in _worker_loop
tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.py", line 555, in _do_train_job
tally += train_batch_sg(self, sentences, alpha, work, neu1)
File "gensim/models/fasttext_inner.pyx", line 276, in gensim.models.fasttext_inner.train_batch_sg
cdef REAL_t *word_locks_vocab = <REAL_t *>(np.PyArray_DATA(model.trainables.vectors_vocab_lockf))
AttributeError: 'FastTextTrainables' object has no attribute 'vectors_vocab_lockf'
"""
Of course, I'm +1 for fix this issue -> training will work as @tranhungnghiep suggest.
Related issue - #2139
@menshikh-iv Thanks for looking into it.
This issue is a more low-level problem, particularly FastText.load_fasttext_format() currently does not load the hidden-output matrix. After loading, we may need to do some checks and initializations related to #2139.
Hi @menshikh-iv it seems that the hidden vectors are still bad. I'm using the gensim.models.fasttext.load_facebook_model function to load the .bin file, but the syn1 fails loading. Also trainables.syn1neg is full of zeros.
Hi @aviclu, please post more information
@aviclu Please open a new ticket and be sure to fill in the template.
Most helpful comment
@menshikh-iv Thanks for looking into it.
This issue is a more low-level problem, particularly
FastText.load_fasttext_format()currently does not load thehidden-outputmatrix. After loading, we may need to do some checks and initializations related to #2139.