Gensim: [Feature request] Load full native fastText model to continue training on new data

Created on 24 Aug 2018 · 5Comments · Source: RaRe-Technologies/gensim

Currently gensim cannot load and continue training native fastText model. According to the docs [1], this is because it only loads input-hidden matrix. However, fastText also saves hidden-output matrix [2].

Moreover, even the input-hidden matrix could support some sort of transfer learning, with hidden-output matrix inited randomly, similar to how gensim.models.Word2Vec.intersect_word2vec_format() works.

Please correct me if I'm wrong here, but I think there is no technical issue preventing loading and continue training fastText model. How about supporting this feature?

bug difficulty medium fasttext

Source

tranhungnghiep

👍2

Most helpful comment

@menshikh-iv Thanks for looking into it.

This issue is a more low-level problem, particularly FastText.load_fasttext_format() currently does not load the hidden-output matrix. After loading, we may need to do some checks and initializations related to #2139.

tranhungnghiep on 27 Aug 2018

👍2

All 5 comments

@tranhungnghiep thanks for the request, as I remember, FB distribute 2 type of models

only vectors .vec file (i.e. no ngrams, only 1 matrix for words) in plain text format, for loading this, you should use KeyedVectors.load_word2vec_format
full model binary .bin, FastText.load_fasttext_format should be used for ngrams & continue an training process

I think that this is a bug of current implementation (this already should works)

from gensim.models import FastText
from gensim.test.utils import common_texts


m = FastText.load_fasttext_format("wiki.ru.bin")  # load wiki FB model from https://fasttext.cc/docs/en/pretrained-vectors.html
m.build_vocab(common_texts, update=True)  # this doesn't work, but should. See also https://github.com/RaRe-Technologies/gensim/issues/2139 
"""
/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in build_vocab(self, sentences, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    480         return super(FastText, self).build_vocab(
    481             sentences, update=update, progress_per=progress_per,
--> 482             keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
    483 
    484     def _set_train_params(self, **kwargs):

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.pyc in build_vocab(self, sentences, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    805             trim_rule=trim_rule, **kwargs)
    806         report_values['memory'] = self.estimate_memory(vocab_size=report_values['num_retained_words'])
--> 807         self.trainables.prepare_weights(self.hs, self.negative, self.wv, update=update, vocabulary=self.vocabulary)
    808 
    809     def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False):

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in prepare_weights(self, hs, negative, wv, update, vocabulary)
    932 
    933     def prepare_weights(self, hs, negative, wv, update=False, vocabulary=None):
--> 934         super(FastTextTrainables, self).prepare_weights(hs, negative, wv, update=update, vocabulary=vocabulary)
    935         self.init_ngrams_weights(wv, update=update, vocabulary=vocabulary)
    936 

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.pyc in prepare_weights(self, hs, negative, wv, update, vocabulary)
   1744             self.reset_weights(hs, negative, wv)
   1745         else:
-> 1746             self.update_weights(hs, negative, wv)
   1747 
   1748     def seeded_vector(self, seed_string, vector_size):

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.pyc in update_weights(self, hs, negative, wv)
   1791             self.syn1 = vstack([self.syn1, zeros((gained_vocab, self.layer1_size), dtype=REAL)])
   1792         if negative:
-> 1793             self.syn1neg = vstack([self.syn1neg, zeros((gained_vocab, self.layer1_size), dtype=REAL)])
   1794         wv.vectors_norm = None
   1795 

AttributeError: 'FastTextTrainables' object has no attribute 'syn1neg'
"""

m.train(common_texts, epochs=1, total_examples=len(common_texts))
"""
Exception in thread Thread-17:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 164, in _worker_loop
    tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
  File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.py", line 555, in _do_train_job
    tally += train_batch_sg(self, sentences, alpha, work, neu1)
  File "gensim/models/fasttext_inner.pyx", line 276, in gensim.models.fasttext_inner.train_batch_sg
    cdef REAL_t *word_locks_vocab = <REAL_t *>(np.PyArray_DATA(model.trainables.vectors_vocab_lockf))
AttributeError: 'FastTextTrainables' object has no attribute 'vectors_vocab_lockf'
"""

Of course, I'm +1 for fix this issue -> training will work as @tranhungnghiep suggest.

Related issue - #2139

menshikh-iv on 27 Aug 2018

👍1

@menshikh-iv Thanks for looking into it.

tranhungnghiep on 27 Aug 2018

👍2

Hi @menshikh-iv it seems that the hidden vectors are still bad. I'm using the gensim.models.fasttext.load_facebook_model function to load the .bin file, but the syn1 fails loading. Also trainables.syn1neg is full of zeros.