Gensim: FastText incremental training fails

Created on 24 Jul 2018  Â·  13Comments  Â·  Source: RaRe-Technologies/gensim

Description

Having successfully trained model (with 20 epochs), which has been saved and loaded back without any problems, I'm trying to continue training it for another 10 epochs - on the same data, with the same parameters - but it fails with an error: TypeError: 'NoneType' object is not subscriptable (for full traceback see below).

Steps/Code/Corpus to Reproduce

from gensim.models.fasttext import FastText

# `train_data' is just a list of lists of strings (words), e.g.
# `[['w1', 'w2', 'w3', ...], ['w1', 'w4', 'w5', ...], ...]'.
model = FastText(
    train_data,
    sg=1,
    size=200,
    window=5,
    min_count=1,
    workers=16,
    negative=20,
    iter=20,
    min_n=3,
    max_n=5,
    word_ngrams=1,
    bucket=int(2e6)
)

# `model_file' is a string with the path to the file where model is being saved
model.save(model_file)

model = FastText.load(model_file)

# `train_data' here is exactly the same as before
model.train(train_data, epochs=10, total_examples=model.corpus_count)

Expected Results

Successfully trained model.

Actual Results

[WARNING 2018-07-23 14:42:00,222] Effective 'alpha' higher than previous training cycles
[INFO 2018-07-23 14:42:00,222] training model with 16 workers on 15145 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=20 window=5
Exception in thread Thread-50:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/env/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 99, in _worker_loop
    tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
  File "/home/ubuntu/env/lib/python3.5/site-packages/gensim/models/fasttext.py", line 454, in _do_train_job
    tally += train_batch_sg(self, sentences, alpha, work, neu1)
  File "gensim/models/fasttext_inner.pyx", line 319, in gensim.models.fasttext_inner.train_batch_sg
TypeError: 'NoneType' object is not subscriptable

Versions

Linux-4.15.0-1014-gcp-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.4.0
FAST_VERSION 1

bug difficulty medium fasttext

Most helpful comment

@xor-xor thanks for the report, problem reproduced with gensim==3.5.0

from gensim.models import FastText
from gensim.test.utils import common_texts, get_tmpfile

model = FastText(
    common_texts,
    sg=1,
    size=200,
    window=5,
    min_count=1,
    workers=16,
    negative=20,
    iter=20,
    min_n=3,
    max_n=5,
    word_ngrams=1,
    bucket=int(2e6)
)


path = get_tmpfile("fasttext.model")
model.save(path)

loaded_model = FastText.load(path)
loaded_model.train(common_texts, epochs=10, total_examples=model.corpus_count)

but if we add build_vocab(update=True) after loading (before second training) - all will work correctly

loaded_model.build_vocab(common_texts, update=True) 

anyway, need to investigate this behavior because this must work without additional build_vocab call.

All 13 comments

@manneshiva @gojomo @menshikh-iv any idea? That use-case sounds like something we definitely want to (should) support.

Yes, that seems like something that should work (even if it might be tricky to get working well) – and I'd guess the issue is something that's not being serialized isn't being rebuilt after re-load.

I'm having trouble to compile this code, Does anyone know what would be the problem?

@ntonyproduction Which code, what problem, what have you tried so far, what does it have to do with this issue?

@xor-xor thanks for the report, problem reproduced with gensim==3.5.0

from gensim.models import FastText
from gensim.test.utils import common_texts, get_tmpfile

model = FastText(
    common_texts,
    sg=1,
    size=200,
    window=5,
    min_count=1,
    workers=16,
    negative=20,
    iter=20,
    min_n=3,
    max_n=5,
    word_ngrams=1,
    bucket=int(2e6)
)


path = get_tmpfile("fasttext.model")
model.save(path)

loaded_model = FastText.load(path)
loaded_model.train(common_texts, epochs=10, total_examples=model.corpus_count)

but if we add build_vocab(update=True) after loading (before second training) - all will work correctly

loaded_model.build_vocab(common_texts, update=True) 

anyway, need to investigate this behavior because this must work without additional build_vocab call.

I can confirm this bug - its even worse:

model_gensim = FT_gensim(size=100) 
model_gensim.build_vocab(lee_data) 
model_gensim.train(lee_data, total_examples=model_gensim.corpus_count, epochs=model_gensim.iter)

#### --> same error
model_gensim.train(lee_data, total_examples=model_gensim.corpus_count, epochs=model_gensim.iter)

Seems like you have to call build_vocab before EVERY train call, no matter if loaded or not.

I wonder if I should call build_vocab with the actual corpus or just with something like ['foo'] - latter would properly be better in order not to alter any vocabulary frequencies?

dataset1 = models.Doc2Vec.load("dataset1.model") 1000 sentences
print(len(dataset1.docvecs.vectors_docs)) 1000 sentences vetor
dataset1.build_vocab(new_sentences, update=True) 500 sentences
print(len(dataset1.docvecs.vectors_docs)) 1000 sentences vetor
dataset1.train(new_sentences, total_examples=dataset1.corpus_count, epochs=100)
print(len(dataset1.docvecs.vectors_docs)) 1000 sentences vetor

question: three print results are same
please help me ,thank you

I'm stuck on this problem, the issue seems to be solved, do we need to recompile gensim?

Hi ,

I am stuck with this issue for few days now , can someone please tell in which gensim fastext version would this fix be available . ?

I'm stuck on this problem, the issue seems to be solved, do we need to recompile gensim?

Hi ,
Have you got your query resolved ? Please share if you got something on this .

Thanks !

I'm stuck on this problem, the issue seems to be solved, do we need to recompile gensim?

Hi ,
Have you got your query resolved ? Please share if you got something on this .

Thanks !

I didn't solve it but referring to the #2215, it seems to be already solved. My question was about the need of a gensim update for that.

Fixed by #2313

My question was about the need of a gensim update for that.

@aakash086 @MorenoLaQuatra Will be available in gensim==3.7.0 (end of Jan)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jeradf picture jeradf  Â·  4Comments

johann-petrak picture johann-petrak  Â·  3Comments

volj1 picture volj1  Â·  4Comments

k0nserv picture k0nserv  Â·  3Comments

dancinghui picture dancinghui  Â·  4Comments