Having successfully trained model (with 20 epochs), which has been saved and loaded back without any problems, I'm trying to continue training it for another 10 epochs - on the same data, with the same parameters - but it fails with an error: TypeError: 'NoneType' object is not subscriptable (for full traceback see below).
from gensim.models.fasttext import FastText
# `train_data' is just a list of lists of strings (words), e.g.
# `[['w1', 'w2', 'w3', ...], ['w1', 'w4', 'w5', ...], ...]'.
model = FastText(
train_data,
sg=1,
size=200,
window=5,
min_count=1,
workers=16,
negative=20,
iter=20,
min_n=3,
max_n=5,
word_ngrams=1,
bucket=int(2e6)
)
# `model_file' is a string with the path to the file where model is being saved
model.save(model_file)
model = FastText.load(model_file)
# `train_data' here is exactly the same as before
model.train(train_data, epochs=10, total_examples=model.corpus_count)
Successfully trained model.
[WARNING 2018-07-23 14:42:00,222] Effective 'alpha' higher than previous training cycles
[INFO 2018-07-23 14:42:00,222] training model with 16 workers on 15145 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=20 window=5
Exception in thread Thread-50:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/env/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 99, in _worker_loop
tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
File "/home/ubuntu/env/lib/python3.5/site-packages/gensim/models/fasttext.py", line 454, in _do_train_job
tally += train_batch_sg(self, sentences, alpha, work, neu1)
File "gensim/models/fasttext_inner.pyx", line 319, in gensim.models.fasttext_inner.train_batch_sg
TypeError: 'NoneType' object is not subscriptable
Linux-4.15.0-1014-gcp-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.4.0
FAST_VERSION 1
@manneshiva @gojomo @menshikh-iv any idea? That use-case sounds like something we definitely want to (should) support.
Yes, that seems like something that should work (even if it might be tricky to get working well) – and I'd guess the issue is something that's not being serialized isn't being rebuilt after re-load.
I'm having trouble to compile this code, Does anyone know what would be the problem?
@ntonyproduction Which code, what problem, what have you tried so far, what does it have to do with this issue?
@xor-xor thanks for the report, problem reproduced with gensim==3.5.0
from gensim.models import FastText
from gensim.test.utils import common_texts, get_tmpfile
model = FastText(
common_texts,
sg=1,
size=200,
window=5,
min_count=1,
workers=16,
negative=20,
iter=20,
min_n=3,
max_n=5,
word_ngrams=1,
bucket=int(2e6)
)
path = get_tmpfile("fasttext.model")
model.save(path)
loaded_model = FastText.load(path)
loaded_model.train(common_texts, epochs=10, total_examples=model.corpus_count)
but if we add build_vocab(update=True) after loading (before second training) - all will work correctly
loaded_model.build_vocab(common_texts, update=True)
anyway, need to investigate this behavior because this must work without additional build_vocab call.
I can confirm this bug - its even worse:
model_gensim = FT_gensim(size=100)
model_gensim.build_vocab(lee_data)
model_gensim.train(lee_data, total_examples=model_gensim.corpus_count, epochs=model_gensim.iter)
#### --> same error
model_gensim.train(lee_data, total_examples=model_gensim.corpus_count, epochs=model_gensim.iter)
Seems like you have to call build_vocab before EVERY train call, no matter if loaded or not.
I wonder if I should call build_vocab with the actual corpus or just with something like ['foo'] - latter would properly be better in order not to alter any vocabulary frequencies?
dataset1 = models.Doc2Vec.load("dataset1.model") 1000 sentences
print(len(dataset1.docvecs.vectors_docs)) 1000 sentences vetor
dataset1.build_vocab(new_sentences, update=True) 500 sentences
print(len(dataset1.docvecs.vectors_docs)) 1000 sentences vetor
dataset1.train(new_sentences, total_examples=dataset1.corpus_count, epochs=100)
print(len(dataset1.docvecs.vectors_docs)) 1000 sentences vetor
question: three print results are same
please help me ,thank you
I'm stuck on this problem, the issue seems to be solved, do we need to recompile gensim?
Hi ,
I am stuck with this issue for few days now , can someone please tell in which gensim fastext version would this fix be available . ?
I'm stuck on this problem, the issue seems to be solved, do we need to recompile gensim?
Hi ,
Have you got your query resolved ? Please share if you got something on this .
Thanks !
I'm stuck on this problem, the issue seems to be solved, do we need to recompile gensim?
Hi ,
Have you got your query resolved ? Please share if you got something on this .Thanks !
I didn't solve it but referring to the #2215, it seems to be already solved. My question was about the need of a gensim update for that.
Fixed by #2313
My question was about the need of a gensim update for that.
@aakash086 @MorenoLaQuatra Will be available in gensim==3.7.0 (end of Jan)
Most helpful comment
@xor-xor thanks for the report, problem reproduced with
gensim==3.5.0but if we add
build_vocab(update=True)after loading (before second training) - all will work correctlyanyway, need to investigate this behavior because this must work without additional
build_vocabcall.