Gensim's Doc2Vec model can not be loaded after saving.
I trained my Doc2Vec model using examples of code provided below, saved it on 9th epoch and than when I'm trying to load it I receive an error (listing added).
No problems were received during training.
Training was executed on the same machine, on which I'm trying to load it again.
And please, provide the way to save this model to continue using it after training.
class EpochSaver(CallbackAny2Vec):
'''Callback to save model after each epoch and show training parameters '''
def __init__(self, savedir):
self.savedir = savedir
self.epoch = 0
os.makedirs(self.savedir, exist_ok=True)
def on_epoch_end(self, model):
savepath = os.path.join(self.savedir, "model_neg{}_epoch.gz".format(self.epoch))
model.save(savepath)
print(
"Epoch saved: {}".format(self.epoch + 1),
"Start next epoch ... ", sep="\n"
)
if os.path.isfile(os.path.join(self.savedir, "model_neg{}_epoch.gz".format(self.epoch - 1))):
print("Previous model deleted ")
os.remove(os.path.join(self.savedir, "model_neg{}_epoch.gz".format(self.epoch - 1)))
self.epoch += 1
def train():
workers = multiprocessing.cpu_count()/2
model = Doc2Vec(
DocIter(),
vec_size=700, alpha=0.03, min_alpha=0.00025, epochs=10,
min_count=10, dm=1, hs=0, negative=10, workers=workers,
window=20, callbacks=[EpochSaver("./checkpoints")]
)
Load trained model:
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec.load("checkpoints/model_neg9_epoch.gz")
Loading model
INFO:gensim.utils:loading Doc2Vec object from checkpoints/model_neg9_epoch.gz
INFO:gensim.models.doc2vec:Model saved using code from earlier Gensim Version. Re-loading old model in a compatible way.
INFO:gensim.models.deprecated.old_saveload:loading Doc2Vec object from checkpoints/model_neg9_epoch.gz
Traceback (most recent call last):
File "/Users/user/Python/dynamic_topics/env/lib/python3.5/site-packages/gensim/models/doc2vec.py", line 689, in load
return super(Doc2Vec, cls).load(args, *kwargs)
File "/Users/user/Python/dynamic_topics/env/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 629, in load
model = super(BaseWordEmbeddingsModel, cls).load(args, *kwargs)
File "/Users/user/Python/dynamic_topics/env/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 278, in load
return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
File "/Users/user/Python/dynamic_topics/env/lib/python3.5/site-packages/gensim/utils.py", line 425, in load
obj = unpickle(fname)
File "/Users/user/Python/dynamic_topics/env/lib/python3.5/site-packages/gensim/utils.py", line 1332, in unpickle
return _pickle.load(f, encoding='latin1')
AttributeError: Can't get attribute 'EpochSaver' onDuring handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cluster_d2v.py", line 71, in
model = Doc2Vec.load("checkpoints/model_neg9_epoch.gz")
File "/Users/user/Python/dynamic_topics/env/lib/python3.5/site-packages/gensim/models/doc2vec.py", line 693, in load
return load_old_doc2vec(args, *kwargs)
File "/Users/user/Python/dynamic_topics/env/lib/python3.5/site-packages/gensim/models/deprecated/doc2vec.py", line 84, in load_old_doc2vec
old_model = Doc2Vec.load(args, *kwargs)
File "/Users/user/Python/dynamic_topics/env/lib/python3.5/site-packages/gensim/models/deprecated/word2vec.py", line 1616, in load
model = super(Word2Vec, cls).load(args, *kwargs)
File "/Users/user/Python/dynamic_topics/env/lib/python3.5/site-packages/gensim/models/deprecated/old_saveload.py", line 87, in load
obj = unpickle(fname)
File "/Users/user/Python/dynamic_topics/env/lib/python3.5/site-packages/gensim/models/deprecated/old_saveload.py", line 380, in unpickle
return _pickle.loads(file_bytes, encoding='latin1')
AttributeError: Can't get attribute 'EpochSaver' on
Darwin-17.6.0-x86_64-i386-64bit
Python 3.5.4 (v3.5.4:3f56838976, Aug 7 2017, 12:56:33)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.4.0
FAST_VERSION 0
Thanks for reporting. That does look like a bug.
@gojomo @menshikh-iv @manneshiva any idea? To me it looks like the callback got serialized along with the model. Not sure if that's what we want.
@daridar can you try importing your EpochSaver from a module? (not just the main script, __main__).
Not sure this is necessarily a functionality bug - that deserialization fails when a class is unavailable is a known limitation. May be a doc or error-catching insufficiency.
Perhaps the docs for callbacks property or CallbackAny2Vec superclass could be clearer. Perhaps risk could be caught/warned/errored on save() when effective classes look serialization-fragile. Perhaps error on load() could be clearer by catching/detecting the representative AttributeErrors and re-throwing a more descriptive error.
Hello, @piskvorky and @gojomo ! Thank you for your kind advice.
I tried importing from file, where I created EpochSaver to file, where I'm trying to load my model and it helped!
Thank you, now everything works.
@daridar good to hear :) Thanks for confirming.
@gojomo @menshikh-iv Do we need to serialize the callbacks? Are they useful outside of training?
Now that I think about it, we probably do. Even here in @daridar 's use-case, if we don't serialize callbacks, a loaded "checkpoint" model wouldn't support any more check-pointing, which is potentially surprising (not good).
Let me close here as "won't fix" but a general documentation / error message update sounds like a good idea. @daridar what kind of documentation would have helped you? For example, did you read the documentation on models.Callbacks in full, would a clear warning there, somewhere near the top, have saved you the trouble? Or a better exception message?
@piskvorky I think better exception message would have helped!
Because for me It was not so obvious that EpochSaver() needs to be imported from the module, where It was created while seeing this error) And, of course, It would be better if documentation would include some information about the correct usage of the saved model when using callbacks.
Thank you!
No problem. Can you open a PR that improves the docs?
Plus, if you're feeling a little more adventurous, a PR for the better exception message too :)
Do we need to serialize the callbacks?
IMO yes, we need (for the same reason as you suggest), update documentation with this note about serialization is a good idea I think.
@piskvorky Hello! Was it addressed to me?
No problem. Can you open a PR that improves the docs?
Plus, if you're feeling a little more adventurous, a PR for the better exception message too :)
If it's for me, sorry, I missed it.
Yes, for you @daridar :)
Docstring / message fixes may seem insignificant, but they actually improve Gensim from an important perspective we (core devs) don't have -- the "newcomer user" perspective.
I would prefer if it was possible to load models without loading classes/modules used to save the model in the first place. I know that would be a major refactoring. Anyhow, 1) I do not like to import modules that are not used, 2) I serve my prediction step on an AWS lambda (in production and for tests) and I train my model on an ec2 instance and therefore save my model in an s3 bucket. It is already problematic top fit the gensim doc2vec implementation in an AWs Lambda function mainly due lambda function size limitation (scipy is the big problem not gensim itself), so importing unnecessary modules makes my life more complicated because the import increase the lambda function size unless I refactor :-)
This is not a complaint; I really appreciate and enjoy using Gensim. Best Jens
Hi @JensMadsen, what exactly you mean by load models without loading classes/modules used to save the model in the first place?
If you need to load something, in gensim typically this looks like
from gensim.models import X
my_model = X.load(...)`
my_model instance of class X -> X directly used here.
Please describe your idea in more details, we are open to discussion :)
@menshikh-iv Sorry about being unclear :-)
Essentially my code is:
train_model.py:
def train_model(documents,
args):
assert gensim.models.doc2vec.FAST_VERSION > -1
if args.workers is None:
args.workers = multiprocessing.cpu_count()
logger.info('using {} workers'.format(args.workers))
args_as_dict = vars(args)
epoch_saver = epoch_saver.EpochSaver(args)
model = gensim.models.doc2vec.Doc2Vec(documents=documents,
callbacks=[epoch_saver],
**args_as_dict)
return model
epoch_saver.py:
class EpochSaver(gensim.models.callbacks.CallbackAny2Vec):
def __init__(self, args):
self.save_every_n_epoch = args.save_every_n_epoch
self.args = args
self.epoch = 0
def on_epoch_end(self, model):
if self.save_every_n_epoch is None:
pass
elif (self.epoch + 1) % self.save_every_n_epoch == 0:
// save and upload to s3
save_trained_model(model, self.args, epoch=self.epoch)
self.epoch += 1
In order to load the model pickle need to have access to epoch_saver.py but epoch_saver.py is otherwise not in use. To infer vectors I therefore have to import epoch_saver.py in infer_vector.py which looks something like
import epoch_saver
def infer_docs(input_string, model_path=None, inferred_docs=10, steps=10):
// when saved using a callback object this bit fails if pickle has no access to the class i.e. i must import epoch_saver ?
model = load_model(model_path)
logging.info('do simple preprocessing of input string')
processed_str = simple_preprocess(input_string, min_len=2, max_len=35)
logging.info('input string\n{}\nprocessed string{}\n'.format(
input_string, processed_str))
logging.info('infer vectors of input text')
inferred_vector = model.infer_vector(processed_str, steps=steps)
logging.info('calculating most similar docs')
return model.docvecs.most_similar([inferred_vector], topn=inferred_docs)
That was my only point; importing epoch_saver.py in infer_vector.py without really using it in infer_vector.py :-) Since doc2vec has something like 4 levels of inheritance and since pickle is such an nice tool I know that doing it in a different way is very cumbersome :-)
@JensMadsen aha, I understand what you mean, thanks for the clarification!
In this case, probably a good idea to save the model without callback always (for avoiding any pickle issues), @piskvorky @gojomo wdyt?
@menshikh-iv but then @daridar 's use case won't work — see https://github.com/RaRe-Technologies/gensim/issues/2136#issuecomment-407339864 .
I see two options:
1) Don't serialize callbacks. Pros: No need to have the callback classes named and importable.
2) Serialize callbacks. Pros: can load models and continue training, without specifying callbacks again.
I have no strong preference, either works. Sounds like users prefer 1). But no matter which we choose, we have to make the documentation and error messages clearer around this point.
The variant of (1) I'd prefer is not even allowing callbacks to be specified in the initializer, and thus not store them on a model property. They're plausibly an advanced feature that users can supply each time they call train() (which is already supported).
(They're also essentially a green/experimental feature that came in with only slight review/testing in the "big any2vec refactor".)
Training models directly with one-liners like model = Word2Vec(corpus, …) is still supported and widely used.
Not exposing "advanced features / parameters" is in __init__ is an interesting idea, but currently we have no "basic / advanced parameter" API contract in Gensim. Typically __init__ accepts everything that train does, and just proxies those parameters. So I'm worried such basic/advanced choices will be arbitrary and perhaps confusing. Or how would that contract look like?
But, the instantiate-and-train-in-1-line would still work. There'd just be fancy stuff you couldn't do that way - as is already the case with lots of things the model can do, if you break out the steps.
Or, __init__() could as a general matter always accept the same params as train(), and pass them through to the train() class it (sometimes) makes when it's following the special "instantiate-and-train-in-1-line" code-path. But it'd still not store them in the model for future reuse – because caching them for future train()s (which are themselves an ornery option best used only by advanced users) multiplies the stateful complexity & gives rise to problems like this. Even things like starting-alpha and min_alpha are sketchy to cache, as it's far from automatically-reasonable that each repeated train() should also repeat the same high-to-low learning-rate management.
More generally, though I've seen this combined init-and-do pattern popular in libraries that try to win on "how much you can do in 1 line" examples (like say requests), I'm not a fan of rolling the training into the initialization. To me it violates the Pythonic "There should be one — and preferably only one — obvious way to do it" principle. For the amount of work that's being done – sometimes hours per step – and the number of options to support, 3-4 lines of code which make the discrete steps explicit and individually well-documented can be better than 1-line with a encyclopedia-sized list of parameters, which can also affect each others' interpretation in hard-to-document ways.
I agree with all of that. If we go with option 1), we don't store callbacks inside the model.
Question is, is this a pattern / policy we want to follow more generally? Do we start splitting out some training parameters into train-only vs train-and-init? Which ones? Or all of them? Or decide ad-hoc?
To be clear, the fix for this PR will likely be fairly trivial, there's no major issue here. But it sounds we hit a wider consideration, and this is as good a place as any to clear out the principles.
What about going the keras route and manually specify the callbacks during load?:
https://github.com/keras-team/keras/issues/5916#issuecomment-300038263
I solved this problem by adding 'custom_bojects'
model = load_model('model/multi_task/try.h5', custom_objects={'loss_max': loss_max})my loss function:
```python
def loss_max(y_true, y_pred):
from keras import backend as K
return K.max(K.abs(y_pred - y_true), axis=-1)> ```
I'm having weird behavior like JensMadsen, but importing the custom callbacks inside my class does not work.
from document_vectors import DocumentVectorizer
vectorizer = DocumentVectorizer('....model')
Gives me:
File "...lib/python3.6/site-packages/gensim/models/deprecated/old_saveload.py", line 380, in unpickle
return _pickle.loads(file_bytes, encoding='latin1')
AttributeError: Can't get attribute 'EpochLogger' on
My document_vectors.py:
class DocumentVectorizer:
from model.doc2vec_train import EpochLogger, EpochSaver
def __init__(self, model_path):
self.model = Doc2Vec.load(model_path)
"""Use the trained Doc2Vec model to vectorize a document"""
def vectorize_document(self, text: List[str]):
return self.model.infer_vector(text)
It's using old_saveload.py even though the model was saved with the same gensim version.
However, this works:
from model.doc2vec_train import EpochLogger, EpochSaver
from document_vectors import DocumentVectorizer
vectorizer = DocumentVectorizer('...')
when I import the custom callbacks at the root level of the repl. The import inside the class should force it to load the callbacks before Doc2Vec.load right? I've tried putting the import inside '__init__' as well and that does not work.
Any updates here? I just stumbled upon this problem. I don't think callbacks should be serialized with the model. This makes it difficult to distribute models for research purpose. I think user should be given an option before serializing them.
That makes sense to me (to not serialize callbacks), but I'm just a bystander, never used those gensim workflows myself. @mpenkov thoughts?
@manrajgrover @kramer425 can you open a PR that implements that?
I also think callbacks shouldn't be serialized. We should primarily be serializing data like models, coefficients, etc.
I don't think an additional option to serialize (or not serialize) callbacks is the way to go. I think we should just stop serializing them, in a backwards compatible way.
Before we commit to that, though, can anyone come up with a use case that requires serializing callbacks? @menshikh-iv
I don't think an additional option to serialize (or not serialize) callbacks is the way to go. I think we should just stop serializing them, in a backwards compatible way.
I agree
can anyone come up with a use case that requires serializing callbacks?
For example, storing metric values after train -> save -> load, but that's can be done by extraction these values before a save call. I'm still +1 for dropping it.
Another user bit by this "feature"/bug:
https://groups.google.com/forum/#!topic/gensim/3nktOQStlbU
FYI, #2698 includes the policy discussed above of not-caching callbacks – they're only effective for the __init__() or train() call where they are provided, to avoid serialization complications.
Most helpful comment
What about going the keras route and manually specify the callbacks during load?:
https://github.com/keras-team/keras/issues/5916#issuecomment-300038263
I'm having weird behavior like JensMadsen, but importing the custom callbacks inside my class does not work.
Gives me:
My document_vectors.py:
It's using old_saveload.py even though the model was saved with the same gensim version.
However, this works:
when I import the custom callbacks at the root level of the repl. The import inside the class should force it to load the callbacks before Doc2Vec.load right? I've tried putting the import inside '__init__' as well and that does not work.