This is the callback in trainer()
trainer = pl.Trainer(
callbacks=[ModelCheckpoint(monitor='val_loss',
filepath=os.path.join(hparams.default_root_dir,
'{epoch}-{val_loss:.2f}-{test_acc:.2f}'), verbose=True) ],
But the app crashes on the first epoch on the following error
Exception has occurred: ValueError
.save_function() not set
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 133, in _save_model
raise ValueError(".save_function() not set")
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 240, in _do_check_save
self._save_model(filepath)
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 208, in on_validation_end
self._do_check_save(filepath, current, epoch)
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 63, in on_validation_end
callback.on_validation_end(self, self.get_model())
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 792, in call_checkpoint_callback
self.on_validation_end()
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 477, in run_training_epoch
self.call_checkpoint_callback()
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 363, in train
self.run_training_epoch()
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 865, in run_pretrain_routine
self.train()
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 477, in single_gpu_train
self.run_pretrain_routine(model)
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 705, in fit
self.single_gpu_train(model)
File "/home/AAA/PycharmProjects/DL2020LiorWolf/train.py", line 110, in main_train
trainer.fit(model)
File "/home/AAA/PycharmProjects/DL2020LiorWolf/train.py", line 40, in main
main_train(model_class_pointer, hyperparams, logger)
File "/home/AAA/PycharmProjects/DL2020LiorWolf/train.py", line 118, in <module>
main()
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
From the docs the model_checkpoint
module seems as a "plug-and-play", I need to implement something else?
Actually, going through the source code, it seems as save_function
is never set
yeah, checkpoint function should be plug and play (it is already enabled by default if you don't add a callback). maybe this is a bug?
Try it without the callback first. it will save checkpoints automatically. then if it still breaks, we can look into what happened?
Thanks for the fast reply!
Yeah, it works without the callback.
I'm also using comet manager might me the issue? You mentioned pl stores for me automatically, but on which policy? The default parameters of this callback? (which metric it tracks)
Thanks!
it auto tracks val_loss (yes, the default params)
This error happens if a ModelCheckpoint
instance is passed to the callback
argument and not the checkpoint_callback
argument of Trainer
. Maybe an error should be thrown in this case, something like:
for logger in self.logger:
if isinstance(logger, ModelCheckpoint):
raise MisconfigurationException('You passed a ModelCheckpoint to argument `callback`, it should instead be passed to argument `checkpoint_callback`)
a similar check should probably be implemented for EarlyStopping
that also has its own trainer argument.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
With the comment of @SkafteNicki, I find saving and loading weight tutorial very misleading. In particular,
# 3. Init ModelCheckpoint callback, monitoring 'val_loss'
checkpoint_callback = ModelCheckpoint(monitor='val_loss')
# 4. Add your callback to the callbacks list
trainer = #Trainer(callbacks=[checkpoint_callback])
Most helpful comment
This error happens if a
ModelCheckpoint
instance is passed to thecallback
argument and not thecheckpoint_callback
argument ofTrainer
. Maybe an error should be thrown in this case, something like:a similar check should probably be implemented for
EarlyStopping
that also has its own trainer argument.