Pytorch-lightning: save_function() not set with save_model callback?

Created on 18 Apr 2020 · 6Comments · Source: PyTorchLightning/pytorch-lightning

This is the callback in trainer()

trainer = pl.Trainer(
        callbacks=[ModelCheckpoint(monitor='val_loss',
filepath=os.path.join(hparams.default_root_dir,
'{epoch}-{val_loss:.2f}-{test_acc:.2f}'), verbose=True) ],

But the app crashes on the first epoch on the following error

Exception has occurred: ValueError
.save_function() not set
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 133, in _save_model
    raise ValueError(".save_function() not set")
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 240, in _do_check_save
    self._save_model(filepath)
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 208, in on_validation_end
    self._do_check_save(filepath, current, epoch)
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 63, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 792, in call_checkpoint_callback
    self.on_validation_end()
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 477, in run_training_epoch
    self.call_checkpoint_callback()
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 363, in train
    self.run_training_epoch()
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 865, in run_pretrain_routine
    self.train()
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 477, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 705, in fit
    self.single_gpu_train(model)
  File "/home/AAA/PycharmProjects/DL2020LiorWolf/train.py", line 110, in main_train
    trainer.fit(model)
  File "/home/AAA/PycharmProjects/DL2020LiorWolf/train.py", line 40, in main
    main_train(model_class_pointer, hyperparams, logger)
  File "/home/AAA/PycharmProjects/DL2020LiorWolf/train.py", line 118, in <module>
    main()
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/AAA/anaconda3/envs/BBB/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)

From the docs the model_checkpoint module seems as a "plug-and-play", I need to implement something else?
Actually, going through the source code, it seems as save_function is never set

question won't fix

Source

dvirginz

Most helpful comment

This error happens if a ModelCheckpoint instance is passed to the callback argument and not the checkpoint_callback argument of Trainer. Maybe an error should be thrown in this case, something like:

for logger in self.logger:
    if isinstance(logger, ModelCheckpoint):
        raise MisconfigurationException('You passed a ModelCheckpoint to argument `callback`, it should instead be passed to argument `checkpoint_callback`)

a similar check should probably be implemented for EarlyStopping that also has its own trainer argument.

SkafteNicki on 21 Apr 2020

👍22

All 6 comments

yeah, checkpoint function should be plug and play (it is already enabled by default if you don't add a callback). maybe this is a bug?

Try it without the callback first. it will save checkpoints automatically. then if it still breaks, we can look into what happened?

williamFalcon on 18 Apr 2020

Thanks for the fast reply!
Yeah, it works without the callback.
I'm also using comet manager might me the issue? You mentioned pl stores for me automatically, but on which policy? The default parameters of this callback? (which metric it tracks)
Thanks!

dvirginz on 18 Apr 2020

it auto tracks val_loss (yes, the default params)

williamFalcon on 18 Apr 2020

for logger in self.logger:
    if isinstance(logger, ModelCheckpoint):
        raise MisconfigurationException('You passed a ModelCheckpoint to argument `callback`, it should instead be passed to argument `checkpoint_callback`)

a similar check should probably be implemented for EarlyStopping that also has its own trainer argument.

SkafteNicki on 21 Apr 2020

👍22

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 20 Jun 2020

With the comment of @SkafteNicki, I find saving and loading weight tutorial very misleading. In particular,

# 3. Init ModelCheckpoint callback, monitoring 'val_loss'
checkpoint_callback = ModelCheckpoint(monitor='val_loss')

# 4. Add your callback to the callbacks list
trainer =  #Trainer(callbacks=[checkpoint_callback])