Pytorch-lightning: ModelCheckpoint ignores dirpath may result in permission denied error

Created on 22 Nov 2020  路  7Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

I set my model checkpoint callback as follows:

    ckpt_callback = ModelCheckpoint(
        filename='/{epoch}-{val_loss:.2f}',
        monitor='val_loss',
        save_last=True,
        mode='min'
        )

OR

    ckpt_callback = ModelCheckpoint(
        filename='/{epoch}-{val_avg_rmse:.2f}',
        monitor='val_loss',
        save_top_k=k,
        mode='min'
        )

and initialize the trainer dirpath with default os.getcwd().

In both cases, the a new directory is created in the correct path: proj_root/exp_name/exp_id/checkpoints, yet I get permission denied error since lightning tries to save the checkpoint in /<file_name>.ckpt instead of proj_root/exp_name/exp_id/checkpoints/<file_name>.ckpt, ignoring the dirpath.

Debug leads to inconsistent behavior (as far as I understand from the code/comments) in the following part of model_checkpoint.save_checkpoint():

https://github.com/PyTorchLightning/pytorch-lightning/blob/8601268c70649f49767001098adbf665a93843df/pytorch_lightning/callbacks/model_checkpoint.py#L236-L246

  1. `_get_metric_interpolated_filepath_name()' retrieves only the filename without the dirpath (as the comment above depicts)
  2. self_save_top_k_checkpoints() doesn't add the dirpath internally whereas self._save_last_checkpoint does.
  3. 'self.save_top_k' is set to 1 if ModelCheckpoint is initialized with save_top_k=None and 'save_las=True' which means the first if condition is satisfied unless explicitly initializing ModelCheckpoint with `save_top_k=0.

Expected behavior

  1. Save checkpoints to expected path.
  2. I think save_last should be treated in the second if statement (mode 2) instead of the first one (mode 1, save_top_k).

Environment

  • PyTorch Version (e.g., 1.0): 1.6.0
  • PyTorch Lightning Version (e.g., 1.0): 1.0.7
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): pip
  • Python version: 3.6.9
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration: 2 x RTX2080Ti
bug / fix help wanted

Most helpful comment

filename='/{epoch}-{val_loss:.2f}' should be filename='{epoch}-{val_loss:.2f}'??

All 7 comments

filename='/{epoch}-{val_loss:.2f}' should be filename='{epoch}-{val_loss:.2f}'??

"/" and"\" are not allowed in filename.
@itsikad You mention an inconsistent behaviour, how can it be reproduced (aside from the illegal character in filename)?

@rohitgr7 @awaelchli Thanks! can't believe I missed this and spent couple of hours debugging it.

This solves the first two points.

the third still occurs. To reproduce just initialize a ModelCheckpoint with save_last=True only (leave save_top_k to default) and observing self.save_top_k in model_checpoint.save_chekcpoint(). Notice that it is set to 1 so both if statements (mode 1 and mode 2 in the code above) are satisfied (checkpoint is saved twice).

If save_top_k=None (default), it will be set to 1 only if a monitor is given else it won't be changed. The point here is if someone specifies the monitor key and set save_top_k=None, it won't make sense.

Also, save_last should have nothing to do with the monitor key IMO.
These lines seem a bit wrong to me.
https://github.com/PyTorchLightning/pytorch-lightning/blob/8601268c70649f49767001098adbf665a93843df/pytorch_lightning/callbacks/model_checkpoint.py#L253-L264

@rohitgr7 regarding save_last, a discussion is already happening here: #4335

thanks @awaelchli

Thanks @awaelchli @rohitgr7 for your help.
Closing this issue as there's an active discussion on the remaining issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

baeseongsu picture baeseongsu  路  3Comments

Vichoko picture Vichoko  路  3Comments

edenlightning picture edenlightning  路  3Comments

williamFalcon picture williamFalcon  路  3Comments

mmsamiei picture mmsamiei  路  3Comments