I set my model checkpoint callback as follows:
ckpt_callback = ModelCheckpoint(
filename='/{epoch}-{val_loss:.2f}',
monitor='val_loss',
save_last=True,
mode='min'
)
OR
ckpt_callback = ModelCheckpoint(
filename='/{epoch}-{val_avg_rmse:.2f}',
monitor='val_loss',
save_top_k=k,
mode='min'
)
and initialize the trainer dirpath with default os.getcwd().
In both cases, the a new directory is created in the correct path: proj_root/exp_name/exp_id/checkpoints, yet I get permission denied error since lightning tries to save the checkpoint in /<file_name>.ckpt instead of proj_root/exp_name/exp_id/checkpoints/<file_name>.ckpt, ignoring the dirpath.
Debug leads to inconsistent behavior (as far as I understand from the code/comments) in the following part of model_checkpoint.save_checkpoint():
self_save_top_k_checkpoints() doesn't add the dirpath internally whereas self._save_last_checkpoint does.ModelCheckpoint is initialized with save_top_k=None and 'save_las=True' which means the first if condition is satisfied unless explicitly initializing ModelCheckpoint with `save_top_k=0.save_last should be treated in the second if statement (mode 2) instead of the first one (mode 1, save_top_k).conda, pip, source): pipfilename='/{epoch}-{val_loss:.2f}' should be filename='{epoch}-{val_loss:.2f}'??
"/" and"\" are not allowed in filename.
@itsikad You mention an inconsistent behaviour, how can it be reproduced (aside from the illegal character in filename)?
@rohitgr7 @awaelchli Thanks! can't believe I missed this and spent couple of hours debugging it.
This solves the first two points.
the third still occurs. To reproduce just initialize a ModelCheckpoint with save_last=True only (leave save_top_k to default) and observing self.save_top_k in model_checpoint.save_chekcpoint(). Notice that it is set to 1 so both if statements (mode 1 and mode 2 in the code above) are satisfied (checkpoint is saved twice).
If save_top_k=None (default), it will be set to 1 only if a monitor is given else it won't be changed. The point here is if someone specifies the monitor key and set save_top_k=None, it won't make sense.
Also, save_last should have nothing to do with the monitor key IMO.
These lines seem a bit wrong to me.
https://github.com/PyTorchLightning/pytorch-lightning/blob/8601268c70649f49767001098adbf665a93843df/pytorch_lightning/callbacks/model_checkpoint.py#L253-L264
@rohitgr7 regarding save_last, a discussion is already happening here: #4335
thanks @awaelchli
Thanks @awaelchli @rohitgr7 for your help.
Closing this issue as there's an active discussion on the remaining issue.
Most helpful comment
filename='/{epoch}-{val_loss:.2f}'should befilename='{epoch}-{val_loss:.2f}'??