Pytorch-lightning: Checkpoint is saving the model based on the last val_metric_step value and not val_metric_epoch

Created on 18 Oct 2020  ·  10Comments  ·  Source: PyTorchLightning/pytorch-lightning

🐛 Bug

Checkpoint callback did not save some models even thought they achieved better result in the monitored metric, than the currently top k saved models

Expected behavior

Checkpoint callback saving the best scoring models based on a metric

Environment

I am using pytorch-lightning 1.0.2

Update:

I changed the checkpoint call back to add the value I am monitoring to the name of the saved checkpoint, what I notice it's not the epoch value, but the last step in the epoch value, so it's not taking the metric average value, but taking only the last one.

Checkpoint documentation help wanted

All 10 comments

Can you post some code to reproduce this? Or code snippet for training_step and validation_step

This is what I log on my train and validation step:

values = {'val_loss': loss, 'val_cer': cer_avg}
self.log_dict(values, logger=True, prog_bar=True, on_step=True, on_epoch=True)

And this my checkpoint callback :
checkpoint_callback = ModelCheckpoint(filepath='checkpoints/model_64_3/word_recog-{epoch:02d}-{val_cer:.2f}',save_last=True, mode='min', monitor='val_cer', save_top_k=5)

@awaelchli or @justusschock maybe related to other issues?

update:
Now I set on_step=False, and the checkpoint seems to correctly saving the best model

I know what's going on here.
When you log both on step and on epoch, i.e.

self.log_dict(values, on_step=True, on_epoch=True)

Lighting will create the keys

  • val_cer_step
  • val_cer_epoch

This is needed because it cannot log the val_cer on epoch and val_cer on step to the same graph in tensorboard.
So your ModelCheckpoint should monitor the epoch metric:

checkpoint_callback = ModelCheckpoint(
    dirpath="checkpoints/model_64_3", 
    filename="/word_recog-{epoch:02d}-{val_cer_epoch:.2f}",  # <--- note epoch suffix here
    save_last=True, 
    mode='min', 
    monitor='val_cer_epoch',   # <--- note epoch suffix here
    save_top_k=5
)

I will send a PR that updates the docs explaining this behaviour.

To be honest I changed my code currently, so I can not test this, but I believe that I set monitor=val_cer_epoch and the checkpoint did not save the height cer. But about the name of the checkpoint file I think it was I mistake, and I should have set it to val_cer_epoch.

Thank you for the help, and I hope that this issue was helpful to this great Library.

But about the name of the checkpoint file I think it was I mistake, and I should have set it to val_cer_epoch.

Yes, that would also explain this, because otherwise it would show the val_cer of the last batch in the validation loop in the name of the checkpoint, even if it saves the correct checkpoint

I maybe wrong but, I checked the date of the saved checkpoint and the date of highest val_cer_epoch via tensorboard, and it wasn't the same.
This is why I was sure it wasn't saving the best checkpoint.

ok, just note that if you want to get the highest value as the best, then you need to set mode="max", but you have mode="min".

Sorry my bad when I said the highest I meant the best value, CER is character error rate so the lower the better.

Was this page helpful?
0 / 5 - 0 ratings