Pytorch-lightning: Checkpoint is saving the model based on the last val_metric_step value and not val_metric_epoch

Created on 18 Oct 2020 · 10Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

Checkpoint callback did not save some models even thought they achieved better result in the monitored metric, than the currently top k saved models

Expected behavior

Checkpoint callback saving the best scoring models based on a metric

Environment

I am using pytorch-lightning 1.0.2

Update:

I changed the checkpoint call back to add the value I am monitoring to the name of the saved checkpoint, what I notice it's not the epoch value, but the last step in the epoch value, so it's not taking the metric average value, but taking only the last one.

Checkpoint documentation help wanted

Source

MohammedAljahdali

All 10 comments

Can you post some code to reproduce this? Or code snippet for training_step and validation_step

ananyahjha93 on 19 Oct 2020

This is what I log on my train and validation step:

values = {'val_loss': loss, 'val_cer': cer_avg}
self.log_dict(values, logger=True, prog_bar=True, on_step=True, on_epoch=True)

And this my checkpoint callback :
checkpoint_callback = ModelCheckpoint(filepath='checkpoints/model_64_3/word_recog-{epoch:02d}-{val_cer:.2f}',save_last=True, mode='min', monitor='val_cer', save_top_k=5)

MohammedAljahdali on 21 Oct 2020

@awaelchli or @justusschock maybe related to other issues?

edenlightning on 22 Oct 2020

update:
Now I set on_step=False, and the checkpoint seems to correctly saving the best model

MohammedAljahdali on 22 Oct 2020

I know what's going on here.
When you log both on step and on epoch, i.e.

self.log_dict(values, on_step=True, on_epoch=True)

Lighting will create the keys

val_cer_step
val_cer_epoch

This is needed because it cannot log the val_cer on epoch and val_cer on step to the same graph in tensorboard.
So your ModelCheckpoint should monitor the epoch metric:

checkpoint_callback = ModelCheckpoint(
    dirpath="checkpoints/model_64_3", 
    filename="/word_recog-{epoch:02d}-{val_cer_epoch:.2f}",  # <--- note epoch suffix here
    save_last=True, 
    mode='min', 
    monitor='val_cer_epoch',   # <--- note epoch suffix here
    save_top_k=5
)

I will send a PR that updates the docs explaining this behaviour.

awaelchli on 24 Oct 2020

👍1

To be honest I changed my code currently, so I can not test this, but I believe that I set monitor=val_cer_epoch and the checkpoint did not save the height cer. But about the name of the checkpoint file I think it was I mistake, and I should have set it to val_cer_epoch.

Thank you for the help, and I hope that this issue was helpful to this great Library.

MohammedAljahdali on 24 Oct 2020

👍1

But about the name of the checkpoint file I think it was I mistake, and I should have set it to val_cer_epoch.

Yes, that would also explain this, because otherwise it would show the val_cer of the last batch in the validation loop in the name of the checkpoint, even if it saves the correct checkpoint

awaelchli on 24 Oct 2020

I maybe wrong but, I checked the date of the saved checkpoint and the date of highest val_cer_epoch via tensorboard, and it wasn't the same.
This is why I was sure it wasn't saving the best checkpoint.

MohammedAljahdali on 24 Oct 2020

ok, just note that if you want to get the highest value as the best, then you need to set mode="max", but you have mode="min".

awaelchli on 24 Oct 2020

Sorry my bad when I said the highest I meant the best value, CER is character error rate so the lower the better.

MohammedAljahdali on 24 Oct 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings