
I stop after training 50 epoch , and then resume to train next 50epoch. But the loss got bad. is this normal?
@jvnext1 yes this happen sometimes when resuming saved models. I'm not sure what the cause is, as we follow all of the pytorch best practices for saving model and optimizer dictionaries, and we set the scheduler last_epoch value as well.
The best solution is to train the model from start to finish rather than stopping unfortunately.
We had the same problem and I think I found a solution. When resuming from a checkpoint the optimizer is loaded into the lr_scheduler but no last_epoch argument is added in the constructor. It is set afterwards as scheduler.last_epoch = start_epoch in train.py. This leads to a bug as the constructor has already initialized the scheduler in a way that was dependent on the last_epoch argument to be set (see source code of lr_scheduler for details https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#LambdaLR).
The proposed solution is to pass the correct start_epoch in the constructor as such:
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf,last_epoch = start_epoch)
Note that if you resume training and also change the total number of episodes, this will lead to another learning rate being calculated as the lr_lambda function is dependent on the total number of episodes.
Hope this will help and thank you for an otherwise excellent implementation!
@Papibarozza ah thank you for your proposed fix. Yes this sounds like it may be a bug! Have you had success in implementing this fix?
Also yes, resuming training with different --epochs will lead to bad things, especially with a cosine LR scheduler.
TODO: Implement proposed fix in https://github.com/ultralytics/yolov3/issues/902#issuecomment-597696202
@Papibarozza commit 320f9c6601ae1bddae036a5094dfe81ac1441cc3 should fix this now. Passing last_epoch = start_epoch led to an error, so I used start_epoch - 1, which I think is better, since at epoch 0 this will read at -1, the default value.
I think this might create the scheduler warning again though, i will need to check. Ah, no everything seems fine. Ok great! I have not checked whether this fixes the error on resume training, but it doesn't seem to be hurting anything at the start of training, so I will leave it as is following the commit.
@glenn-jocher Yes we resumed training with this fix and completed some epochs, everything works as expected and we are seeing similar trends in the loss curve and metrics as we did before stopping the training. I think setting start_epoch - 1 like you propose will give us an off by one error when resuming error though so maybe there needs to be some more code to make this right.
@Papibarozza awesome!! This problem has been with us for over a year, I'm glad this finally solves it. Thank you for your solution!!
Most helpful comment
@glenn-jocher Yes we resumed training with this fix and completed some epochs, everything works as expected and we are seeing similar trends in the loss curve and metrics as we did before stopping the training. I think setting
start_epoch - 1like you propose will give us an off by one error when resuming error though so maybe there needs to be some more code to make this right.