Keras-retinanet: network stops training after ~ 6-8 epochs

Created on 14 May 2020  路  5Comments  路  Source: fizyr/keras-retinanet

Hello,
I posted an issue earlier that I realized was not quite accurately describing what I was observing. Here is the correct description:
I am working through this tutorial on using retinanet to train on custom image set on a Google cloud VM, 52 GB RAM, Tesla T4, running Ubuntu 18.04, TF 1.15.2, python 3.7.7 (please let me know if any other information would be helpful). My dataset is comprised of 441 images with 1-5 occrruences of the same object (seeds). The train/val split is 271/170.
I have appeared to correctly install everything. I run the following train command which uses the default of 50 epochs:

$ keras_retinanet/bin/train.py --tensorboard-dir ~/retinanet/TrainingOutput --snapshot-path ~/retinanet/TrainingOutput/snapshots --random-transform --steps 2000 pascal ~/retinanet/seed_detection_VOC/

Everything works seems to work great for the first few epochs. I can see the loss decrease with tensorboard, etc. But then around epoch 6 or 8 training stops after saving the snapshot. No indication of an error. The last line written out is:

Epoch 00008: saving model to /home/robotmessenger810/retinanet/TrainingOutput7/snapshots/resnet50_pascal_08.h5

I have gone ahead and used the model saved at this point and it actually seems sufficient for my purposes, but I don't understand why training does not proceed for the expected number of epochs. I have tried increasing RAM on VM (from 30 to 52 GB), setting "--worker 0", and going up and down on the steps (from 100 to 10000). I still see the same pattern with training stopping after around epochs 6-8, with no discernable a pattern. Can anyone give insight into what might be happening and how I can troubleshoot? Are there log files I can consult?
Thanks!

Most helpful comment

@FarahSaeed
https://github.com/fizyr/keras-retinanet/blob/2b5d84e45329883870f8c1a2646ef3e761d399d6/keras_retinanet/bin/train.py#L204

Have you tried increasing patience value in EarlyStopping config? I suppose your mAP does not improve in 5 epochs.

It is running for all epochs after increasing patience parameter. Thanks.

All 5 comments

Hi I am also having same problem. Have you been able to fix it?

@FarahSaeed
https://github.com/fizyr/keras-retinanet/blob/2b5d84e45329883870f8c1a2646ef3e761d399d6/keras_retinanet/bin/train.py#L204
Have you tried increasing patience value in EarlyStopping config? I suppose your mAP does not improve in 5 epochs.

I was unaware of what happened when the patience threshold was reached. That makes total sense (my detection task is quite simple and even after 6 epochs it worked extremely well). I'll close this.

@FarahSaeed
https://github.com/fizyr/keras-retinanet/blob/2b5d84e45329883870f8c1a2646ef3e761d399d6/keras_retinanet/bin/train.py#L204

Have you tried increasing patience value in EarlyStopping config? I suppose your mAP does not improve in 5 epochs.

It is running for all epochs after increasing patience parameter. Thanks.

Mine stops at epoch 1. increasing patience does not solve the problem. Any help please?
Trace as follows:
Epoch 1/50
2020-09-20 16:43:41.588458: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-20 16:43:42.717110: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
97/10000 [..............................] - ETA: 1:08:14 - loss: 3.3150 - regression_loss: 2.4165 - classification_loss: 0.8985WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 500000 batches). You may need to use the repeat() function when building your dataset.
Running network: 100% (15 of 15) |##########################################################################################| Elapsed Time: 0:00:10 Time: 0:00:10
Parsing annotations: 100% (15 of 15) |######################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00
23 instances of class damaged with average precision: 0.1231
69 instances of class undamaged with average precision: 0.6403
mAP: 0.3817

Epoch 00001: saving model to ./snapshots/resnet50_pascal_01.h5
97/10000 [..............................] - 52s 532ms/step - loss: 3.3150 - regression_loss: 2.4165 - classification_loss: 0.8985
(retinaNet_2) demolakstate@demolakstate:/data/RetinaNet_2/keras-retinanet$

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sumeetssaurav picture sumeetssaurav  路  4Comments

Lakshya-Kejriwal picture Lakshya-Kejriwal  路  5Comments

Doodle1106 picture Doodle1106  路  3Comments

ztf-ucas picture ztf-ucas  路  3Comments

deep-diver picture deep-diver  路  6Comments