Darknet: When should I stop training if the validation metrics doesn't go down?

Created on 14 May 2018  路  6Comments  路  Source: AlexeyAB/darknet

I train my model on Ubuntu 16.04 with the command below:
darknet detector train <data_file> <cfg_file> darknet19_448.conv.23 | tee log.txt

Here is my learning rate strategy in the cfg_file:

learning_rate=0.0001
max_batches = 90000
policy=steps
steps=200,50000,70000
scales=10,.1,.1

And here is a chart during the training:
chart

After the training, I use my Python script to validate the models with different training steps. The script runs the command:
darknet detector map <data_file> <cfg_file> <weight_file> 1>log_file

and parses the output to get the metrics, then plots them out. Here is a plot I got:
metrics

which shows that the metrics doesn't go down like this plot:
metrics
So, when should I stop training, or which weight file should I chose?

BTW: my Python script is ugly, while it does works.

question

Most helpful comment

Did you get mAP for validation dataset - images that weren't used during training?
If yes, then you can use any weights-file since ~54 000 iterations.
I.e. get any weights-file with the highest mAP. (Or weights-file with highest mAP, Precision and Recall)

which shows that the metrics doesn't go down like this plot:

Overfitting is rare for Yolo v3/v2. It can be only:

  • if you use very low number of images and train many iterations,
  • or high number of similar images with a different distribution than in the validation dataset,
  • and set wrong params for data augmentation or learning rate

All 6 comments

Did you get mAP for validation dataset - images that weren't used during training?
If yes, then you can use any weights-file since ~54 000 iterations.
I.e. get any weights-file with the highest mAP. (Or weights-file with highest mAP, Precision and Recall)

which shows that the metrics doesn't go down like this plot:

Overfitting is rare for Yolo v3/v2. It can be only:

  • if you use very low number of images and train many iterations,
  • or high number of similar images with a different distribution than in the validation dataset,
  • and set wrong params for data augmentation or learning rate

Thank you for your timing reply!
Yes, I do randomly split my ~8000 image samples into training and validation dataset with a ratio of 8:2, and validate the models with the validation dataset after training.
Looks like I could reduce the training iterations to save some training time.
Thank you again, I can proceed without the worry about overfitting now.

@yangulei
Looking at your learning strategy...
What're the purpose of scales and steps?
Thx

@EscVM
It's a learning rate (LR) schedule. The parameter "steps" refers to the incremental steps to adjust the LR, and the parameter "scales" refers to the multipliers, see the answer on stackoverflow.
In my personal opinion, the schedule is aimed to balance the computation time and the convergence accuracy. You can find more detail about this at CS231n and the YOLOv1 paper.

@yangulei
Thank you. Gotcha!

Insead, I've tried your code, but I got this error:

`---------------------------------------
KeyErrorTraceback (most recent call last)
in ()
75 metrics = pd.read_hdf(h5_name, 'metrics')
76 else:
---> 77 metrics = get_metrics()
78
79 metrics_select = metrics[["F1","IoU","mAP","precision","recall"]]

in get_metrics()
48 k, v = iterm.split("=")
49 metrics_dict.update({k:v})
---> 50 mAPs.append(float(metrics_dict["mAP"]))
51 precisions.append(float(metrics_dict["precision"]))
52 recalls.append(float(metrics_dict["recall"]))

KeyError: 'mAP`

@EscVM Sorry for reply so late. Did you change the darknet executable, the data path, the config path and the weights path to your own in get_metrics() function? It's line 11 to 17 in the script.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

siddharth2395 picture siddharth2395  路  3Comments

HilmiK picture HilmiK  路  3Comments

HanSeYeong picture HanSeYeong  路  3Comments

hemp110 picture hemp110  路  3Comments

Jacky3213 picture Jacky3213  路  3Comments