Darknet: Training of Pytorch YOLOv3 doesn't converge

Created on 20 May 2020 · 10Comments · Source: AlexeyAB/darknet

Hi,

I am trying to train YOLOv3 based on the code:
https://github.com/eriklindernoren/PyTorch-YOLOv3
As we all know, that is the best known Pytorch YOLOv3 code.

However, the training doesn't converge. Please see my issue:
https://github.com/eriklindernoren/PyTorch-YOLOv3/issues/504

All the parameters are not changed(i.e. the same as those in the code on github).
I trained more than 100 epochs, the AP is around 28%, and the loss is around 5.

do you have any idea about how to trace the problem?
or do you have the pytorch code of YOLOv3 that could be trained and converge?

Thanks,
Ardeal

Solved

Source

ardeal

Most helpful comment

I eventually confirmed that the code(https://github.com/eriklindernoren/PyTorch-YOLOv3) doesn't converge, but the code (https://github.com/ultralytics/yolov3) converges.

I trained eriklindernoren/PyTorch-YOLOv3 code for around 180 epoch, but the AP is around 0.26.

however, I trained ultralytics/yolov3 code for only 21 epoch. the AP is much better. Furthermore, the increasement of AP is normal. The following table is the output of the training of code ultralytics/yolov3

|epoch|mem|lossbox|lossobj|losscls|los |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|1/299|6.23G| 5.22| 3.5| 5.61| 14.3| |2/299|6.24G| 4.42| 2.66| 4.59| 11.7| |3/299|6.29G| 4.13| 2.37| 3.98| 10.5| |4/299|8.87G| 3.95| 2.33| 3.59| 9.87| |5/299|6.23G| 3.83| 2.26| 3.31| 9.4| |6/299|6.23G| 3.75| 2.24| 3.1| 9.09| |7/299|6.29G| 3.68| 2.21| 2.95| 8.83| |8/299| 8.8G| 3.62| 2.16| 2.81| 8.6| |9/299| 8.8G| 3.59| 2.18| 2.7| 8.48| |10/299|8.81G| 3.55| 2.15| 2.61| 8.3| |11/299|8.81G| 3.52| 2.16| 2.53| 8.21| |12/299|8.81G| 3.48| 2.13| 2.46| 8.07| |13/299| 8.8G| 3.46| 2.11| 2.4| 7.97| |14/299| 8.8G| 3.44| 2.07| 2.34| 7.85| |15/299| 8.8G| 3.42| 2.1| 2.3| 7.82| |16/299| 8.8G| 3.41| 2.1| 2.27| 7.78| |17/299| 8.8G| 3.39| 2.08| 2.22| 7.69| |18/299| 8.8G| 3.38| 2.07| 2.19| 7.64| |19/299| 8.8G| 3.36| 2.05| 2.16| 7.57| |20/299| 8.8G| 3.35| 2.06| 2.13| 7.53| |21/299| 8.8G| 3.33| 2.02| 2.11| 7.46| stotal|targets|img_size|P|R|mAP|F1|GIoU|obj|cls|
5| 512|0.067|0.0138|0.0067|0.00813| 5.22| 2.4| 5.3|
5| 320|0.119|0.0464|0.0363|0.0439| 4.31| 2.04| 4.45|
5| 576|0.212|0.0929|0.0849|0.0961| 3.75| 1.84| 3.69|
5| 320|0.268|0.144|0.125|0.144| 3.53| 1.76| 3.28|
5| 640|0.306|0.185|0.165|0.188| 3.36| 1.71| 2.95|
5| 384|0.355|0.217| 0.2|0.226| 3.23| 1.61| 2.7|
5| 448|0.376|0.232| 0.22|0.246| 3.16| 1.56| 2.57|
5| 320|0.383|0.247|0.235|0.261| 3.12| 1.54| 2.48|
5| 576|0.402|0.257|0.246|0.272| 3.09| 1.53| 2.42|
5| 384|0.409|0.264|0.255|0.282| 3.07| 1.52| 2.37|
5| 320|0.422|0.269|0.264|0.289| 3.05| 1.51| 2.33|
5| 640|0.437|0.275|0.273|0.298| 3.03| 1.5| 2.29|
5| 448|0.446| 0.28|0.282|0.306| 3.02| 1.49| 2.25|
5| 320|0.447|0.285| 0.29|0.314| 3| 1.49| 2.2|
5| 576|0.462|0.289|0.299|0.321| 2.99| 1.48| 2.16|
5| 320|0.475|0.294|0.307|0.328| 2.97| 1.47| 2.12|
5| 384|0.482|0.301|0.315|0.336| 2.95| 1.46| 2.08|
5| 576|0.491|0.306|0.324|0.343| 2.94| 1.45| 2.03|
5| 448|0.498|0.315|0.332|0.352| 2.92| 1.45| 1.99|
5| 384|0.502|0.325|0.341|0.362| 2.9| 1.44| 1.94|
5| 320|0.502|0.333|0.348|0.369| 2.88| 1.43| 1.9|

ardeal on 28 May 2020

👍3

All 10 comments

What AP can you get by using other frameworks, models, ...?

AlexeyAB on 20 May 2020

@AlexeyAB ,
Many thanks for your answer!

I didn't try other frameworks, models.
I just used the model trained by the code(https://github.com/eriklindernoren/PyTorch-YOLOv3).

Today, I am trying to train the code:
https://github.com/ultralytics/yolov3
The trained is not finalized. once it is finalized, I will check whether it could converge.

My problems are:
1) I didn't find the pytorch code of YOLOv3 that could be trained and converge. do you have the pytorch code of YOLOv3 that could be trained and converge?
2) do you have any idea about how to trace the problem?

Thanks,
Ardeal

ardeal on 20 May 2020

https://github.com/ultralytics/yolov3
May be your dataset is incorrect or very difficult, or you use bad params in cfg-file

AlexeyAB on 20 May 2020

@AlexeyAB ，

Thanks！

The dataset I am using is COCO.
I didn't change any params in cfg-file or and py file.

I am trying to train https://github.com/ultralytics/yolov3.
Let's wait for the training result of ultralytics/yolov3. If the training converges, we could get conclusion that the code eriklindernoren/PyTorch-YOLOv3 or params doesn't work.

Thanks,
Ardeal

ardeal on 20 May 2020

I trained more than 100 epochs, the AP is around 28%, and the loss is around 5.

For yolov3.cfg width=416 height=416 the good final AP50...95=31% after 300 epochs, if you use random-shapes, or lower if you don't.

AlexeyAB on 20 May 2020

@AlexeyAB ,
Thanks,

For yolov3.cfg, with=416 and height = 416. I didn't change it.

I couldn't get your meaning by:
For yolov3.cfg width=416 height=416 the good final AP05...95=31% after 300 epochs, if you use random-shapes, or lower if you don't.

Thanks,
Ardeal

ardeal on 20 May 2020

So your AP=28% isn't bad for yolov3 for 100 epochs, I don't see any problem.

AlexeyAB on 20 May 2020

@AlexeyAB ,

Thanks!

I am very glad to hear what you said. In the past a few month, I kept on researching what is wrong the code.

according to your understanding, AP = 28% isn't bad for 100 epochs for yolov3.
If I would like to increase the AP, what should I do? such as:
1) should I decrease the learning rate LR?
2) should I train it more epochs? what epochs is needed if I would like to achieve similar AP with yolov3 paper?
3) is there any other thing that I could do to increase the AP?

Thanks,
Ardeal

ardeal on 20 May 2020

@AlexeyAB
Hi,

Could you please check my question at https://github.com/ultralytics/yolov3/issues/1211?
That question is related with this question.

Thanks,
Ardeal

ardeal on 21 May 2020

I eventually confirmed that the code(https://github.com/eriklindernoren/PyTorch-YOLOv3) doesn't converge, but the code (https://github.com/ultralytics/yolov3) converges.

I trained eriklindernoren/PyTorch-YOLOv3 code for around 180 epoch, but the AP is around 0.26.

|epoch|mem|lossbox|lossobj|losscls|losstotal|targets|img_size|P|R|mAP|F1|GIoU|obj|cls|
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|1/299|6.23G| 5.22| 3.5| 5.61| 14.3| 5| 512|0.067|0.0138|0.0067|0.00813| 5.22| 2.4| 5.3|
|2/299|6.24G| 4.42| 2.66| 4.59| 11.7| 5| 320|0.119|0.0464|0.0363|0.0439| 4.31| 2.04| 4.45|
|3/299|6.29G| 4.13| 2.37| 3.98| 10.5| 5| 576|0.212|0.0929|0.0849|0.0961| 3.75| 1.84| 3.69|
|4/299|8.87G| 3.95| 2.33| 3.59| 9.87| 5| 320|0.268|0.144|0.125|0.144| 3.53| 1.76| 3.28|
|5/299|6.23G| 3.83| 2.26| 3.31| 9.4| 5| 640|0.306|0.185|0.165|0.188| 3.36| 1.71| 2.95|
|6/299|6.23G| 3.75| 2.24| 3.1| 9.09| 5| 384|0.355|0.217| 0.2|0.226| 3.23| 1.61| 2.7|
|7/299|6.29G| 3.68| 2.21| 2.95| 8.83| 5| 448|0.376|0.232| 0.22|0.246| 3.16| 1.56| 2.57|
|8/299| 8.8G| 3.62| 2.16| 2.81| 8.6| 5| 320|0.383|0.247|0.235|0.261| 3.12| 1.54| 2.48|
|9/299| 8.8G| 3.59| 2.18| 2.7| 8.48| 5| 576|0.402|0.257|0.246|0.272| 3.09| 1.53| 2.42|
|10/299|8.81G| 3.55| 2.15| 2.61| 8.3| 5| 384|0.409|0.264|0.255|0.282| 3.07| 1.52| 2.37|
|11/299|8.81G| 3.52| 2.16| 2.53| 8.21| 5| 320|0.422|0.269|0.264|0.289| 3.05| 1.51| 2.33|
|12/299|8.81G| 3.48| 2.13| 2.46| 8.07| 5| 640|0.437|0.275|0.273|0.298| 3.03| 1.5| 2.29|
|13/299| 8.8G| 3.46| 2.11| 2.4| 7.97| 5| 448|0.446| 0.28|0.282|0.306| 3.02| 1.49| 2.25|
|14/299| 8.8G| 3.44| 2.07| 2.34| 7.85| 5| 320|0.447|0.285| 0.29|0.314| 3| 1.49| 2.2|
|15/299| 8.8G| 3.42| 2.1| 2.3| 7.82| 5| 576|0.462|0.289|0.299|0.321| 2.99| 1.48| 2.16|
|16/299| 8.8G| 3.41| 2.1| 2.27| 7.78| 5| 320|0.475|0.294|0.307|0.328| 2.97| 1.47| 2.12|
|17/299| 8.8G| 3.39| 2.08| 2.22| 7.69| 5| 384|0.482|0.301|0.315|0.336| 2.95| 1.46| 2.08|
|18/299| 8.8G| 3.38| 2.07| 2.19| 7.64| 5| 576|0.491|0.306|0.324|0.343| 2.94| 1.45| 2.03|
|19/299| 8.8G| 3.36| 2.05| 2.16| 7.57| 5| 448|0.498|0.315|0.332|0.352| 2.92| 1.45| 1.99|
|20/299| 8.8G| 3.35| 2.06| 2.13| 7.53| 5| 384|0.502|0.325|0.341|0.362| 2.9| 1.44| 1.94|
|21/299| 8.8G| 3.33| 2.02| 2.11| 7.46| 5| 320|0.502|0.333|0.348|0.369| 2.88| 1.43| 1.9|