Faster-rcnn.pytorch: Training loss doesn't decrease

Created on 18 Jan 2018  路  2Comments  路  Source: jwyang/faster-rcnn.pytorch

Hi @jwyang, first of all, I appreciate for your great work!

It works well on pascal voc dataset, with fast training speed.

But in case of COCO, I tried to train with our 8 gpu machine with 16 batch size, and the training process is as below:

[session 1][epoch  1][iter    0] loss: 6.1804, lr: 1.00e-02
                        fg/bg=(426/1622), time cost: 35.842244
                        rpn_cls: 0.6897, rpn_box: 0.3679, rcnn_cls: 4.6646, rcnn_box 0.4582
[session 1][epoch  1][iter  100] loss: nan, lr: 1.00e-02
                        fg/bg=(296/1752), time cost: 104.216505
                        rpn_cls: 0.5585, rpn_box: 0.1213, rcnn_cls: 1.1079, rcnn_box 0.2641
[session 1][epoch  1][iter  200] loss: 2.0316, lr: 1.00e-02
                        fg/bg=(292/1756), time cost: 97.825786
                        rpn_cls: 0.4695, rpn_box: 0.1333, rcnn_cls: 0.9935, rcnn_box 0.2353
[session 1][epoch  1][iter  300] loss: 1.9466, lr: 1.00e-02
                        fg/bg=(309/1739), time cost: 96.629395
                        rpn_cls: 0.4684, rpn_box: 0.2638, rcnn_cls: 1.0195, rcnn_box 0.2282
[session 1][epoch  1][iter  400] loss: 1.8687, lr: 1.00e-02
                        fg/bg=(351/1697), time cost: 96.798322
                        rpn_cls: 0.3513, rpn_box: 0.3516, rcnn_cls: 1.1189, rcnn_box 0.3672
[session 1][epoch  1][iter  500] loss: 1.8504, lr: 1.00e-02
                        fg/bg=(389/1659), time cost: 97.335257
                        rpn_cls: 0.5053, rpn_box: 0.1943, rcnn_cls: 1.1378, rcnn_box 0.3056

After 6 epochs:

[session 1][epoch  6][iter    0] loss: 2.1971, lr: 1.00e-03
                        fg/bg=(371/1677), time cost: 10.386064
                        rpn_cls: 0.4443, rpn_box: 0.1902, rcnn_cls: 1.2200, rcnn_box 0.3425
[session 1][epoch  6][iter  100] loss: 1.7717, lr: 1.00e-03
                        fg/bg=(359/1689), time cost: 102.440927
                        rpn_cls: 0.5265, rpn_box: 0.4803, rcnn_cls: 1.0865, rcnn_box 0.2994
[session 1][epoch  6][iter  200] loss: 1.8162, lr: 1.00e-03
                        fg/bg=(270/1778), time cost: 101.048738
                        rpn_cls: 0.4605, rpn_box: 0.2271, rcnn_cls: 0.7392, rcnn_box 0.1698
[session 1][epoch  6][iter  300] loss: 1.8152, lr: 1.00e-03
                        fg/bg=(327/1721), time cost: 100.423944
                        rpn_cls: 0.4267, rpn_box: 0.1103, rcnn_cls: 0.9151, rcnn_box 0.2822
[session 1][epoch  6][iter  400] loss: 1.7705, lr: 1.00e-03
                        fg/bg=(255/1793), time cost: 101.397022
                        rpn_cls: 0.3407, rpn_box: 0.0913, rcnn_cls: 0.8353, rcnn_box 0.2553
[session 1][epoch  6][iter  500] loss: 1.7776, lr: 1.00e-03
                        fg/bg=(274/1774), time cost: 107.594405
                        rpn_cls: 0.3600, rpn_box: 0.1331, rcnn_cls: 0.8416, rcnn_box 0.2436
[session 1][epoch  6][iter  600] loss: 1.8178, lr: 1.00e-03
                        fg/bg=(267/1781), time cost: 108.269909
                        rpn_cls: 0.5435, rpn_box: 0.1890, rcnn_cls: 0.8891, rcnn_box 0.1905
[session 1][epoch  6][iter  700] loss: 1.7765, lr: 1.00e-03
                        fg/bg=(319/1729), time cost: 106.280694
                        rpn_cls: 0.4014, rpn_box: 0.1129, rcnn_cls: 0.9987, rcnn_box 0.2862
[session 1][epoch  6][iter  800] loss: 1.7829, lr: 1.00e-03
                        fg/bg=(254/1794), time cost: 107.551383
                        rpn_cls: 0.3835, rpn_box: 0.2285, rcnn_cls: 0.7649, rcnn_box 0.2244
[session 1][epoch  6][iter  900] loss: 1.7912, lr: 1.00e-03
                        fg/bg=(255/1793), time cost: 108.035455
                        rpn_cls: 0.3852, rpn_box: 0.3159, rcnn_cls: 0.7925, rcnn_box 0.1745

The training loss does not decrease from 1.7~1.8, I wonder if the training is going well.

It would be helpful if you could show the training log file.

Most helpful comment

@jhkim89 hot did you fix not getting a nan loss?

All 2 comments

I found the reason, the loss: nan in iter 100 was the problem.

The training loss decreases well when re training.

@jhkim89 hot did you fix not getting a nan loss?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

WangTianYuan picture WangTianYuan  路  25Comments

CodeJjang picture CodeJjang  路  19Comments

wjx2 picture wjx2  路  14Comments

HViktorTsoi picture HViktorTsoi  路  12Comments

andrewjong picture andrewjong  路  22Comments