Hi @jwyang, first of all, I appreciate for your great work!
It works well on pascal voc dataset, with fast training speed.
But in case of COCO, I tried to train with our 8 gpu machine with 16 batch size, and the training process is as below:
[session 1][epoch 1][iter 0] loss: 6.1804, lr: 1.00e-02
fg/bg=(426/1622), time cost: 35.842244
rpn_cls: 0.6897, rpn_box: 0.3679, rcnn_cls: 4.6646, rcnn_box 0.4582
[session 1][epoch 1][iter 100] loss: nan, lr: 1.00e-02
fg/bg=(296/1752), time cost: 104.216505
rpn_cls: 0.5585, rpn_box: 0.1213, rcnn_cls: 1.1079, rcnn_box 0.2641
[session 1][epoch 1][iter 200] loss: 2.0316, lr: 1.00e-02
fg/bg=(292/1756), time cost: 97.825786
rpn_cls: 0.4695, rpn_box: 0.1333, rcnn_cls: 0.9935, rcnn_box 0.2353
[session 1][epoch 1][iter 300] loss: 1.9466, lr: 1.00e-02
fg/bg=(309/1739), time cost: 96.629395
rpn_cls: 0.4684, rpn_box: 0.2638, rcnn_cls: 1.0195, rcnn_box 0.2282
[session 1][epoch 1][iter 400] loss: 1.8687, lr: 1.00e-02
fg/bg=(351/1697), time cost: 96.798322
rpn_cls: 0.3513, rpn_box: 0.3516, rcnn_cls: 1.1189, rcnn_box 0.3672
[session 1][epoch 1][iter 500] loss: 1.8504, lr: 1.00e-02
fg/bg=(389/1659), time cost: 97.335257
rpn_cls: 0.5053, rpn_box: 0.1943, rcnn_cls: 1.1378, rcnn_box 0.3056
After 6 epochs:
[session 1][epoch 6][iter 0] loss: 2.1971, lr: 1.00e-03
fg/bg=(371/1677), time cost: 10.386064
rpn_cls: 0.4443, rpn_box: 0.1902, rcnn_cls: 1.2200, rcnn_box 0.3425
[session 1][epoch 6][iter 100] loss: 1.7717, lr: 1.00e-03
fg/bg=(359/1689), time cost: 102.440927
rpn_cls: 0.5265, rpn_box: 0.4803, rcnn_cls: 1.0865, rcnn_box 0.2994
[session 1][epoch 6][iter 200] loss: 1.8162, lr: 1.00e-03
fg/bg=(270/1778), time cost: 101.048738
rpn_cls: 0.4605, rpn_box: 0.2271, rcnn_cls: 0.7392, rcnn_box 0.1698
[session 1][epoch 6][iter 300] loss: 1.8152, lr: 1.00e-03
fg/bg=(327/1721), time cost: 100.423944
rpn_cls: 0.4267, rpn_box: 0.1103, rcnn_cls: 0.9151, rcnn_box 0.2822
[session 1][epoch 6][iter 400] loss: 1.7705, lr: 1.00e-03
fg/bg=(255/1793), time cost: 101.397022
rpn_cls: 0.3407, rpn_box: 0.0913, rcnn_cls: 0.8353, rcnn_box 0.2553
[session 1][epoch 6][iter 500] loss: 1.7776, lr: 1.00e-03
fg/bg=(274/1774), time cost: 107.594405
rpn_cls: 0.3600, rpn_box: 0.1331, rcnn_cls: 0.8416, rcnn_box 0.2436
[session 1][epoch 6][iter 600] loss: 1.8178, lr: 1.00e-03
fg/bg=(267/1781), time cost: 108.269909
rpn_cls: 0.5435, rpn_box: 0.1890, rcnn_cls: 0.8891, rcnn_box 0.1905
[session 1][epoch 6][iter 700] loss: 1.7765, lr: 1.00e-03
fg/bg=(319/1729), time cost: 106.280694
rpn_cls: 0.4014, rpn_box: 0.1129, rcnn_cls: 0.9987, rcnn_box 0.2862
[session 1][epoch 6][iter 800] loss: 1.7829, lr: 1.00e-03
fg/bg=(254/1794), time cost: 107.551383
rpn_cls: 0.3835, rpn_box: 0.2285, rcnn_cls: 0.7649, rcnn_box 0.2244
[session 1][epoch 6][iter 900] loss: 1.7912, lr: 1.00e-03
fg/bg=(255/1793), time cost: 108.035455
rpn_cls: 0.3852, rpn_box: 0.3159, rcnn_cls: 0.7925, rcnn_box 0.1745
The training loss does not decrease from 1.7~1.8, I wonder if the training is going well.
It would be helpful if you could show the training log file.
I found the reason, the loss: nan in iter 100 was the problem.
The training loss decreases well when re training.
@jhkim89 hot did you fix not getting a nan loss?
Most helpful comment
@jhkim89 hot did you fix not getting a nan loss?