Vision: Training own dataset with official torchvision0.3 OBJECT DETECTION FINETUNING example for multi-instances segmentation, get "nan loss_box_reg" ?

Created on 6 Jun 2019  ·  8Comments  ·  Source: pytorch/vision

enhancement reference scripts question object detection

Most helpful comment

@fmassa thank you very much, that works for my case. I just add the following code piece into the PennFudanDataset generation, then the nan loss disappear.

    keep = (boxes[:, 3]>boxes[:, 1]) & (boxes[:, 2]>boxes[:, 0])
    boxes = boxes[keep]
    labels = labels[keep]
    masks = masks[keep]
    area = area[keep]
    iscrowd = iscrowd[keep]

All 8 comments

Here is the training log I have got:

Epoch: [0] [ 0/231] eta: 0:10:13 lr: 0.000021 loss: 7.4132 (7.4132) loss_classifier: 2.0382 (2.0382) loss_box_reg: 0.0140 (0.0140) loss_mask: 1.4018 (1.4018) loss_objectness: 3.4656 (3.4656) loss_rpn_box_reg: 0.4934 (0.4934) time: 2.6561 data: 1.6422 max mem: 4248
Epoch: [0] [ 10/231] eta: 0:03:01 lr: 0.000195 loss: 6.0985 (5.9091) loss_classifier: 1.9702 (1.8952) loss_box_reg: 0.0327 (0.0344) loss_mask: 1.2643 (1.2102) loss_objectness: 2.1992 (2.3107) loss_rpn_box_reg: 0.4934 (0.4586) time: 0.8225 data: 0.1776 max mem: 9283
Epoch: [0] [ 20/231] eta: 0:02:34 lr: 0.000369 loss: 4.5639 (4.5539) loss_classifier: 1.5667 (1.5147) loss_box_reg: 0.0505 (0.0533) loss_mask: 0.8897 (0.9595) loss_objectness: 1.5528 (1.5439) loss_rpn_box_reg: 0.5035 (0.4826) time: 0.6384 data: 0.0235 max mem: 9283
Epoch: [0] [ 30/231] eta: 0:02:20 lr: 0.000543 loss: 1.7964 (3.5635) loss_classifier: 0.5822 (1.1436) loss_box_reg: 0.0997 (0.0862) loss_mask: 0.6035 (0.8217) loss_objectness: 0.1093 (1.0732) loss_rpn_box_reg: 0.4127 (0.4388) time: 0.6309 data: 0.0158 max mem: 9283
Epoch: [0] [ 40/231] eta: 0:02:08 lr: 0.000716 loss: 1.2845 (2.9765) loss_classifier: 0.3288 (0.9400) loss_box_reg: 0.1662 (0.1091) loss_mask: 0.4163 (0.7004) loss_objectness: 0.0595 (0.8257) loss_rpn_box_reg: 0.2974 (0.4013) time: 0.6100 data: 0.0186 max mem: 9283
Epoch: [0] [ 50/231] eta: 0:01:59 lr: 0.000890 loss: 0.9676 (2.5608) loss_classifier: 0.2270 (0.7952) loss_box_reg: 0.1662 (0.1189) loss_mask: 0.2690 (0.6097) loss_objectness: 0.0404 (0.6702) loss_rpn_box_reg: 0.2644 (0.3667) time: 0.5986 data: 0.0182 max mem: 9283
Epoch: [0] [ 60/231] eta: 0:01:52 lr: 0.001064 loss: 0.7760 (2.2629) loss_classifier: 0.1824 (0.6906) loss_box_reg: 0.1575 (0.1266) loss_mask: 0.2311 (0.5441) loss_objectness: 0.0175 (0.5637) loss_rpn_box_reg: 0.1914 (0.3380) time: 0.6290 data: 0.0146 max mem: 9703
Epoch: [0] [ 70/231] eta: 0:01:46 lr: 0.001238 loss: 0.6962 (2.0360) loss_classifier: 0.1340 (0.6108) loss_box_reg: 0.1564 (0.1301) loss_mask: 0.1986 (0.4949) loss_objectness: 0.0173 (0.4870) loss_rpn_box_reg: 0.1682 (0.3132) time: 0.6584 data: 0.0149 max mem: 9703
Epoch: [0] [ 80/231] eta: 0:01:39 lr: 0.001411 loss: 0.5906 (1.8475) loss_classifier: 0.1145 (0.5470) loss_box_reg: 0.1201 (0.1261) loss_mask: 0.1653 (0.4521) loss_objectness: 0.0166 (0.4285) loss_rpn_box_reg: 0.1664 (0.2938) time: 0.6511 data: 0.0148 max mem: 9703
Epoch: [0] [ 90/231] eta: 0:01:32 lr: 0.001585 loss: 0.5296 (1.7063) loss_classifier: 0.0908 (0.4971) loss_box_reg: 0.0893 (0.1215) loss_mask: 0.1486 (0.4207) loss_objectness: 0.0157 (0.3855) loss_rpn_box_reg: 0.1691 (0.2816) time: 0.6458 data: 0.0146 max mem: 9703
Epoch: [0] [100/231] eta: 0:01:25 lr: 0.001759 loss: 0.4635 (1.5807) loss_classifier: 0.0799 (0.4556) loss_box_reg: 0.0624 (0.1149) loss_mask: 0.1470 (0.3932) loss_objectness: 0.0152 (0.3491) loss_rpn_box_reg: 0.1555 (0.2679) time: 0.6192 data: 0.0142 max mem: 9703
Epoch: [0] [110/231] eta: 0:01:18 lr: 0.001933 loss: 0.4267 (1.4783) loss_classifier: 0.0766 (0.4218) loss_box_reg: 0.0534 (0.1094) loss_mask: 0.1414 (0.3711) loss_objectness: 0.0130 (0.3191) loss_rpn_box_reg: 0.1425 (0.2569) time: 0.5953 data: 0.0136 max mem: 9703
Epoch: [0] [120/231] eta: 0:01:11 lr: 0.002106 loss: 0.4233 (1.3926) loss_classifier: 0.0746 (0.3941) loss_box_reg: 0.0510 (0.1047) loss_mask: 0.1418 (0.3526) loss_objectness: 0.0112 (0.2945) loss_rpn_box_reg: 0.1255 (0.2466) time: 0.6173 data: 0.0135 max mem: 9703
Epoch: [0] [130/231] eta: 0:01:05 lr: 0.002280 loss: 0.3902 (1.3229) loss_classifier: 0.0770 (0.3704) loss_box_reg: 0.0494 (0.1010) loss_mask: 0.1462 (0.3378) loss_objectness: 0.0102 (0.2748) loss_rpn_box_reg: 0.1329 (0.2390) time: 0.6762 data: 0.0139 max mem: 10262
Epoch: [0] [140/231] eta: 0:00:58 lr: 0.002454 loss: 0.4536 (1.2612) loss_classifier: 0.0702 (0.3496) loss_box_reg: 0.0534 (0.0977) loss_mask: 0.1431 (0.3247) loss_objectness: 0.0147 (0.2569) loss_rpn_box_reg: 0.1434 (0.2323) time: 0.6482 data: 0.0141 max mem: 10262
Epoch: [0] [150/231] eta: 0:00:52 lr: 0.002627 loss: 0.4593 (1.2083) loss_classifier: 0.0672 (0.3309) loss_box_reg: 0.0480 (0.0943) loss_mask: 0.1339 (0.3126) loss_objectness: 0.0147 (0.2409) loss_rpn_box_reg: 0.1503 (0.2296) time: 0.6193 data: 0.0133 max mem: 10266
Epoch: [0] [160/231] eta: 0:00:45 lr: 0.002801 loss: 0.3742 (1.1563) loss_classifier: 0.0513 (0.3139) loss_box_reg: 0.0356 (0.0908) loss_mask: 0.1266 (0.3013) loss_objectness: 0.0136 (0.2268) loss_rpn_box_reg: 0.1373 (0.2235) time: 0.6235 data: 0.0123 max mem: 10266
Epoch: [0] [170/231] eta: 0:00:39 lr: 0.002975 loss: 0.3722 (1.1102) loss_classifier: 0.0496 (0.2983) loss_box_reg: 0.0338 (0.0875) loss_mask: 0.1281 (0.2910) loss_objectness: 0.0129 (0.2144) loss_rpn_box_reg: 0.1373 (0.2190) time: 0.5895 data: 0.0117 max mem: 10266
Epoch: [0] [180/231] eta: 0:00:32 lr: 0.003149 loss: 0.3752 (1.0699) loss_classifier: 0.0449 (0.2845) loss_box_reg: 0.0352 (0.0849) loss_mask: 0.1335 (0.2827) loss_objectness: 0.0113 (0.2033) loss_rpn_box_reg: 0.1377 (0.2145) time: 0.5790 data: 0.0117 max mem: 10266
Epoch: [0] [190/231] eta: 0:00:25 lr: 0.003322 loss: 0.3604 (1.0340) loss_classifier: 0.0464 (0.2722) loss_box_reg: 0.0378 (0.0823) loss_mask: 0.1266 (0.2742) loss_objectness: 0.0113 (0.1934) loss_rpn_box_reg: 0.1561 (0.2119) time: 0.5889 data: 0.0120 max mem: 10266
Epoch: [0] [200/231] eta: 0:00:19 lr: 0.003496 loss: 0.3412 (1.0001) loss_classifier: 0.0477 (0.2615) loss_box_reg: 0.0307 (0.0799) loss_mask: 0.1151 (0.2660) loss_objectness: 0.0119 (0.1843) loss_rpn_box_reg: 0.1496 (0.2083) time: 0.6061 data: 0.0121 max mem: 10266
Epoch: [0] [210/231] eta: 0:00:13 lr: 0.003670 loss: 0.3412 (0.9711) loss_classifier: 0.0493 (0.2516) loss_box_reg: 0.0307 (0.0776) loss_mask: 0.1160 (0.2598) loss_objectness: 0.0106 (0.1761) loss_rpn_box_reg: 0.1411 (0.2060) time: 0.6413 data: 0.0115 max mem: 10266
Epoch: [0] [220/231] eta: 0:00:06 lr: 0.003844 loss: 0.3592 (0.9424) loss_classifier: 0.0427 (0.2423) loss_box_reg: 0.0286 (0.0753) loss_mask: 0.1161 (0.2531) loss_objectness: 0.0104 (0.1686) loss_rpn_box_reg: 0.1437 (0.2032) time: 0.6326 data: 0.0122 max mem: 10266
Loss is nan, stopping training
{'loss_classifier': tensor(0.4403, device='cuda:0', grad_fn=), 'loss_box_reg': tensor(nan, device='cuda:0', grad_fn=), 'loss_mask': tensor(0.1802, device='cuda:0', grad_fn=), 'loss_objectness': tensor(3.2400, device='cuda:0', grad_fn=), 'loss_rpn_box_reg': tensor(3.9531, device='cuda:0', grad_fn=)}

I only change the official code tv-training-code.py(code link is at the bottom of this page: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html) in main function.
Change num_classes from 2 to 5 for my own dataset, and change training batchsize from 2 to 3 (if leave unchanged, just got the same nan error).

if I change my task to 2 classes mask (background and interest thing), the training is OK, and the results seems good. But change to 5 classes, the nan loss_box_reg occurs.

So how can I solve this error ?

Hi,

The first thing I'd check is that your bounding boxes can't be degenerated. For example, if one bounding box has xmax <= xmin, or ymax <= ymin, then you might have problems with the box encoding. Also, make sure that your boxes are within the image xmax <= image_width and ymax <= image_height).

Here is how I did it for COCO dataset: without it, training gets to nan in the box_reg loss as well:
https://github.com/pytorch/vision/blob/aa32c9376c46eb284f2b091f3eb98aec4fd64b03/references/detection/coco_utils.py#L65-L67
for the boxes within the image, and
https://github.com/pytorch/vision/blob/aa32c9376c46eb284f2b091f3eb98aec4fd64b03/references/detection/coco_utils.py#L82-L85
for removing degenerate boxes

Let me know if this fixes your issues

@fmassa thank you very much, that works for my case. I just add the following code piece into the PennFudanDataset generation, then the nan loss disappear.

    keep = (boxes[:, 3]>boxes[:, 1]) & (boxes[:, 2]>boxes[:, 0])
    boxes = boxes[keep]
    labels = labels[keep]
    masks = masks[keep]
    area = area[keep]
    iscrowd = iscrowd[keep]

@XiaoLaoDi Do you think it might be useful to add this information somewhere in the tutorial? If yes, could you send a PR?

@fmassa This information maybe useful, but I can find no where to add this attention part through a PR. Hoping you can add some description of this information to the PennFudanDataset generation in the tutorial at https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#torchvision-0-3-object-detection-finetuning-tutorial when changing the instance segmentation from 2 classes to multi-classes case.

@XiaoLaoDi here is the notebook where you can send PRs to it. Note that it's not in the master branch https://github.com/pytorch/vision/tree/temp-tutorial/tutorials

Was this page helpful?
0 / 5 - 0 ratings