training on fasterrcnn-resnet50, turns out nan loss after several epochs:
2019-09-04 17:07:17,711 maskrcnn_benchmark.trainer INFO: eta: 9:19:05 iter: 660 loss_box_reg: 0.0632 (0.0884) loss: 0.7677 (0.8246) loss_classifier: 0.3055 (0.3680) loss_rpn_box_reg: 0.1366 (0.1266) loss_objectness: 0.2651 (0.2416) data: 0.0139 (0.0183) time: 0.3717 (0.3755) lr: 0.020000 max mem: 3889
2019-09-04 17:07:25,205 maskrcnn_benchmark.trainer INFO: eta: 9:18:55 iter: 680 loss_box_reg: 0.0735 (0.0883) loss: 0.7835 (0.8263) loss_classifier: 0.3279 (0.3691) loss_rpn_box_reg: 0.0588 (0.1269) loss_objectness: 0.1906 (0.2420) data: 0.0167 (0.0183) time: 0.3737 (0.3755) lr: 0.020000 max mem: 3889
2019-09-04 17:07:32,679 maskrcnn_benchmark.trainer INFO: eta: 9:18:43 iter: 700 loss_box_reg: 0.0957 (0.0887) loss: 0.7356 (0.8261) loss_classifier: 0.3199 (0.3689) loss_rpn_box_reg: 0.0783 (0.1261) loss_objectness: 0.2149 (0.2424) data: 0.0165 (0.0183) time: 0.3685 (0.3754) lr: 0.020000 max mem: 3889
2019-09-04 17:07:40,123 maskrcnn_benchmark.trainer INFO: eta: 9:18:28 iter: 720 loss_box_reg: 0.0658 (0.0884) loss: 0.9130 (0.8315) loss_classifier: 0.3697 (0.3699) loss_rpn_box_reg: 0.1159 (0.1289) loss_objectness: 0.2715 (0.2443) data: 0.0176 (0.0183) time: 0.3727 (0.3753) lr: 0.020000 max mem: 3889
2019-09-04 17:07:47,428 maskrcnn_benchmark.trainer INFO: eta: 9:17:56 iter: 740 loss_box_reg: 0.1149 (nan) loss: 2.4033 (nan) loss_classifier: 0.9854 (nan) loss_rpn_box_reg: 0.1203 (4793.8136) loss_objectness: 0.4745 (4714.1043) data: 0.0145 (0.0182) time: 0.3639 (0.3750) lr: 0.020000 max mem: 3889
2019-09-04 17:07:54,418 maskrcnn_benchmark.trainer INFO: eta: 9:16:48 iter: 760 loss_box_reg: nan (nan) loss: nan (nan) loss_classifier: nan (nan) loss_rpn_box_reg: 0.0713 (4667.6634) loss_objectness: 0.4063 (4590.0605) data: 0.0152 (0.0182) time: 0.3483 (0.3744) lr: 0.020000 max mem: 3889
2019-09-04 17:08:01,473 maskrcnn_benchmark.trainer INFO: eta: 9:15:51 iter: 780 loss_box_reg: nan (nan) loss: nan (nan) loss_classifier: nan (nan) loss_rpn_box_reg: 0.0868 (4547.9835) loss_objectness: 0.4457 (4472.3782) data: 0.0177 (0.0181) time: 0.3491 (0.3738) lr: 0.020000 max mem: 3889
2019-09-04 17:08:08,550 maskrcnn_benchmark.trainer INFO: eta: 9:15:00 iter: 800 loss_box_reg: nan (nan) loss: nan (nan) loss_classifier: nan (nan) loss_rpn_box_reg: 0.0950 (4434.2873) lo
python3 tools/train_net.py --config-file configs/e2e_faster_rcnn_R_50_FPN_1x.yaml
I believe that the learning rate is too big. The config you are using was configured for training on 8 GPUs. You should scale SOLVER parameters according to the number of GPUs used for training.
More on that here
Most helpful comment
I believe that the learning rate is too big. The config you are using was configured for training on 8 GPUs. You should scale SOLVER parameters according to the number of GPUs used for training.
More on that here