OS: Ubuntu 16.04.4 LTS
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10
cudnn version: 7.41
GPU models and configuration:
GPU 0: RTX 2080Ti
GPU 1: RTX 2080Ti
Pytorch version: 1.0.0a0+a1b2f17 (I build it myself)
I use two gpus to train.
When I set the IMS_PER_BATCH=2 the training logs:
2018-11-10 23:07:10,792 maskrcnn_benchmark.trainer INFO: eta: 1 day, 15:38:28 iter: 600 loss: 0.8067 (0.9634) loss_classifier: 0.4021 (0.4952) loss_box_reg: 0.1938 (0.1645) loss_objectness: 0.0955 (0.1932) loss_rpn_box_reg: 0.0603 (0.1105) time: 0.2030 (0.1984) data: 0.0025 (0.0064) lr: 0.002500 max mem: 2055
2018-11-10 23:07:14,763 maskrcnn_benchmark.trainer INFO: eta: 1 day, 15:38:29 iter: 620 loss: 0.8861 (0.9619) loss_classifier: 0.4842 (0.4967) loss_box_reg: 0.2101 (0.1662) loss_objectness: 0.0698 (0.1904) loss_rpn_box_reg: 0.0339 (0.1087) time: 0.1934 (0.1984) data: 0.0027 (0.0063) lr: 0.002500 max mem: 2055
2018-11-10 23:07:18,677 maskrcnn_benchmark.trainer INFO: eta: 1 day, 15:37:24 iter: 640 loss: 0.8827 (0.9617) loss_classifier: 0.4748 (0.4980) loss_box_reg: 0.2044 (0.1678) loss_objectness: 0.0784 (0.1877) loss_rpn_box_reg: 0.0486 (0.1083) time: 0.1917 (0.1983) data: 0.0027 (0.0062) lr: 0.002500 max mem: 2055
IMS_PER_BATCH=4 the training logs
2018-11-10 22:59:43,577 maskrcnn_benchmark.trainer INFO: eta: 2 days, 22:13:32 iter: 620 loss: 0.6196 (0.7886) loss_classifier: 0.3200 (0.3936) loss_box_reg: 0.1238 (0.1176) loss_objectness: 0.1067 (0.1777) loss_rpn_box_reg: 0.0479 (0.0998) time: 0.4458 (0.3514) data: 0.0052 (0.0093) lr: 0.002500 max mem: 3319
2018-11-10 22:59:52,395 maskrcnn_benchmark.trainer INFO: eta: 2 days, 22:46:53 iter: 640 loss: 0.7599 (0.7923) loss_classifier: 0.4215 (0.3962) loss_box_reg: 0.1794 (0.1201) loss_objectness: 0.0991 (0.1761) loss_rpn_box_reg: 0.0877 (0.0999) time: 0.4523 (0.3542) data: 0.0055 (0.0092) lr: 0.002500 max mem: 3319
IMS_PER_BATCH=8 the training logs
2018-11-10 23:14:26,879 maskrcnn_benchmark.trainer INFO: eta: 5 days, 2:40:50 iter: 620 loss: 0.8377 (0.7058) loss_classifier: 0.4479 (0.3423) loss_box_reg: 0.1804 (0.0910) loss_objectness: 0.1076 (0.1768) loss_rpn_box_reg: 0.0946 (0.0957) time: 0.5939 (0.6139) data: 0.0113 (0.0136) lr: 0.002500 max mem: 6111
2018-11-10 23:14:38,773 maskrcnn_benchmark.trainer INFO: eta: 5 days, 2:33:26 iter: 640 loss: 0.8840 (0.7114) loss_classifier: 0.5055 (0.3468) loss_box_reg: 0.1873 (0.0941) loss_objectness: 0.1070 (0.1750) loss_rpn_box_reg: 0.0651 (0.0954) time: 0.5935 (0.6133) data: 0.0121 (0.0135) lr: 0.002500 max mem: 6111
2018-11-10 23:14:50,503 maskrcnn_benchmark.trainer INFO: eta: 5 days, 2:23:30 iter: 660 loss: 0.8384 (0.7163) loss_classifier: 0.4757 (0.3512) loss_box_reg: 0.1834 (0.0971) loss_objectness: 0.1200 (0.1732) loss_rpn_box_reg: 0.0555 (0.0948) time: 0.5826 (0.6125) data: 0.0112 (0.0135) lr: 0.002500 max mem: 6111
Why the larger IMS_PER_BATCH the longer eta days. I think large batch size should cause training more faster.
@auroua
According to the "linear scaling rule", you can run less iterations when using larger batch size. You can check the example they used in Detectron.
The solution is precisely what @chengyangfu mentioned: if you increate the batch size, you can decrease the number of iterations (and you should also adapt the learning rate accordingly).
I'm closing the issue, but let us know if you have further questions
@chengyangfu @fmassa Thanks.
Most helpful comment
@auroua
According to the "linear scaling rule", you can run less iterations when using larger batch size. You can check the example they used in Detectron.