Maskrcnn-benchmark: Why the large batchsize cause training slow?

Created on 10 Nov 2018  ยท  3Comments  ยท  Source: facebookresearch/maskrcnn-benchmark

โ“ Questions and Help

OS: Ubuntu 16.04.4 LTS
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10
cudnn version: 7.41
GPU models and configuration:
GPU 0: RTX 2080Ti
GPU 1: RTX 2080Ti

Pytorch version: 1.0.0a0+a1b2f17 (I build it myself)

I use two gpus to train.
When I set the IMS_PER_BATCH=2 the training logs:

2018-11-10 23:07:10,792 maskrcnn_benchmark.trainer INFO: eta: 1 day, 15:38:28  iter: 600  loss: 0.8067 (0.9634)  loss_classifier: 0.4021 (0.4952)  loss_box_reg: 0.1938 (0.1645)  loss_objectness: 0.0955 (0.1932)  loss_rpn_box_reg: 0.0603 (0.1105)  time: 0.2030 (0.1984)  data: 0.0025 (0.0064)  lr: 0.002500  max mem: 2055
2018-11-10 23:07:14,763 maskrcnn_benchmark.trainer INFO: eta: 1 day, 15:38:29  iter: 620  loss: 0.8861 (0.9619)  loss_classifier: 0.4842 (0.4967)  loss_box_reg: 0.2101 (0.1662)  loss_objectness: 0.0698 (0.1904)  loss_rpn_box_reg: 0.0339 (0.1087)  time: 0.1934 (0.1984)  data: 0.0027 (0.0063)  lr: 0.002500  max mem: 2055
2018-11-10 23:07:18,677 maskrcnn_benchmark.trainer INFO: eta: 1 day, 15:37:24  iter: 640  loss: 0.8827 (0.9617)  loss_classifier: 0.4748 (0.4980)  loss_box_reg: 0.2044 (0.1678)  loss_objectness: 0.0784 (0.1877)  loss_rpn_box_reg: 0.0486 (0.1083)  time: 0.1917 (0.1983)  data: 0.0027 (0.0062)  lr: 0.002500  max mem: 2055

IMS_PER_BATCH=4 the training logs

2018-11-10 22:59:43,577 maskrcnn_benchmark.trainer INFO: eta: 2 days, 22:13:32  iter: 620  loss: 0.6196 (0.7886)  loss_classifier: 0.3200 (0.3936)  loss_box_reg: 0.1238 (0.1176)  loss_objectness: 0.1067 (0.1777)  loss_rpn_box_reg: 0.0479 (0.0998)  time: 0.4458 (0.3514)  data: 0.0052 (0.0093)  lr: 0.002500  max mem: 3319
2018-11-10 22:59:52,395 maskrcnn_benchmark.trainer INFO: eta: 2 days, 22:46:53  iter: 640  loss: 0.7599 (0.7923)  loss_classifier: 0.4215 (0.3962)  loss_box_reg: 0.1794 (0.1201)  loss_objectness: 0.0991 (0.1761)  loss_rpn_box_reg: 0.0877 (0.0999)  time: 0.4523 (0.3542)  data: 0.0055 (0.0092)  lr: 0.002500  max mem: 3319

IMS_PER_BATCH=8 the training logs

2018-11-10 23:14:26,879 maskrcnn_benchmark.trainer INFO: eta: 5 days, 2:40:50  iter: 620  loss: 0.8377 (0.7058)  loss_classifier: 0.4479 (0.3423)  loss_box_reg: 0.1804 (0.0910)  loss_objectness: 0.1076 (0.1768)  loss_rpn_box_reg: 0.0946 (0.0957)  time: 0.5939 (0.6139)  data: 0.0113 (0.0136)  lr: 0.002500  max mem: 6111
2018-11-10 23:14:38,773 maskrcnn_benchmark.trainer INFO: eta: 5 days, 2:33:26  iter: 640  loss: 0.8840 (0.7114)  loss_classifier: 0.5055 (0.3468)  loss_box_reg: 0.1873 (0.0941)  loss_objectness: 0.1070 (0.1750)  loss_rpn_box_reg: 0.0651 (0.0954)  time: 0.5935 (0.6133)  data: 0.0121 (0.0135)  lr: 0.002500  max mem: 6111
2018-11-10 23:14:50,503 maskrcnn_benchmark.trainer INFO: eta: 5 days, 2:23:30  iter: 660  loss: 0.8384 (0.7163)  loss_classifier: 0.4757 (0.3512)  loss_box_reg: 0.1834 (0.0971)  loss_objectness: 0.1200 (0.1732)  loss_rpn_box_reg: 0.0555 (0.0948)  time: 0.5826 (0.6125)  data: 0.0112 (0.0135)  lr: 0.002500  max mem: 6111

Why the larger IMS_PER_BATCH the longer eta days. I think large batch size should cause training more faster.

question

Most helpful comment

@auroua
According to the "linear scaling rule", you can run less iterations when using larger batch size. You can check the example they used in Detectron.

All 3 comments

@auroua
According to the "linear scaling rule", you can run less iterations when using larger batch size. You can check the example they used in Detectron.

The solution is precisely what @chengyangfu mentioned: if you increate the batch size, you can decrease the number of iterations (and you should also adapt the learning rate accordingly).

I'm closing the issue, but let us know if you have further questions

@chengyangfu @fmassa Thanks.

Was this page helpful?
0 / 5 - 0 ratings