Maskrcnn-benchmark: Why the large batchsize cause training slow?

Created on 10 Nov 2018 · 3Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

OS: Ubuntu 16.04.4 LTS
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10
cudnn version: 7.41
GPU models and configuration:
GPU 0: RTX 2080Ti
GPU 1: RTX 2080Ti

Pytorch version: 1.0.0a0+a1b2f17 (I build it myself)

I use two gpus to train.
When I set the IMS_PER_BATCH=2 the training logs:

2018-11-10 23:07:10,792 maskrcnn_benchmark.trainer INFO: eta: 1 day, 15:38:28  iter: 600  loss: 0.8067 (0.9634)  loss_classifier: 0.4021 (0.4952)  loss_box_reg: 0.1938 (0.1645)  loss_objectness: 0.0955 (0.1932)  loss_rpn_box_reg: 0.0603 (0.1105)  time: 0.2030 (0.1984)  data: 0.0025 (0.0064)  lr: 0.002500  max mem: 2055
2018-11-10 23:07:14,763 maskrcnn_benchmark.trainer INFO: eta: 1 day, 15:38:29  iter: 620  loss: 0.8861 (0.9619)  loss_classifier: 0.4842 (0.4967)  loss_box_reg: 0.2101 (0.1662)  loss_objectness: 0.0698 (0.1904)  loss_rpn_box_reg: 0.0339 (0.1087)  time: 0.1934 (0.1984)  data: 0.0027 (0.0063)  lr: 0.002500  max mem: 2055
2018-11-10 23:07:18,677 maskrcnn_benchmark.trainer INFO: eta: 1 day, 15:37:24  iter: 640  loss: 0.8827 (0.9617)  loss_classifier: 0.4748 (0.4980)  loss_box_reg: 0.2044 (0.1678)  loss_objectness: 0.0784 (0.1877)  loss_rpn_box_reg: 0.0486 (0.1083)  time: 0.1917 (0.1983)  data: 0.0027 (0.0062)  lr: 0.002500  max mem: 2055

IMS_PER_BATCH=4 the training logs

2018-11-10 22:59:43,577 maskrcnn_benchmark.trainer INFO: eta: 2 days, 22:13:32  iter: 620  loss: 0.6196 (0.7886)  loss_classifier: 0.3200 (0.3936)  loss_box_reg: 0.1238 (0.1176)  loss_objectness: 0.1067 (0.1777)  loss_rpn_box_reg: 0.0479 (0.0998)  time: 0.4458 (0.3514)  data: 0.0052 (0.0093)  lr: 0.002500  max mem: 3319
2018-11-10 22:59:52,395 maskrcnn_benchmark.trainer INFO: eta: 2 days, 22:46:53  iter: 640  loss: 0.7599 (0.7923)  loss_classifier: 0.4215 (0.3962)  loss_box_reg: 0.1794 (0.1201)  loss_objectness: 0.0991 (0.1761)  loss_rpn_box_reg: 0.0877 (0.0999)  time: 0.4523 (0.3542)  data: 0.0055 (0.0092)  lr: 0.002500  max mem: 3319

IMS_PER_BATCH=8 the training logs

2018-11-10 23:14:26,879 maskrcnn_benchmark.trainer INFO: eta: 5 days, 2:40:50  iter: 620  loss: 0.8377 (0.7058)  loss_classifier: 0.4479 (0.3423)  loss_box_reg: 0.1804 (0.0910)  loss_objectness: 0.1076 (0.1768)  loss_rpn_box_reg: 0.0946 (0.0957)  time: 0.5939 (0.6139)  data: 0.0113 (0.0136)  lr: 0.002500  max mem: 6111
2018-11-10 23:14:38,773 maskrcnn_benchmark.trainer INFO: eta: 5 days, 2:33:26  iter: 640  loss: 0.8840 (0.7114)  loss_classifier: 0.5055 (0.3468)  loss_box_reg: 0.1873 (0.0941)  loss_objectness: 0.1070 (0.1750)  loss_rpn_box_reg: 0.0651 (0.0954)  time: 0.5935 (0.6133)  data: 0.0121 (0.0135)  lr: 0.002500  max mem: 6111
2018-11-10 23:14:50,503 maskrcnn_benchmark.trainer INFO: eta: 5 days, 2:23:30  iter: 660  loss: 0.8384 (0.7163)  loss_classifier: 0.4757 (0.3512)  loss_box_reg: 0.1834 (0.0971)  loss_objectness: 0.1200 (0.1732)  loss_rpn_box_reg: 0.0555 (0.0948)  time: 0.5826 (0.6125)  data: 0.0112 (0.0135)  lr: 0.002500  max mem: 6111

Why the larger IMS_PER_BATCH the longer eta days. I think large batch size should cause training more faster.

question

Source

auroua

Most helpful comment

@auroua
According to the "linear scaling rule", you can run less iterations when using larger batch size. You can check the example they used in Detectron.

chengyangfu on 10 Nov 2018

👍4

All 3 comments

@auroua
According to the "linear scaling rule", you can run less iterations when using larger batch size. You can check the example they used in Detectron.

chengyangfu on 10 Nov 2018

👍4

The solution is precisely what @chengyangfu mentioned: if you increate the batch size, you can decrease the number of iterations (and you should also adapt the learning rate accordingly).

I'm closing the issue, but let us know if you have further questions

fmassa on 10 Nov 2018

👍1

@chengyangfu @fmassa Thanks.

auroua on 11 Nov 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

?? What's the problem

kaaier · 3Comments

Extracting mAP for each class when evaluating.

Jinksi · 3Comments

Get 0 AP and AR when testing, and the inference result is very bad.

KuribohG · 3Comments

Raise ValueError: Type mismatch (<type 'str'> vs. <type 'tuple'>) with values (coco_2017_train vs. ('coco_2017_train',)) for config key: DATASETS.TRAIN

SkeletonOne · 3Comments

Unable to reproduce the results of baseline on conv5 in FPN paper on CityScapes

krumo · 3Comments