Pytorch-lightning: How to scale learning rate with batch size for DDP training?

Created on 28 Sep 2020 · 6Comments · Source: PyTorchLightning/pytorch-lightning

When using LARS optimizer, usually the batch size is scale linearly with the learning rate.
Suppose I set the base_lr to be 0.1 * batch_size / 256.
Now for 1 GPU training with batch size 512, the learning rate should be 0.1 * 2 = 0.2

However when I use 2 GPUs with DDP backend and batch size of 512 on each GPU. Should my learning rate be:

0.1 * 2 = 0.2
or 0.1 * 2 * 2 (no. GPUs) = 0.4

question

Source

huyvnphan

Most helpful comment

As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.

Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes

In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512.

itsikad on 29 Sep 2020

👍2

All 6 comments

Just to clarify if you use batch_size=512 in DDP backend, each GPU will train on 512 batch_size in lightning. Do you want 512 on each or 256 on each GPU?

rohitgr7 on 28 Sep 2020

Hi I want each GPUs to has batch size of 512. So two GPUs will have a total batch size of 1024. I don't know if I should set the learinng rate base on the total batch size or batch size on each GPU

huyvnphan on 28 Sep 2020

in DDP the gradients are averaged and synced across each device before optimzer_step, so I don't think lr should be changed here.

rohitgr7 on 29 Sep 2020

As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.

Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes

itsikad on 29 Sep 2020

👍2

Thank you all for your answers. I'll scale the LR with the total effective batch size.

huyvnphan on 29 Sep 2020

This is mentioned very briefly in the DDP documentation, perhaps it should also be mentioned in the TPU section in the docs since TPU uses DDP. This was my case, and I understood it as needing to scale batch size to match effective learning rate, but it was hard to find confirmation on this even with several threads on the subject in various places.