Pytorch-lightning: How to scale learning rate with batch size for DDP training?

Created on 28 Sep 2020  路  6Comments  路  Source: PyTorchLightning/pytorch-lightning

When using LARS optimizer, usually the batch size is scale linearly with the learning rate.
Suppose I set the base_lr to be 0.1 * batch_size / 256.
Now for 1 GPU training with batch size 512, the learning rate should be 0.1 * 2 = 0.2

However when I use 2 GPUs with DDP backend and batch size of 512 on each GPU. Should my learning rate be:

  • 0.1 * 2 = 0.2
  • or 0.1 * 2 * 2 (no. GPUs) = 0.4
question

Most helpful comment

As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.

Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes

In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512.

All 6 comments

Just to clarify if you use batch_size=512 in DDP backend, each GPU will train on 512 batch_size in lightning. Do you want 512 on each or 256 on each GPU?

Hi I want each GPUs to has batch size of 512. So two GPUs will have a total batch size of 1024. I don't know if I should set the learinng rate base on the total batch size or batch size on each GPU

in DDP the gradients are averaged and synced across each device before optimzer_step, so I don't think lr should be changed here.

As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.

Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes

In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512.

Thank you all for your answers. I'll scale the LR with the total effective batch size.

This is mentioned very briefly in the DDP documentation, perhaps it should also be mentioned in the TPU section in the docs since TPU uses DDP. This was my case, and I understood it as needing to scale batch size to match effective learning rate, but it was hard to find confirmation on this even with several threads on the subject in various places.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

versatran01 picture versatran01  路  3Comments

williamFalcon picture williamFalcon  路  3Comments

baeseongsu picture baeseongsu  路  3Comments

jcreinhold picture jcreinhold  路  3Comments

srush picture srush  路  3Comments