Pytorch-lightning: Is Sync BatchNorm supported?

Created on 5 Jul 2020  路  9Comments  路  Source: PyTorchLightning/pytorch-lightning

Does pytorch-lightning support synchronized batch normalization (SyncBN) when training with DDP? If so, how to use it?

If not, Apex has implemented SyncBN and one can use it with native PyTorch and Apex by:

from apex import amp
from apex.parallel import convert_syncbn_model

model = apex.parallel.convert_syncbn_model(model)
model, optimizer = amp.initialize(model, optimizer)

How to use them under the pytorch-lightning scheme?

SyncBN makes a big difference when training the model with DDP and it would be great to know how to use it in pytorch-lightning.

Thanks!

question

All 9 comments

Hi! thanks for your contribution!, great first issue!

I am also curious. My guess is you need to convert_sync_bn manually, because sync bn is more inside building model part not trainer engine part. Do you have any progress?

@Yelen719 @ruotianluo we support sync_batchnorm in lightning now.

@ananyahjha93

Hi, is there any tutorial how to use SyncBatchNorm in lightning ?

@phongnhhn92 With some search in the doc: https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#sync-batchnorm

Hi @DKandrew @ananyahjha93 , Can u provide an example of how to use it. Of course, as easy as it sounds I can just add that option into Trainer. My question is that will that work out of the box for model using pytorch Synbatchnorm or above SyncBN from Apex ?

Hi @phongnhhn92

Here is an example: https://github.com/PyTorchLightning/pytorch-lightning/blob/114af8ba9fc42fcf7053fa06299fbe4aecab8a06/pl_examples/basic_examples/sync_bn.py

By the way, I don't think the example given here is completely correct: it does not set the random seed properly. Based on my personal understanding, the seed should be called after all the processes have been "created" (or "spawned" if you may). Here, the mistake is that the random seed is set only on the main process. I am not 100% sure about my analysis tho, not sure if a call at line 24 of the example can set the seed to all the processes (a Python question). And unfortunately Lightning does not have good documentation for this (I raise an issue #3460)

I believe that it is using pytorch Synbatchnorm. Check out the source code here

Hi @DKandrew , after reading the example, I think we should define our model with regular BatchNorm and then if we decide to use the option sync_batchnorm = true in Trainer then the framework will convert all those BatchNorm layer into SyncBatchNorm for us. I will test this in my code to see if it works like that.
However, I wonder that is there any difference between Apex SyncBatchNorm and Pytorch SyncBatchNorm ? Which one is better to use ?
I am also curios about that function seed_everything() function in the issue #3460. Hopefully, we can have explanation from Pytorch team on this.

Hi @phongnhhn92, from my personal experience, there is not much difference between Apex and PyTorch SyncBatchNorm and I vaguely remember that Apex developers have a close relationship with PyTorch's so their implementations may be fundamentally the same (don't quote me, please put a grant of salt on this). I have used nn.SyncBatchNorm for a while for semantic segmentation tasks and haven't encountered any issue so far, my network output is descent so I would say the PyTorch one is safe to use.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

baeseongsu picture baeseongsu  路  3Comments

monney picture monney  路  3Comments

versatran01 picture versatran01  路  3Comments

awaelchli picture awaelchli  路  3Comments

anthonytec2 picture anthonytec2  路  3Comments