Does pytorch-lightning support synchronized batch normalization (SyncBN) when training with DDP? If so, how to use it?
If not, Apex has implemented SyncBN and one can use it with native PyTorch and Apex by:
from apex import amp
from apex.parallel import convert_syncbn_model
model = apex.parallel.convert_syncbn_model(model)
model, optimizer = amp.initialize(model, optimizer)
How to use them under the pytorch-lightning scheme?
SyncBN makes a big difference when training the model with DDP and it would be great to know how to use it in pytorch-lightning.
Thanks!
Hi! thanks for your contribution!, great first issue!
I am also curious. My guess is you need to convert_sync_bn manually, because sync bn is more inside building model part not trainer engine part. Do you have any progress?
@Yelen719 @ruotianluo we support sync_batchnorm in lightning now.
@ananyahjha93
Hi, is there any tutorial how to use SyncBatchNorm in lightning ?
@phongnhhn92 With some search in the doc: https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#sync-batchnorm
Hi @DKandrew @ananyahjha93 , Can u provide an example of how to use it. Of course, as easy as it sounds I can just add that option into Trainer. My question is that will that work out of the box for model using pytorch Synbatchnorm or above SyncBN from Apex ?
Hi @phongnhhn92
Here is an example: https://github.com/PyTorchLightning/pytorch-lightning/blob/114af8ba9fc42fcf7053fa06299fbe4aecab8a06/pl_examples/basic_examples/sync_bn.py
By the way, I don't think the example given here is completely correct: it does not set the random seed properly. Based on my personal understanding, the seed should be called after all the processes have been "created" (or "spawned" if you may). Here, the mistake is that the random seed is set only on the main process. I am not 100% sure about my analysis tho, not sure if a call at line 24 of the example can set the seed to all the processes (a Python question). And unfortunately Lightning does not have good documentation for this (I raise an issue #3460)
I believe that it is using pytorch Synbatchnorm. Check out the source code here
Hi @DKandrew , after reading the example, I think we should define our model with regular BatchNorm and then if we decide to use the option sync_batchnorm = true in Trainer then the framework will convert all those BatchNorm layer into SyncBatchNorm for us. I will test this in my code to see if it works like that.
However, I wonder that is there any difference between Apex SyncBatchNorm and Pytorch SyncBatchNorm ? Which one is better to use ?
I am also curios about that function seed_everything() function in the issue #3460. Hopefully, we can have explanation from Pytorch team on this.
Hi @phongnhhn92, from my personal experience, there is not much difference between Apex and PyTorch SyncBatchNorm and I vaguely remember that Apex developers have a close relationship with PyTorch's so their implementations may be fundamentally the same (don't quote me, please put a grant of salt on this). I have used nn.SyncBatchNorm for a while for semantic segmentation tasks and haven't encountered any issue so far, my network output is descent so I would say the PyTorch one is safe to use.