Pytorch-lightning: Default behavior of sync_batchnorm with multi-processing backends

Created on 12 Nov 2020 · 3Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Following the discussion in #4597, notify the user on the default behavior of sync_batchnorm when it is not explicitly passed to pl.Trainer by the user in multi-processing backends, e.g. DDP.

Motivation

Make the transition from training on single GPU/single process backend, e.g. DP, to multi-processing multi-GPU seamless

Pitch

Users may be unfamiliar/unaware of the default behavior of batchnorm layers when using multi-processing backends and may expect batchnorm layers to be synced and updated during backpropagation.

SyncBatchNorm is mostly beneficial when batch_size is small (only few samples per GPU), however, the frequent gather operations may slow down training.

Alternatives:

Set to True by default and notify user: 'Running in backend and sync_batchnorm was not set, we enable it for you. Please note it is mostly beneficial with small batch size but may slow down training.' I personally think this is the preferred behavior since it maintains the same behavior when moving from DP to DDP.
Set to False by default and notify user: 'Running in backend, sync_batchnorm was not set and defaults to False. Consider setting sync_batchnorm=True, it may be beneficial with small batch-size.'

DDP discussion enhancement help wanted

Source

itsikad

All 3 comments

@PyTorchLightning/core-contributors any thoughts here?

Borda on 20 Nov 2020

I prefer the current False default, but I imagine a message reminding you to turn it on when it's not your intention might get annoying...

s-rog on 23 Nov 2020

👍1

@Borda I agree with the fact that the default syncbn behavior should be False and a warning to the users could be annoying. Setting syncbn to True is not required in majority of the cases.