Pytorch-lightning: Default behavior of sync_batchnorm with multi-processing backends

Created on 12 Nov 2020  路  3Comments  路  Source: PyTorchLightning/pytorch-lightning

馃殌 Feature

Following the discussion in #4597, notify the user on the default behavior of sync_batchnorm when it is not explicitly passed to pl.Trainer by the user in multi-processing backends, e.g. DDP.

Motivation

Make the transition from training on single GPU/single process backend, e.g. DP, to multi-processing multi-GPU seamless

Pitch

Users may be unfamiliar/unaware of the default behavior of batchnorm layers when using multi-processing backends and may expect batchnorm layers to be synced and updated during backpropagation.

SyncBatchNorm is mostly beneficial when batch_size is small (only few samples per GPU), however, the frequent gather operations may slow down training.

Alternatives:

  1. Set to True by default and notify user: 'Running in backend and sync_batchnorm was not set, we enable it for you. Please note it is mostly beneficial with small batch size but may slow down training.' I personally think this is the preferred behavior since it maintains the same behavior when moving from DP to DDP.

  2. Set to False by default and notify user: 'Running in backend, sync_batchnorm was not set and defaults to False. Consider setting sync_batchnorm=True, it may be beneficial with small batch-size.'

DDP discussion enhancement help wanted

All 3 comments

@PyTorchLightning/core-contributors any thoughts here?

I prefer the current False default, but I imagine a message reminding you to turn it on when it's not your intention might get annoying...

@Borda I agree with the fact that the default syncbn behavior should be False and a warning to the users could be annoying. Setting syncbn to True is not required in majority of the cases.

Was this page helpful?
0 / 5 - 0 ratings