Dose every tower batch-norm on it's batch (part of batch in multi GPU mode),
or the Wx+b of all towers are concatenated to calculate batch-norm of batch*num_GPU eaxamples?
The latter maybe much slower due the sycronization.
Each tower performs batch_norm on its own part of the batch, there is no synchronization across towers for that.
@sun9700: Please reopen if that doesn't answer your question.
does it mean the moving_mean and moving_variance on each tower will potentially be updated to different values even when the variables are shared across towers?
When we save the model, which tower's moving_mean/variance is saved?
Is there away to handle this correctly?
Most helpful comment
does it mean the moving_mean and moving_variance on each tower will potentially be updated to different values even when the variables are shared across towers?
When we save the model, which tower's moving_mean/variance is saved?
Is there away to handle this correctly?