Keras: Momentum vs. decay in normalization.py for batch normalization

Created on 3 Jun 2017  路  14Comments  路  Source: keras-team/keras

I was looking at the implementation for batch normalization in normalization.py, specifically for the use of momentum. I followed the implementation of the methods that use momentum all the way to the original tensorflow implementation. Below I give the snippets of each consecutive method that passes on the momentum variable that is first given when you apply keras.layers.normalization.BatchNormalization.

I noticed that Keras uses momentum, while tensorflow uses decay and not momentum when they compute the moving average. Is this a bug in the Keras implementation? Why doesn't Keras use decay since the original implementation requires decay? Are they the same thing in this context?

Use of momentum in normalization.py:

self.add_update([K.moving_average_update(self.moving_mean,
                                                 mean,
                                                 self.momentum),
                         K.moving_average_update(self.moving_variance,
                                                 variance,
                                                 self.momentum)],
                        inputs)

momentum is passed on in K.moving_average_update() in tensorflow_backend.py:

def moving_average_update(x, value, momentum):
    """Compute the moving average of a variable.
    # Arguments
        x: A Variable.
        value: A tensor with the same shape as `variable`.
        momentum: The moving average momentum.
    # Returns
        An Operation to update the variable."""
    return moving_averages.assign_moving_average(
        x, value, momentum, zero_debias=False)

Finally, momentum is passed to assign_moving_average() in moving_averages.py in tensorflow's repository, where it is given in the place of the variable decay:

def assign_moving_average(variable, value, decay, zero_debias=True, name=None):
  """Compute the moving average of a variable.
  The moving average of 'variable' updated with 'value' is:
    variable * decay + value * (1 - decay)
  The returned Operation sets 'variable' to the newly computed moving average.
  The new value of 'variable' can be set with the 'AssignSub' op as:
     variable -= (1 - decay) * (variable - value)
  Since variables that are initialized to a `0` value will be `0` biased,
  `zero_debias` optionally enables scaling by the mathematically correct
  debiasing factor of
    1 - decay ** num_updates
  See `ADAM: A Method for Stochastic Optimization` Section 3 for more details
  (https://arxiv.org/abs/1412.6980).
  Args:
    variable: A Variable.
    value: A tensor with the same shape as 'variable'.
    decay: A float Tensor or float value.  The moving average decay.
    zero_debias: A python bool. If true, assume the variable is 0-intialized and
      unbias it, as in https://arxiv.org/abs/1412.6980. See docstring in
      `_zero_debias` for more details.
    name: Optional name of the returned operation.
  Returns:
    A reference to the input 'variable' tensor with the newly computed
    moving average.
  """
  with ops.name_scope(name, "AssignMovingAvg",
                      [variable, value, decay]) as scope:
    with ops.colocate_with(variable):
      decay = ops.convert_to_tensor(1.0 - decay, name="decay")
      if decay.dtype != variable.dtype.base_dtype:
        decay = math_ops.cast(decay, variable.dtype.base_dtype)
      if zero_debias:
        update_delta = _zero_debias(variable, value, decay)
      else:
        update_delta = (variable - value) * decay
      return state_ops.assign_sub(variable, update_delta, name=scope)

I noticed that Keras uses momentum, while tensorflow uses decay and not momentum when they compute the moving average. Is this a bug in the Keras implementation? Why doesn't Keras use decay since the original implementation requires decay? Are they the same thing in this context?

Thank you!

  • [x ] Check that you are up-to-date with the master branch of Keras. You can update with:
    pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps

  • [x ] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.

Most helpful comment

It's just a name. What Keras (correctly) cals momentum, TF (incorrectly) calls decay. momentum+decay=1. You can tell that TF is wrong, because larger decay is supposed to cause faster decay. You can see how TF has to correct itself here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L81

TF is a mess:

BatchNormalization.__init__():
https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/python/layers/normalization.py#L224

https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/python/layers/normalization.py#L261
https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/python/layers/normalization.py#L441

Additional references:

TF docs
https://www.tensorflow.org/versions/r0.12/api_docs/python/train/moving_averages

def moving_average_update(x, value, momentum):
    return moving_averages.assign_moving_average(x, value, momentum, zero_debias=False)

https://github.com/fchollet/keras/blob/master/keras/backend/tensorflow_backend.py#L974

def assign_moving_average(variable, value, decay, zero_debias=True, name=None):
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L32

tf.layers.batch_normalization(
    inputs,
    axis=-1,
    momentum=0.99,
    epsilon=0.001, ...)

https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization

All 14 comments

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

Would also be interested in this, especially since I noticed that the momentum definition seems to be exactly opposite to the Torch one?

Keras
If we follow the arguments (self.moving_mean, mean, self.momentum) of add_update in normalization.py down to the TF function assign_moving_average we find:
variable = moving_mean and value=mean, hence the update is done with moving_mean * decay + mean * (1 - decay).
I.e. if momentum=0.99: variable * 0.99 + value * (1 - 0.99) = variable * 0.99 + value * 0.01 = moving_mean * 0.99 + mean * 0.01.

Torch
Here we find THTensor_(set1d)(running_mean, f, (real) (momentum * mean + (1 - momentum) * THTensor_(get1d)(running_mean, [f))); (Source)
-> running_mean * (1-momentum) + mean * momentum it's exactly the opposite definition to the TF/Keras one.
I.e. if momentum=0.99: running_mean * 0.01 + mean * 0.99

So either I missed something, or the Keras implementation multiplies the (1-momentum) bracket with the mean while Torch multiplies it with the moving_mean.
Unfortunately, I couldn't find any reference to what the correct definition would be.

Any idea @fchollet?

I noticed something similar : I've been trying to reproduce training from scratch on ImageNet but I get very different results from the tf-slim implementation. One of the reasons I am investigating is the bn parameters : https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.py#L410
As you can see, the bn-decay is set to 0.9997, so when I replace bn's momentum in Keras with 0.9997 instead of 0.99 I get bad results. On the other hand, when I reduce that momentum the network seems to generalize better.

@hicham-eyeem
I鈥檓 trying to reproduce the mobilenet accuracy, but I only got about 30.7% error rate. I also noticed the batchnorm parameters in slim implementation, it's weird to set such high decay(0.9997), do you have any idea?

@boluoweifenda I got 31.45% top-1 validation accuracy with center crop. I trained for 45 epochs (instead of 90), with SGD (instead of RMSprop) reduced by 10 every 15 epochs, and weight decay of 1e-5 (instead of 4e-5)/dropout of 1e-2 (instead of 1e-3). For the batchnorm parameter I put it to 0.70 which seemed to work better than 0.99 (but I don't know how it impacts the results on the long run). Since I am trying to reproduce mobilenet results for transfer learning, I don't really care about the final accuracy but how good the weights are for fine-tuning, unfortunately the performance of the model that I trained (with several attempts/different settings) is quite far from the keras weights available online (2 mAP points difference on my multilabel dataset), so I don't know what is exactly the source of the gap : solver, batchnorm params, weight decay values, train for longer, etc.
May I ask you how did you tune the parameters to get 30.7% ?

@hicham-eyeem
I got 30.5% validation accuracy with center crop now
trainset: reshape short side to 256 and center crop to 256x256, random crop to 224x224
validset: reshape short side to 256 and center crop to 256x256, center crop to 224x224
batchsize 256, SGD with momentum=0.9, init lr = 0.1 and linearly decay to 1e-3 after 60 epochs as https://github.com/shicai/MobileNet-Caffe/issues/1
batchnorm decay 0.9, L2 weight decay 1e-4

@boluoweifenda
Cool, I will try that, thank you. I used 3 gpus with every gpu processing a batch of 32 images (the total batch size is 32x3 with initial learning rate of 0.1323/256), which might affect performance due to batch norm on small batches.
Did you compare the performance of your model in a fine-tuning scenario and see how it compares with the original weights?

@hicham-eyeem
Sorry I didn't try transfer learning or any fine-tuning scenario, I just pay attention to CNN models and architectures. BTW, I found that training with 2 GPUs is slightly better than 4 GPUs with same batchsize 256, I just use mean and std in single GPU to update moving average statics of batch normalization.

@boluoweifenda cool, thanks!

It's just a name. What Keras (correctly) cals momentum, TF (incorrectly) calls decay. momentum+decay=1. You can tell that TF is wrong, because larger decay is supposed to cause faster decay. You can see how TF has to correct itself here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L81

TF is a mess:

BatchNormalization.__init__():
https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/python/layers/normalization.py#L224

https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/python/layers/normalization.py#L261
https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/python/layers/normalization.py#L441

Additional references:

TF docs
https://www.tensorflow.org/versions/r0.12/api_docs/python/train/moving_averages

def moving_average_update(x, value, momentum):
    return moving_averages.assign_moving_average(x, value, momentum, zero_debias=False)

https://github.com/fchollet/keras/blob/master/keras/backend/tensorflow_backend.py#L974

def assign_moving_average(variable, value, decay, zero_debias=True, name=None):
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L32

tf.layers.batch_normalization(
    inputs,
    axis=-1,
    momentum=0.99,
    epsilon=0.001, ...)

https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization

@ozabluda soo it's fine then, because I was worried that since momentum != decay that I'd have to set momentum = 1-decay, which I tried, but I didn't notice any difference. I don't think it's a big deal just really annoying that the naming is inconsistent.

@ozabluda Thank you for the explanation, it's a bit confusing though. If we take the MobileNet tf-slim implementation, they put the batchnrom "decay" to 0.9997 : https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.py#L410, since there's that 1 - decay thing in the TF's moving average function, does that mean that equivalent implementation in Keras would be a momentum of 1 - 0.9997 or 0.9997? I tried 0.9997 batchnorm momentum in Keras few days ago but it was not training properly. In any case, what would be the impact of having a momentum of 0.9997 instead of 0.99, does it really matter?

@hicham-eyeem >If we take the MobileNet tf-slim implementation, they put the batchnrom "decay" to 0.9997 :

https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.py#L410,

Can you track down how it gets to the actual batch_normalization() or class BatchNormalization, neither of which has decay parameter (both have momentum):
https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/python/layers/normalization.py#L467
https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/python/layers/normalization.py#L98

it's a bit confusing though.

looks to me that TF erroneously called momentum "decay" at some point in the past, then corrected itself, and now it's a mess. Maybe you can track it all down in the revision history? Maybe this is a good starting point:
https://github.com/tensorflow/tensorflow/commit/f56c0abfdb1fe0e4812ac490e68cb58a3761586c#diff-94bbcef0ec8a5cdef55f705e99c2b2edR357

does that mean that equivalent implementation in Keras would be a momentum of 1 - 0.9997 or 0.9997?

It depends on what exactly gets to the actual BatchNormalization, but it's almost certainly 0.9997

I tried 0.9997 batchnorm momentum in Keras few days ago but it was not training properly. In any case, what would be the impact of having a momentum of 0.9997 instead of 0.99, does it really matter?

It must matter, otherwise it wouldn't be there.

@redsphinx > really annoying that the naming is inconsistent.

the naming is actually consistent with TF batch_normalization() or class BatchNormalization, see above. Of course, @fchollet, implemented those in TF, so that consistency by itself doesn't prove anything. But we know that Keras is right and TF is wrong. See
https://github.com/fchollet/keras/issues/6839#issuecomment-347377404

Was this page helpful?
0 / 5 - 0 ratings