Incubator-mxnet: batchnorm from scratch with autograd gives very different gradient from mx.nd.BatchNorm

Created on 27 Aug 2018 · 9Comments · Source: apache/incubator-mxnet

Description

batchnorm from scratch with autograd gives very different gradient from mx.nd.BatchNorm. Forward results are OK.

Environment info (Required)

macOS 10.13 with mxnet 1.2.1 cpu version from pip.

Package used (Python/R/Scala/Julia):
Python

Error Message:

(Paste the complete error message, including stack trace.)

Minimum reproducible example

import mxnet as mx


def batch_norm_nd(x, gamma, beta, eps=1e-5):
    mean = mx.nd.mean(x, axis=(0, 2, 3), keepdims=True)
    var = mx.nd.mean((x - mean) ** 2, axis=(0, 2, 3), keepdims=True)
    x_hat = (x - mean) / mx.nd.sqrt(var + eps)

    return x_hat * gamma + beta

if __name__ == "__main__":
    x = mx.nd.random_uniform(low=1, high=2, shape=(2, 16, 4, 4))
    gamma = mx.nd.ones(shape=(1, 16, 1, 1))
    beta = mx.nd.zeros(shape=(1, 16, 1, 1))
    mmean = mx.nd.zeros(shape=(1, 16, 1, 1))
    mvar = mx.nd.zeros(shape=(1, 16, 1, 1))
    x.attach_grad()
    gamma.attach_grad()
    beta.attach_grad()

    with mx.autograd.record(train_mode=True):
        y = mx.nd.BatchNorm(x, gamma, beta, mmean, mvar, fix_gamma=False, use_global_stats=False)
    y.backward(mx.nd.ones_like(y))
    y2 = y.copy()
    x2_grad = x.grad.copy()

    with mx.autograd.record(train_mode=True):
        y = batch_norm_nd(x, gamma, beta)
    y.backward(mx.nd.ones_like(y))
    y1 = y.copy()
    x1_grad = x.grad.copy()

    print((y2 / y1)[0, 1])
    print((x2_grad / x1_grad)[0, 1])

results:

[[0.99354386 0.9935453  0.993546   0.9935485 ]
 [0.99354345 0.9935435  0.993581   0.9935487 ]
 [0.9935372  0.99354607 0.9935438  0.9935436 ]
 [0.9935449  0.9935456  0.993545   0.9935423 ]]
<NDArray 4x4 @cpu(0)>

[[-3.6692393 -3.6692448 -3.669247  -3.669256 ]
 [-3.6692376 -3.6692383 -3.6693766 -3.6692567]
 [-3.6692145 -3.6692476 -3.669239  -3.6692383]
 [-3.669243  -3.6692457 -3.6692433 -3.6692333]]
<NDArray 4x4 @cpu(0)>

Steps to reproduce

(Paste the commands you ran that produced the error.)

1.
2.

What have you tried to solve it?

1.
2.

Autograd NDArray

Source

RogerChern

Most helpful comment

Gradients in this example are tiny (smaller than float epsilon) so I think the variance in the ratio is to be expected here.

thomelane on 13 Jun 2019

👍2

All 9 comments

@ZiyueHuang

RogerChern on 27 Aug 2018

@mxnet-label-bot [Bug, NDArray]

ankkhedia on 27 Aug 2018

anirudhacharya on 11 Jun 2019

I don't reckon this is related to the tagged issue Anirudh.
In this case, the MRE provided is running imperative NDArray operations in both the cases.

piyushghai on 11 Jun 2019

I posted this question on the MXNet Discuss Forum as well to get a wider audience. https://discuss.mxnet.io/t/grads-from-batchnorm-implemented-from-scratch-different-from-mx-nd-batchnorm/4167

piyushghai on 11 Jun 2019

Gradients in this example are tiny (smaller than float epsilon) so I think the variance in the ratio is to be expected here.

thomelane on 13 Jun 2019

👍2

Also, in your example you set eps = 1e-5, but mx.nd.BatchNorm uses a default value of 1e-3: https://github.com/apache/incubator-mxnet/blob/992c3c0dd90c0723de6934e826a49bad6569eeac/src/operator/nn/batch_norm-inl.h#L70

NRauschmayr on 13 Jun 2019

Going by the above comments, I don't think it's a bug in the implementation of BatchNorm or the autograd's differentiation module. So there's nothing to be fixed here.

@mxnet-label-bot Update [Question, NDArray, Autograd]

@RogerChern Can this issue be closed ?
Please feel free to re-open the issue if this is closed in error.

Thanks!

piyushghai on 13 Jun 2019

Cool, I now get the correct result with the following snippet.

import mxnet as mx


def batch_norm_nd(x, gamma, beta, eps=1e-5):
    mean = mx.nd.mean(x, axis=(0, 2, 3), keepdims=True)
    var = mx.nd.mean((x - mean) ** 2, axis=(0, 2, 3), keepdims=True)
    x_hat = (x - mean) / mx.nd.sqrt(var + eps)

    return x_hat * gamma + beta

if __name__ == "__main__":
    x1 = mx.nd.random_normal(0.3, 2, shape=(2, 16, 32, 32))
    x2 = x1.copy()
    gamma = mx.nd.ones(shape=(1, 16, 1, 1))
    beta = mx.nd.zeros(shape=(1, 16, 1, 1))
    mmean = mx.nd.zeros(shape=(1, 16, 1, 1))
    mvar = mx.nd.ones(shape=(1, 16, 1, 1))
    x1.attach_grad()
    x2.attach_grad()
    gamma.attach_grad()
    beta.attach_grad()

    grad = mx.nd.random_normal(0, 1, shape=(2, 16, 32, 32))
    with mx.autograd.record(train_mode=True):
        y1 = batch_norm_nd(x1, gamma, beta)
    y1.backward(grad)

    with mx.autograd.record(train_mode=True):
        y2 = mx.nd.BatchNorm(x2, gamma, beta, mmean, mvar, fix_gamma=False, use_global_stats=False, eps=1e-5)
    y2.backward(grad)

    print("--------------------autograd grad scale----------------------")
    print(x1.grad[0, 1])
    print("\n\n")

    print("--------------------forward native/autograd----------------------")
    print((y2 / y1)[0, 1])
    print("\n\n")

    print("--------------------backward native/autograd----------------------")
    print((x2.grad / x1.grad)[0, 1])

RogerChern on 13 Jun 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings