Incubator-mxnet: batchnorm from scratch with autograd gives very different gradient from mx.nd.BatchNorm

Created on 27 Aug 2018  路  9Comments  路  Source: apache/incubator-mxnet

Description

batchnorm from scratch with autograd gives very different gradient from mx.nd.BatchNorm. Forward results are OK.

Environment info (Required)

macOS 10.13 with mxnet 1.2.1 cpu version from pip.

Package used (Python/R/Scala/Julia):
Python

Error Message:

(Paste the complete error message, including stack trace.)

Minimum reproducible example

import mxnet as mx


def batch_norm_nd(x, gamma, beta, eps=1e-5):
    mean = mx.nd.mean(x, axis=(0, 2, 3), keepdims=True)
    var = mx.nd.mean((x - mean) ** 2, axis=(0, 2, 3), keepdims=True)
    x_hat = (x - mean) / mx.nd.sqrt(var + eps)

    return x_hat * gamma + beta

if __name__ == "__main__":
    x = mx.nd.random_uniform(low=1, high=2, shape=(2, 16, 4, 4))
    gamma = mx.nd.ones(shape=(1, 16, 1, 1))
    beta = mx.nd.zeros(shape=(1, 16, 1, 1))
    mmean = mx.nd.zeros(shape=(1, 16, 1, 1))
    mvar = mx.nd.zeros(shape=(1, 16, 1, 1))
    x.attach_grad()
    gamma.attach_grad()
    beta.attach_grad()

    with mx.autograd.record(train_mode=True):
        y = mx.nd.BatchNorm(x, gamma, beta, mmean, mvar, fix_gamma=False, use_global_stats=False)
    y.backward(mx.nd.ones_like(y))
    y2 = y.copy()
    x2_grad = x.grad.copy()

    with mx.autograd.record(train_mode=True):
        y = batch_norm_nd(x, gamma, beta)
    y.backward(mx.nd.ones_like(y))
    y1 = y.copy()
    x1_grad = x.grad.copy()

    print((y2 / y1)[0, 1])
    print((x2_grad / x1_grad)[0, 1])

results:

[[0.99354386 0.9935453  0.993546   0.9935485 ]
 [0.99354345 0.9935435  0.993581   0.9935487 ]
 [0.9935372  0.99354607 0.9935438  0.9935436 ]
 [0.9935449  0.9935456  0.993545   0.9935423 ]]
<NDArray 4x4 @cpu(0)>

[[-3.6692393 -3.6692448 -3.669247  -3.669256 ]
 [-3.6692376 -3.6692383 -3.6693766 -3.6692567]
 [-3.6692145 -3.6692476 -3.669239  -3.6692383]
 [-3.669243  -3.6692457 -3.6692433 -3.6692333]]
<NDArray 4x4 @cpu(0)>

Steps to reproduce

(Paste the commands you ran that produced the error.)

1.
2.

What have you tried to solve it?

1.
2.

Autograd NDArray

Most helpful comment

Gradients in this example are tiny (smaller than float epsilon) so I think the variance in the ratio is to be expected here.

All 9 comments

@ZiyueHuang

@mxnet-label-bot [Bug, NDArray]

I don't reckon this is related to the tagged issue Anirudh.
In this case, the MRE provided is running imperative NDArray operations in both the cases.

I posted this question on the MXNet Discuss Forum as well to get a wider audience. https://discuss.mxnet.io/t/grads-from-batchnorm-implemented-from-scratch-different-from-mx-nd-batchnorm/4167

Gradients in this example are tiny (smaller than float epsilon) so I think the variance in the ratio is to be expected here.

Also, in your example you set eps = 1e-5, but mx.nd.BatchNorm uses a default value of 1e-3: https://github.com/apache/incubator-mxnet/blob/992c3c0dd90c0723de6934e826a49bad6569eeac/src/operator/nn/batch_norm-inl.h#L70

Going by the above comments, I don't think it's a bug in the implementation of BatchNorm or the autograd's differentiation module. So there's nothing to be fixed here.

@mxnet-label-bot Update [Question, NDArray, Autograd]

@RogerChern Can this issue be closed ?
Please feel free to re-open the issue if this is closed in error.

Thanks!

Cool, I now get the correct result with the following snippet.

import mxnet as mx


def batch_norm_nd(x, gamma, beta, eps=1e-5):
    mean = mx.nd.mean(x, axis=(0, 2, 3), keepdims=True)
    var = mx.nd.mean((x - mean) ** 2, axis=(0, 2, 3), keepdims=True)
    x_hat = (x - mean) / mx.nd.sqrt(var + eps)

    return x_hat * gamma + beta

if __name__ == "__main__":
    x1 = mx.nd.random_normal(0.3, 2, shape=(2, 16, 32, 32))
    x2 = x1.copy()
    gamma = mx.nd.ones(shape=(1, 16, 1, 1))
    beta = mx.nd.zeros(shape=(1, 16, 1, 1))
    mmean = mx.nd.zeros(shape=(1, 16, 1, 1))
    mvar = mx.nd.ones(shape=(1, 16, 1, 1))
    x1.attach_grad()
    x2.attach_grad()
    gamma.attach_grad()
    beta.attach_grad()

    grad = mx.nd.random_normal(0, 1, shape=(2, 16, 32, 32))
    with mx.autograd.record(train_mode=True):
        y1 = batch_norm_nd(x1, gamma, beta)
    y1.backward(grad)

    with mx.autograd.record(train_mode=True):
        y2 = mx.nd.BatchNorm(x2, gamma, beta, mmean, mvar, fix_gamma=False, use_global_stats=False, eps=1e-5)
    y2.backward(grad)

    print("--------------------autograd grad scale----------------------")
    print(x1.grad[0, 1])
    print("\n\n")

    print("--------------------forward native/autograd----------------------")
    print((y2 / y1)[0, 1])
    print("\n\n")

    print("--------------------backward native/autograd----------------------")
    print((x2.grad / x1.grad)[0, 1])
Was this page helpful?
0 / 5 - 0 ratings

Related issues

WangcsShuai picture WangcsShuai  路  3Comments

sbodenstein picture sbodenstein  路  3Comments

Fzz123 picture Fzz123  路  3Comments

Zhaoyang-XU picture Zhaoyang-XU  路  3Comments

realbns2008 picture realbns2008  路  3Comments