Incubator-mxnet: [Activation] GELU precision mismatch between MXNet and PyTorch in the CPU version

Created on 30 Jul 2020 · 6Comments · Source: apache/incubator-mxnet

The CPU version of mx.npx.leaky_relu(x, act_type='gelu') has different precision from PyTorch.

The minimal reproducible example:

import mxnet as mx
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy()).cuda() 
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

The GPU version has no issue:

import mxnet as mx
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,), ctx=mx.gpu()) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy()).cuda() 
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

@pengzhao-intel @ciyongch

Error:

<ipython-input-48-6f3377797f65> in <module>
      9 b_torch = torch.nn.functional.gelu(a_torch)
     10 assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)
---> 11 assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

~/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_allclose(actual, desired, rtol, atol, equal_nan, err_msg, verbose)
   1526     header = 'Not equal to tolerance rtol=%g, atol=%g' % (rtol, atol)
   1527     assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
-> 1528                          verbose=verbose, header=header, equal_nan=equal_nan)
   1529 
   1530 

~/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
    838                                 verbose=verbose, header=header,
    839                                 names=('x', 'y'), precision=precision)
--> 840             raise AssertionError(msg)
    841     except ValueError:
    842         import traceback

AssertionError: 
Not equal to tolerance rtol=0.0001, atol=0.0001

Mismatched elements: 2258 / 10000 (22.6%)
Max absolute difference: 0.0004735
Max relative difference: 0.8255573
 x: array([ 0.684651,  0.508604, -0.165598, ...,  1.706593,  0.288036,
        1.006167], dtype=float32)
 y: array([ 0.68455 ,  0.508554, -0.165716, ...,  1.706508,  0.288026,
        1.005966], dtype=float32)

Bug MKLDNN needs triage v2.0

Source

sxjscience

Most helpful comment

Yes, it's solved.

sxjscience on 7 Aug 2020

👍2

All 6 comments

@sxjscience Can you confirm the operator runs into its mkldnn version?

TaoLv on 30 Jul 2020

Sorry I do not have the bandwidth to confirm that. I think mkldnn should be turned on by default. Are you able to reproduce this?

Get Outlook for iOShttps://aka.ms/o0ukef

From: Tao Lv notifications@github.com
Sent: Wednesday, July 29, 2020 9:31:49 PM
To: apache/incubator-mxnet incubator-mxnet@noreply.github.com
Cc: Xingjian SHI xshiab@connect.ust.hk; Mention mention@noreply.github.com
Subject: Re: [apache/incubator-mxnet] [Activation] GELU precision mismatch between MXNet and PyTorch in the CPU version (#18826)

@sxjsciencehttps://github.com/sxjscience Can you confirm the operator runs into its mkldnn version?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/apache/incubator-mxnet/issues/18826#issuecomment-666098440, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3XYWAH6FQB5D4YO4DDR6DZTLANCNFSM4PMXSPLQ.

sxjscience on 30 Jul 2020

In fact, I cannot correctly run the reproducer. I try to fix the precision problem with #18827. Please let me know if it works for you. Thanks.

TaoLv on 30 Jul 2020

@TaoLv Sorry, missed some imports.

import mxnet as mx
import math
from numpy.testing import assert_allclose
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy())
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

(Compiling MXNet takes some time for me so it will be helpful if you can check that...)

sxjscience on 30 Jul 2020

@TaoLv Sorry, missed some imports.

import mxnet as mx
import math
from numpy.testing import assert_allclose
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy())
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

(Compiling MXNet takes some time for me so it will be helpful if you can check that...)

Does the issue still exist after Tao's PR?