Incubator-mxnet: [Activation] GELU precision mismatch between MXNet and PyTorch in the CPU version

Created on 30 Jul 2020  Â·  6Comments  Â·  Source: apache/incubator-mxnet

The CPU version of mx.npx.leaky_relu(x, act_type='gelu') has different precision from PyTorch.

The minimal reproducible example:

import mxnet as mx
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy()).cuda() 
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)  

The GPU version has no issue:

import mxnet as mx
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,), ctx=mx.gpu()) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy()).cuda() 
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)  

@pengzhao-intel @ciyongch

Error:

<ipython-input-48-6f3377797f65> in <module>
      9 b_torch = torch.nn.functional.gelu(a_torch)
     10 assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)
---> 11 assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

~/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_allclose(actual, desired, rtol, atol, equal_nan, err_msg, verbose)
   1526     header = 'Not equal to tolerance rtol=%g, atol=%g' % (rtol, atol)
   1527     assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
-> 1528                          verbose=verbose, header=header, equal_nan=equal_nan)
   1529 
   1530 

~/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
    838                                 verbose=verbose, header=header,
    839                                 names=('x', 'y'), precision=precision)
--> 840             raise AssertionError(msg)
    841     except ValueError:
    842         import traceback

AssertionError: 
Not equal to tolerance rtol=0.0001, atol=0.0001

Mismatched elements: 2258 / 10000 (22.6%)
Max absolute difference: 0.0004735
Max relative difference: 0.8255573
 x: array([ 0.684651,  0.508604, -0.165598, ...,  1.706593,  0.288036,
        1.006167], dtype=float32)
 y: array([ 0.68455 ,  0.508554, -0.165716, ...,  1.706508,  0.288026,
        1.005966], dtype=float32)
Bug MKLDNN needs triage v2.0

Most helpful comment

Yes, it's solved.

All 6 comments

@sxjscience Can you confirm the operator runs into its mkldnn version?

Sorry I do not have the bandwidth to confirm that. I think mkldnn should be turned on by default. Are you able to reproduce this?

Get Outlook for iOShttps://aka.ms/o0ukef


From: Tao Lv notifications@github.com
Sent: Wednesday, July 29, 2020 9:31:49 PM
To: apache/incubator-mxnet incubator-mxnet@noreply.github.com
Cc: Xingjian SHI xshiab@connect.ust.hk; Mention mention@noreply.github.com
Subject: Re: [apache/incubator-mxnet] [Activation] GELU precision mismatch between MXNet and PyTorch in the CPU version (#18826)

@sxjsciencehttps://github.com/sxjscience Can you confirm the operator runs into its mkldnn version?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/apache/incubator-mxnet/issues/18826#issuecomment-666098440, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3XYWAH6FQB5D4YO4DDR6DZTLANCNFSM4PMXSPLQ.

In fact, I cannot correctly run the reproducer. I try to fix the precision problem with #18827. Please let me know if it works for you. Thanks.

@TaoLv Sorry, missed some imports.

import mxnet as mx
import math
from numpy.testing import assert_allclose
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy())
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)  

(Compiling MXNet takes some time for me so it will be helpful if you can check that...)

@TaoLv Sorry, missed some imports.

import mxnet as mx
import math
from numpy.testing import assert_allclose
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy())
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)  

(Compiling MXNet takes some time for me so it will be helpful if you can check that...)

Does the issue still exist after Tao's PR?

Yes, it's solved.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

luoruisichuan picture luoruisichuan  Â·  3Comments

Zhaoyang-XU picture Zhaoyang-XU  Â·  3Comments

xzqjack picture xzqjack  Â·  3Comments

realbns2008 picture realbns2008  Â·  3Comments

dmadeka picture dmadeka  Â·  3Comments