Incubator-mxnet: Flaky test_operator.test_binary_op

Created on 21 Feb 2018 · 16Comments · Source: apache/incubator-mxnet

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/397/pipeline

======================================================================

FAIL: test_operator.test_binary_op

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py3\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\common.py", line 155, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\test_operator.py", line 1377, in test_binary_op

    test_bmod(a, b)

  File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\test_operator.py", line 1353, in test_bmod

    lambda g_out, a, b: (g_out, - g_out * (np.float32(a) // np.float32(b))), gen_binary_data)

  File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\test_operator.py", line 1319, in check_binary_op_backward

    assert_allclose(y_2.asnumpy(), x_2, rtol=rtol, atol=atol)

  File "C:\Anaconda3\envs\py3\lib\site-packages\numpy\testing\utils.py", line 1411, in assert_allclose

    verbose=verbose, header=header, equal_nan=equal_nan)

  File "C:\Anaconda3\envs\py3\lib\site-packages\numpy\testing\utils.py", line 796, in assert_array_compare

    raise AssertionError(msg)

AssertionError: 

Not equal to tolerance rtol=0.001, atol=1e-05



(mismatch 0.5555555555555571%)

 x: array([[[[ -3.451749e-01,  -0.000000e+00,  -0.000000e+00,  -6.440228e-01],

         [ -0.000000e+00,  -1.070805e+01,  -5.140794e-01,  -6.652636e-01],

         [ -2.817436e-01,  -0.000000e+00,  -0.000000e+00,  -4.327150e+00]],...

 y: array([[[[ -3.451749e-01,  -0.000000e+00,  -0.000000e+00,  -6.440228e-01],

         [ -0.000000e+00,  -1.070805e+01,  -5.140794e-01,  -6.652636e-01],

         [ -2.817437e-01,  -0.000000e+00,  -0.000000e+00,  -4.327150e+00]],...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=94585786 to reproduce.

--------------------- >> end captured logging << ---------------------

Bug Flaky Operator Test

Source

marcoabreu

Most helpful comment

reproducible 100% with export MXNET_TEST_SEED=1688524483

nosetests-3.4 -s -v test_operator_gpu.py:test_binary_op

diff --git a/tests/python/unittest/test_operator.py b/tests/python/unittest/test_operator.py
index 5d38222..04e880c 100644
--- a/tests/python/unittest/test_operator.py
+++ b/tests/python/unittest/test_operator.py
@@ -1429,6 +1429,16 @@ def check_binary_op_backward(symbol, baseline, gen_data, rtol=1e-3, atol=1e-5):
         y.forward(is_train=True)
         y.backward([mx.nd.array(out)])
         assert_allclose(y_1.asnumpy(), x_1, rtol=rtol, atol=atol)
+        z = np.abs(y_2.asnumpy() - x_2)
+        w = np.where(z>atol)
+        if w[0].size > 0:
+            print("d[0].shape: {} d[1].shape: {} baseline_grad2.shape: {}".format(d[0].shape, d[1].shape, baseline_grad2.shape))
+            print(w)
+            print(y_2[w])
+            print(x_2[w])
+            print(z[w])
+            print(d[0][w])
+            print(d[1][w])

larroy on 6 May 2018

❤2

All 16 comments

This hasn't broken in like a year to my knowledge.

cjolivier01 on 22 Feb 2018

is that an mkl build?

cjolivier01 on 22 Feb 2018

No, on Windows we only run OpenBLAS.

marcoabreu on 22 Feb 2018

It seems https://github.com/apache/incubator-mxnet/issues/9853 fails in the same test. It's strange how this fails.

zheng-da on 22 Feb 2018

Wasn't elemwise_add changed to use mkl?
btw, was it verified that mkl is faster for all shapes and types? I saw it allocates memory, which seems like it might be slow.

cjolivier01 on 22 Feb 2018

I don't think we have the tools to measure performance on that scale yet. As far as I know, this is in the works. Since there's still some time until 1.2, we can definitely gather these numbers.

marcoabreu on 23 Feb 2018

Well even if it was changed to use MKL, this would not apply here since we're running on OpenBLAS, right?

marcoabreu on 23 Feb 2018

@cjolivier01 in both cases (https://github.com/apache/incubator-mxnet/issues/9853 and https://github.com/apache/incubator-mxnet/issues/9844), the tests fail in test_bmod. It shouldn't have invoked elemwise_add

zheng-da on 23 Feb 2018

i don’t know what is invoked in the process of calling test_bmod(). could
be that elemwise_add() isn’t called, or is called before and corrupts
memory, or maybe has nothing to do with elemwise_add.

however, we seem to have a lot of tests that are suddenly failing... any
ideas?

On Thu, Feb 22, 2018 at 3:38 PM Da Zheng notifications@github.com wrote:

@cjolivier01 https://github.com/cjolivier01 in both cases (#9853
https://github.com/apache/incubator-mxnet/issues/9853 and #9844
https://github.com/apache/incubator-mxnet/issues/9844), the tests fail
in test_bmod. It shouldn't have invoked elemwise_add

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/9853#issuecomment-367860345,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKts_fYOpzaNXyAWlk7woscdV0Pz1w4Iks5tXfqHgaJpZM4SOaWO
.

cjolivier01 on 23 Feb 2018

I don't have a clue right now. so far we see failures in random generators and binary operators. it's weird why it fails in these simple operators that are seemingly irrelevant to MKLDNN operators.

zheng-da on 23 Feb 2018

https://github.com/apache/incubator-mxnet/issues/9844

marcoabreu on 17 Mar 2018

I think the cause of this is that operator mod is using doubles to make the computation, while the test is forcing float32, also the modulo operator for floating point seems to give different results in GPU vs CPU. Why is fmod in cuda giving different results?

According to table 7 here https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction-cuda-dynamic-parallelism

there should be no differences in fmod.

https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L1511
https://github.com/apache/incubator-mxnet/blob/master/src/operator/mshadow_op.h#L402

np.double(1.68) % np.double(1.30123)
0.37876999999999983
np.float32(1.68) % np.float32(1.30123)
0.37877

larroy on 6 May 2018

I tried to increase the tolerance, but I found out one failure where the difference is much bigger than expected 0.28679015 . I think we should look deeper into this

[-116.15162] <-input
[-115.8648288] <- gradient
[0.28679015] <- diff
[0.8396868] <- a
[0.0020733] <- b
FAIL

larroy on 6 May 2018

reproducible 100% with export MXNET_TEST_SEED=1688524483

nosetests-3.4 -s -v test_operator_gpu.py:test_binary_op

diff --git a/tests/python/unittest/test_operator.py b/tests/python/unittest/test_operator.py
index 5d38222..04e880c 100644
--- a/tests/python/unittest/test_operator.py
+++ b/tests/python/unittest/test_operator.py
@@ -1429,6 +1429,16 @@ def check_binary_op_backward(symbol, baseline, gen_data, rtol=1e-3, atol=1e-5):
         y.forward(is_train=True)
         y.backward([mx.nd.array(out)])
         assert_allclose(y_1.asnumpy(), x_1, rtol=rtol, atol=atol)
+        z = np.abs(y_2.asnumpy() - x_2)
+        w = np.where(z>atol)
+        if w[0].size > 0:
+            print("d[0].shape: {} d[1].shape: {} baseline_grad2.shape: {}".format(d[0].shape, d[1].shape, baseline_grad2.shape))
+            print(w)
+            print(y_2[w])
+            print(x_2[w])
+            print(z[w])
+            print(d[0][w])
+            print(d[1][w])

larroy on 6 May 2018

❤2

I have likely found the root cause of this problem, just so we don't duplicate resources on this one.

larroy on 13 Jun 2018

❤1

seed 1060292419

larroy on 13 Jun 2018

Was this page helpful?

0 / 5 - 0 ratings