Incubator-mxnet: Flaky test_operator.test_binary_op

Created on 21 Feb 2018  Â·  16Comments  Â·  Source: apache/incubator-mxnet

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/397/pipeline

======================================================================

FAIL: test_operator.test_binary_op

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py3\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\common.py", line 155, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\test_operator.py", line 1377, in test_binary_op

    test_bmod(a, b)

  File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\test_operator.py", line 1353, in test_bmod

    lambda g_out, a, b: (g_out, - g_out * (np.float32(a) // np.float32(b))), gen_binary_data)

  File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\test_operator.py", line 1319, in check_binary_op_backward

    assert_allclose(y_2.asnumpy(), x_2, rtol=rtol, atol=atol)

  File "C:\Anaconda3\envs\py3\lib\site-packages\numpy\testing\utils.py", line 1411, in assert_allclose

    verbose=verbose, header=header, equal_nan=equal_nan)

  File "C:\Anaconda3\envs\py3\lib\site-packages\numpy\testing\utils.py", line 796, in assert_array_compare

    raise AssertionError(msg)

AssertionError: 

Not equal to tolerance rtol=0.001, atol=1e-05



(mismatch 0.5555555555555571%)

 x: array([[[[ -3.451749e-01,  -0.000000e+00,  -0.000000e+00,  -6.440228e-01],

         [ -0.000000e+00,  -1.070805e+01,  -5.140794e-01,  -6.652636e-01],

         [ -2.817436e-01,  -0.000000e+00,  -0.000000e+00,  -4.327150e+00]],...

 y: array([[[[ -3.451749e-01,  -0.000000e+00,  -0.000000e+00,  -6.440228e-01],

         [ -0.000000e+00,  -1.070805e+01,  -5.140794e-01,  -6.652636e-01],

         [ -2.817437e-01,  -0.000000e+00,  -0.000000e+00,  -4.327150e+00]],...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=94585786 to reproduce.

--------------------- >> end captured logging << ---------------------
Bug Flaky Operator Test

Most helpful comment

reproducible 100% with export MXNET_TEST_SEED=1688524483

nosetests-3.4 -s -v test_operator_gpu.py:test_binary_op

diff --git a/tests/python/unittest/test_operator.py b/tests/python/unittest/test_operator.py
index 5d38222..04e880c 100644
--- a/tests/python/unittest/test_operator.py
+++ b/tests/python/unittest/test_operator.py
@@ -1429,6 +1429,16 @@ def check_binary_op_backward(symbol, baseline, gen_data, rtol=1e-3, atol=1e-5):
         y.forward(is_train=True)
         y.backward([mx.nd.array(out)])
         assert_allclose(y_1.asnumpy(), x_1, rtol=rtol, atol=atol)
+        z = np.abs(y_2.asnumpy() - x_2)
+        w = np.where(z>atol)
+        if w[0].size > 0:
+            print("d[0].shape: {} d[1].shape: {} baseline_grad2.shape: {}".format(d[0].shape, d[1].shape, baseline_grad2.shape))
+            print(w)
+            print(y_2[w])
+            print(x_2[w])
+            print(z[w])
+            print(d[0][w])
+            print(d[1][w])

All 16 comments

This hasn't broken in like a year to my knowledge.

is that an mkl build?

No, on Windows we only run OpenBLAS.

It seems https://github.com/apache/incubator-mxnet/issues/9853 fails in the same test. It's strange how this fails.

Wasn't elemwise_add changed to use mkl?
btw, was it verified that mkl is faster for all shapes and types? I saw it allocates memory, which seems like it might be slow.

I don't think we have the tools to measure performance on that scale yet. As far as I know, this is in the works. Since there's still some time until 1.2, we can definitely gather these numbers.

Well even if it was changed to use MKL, this would not apply here since we're running on OpenBLAS, right?

@cjolivier01 in both cases (https://github.com/apache/incubator-mxnet/issues/9853 and https://github.com/apache/incubator-mxnet/issues/9844), the tests fail in test_bmod. It shouldn't have invoked elemwise_add

i don’t know what is invoked in the process of calling test_bmod(). could
be that elemwise_add() isn’t called, or is called before and corrupts
memory, or maybe has nothing to do with elemwise_add.

however, we seem to have a lot of tests that are suddenly failing... any
ideas?

On Thu, Feb 22, 2018 at 3:38 PM Da Zheng notifications@github.com wrote:

@cjolivier01 https://github.com/cjolivier01 in both cases (#9853
https://github.com/apache/incubator-mxnet/issues/9853 and #9844
https://github.com/apache/incubator-mxnet/issues/9844), the tests fail
in test_bmod. It shouldn't have invoked elemwise_add

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/9853#issuecomment-367860345,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKts_fYOpzaNXyAWlk7woscdV0Pz1w4Iks5tXfqHgaJpZM4SOaWO
.

I don't have a clue right now. so far we see failures in random generators and binary operators. it's weird why it fails in these simple operators that are seemingly irrelevant to MKLDNN operators.

I think the cause of this is that operator mod is using doubles to make the computation, while the test is forcing float32, also the modulo operator for floating point seems to give different results in GPU vs CPU. Why is fmod in cuda giving different results?

According to table 7 here https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction-cuda-dynamic-parallelism

there should be no differences in fmod.

https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L1511
https://github.com/apache/incubator-mxnet/blob/master/src/operator/mshadow_op.h#L402

np.double(1.68) % np.double(1.30123)
0.37876999999999983
np.float32(1.68) % np.float32(1.30123)
0.37877

I tried to increase the tolerance, but I found out one failure where the difference is much bigger than expected 0.28679015 . I think we should look deeper into this

[-116.15162] <-input
[-115.8648288] <- gradient
[0.28679015] <- diff
[0.8396868] <- a
[0.0020733] <- b
FAIL

reproducible 100% with export MXNET_TEST_SEED=1688524483

nosetests-3.4 -s -v test_operator_gpu.py:test_binary_op

diff --git a/tests/python/unittest/test_operator.py b/tests/python/unittest/test_operator.py
index 5d38222..04e880c 100644
--- a/tests/python/unittest/test_operator.py
+++ b/tests/python/unittest/test_operator.py
@@ -1429,6 +1429,16 @@ def check_binary_op_backward(symbol, baseline, gen_data, rtol=1e-3, atol=1e-5):
         y.forward(is_train=True)
         y.backward([mx.nd.array(out)])
         assert_allclose(y_1.asnumpy(), x_1, rtol=rtol, atol=atol)
+        z = np.abs(y_2.asnumpy() - x_2)
+        w = np.where(z>atol)
+        if w[0].size > 0:
+            print("d[0].shape: {} d[1].shape: {} baseline_grad2.shape: {}".format(d[0].shape, d[1].shape, baseline_grad2.shape))
+            print(w)
+            print(y_2[w])
+            print(x_2[w])
+            print(z[w])
+            print(d[0][w])
+            print(d[1][w])

I have likely found the root cause of this problem, just so we don't duplicate resources on this one.

seed 1060292419

Was this page helpful?
0 / 5 - 0 ratings

Related issues

seongkyun picture seongkyun  Â·  3Comments

Ajoo picture Ajoo  Â·  3Comments

phunterlau picture phunterlau  Â·  3Comments

Shiro-LK picture Shiro-LK  Â·  3Comments

JonBoyleCoding picture JonBoyleCoding  Â·  3Comments