======================================================================
FAIL: test_operator.test_binary_op
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Anaconda3\envs\py3\lib\site-packages\nose\case.py", line 197, in runTest
self.test(*self.arg)
File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\common.py", line 155, in test_new
orig_test(*args, **kwargs)
File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\test_operator.py", line 1377, in test_binary_op
test_bmod(a, b)
File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\test_operator.py", line 1353, in test_bmod
lambda g_out, a, b: (g_out, - g_out * (np.float32(a) // np.float32(b))), gen_binary_data)
File "C:\jenkins_slave\workspace\ut-python-gpu@2\tests\python\unittest\test_operator.py", line 1319, in check_binary_op_backward
assert_allclose(y_2.asnumpy(), x_2, rtol=rtol, atol=atol)
File "C:\Anaconda3\envs\py3\lib\site-packages\numpy\testing\utils.py", line 1411, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "C:\Anaconda3\envs\py3\lib\site-packages\numpy\testing\utils.py", line 796, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-05
(mismatch 0.5555555555555571%)
x: array([[[[ -3.451749e-01, -0.000000e+00, -0.000000e+00, -6.440228e-01],
[ -0.000000e+00, -1.070805e+01, -5.140794e-01, -6.652636e-01],
[ -2.817436e-01, -0.000000e+00, -0.000000e+00, -4.327150e+00]],...
y: array([[[[ -3.451749e-01, -0.000000e+00, -0.000000e+00, -6.440228e-01],
[ -0.000000e+00, -1.070805e+01, -5.140794e-01, -6.652636e-01],
[ -2.817437e-01, -0.000000e+00, -0.000000e+00, -4.327150e+00]],...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=94585786 to reproduce.
--------------------- >> end captured logging << ---------------------
This hasn't broken in like a year to my knowledge.
is that an mkl build?
No, on Windows we only run OpenBLAS.
It seems https://github.com/apache/incubator-mxnet/issues/9853 fails in the same test. It's strange how this fails.
Wasn't elemwise_add changed to use mkl?
btw, was it verified that mkl is faster for all shapes and types? I saw it allocates memory, which seems like it might be slow.
I don't think we have the tools to measure performance on that scale yet. As far as I know, this is in the works. Since there's still some time until 1.2, we can definitely gather these numbers.
Well even if it was changed to use MKL, this would not apply here since we're running on OpenBLAS, right?
@cjolivier01 in both cases (https://github.com/apache/incubator-mxnet/issues/9853 and https://github.com/apache/incubator-mxnet/issues/9844), the tests fail in test_bmod. It shouldn't have invoked elemwise_add
i don’t know what is invoked in the process of calling test_bmod(). could
be that elemwise_add() isn’t called, or is called before and corrupts
memory, or maybe has nothing to do with elemwise_add.
however, we seem to have a lot of tests that are suddenly failing... any
ideas?
On Thu, Feb 22, 2018 at 3:38 PM Da Zheng notifications@github.com wrote:
@cjolivier01 https://github.com/cjolivier01 in both cases (#9853
https://github.com/apache/incubator-mxnet/issues/9853 and #9844
https://github.com/apache/incubator-mxnet/issues/9844), the tests fail
in test_bmod. It shouldn't have invoked elemwise_add—
You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/9853#issuecomment-367860345,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKts_fYOpzaNXyAWlk7woscdV0Pz1w4Iks5tXfqHgaJpZM4SOaWO
.
I don't have a clue right now. so far we see failures in random generators and binary operators. it's weird why it fails in these simple operators that are seemingly irrelevant to MKLDNN operators.
I think the cause of this is that operator mod is using doubles to make the computation, while the test is forcing float32, also the modulo operator for floating point seems to give different results in GPU vs CPU. Why is fmod in cuda giving different results?
According to table 7 here https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction-cuda-dynamic-parallelism
there should be no differences in fmod.
https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L1511
https://github.com/apache/incubator-mxnet/blob/master/src/operator/mshadow_op.h#L402
np.double(1.68) % np.double(1.30123)
0.37876999999999983
np.float32(1.68) % np.float32(1.30123)
0.37877
I tried to increase the tolerance, but I found out one failure where the difference is much bigger than expected 0.28679015 . I think we should look deeper into this
[-116.15162] <-input
[0.28679015] <- diff
[0.8396868] <- a
[0.0020733] <- b
FAIL
reproducible 100% with export MXNET_TEST_SEED=1688524483
nosetests-3.4 -s -v test_operator_gpu.py:test_binary_op
diff --git a/tests/python/unittest/test_operator.py b/tests/python/unittest/test_operator.py
index 5d38222..04e880c 100644
--- a/tests/python/unittest/test_operator.py
+++ b/tests/python/unittest/test_operator.py
@@ -1429,6 +1429,16 @@ def check_binary_op_backward(symbol, baseline, gen_data, rtol=1e-3, atol=1e-5):
y.forward(is_train=True)
y.backward([mx.nd.array(out)])
assert_allclose(y_1.asnumpy(), x_1, rtol=rtol, atol=atol)
+ z = np.abs(y_2.asnumpy() - x_2)
+ w = np.where(z>atol)
+ if w[0].size > 0:
+ print("d[0].shape: {} d[1].shape: {} baseline_grad2.shape: {}".format(d[0].shape, d[1].shape, baseline_grad2.shape))
+ print(w)
+ print(y_2[w])
+ print(x_2[w])
+ print(z[w])
+ print(d[0][w])
+ print(d[1][w])
I have likely found the root cause of this problem, just so we don't duplicate resources on this one.
seed 1060292419
Most helpful comment
reproducible 100% with export MXNET_TEST_SEED=1688524483
nosetests-3.4 -s -v test_operator_gpu.py:test_binary_op