======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
return func(*arg, **kw)
File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
orig_test(*args, **kwargs)
File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
check_layer_forward_withinput(net, x)
File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
ctypes.c_size_t(data.size)))
File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [04:48:16] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875:
Failed to find any forward convolution algorithm. with workspace size of 1073741824
bytes, please consider reducing batch/model size or increasing the workspace size
Possibly related to:
Failing test: test_gluon_gpu.test_slice_batchnorm: https://github.com/apache/incubator-mxnet/issues/12715
I'm unsure this is a flaky test, I think it's a cuda / cudnn or CI environment problem. Could you reproduce?
@mxnet-label-bot [flaky, Gluon]
Another consecutive run failed on master CI:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1728/pipeline
======================================================================
FAIL: test_mkldnn.test_Deconvolution
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 346, in test_Deconvolution
check_Deconvolution_training(stype)
File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 342, in check_Deconvolution_training
check_numeric_gradient(test, in_location, numeric_eps=1e-2, rtol=0.16, atol=1e-4)
File "/work/mxnet/python/mxnet/test_utils.py", line 915, in check_numeric_gradient
("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 3.121914 exceeds tolerance rtol=0.160000, atol=0.000100. Location of maximum error:(2, 1, 5), a=-0.000381, b=-0.001386
NUMERICAL_data: array([[[-0.6184697 , -0.50860643, -0.6415248 , ..., -0.7978529 ,
-0.8801222 , -0.7802248 ],
[-0.26806593, -0.1953423 , -0.14332533, ..., -0.17287433,...
BACKWARD_data: array([[[-0.6174789 , -0.5086705 , -0.6417394 , ..., -0.79945517,
-0.88075024, -0.77997565],
[-0.26776323, -0.19459067, -0.14422962, ..., -0.1742437 ,...
The deconvolution failure is tracked in https://github.com/apache/incubator-mxnet/issues/12579
Flaky test failure.
Please refer to the jenkins' log below:
Log
Another failure:
======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
return func(*arg, **kw)
File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
orig_test(*args, **kwargs)
File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
check_layer_forward_withinput(net, x)
File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
ctypes.c_size_t(data.size)))
File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [23:05:11] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm. with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size
Another failure can be seen here :
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12826/3/pipeline/996
======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
return func(*arg, **kw)
File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
orig_test(*args, **kwargs)
File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
check_layer_forward_withinput(net, x)
File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
ctypes.c_size_t(data.size)))
File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [21:28:14] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm. with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size
@lebeg is there anybody working on this? tests are still failing.
Another failure for me here : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12749/18/pipeline/996
======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
return func(*arg, **kw)
File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
orig_test(*args, **kwargs)
File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
check_layer_forward_withinput(net, x)
File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
ctypes.c_size_t(data.size)))
File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [00:03:15] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm. with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size
@lanking520 I proposed a mitigation here https://github.com/apache/incubator-mxnet/pull/12768 until this will be fixed. You are welcome to participate in the discussion and help merging it. Although this will not fix the problem, it could help reduce the failure rate.
As far as I know @nswamy was investigating the root case.
We have been working in the direction of updating CUDA drivers: https://github.com/apache/incubator-mxnet/pull/12850, but it's blocked until the new AMIs will be deployed with updated CUDA drivers.
@larroy is currently doing the driver updates.
https://github.com/apache/incubator-mxnet/issues/12887 duplicated issue
did you reenable the test?
On Thu, Nov 1, 2018 at 8:05 AM Anton Chernov notifications@github.com
wrote:
Closed #12767 https://github.com/apache/incubator-mxnet/issues/12767.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/12767#event-1940682184,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABJxQtAkR6_7NIzFpsmEoe7SanOAY769ks5uqw3MgaJpZM4XTaqc
.
@nswamy I was thinking https://github.com/apache/incubator-mxnet/pull/12986/files would reenable it
https://github.com/apache/incubator-mxnet/pull/12986 did enable the test again.
Most helpful comment
@larroy is currently doing the driver updates.