Incubator-mxnet: Flaky test: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm

Created on 9 Oct 2018 · 18Comments · Source: apache/incubator-mxnet

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1728/pipeline

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))

mxnet.base.MXNetError: [04:48:16] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: 
Failed to find any forward convolution algorithm.  with workspace size of 1073741824 
bytes, please consider reducing batch/model size or increasing the workspace size

Flaky Gluon

Source

lebeg

Most helpful comment

@larroy is currently doing the driver updates.

lebeg on 17 Oct 2018

👍2

All 18 comments

Possibly related to:
Failing test: test_gluon_gpu.test_slice_batchnorm: https://github.com/apache/incubator-mxnet/issues/12715

lebeg on 9 Oct 2018

I'm unsure this is a flaky test, I think it's a cuda / cudnn or CI environment problem. Could you reproduce?

larroy on 9 Oct 2018

@mxnet-label-bot [flaky, Gluon]

piyushghai on 9 Oct 2018

Another consecutive run failed on master CI:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1728/pipeline

======================================================================
FAIL: test_mkldnn.test_Deconvolution
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 346, in test_Deconvolution
    check_Deconvolution_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 342, in check_Deconvolution_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-2, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 915, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 3.121914 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(2, 1, 5), a=-0.000381, b=-0.001386
 NUMERICAL_data: array([[[-0.6184697 , -0.50860643, -0.6415248 , ..., -0.7978529 ,
         -0.8801222 , -0.7802248 ],
        [-0.26806593, -0.1953423 , -0.14332533, ..., -0.17287433,...
 BACKWARD_data: array([[[-0.6174789 , -0.5086705 , -0.6417394 , ..., -0.79945517,
         -0.88075024, -0.77997565],
        [-0.26776323, -0.19459067, -0.14422962, ..., -0.1742437 ,...

lebeg on 9 Oct 2018

The deconvolution failure is tracked in https://github.com/apache/incubator-mxnet/issues/12579

lebeg on 9 Oct 2018

Flaky test failure.
Please refer to the jenkins' log below:
Log

gaurav-gireesh on 10 Oct 2018

Another failure:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12766/2/pipeline

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))

mxnet.base.MXNetError: [23:05:11] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm.  with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size

lebeg on 10 Oct 2018

Another failure can be seen here :
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12826/3/pipeline/996

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [21:28:14] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm.  with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size

piyushghai on 17 Oct 2018

@lebeg is there anybody working on this? tests are still failing.

lanking520 on 17 Oct 2018

Another failure for me here : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12749/18/pipeline/996

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [00:03:15] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm.  with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size

ChaiBapchya on 17 Oct 2018

@lanking520 I proposed a mitigation here https://github.com/apache/incubator-mxnet/pull/12768 until this will be fixed. You are welcome to participate in the discussion and help merging it. Although this will not fix the problem, it could help reduce the failure rate.

As far as I know @nswamy was investigating the root case.

We have been working in the direction of updating CUDA drivers: https://github.com/apache/incubator-mxnet/pull/12850, but it's blocked until the new AMIs will be deployed with updated CUDA drivers.

lebeg on 17 Oct 2018

👍1

@larroy is currently doing the driver updates.

lebeg on 17 Oct 2018

👍2

Failing again in http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12860/4/pipeline/997

eric-haibin-lin on 18 Oct 2018

https://github.com/apache/incubator-mxnet/issues/12887 duplicated issue

lanking520 on 21 Oct 2018

Failed again here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12919/1/pipeline

aaronmarkham on 23 Oct 2018

did you reenable the test?

On Thu, Nov 1, 2018 at 8:05 AM Anton Chernov notifications@github.com
wrote:

Closed #12767 https://github.com/apache/incubator-mxnet/issues/12767.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/12767#event-1940682184,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABJxQtAkR6_7NIzFpsmEoe7SanOAY769ks5uqw3MgaJpZM4XTaqc
.