Incubator-mxnet: Flaky test: test_operator_gpu.test_sequence_last causes 'CUDA: unspecified launch failure'

Created on 25 Jun 2018  路  21Comments  路  Source: apache/incubator-mxnet

Sometimes, our slaves get corrupted and suddenly all test start to fail. This is unrelated to the tests directly.

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11377/5/pipeline/

======================================================================

ERROR: test_operator_gpu.test_op_roi_align

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\util.py", line 620, in newfunc

    return func(*arg, **kw)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\common.py", line 157, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\test_operator.py", line 6269, in test_op_roi_align

    test_roi_align_value()

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\test_operator.py", line 6230, in test_roi_align_value

    data = mx.nd.array(np.arange(N*C*W*H).reshape((N,C,H,W)), ctx=ctx, dtype = dtype)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\utils.py", line 146, in array

    return _array(source_array, ctx=ctx, dtype=dtype)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 2357, in array

    arr[:] = source_array

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 444, in __setitem__

    self._set_nd_basic_indexing(key, value)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 710, in _set_nd_basic_indexing

    self._sync_copyfrom(value)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 876, in _sync_copyfrom

    ctypes.c_size_t(source_array.size)))

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\base.py", line 210, in check_call

    raise MXNetError(py_str(_LIB.MXGetLastError()))

MXNetError: [06:35:08] c:\jenkins_slave\workspace\build-gpu\3rdparty\mshadow\mshadow\./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1046236735 to reproduce.

--------------------- >> end captured logging << ---------------------
Breaking Disabled test Flaky Test Windows

Most helpful comment

I have a lead on the problem. There is an out-of-bound read performed by the SequenceLastKernel. I'll stop here and let the person responsible for this kernel correct the problem. Kernels that read beyond their valid input tensor regions can be problematic, even if the random data read is never used in a subsequent kernel write. The problem surfaces when the reads are outside of valid mapped address ranges, which results in an unservicable TLB miss. The problems can be non-deterministic since the input tensors may have non-deterministic placement within their mapped pages.

I debugged the problem by going to the first test that showed the failure in one of the above posts, captured the MXNET_TEST_SEED, and then reproduced the error (on Linux no less) with the following command:

MXNET_TEST_SEED=731510245 cuda-memcheck nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last | c++filt
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
[WARNING] *** test-level seed set: all "@with_seed()" tests run deterministically ***
test_operator_gpu.test_sequence_last ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000390 in void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>)
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x7f13f24003f8 is out of bounds
=========     Device Frame:void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) (void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) : 0x390)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24cc4d]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 [0x15680]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaLaunch + 0x14e) [0x33c9e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::SequenceLastOp<mshadow::gpu, float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0xc3a) [0x53384da]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0x363) [0x3214a53]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::exec::StatefulComputeExecutor::Run(mxnet::RunContext, bool) + 0x59) [0x3808f09]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5870]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaStreamSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaStreamSynchronize + 0x176) [0x47336]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mshadow::Stream<mshadow::gpu>::Wait() + 0x26) [0x32635a6]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5945]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
ERROR

======================================================================
ERROR: test_operator_gpu.test_sequence_last
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/common.py", line 157, in test_new
    orig_test(*args, **kwargs)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2998, in test_sequence_last
    check_sequence_func("last", axis=0)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2989, in check_sequence_func
    numeric_eps=1e-2, rtol=1e-2)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 906, in check_numeric_gradient
    eps=numeric_eps, use_forward_train=use_forward_train, dtype=dtype)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 781, in numeric_grad
    f_neps = executor.outputs[0].asnumpy()
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/ndarray/ndarray.py", line 1894, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/base.py", line 210, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [11:31:45] /home/dcarter/mxnet_dev/dgx/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
[bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
[bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0xd8) [0x7f14f31bc658]
[bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(+0x37d5945) [0x7f14f372e945]
[bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f14f3687a15]
[bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f14f369eb1b]
[bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f14f369ed8e]
[bt] (7) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f14f368701a]
[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f151efb1c80]
[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f15269af6ba]


-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
common: WARNING: *** test-level seed set: all "@with_seed()" tests run deterministically ***
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 22.915s

FAILED (errors=1)
terminate called after throwing an instance of 'dmlc::Error'
  what():  [11:31:45] src/storage/./pooled_storage_manager.h:77: CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
[bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
[bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle)+0x8f) [0x7f14f36aa8cf]
[bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::ReleaseAll()+0x95) [0x7f14f36a2ef5]
[bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager()+0x1a) [0x7f14f36aa9ca]
[bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0xa23) [0x7f14f36a9bd3]
[bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::shared_ptr<mxnet::Storage>::~shared_ptr()+0x52) [0x7f14f36aa822]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(+0x39ff8) [0x7f1526617ff8]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(+0x3a045) [0x7f1526618045]
[bt] (9) /usr/bin/python() [0x51dc1f]


=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaFree + 0x1a0) [0x419b0]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle) + 0x32) [0x3751872]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::ReleaseAll() + 0x95) [0x3749ef5]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager() + 0x1a) [0x37519ca]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xa23) [0x3750bd3]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::shared_ptr<mxnet::Storage>::~shared_ptr() + 0x52) [0x3751822]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x39ff8]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x3a045]
=========     Host Frame:/usr/bin/python [0x11dc1f]
=========     Host Frame:/usr/bin/python [0x11b1b7]
=========     Host Frame:/usr/bin/python (PyErr_PrintEx + 0x2d) [0x11aadd]
=========     Host Frame:/usr/bin/python [0x309d5]
=========     Host Frame:/usr/bin/python (Py_Main + 0x612) [0x93ae2]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
=========     Host Frame:/usr/bin/python (_start + 0x29) [0x933e9]
=========
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

All 21 comments

@DickJC123

In a private communication, you indicated this was seen on all platforms. Here you tag it as 'Windows'. Please clarify.

Sorry Dick, I just double checked my database and it seems to only happen on Windows. It seems like I mixed something up, please excuse me for that.

Config: Windows Server 2016, G3.8xlarge, CUDA8, unknown driver version

This seems to be caused by test_conv:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/master/runs/1050/nodes/752/steps/1370/log/?start=0

test_operator_gpu.test_ndarray_crop ... ok
test_operator_gpu.test_cell_fill_shape ... ok
test_operator_gpu.test_conv ... [00:06:32] c:\jenkins_slave\workspace\build-gpu\src\operator\nn\cudnn\./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:148: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:227: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:148: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:227: This convolution is not supported by cudnn, MXNET convolution is applied.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=448545343 to reproduce.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1330733257 to reproduce.
ERROR
test_operator_gpu.test_layer_fill_shape ... ERROR
test_operator_gpu.test_normal_generator ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1592512793 to reproduce.
ERROR
test_operator_gpu.test_ndarray_concatenate ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=466881030 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_transpose ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=221598447 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_storage_fallback ... [00:06:32] c:\jenkins_slave\workspace\build-gpu\src\operator\../common/utils.h:417: 
Storage type fallback detected:
operator = broadcast_add
input storage types = [default, default, ]
output storage types = [csr, ]
params = {}
context.dev_mask = gpu
The operator with default storage type will be dispatched for execution. You're seeing this warning message because the operator above is unable to process the given ndarrays with specified storage types, context and parameter. Temporary dense ndarrays are generated in order to execute the operator. This does not affect the correctness of the programme. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1026960206 to reproduce.
ERROR
test_operator_gpu.test_clip ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1647052128 to reproduce.
ERROR
test_operator_gpu.test_convolution_with_type ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.
ERROR

I have a lead on the problem. There is an out-of-bound read performed by the SequenceLastKernel. I'll stop here and let the person responsible for this kernel correct the problem. Kernels that read beyond their valid input tensor regions can be problematic, even if the random data read is never used in a subsequent kernel write. The problem surfaces when the reads are outside of valid mapped address ranges, which results in an unservicable TLB miss. The problems can be non-deterministic since the input tensors may have non-deterministic placement within their mapped pages.

I debugged the problem by going to the first test that showed the failure in one of the above posts, captured the MXNET_TEST_SEED, and then reproduced the error (on Linux no less) with the following command:

MXNET_TEST_SEED=731510245 cuda-memcheck nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last | c++filt
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
[WARNING] *** test-level seed set: all "@with_seed()" tests run deterministically ***
test_operator_gpu.test_sequence_last ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000390 in void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>)
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x7f13f24003f8 is out of bounds
=========     Device Frame:void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) (void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) : 0x390)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24cc4d]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 [0x15680]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaLaunch + 0x14e) [0x33c9e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::SequenceLastOp<mshadow::gpu, float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0xc3a) [0x53384da]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0x363) [0x3214a53]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::exec::StatefulComputeExecutor::Run(mxnet::RunContext, bool) + 0x59) [0x3808f09]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5870]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaStreamSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaStreamSynchronize + 0x176) [0x47336]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mshadow::Stream<mshadow::gpu>::Wait() + 0x26) [0x32635a6]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5945]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
ERROR

======================================================================
ERROR: test_operator_gpu.test_sequence_last
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/common.py", line 157, in test_new
    orig_test(*args, **kwargs)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2998, in test_sequence_last
    check_sequence_func("last", axis=0)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2989, in check_sequence_func
    numeric_eps=1e-2, rtol=1e-2)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 906, in check_numeric_gradient
    eps=numeric_eps, use_forward_train=use_forward_train, dtype=dtype)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 781, in numeric_grad
    f_neps = executor.outputs[0].asnumpy()
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/ndarray/ndarray.py", line 1894, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/base.py", line 210, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [11:31:45] /home/dcarter/mxnet_dev/dgx/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
[bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
[bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0xd8) [0x7f14f31bc658]
[bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(+0x37d5945) [0x7f14f372e945]
[bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f14f3687a15]
[bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f14f369eb1b]
[bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f14f369ed8e]
[bt] (7) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f14f368701a]
[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f151efb1c80]
[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f15269af6ba]


-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
common: WARNING: *** test-level seed set: all "@with_seed()" tests run deterministically ***
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 22.915s

FAILED (errors=1)
terminate called after throwing an instance of 'dmlc::Error'
  what():  [11:31:45] src/storage/./pooled_storage_manager.h:77: CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
[bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
[bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle)+0x8f) [0x7f14f36aa8cf]
[bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::ReleaseAll()+0x95) [0x7f14f36a2ef5]
[bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager()+0x1a) [0x7f14f36aa9ca]
[bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0xa23) [0x7f14f36a9bd3]
[bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::shared_ptr<mxnet::Storage>::~shared_ptr()+0x52) [0x7f14f36aa822]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(+0x39ff8) [0x7f1526617ff8]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(+0x3a045) [0x7f1526618045]
[bt] (9) /usr/bin/python() [0x51dc1f]


=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaFree + 0x1a0) [0x419b0]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle) + 0x32) [0x3751872]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::ReleaseAll() + 0x95) [0x3749ef5]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager() + 0x1a) [0x37519ca]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xa23) [0x3750bd3]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::shared_ptr<mxnet::Storage>::~shared_ptr() + 0x52) [0x3751822]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x39ff8]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x3a045]
=========     Host Frame:/usr/bin/python [0x11dc1f]
=========     Host Frame:/usr/bin/python [0x11b1b7]
=========     Host Frame:/usr/bin/python (PyErr_PrintEx + 0x2d) [0x11aadd]
=========     Host Frame:/usr/bin/python [0x309d5]
=========     Host Frame:/usr/bin/python (Py_Main + 0x612) [0x93ae2]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
=========     Host Frame:/usr/bin/python (_start + 0x29) [0x933e9]
=========
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

I'm still blocked by this. Was it really disabled?

I don't think anyone disabled the test. Opening a PR to do so here: https://github.com/apache/incubator-mxnet/pull/11485

@DickJC123 Thank you for investigating this issue! I am not able to reproduce the test failure pointed by you with the commit : b786ead562590300519a3f0725dafe7d40325edd with the same test seed: 731510245 . Are you able to reproduce the issue with the latest master ?

(I'll answer on behalf of dick because he told me that he will be quite busy around this time)
Hey anirudh, could you elaborate how you tried to reproduce this issue? Please be aware that this is about memory corruption because of out of bounds writes. This behaviour might now always be reproducible locally. Did you use cuda memcheck to investigate?

yes i am not able to similar errors with cuda memcheck either.

Tried the following:

MXNET_TEST_SEED=731510245 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last

Also did:

MXNET_TEST_SEED=731510245 cuda-memcheck nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last

couldn't reproduce Invalid __global__ read of size 4 with cudamemcheck

How often did you run?

ran repeatedly around 2k times.

Unspecified launch failure here:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17435/7/pipeline

What's the workaround? Need to a get PR through that has nothing to do with CUDA...

I haven't found one. I kept retriggering and finally gave up.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yuconglin picture yuconglin  路  3Comments

dushoufu picture dushoufu  路  3Comments

Shiro-LK picture Shiro-LK  路  3Comments

Ajoo picture Ajoo  路  3Comments

WangcsShuai picture WangcsShuai  路  3Comments