Sometimes, our slaves get corrupted and suddenly all test start to fail. This is unrelated to the tests directly.
======================================================================
ERROR: test_operator_gpu.test_op_roi_align
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Anaconda3\envs\py2\lib\site-packages\nose\case.py", line 197, in runTest
self.test(*self.arg)
File "C:\Anaconda3\envs\py2\lib\site-packages\nose\util.py", line 620, in newfunc
return func(*arg, **kw)
File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\common.py", line 157, in test_new
orig_test(*args, **kwargs)
File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\test_operator.py", line 6269, in test_op_roi_align
test_roi_align_value()
File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\test_operator.py", line 6230, in test_roi_align_value
data = mx.nd.array(np.arange(N*C*W*H).reshape((N,C,H,W)), ctx=ctx, dtype = dtype)
File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\utils.py", line 146, in array
return _array(source_array, ctx=ctx, dtype=dtype)
File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 2357, in array
arr[:] = source_array
File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 444, in __setitem__
self._set_nd_basic_indexing(key, value)
File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 710, in _set_nd_basic_indexing
self._sync_copyfrom(value)
File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 876, in _sync_copyfrom
ctypes.c_size_t(source_array.size)))
File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\base.py", line 210, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [06:35:08] c:\jenkins_slave\workspace\build-gpu\3rdparty\mshadow\mshadow\./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1046236735 to reproduce.
--------------------- >> end captured logging << ---------------------
@DickJC123
In a private communication, you indicated this was seen on all platforms. Here you tag it as 'Windows'. Please clarify.
Sorry Dick, I just double checked my database and it seems to only happen on Windows. It seems like I mixed something up, please excuse me for that.
Config: Windows Server 2016, G3.8xlarge, CUDA8, unknown driver version
Another Jenkins log for flaky failure of test_operator_gpu.test_op_roi_align - http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-10889/runs/17/nodes/751/log/?start=0
This seems to be caused by test_conv:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/master/runs/1050/nodes/752/steps/1370/log/?start=0
test_operator_gpu.test_ndarray_crop ... ok
test_operator_gpu.test_cell_fill_shape ... ok
test_operator_gpu.test_conv ... [00:06:32] c:\jenkins_slave\workspace\build-gpu\src\operator\nn\cudnn\./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:148: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:227: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:148: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:227: This convolution is not supported by cudnn, MXNET convolution is applied.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=448545343 to reproduce.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1330733257 to reproduce.
ERROR
test_operator_gpu.test_layer_fill_shape ... ERROR
test_operator_gpu.test_normal_generator ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1592512793 to reproduce.
ERROR
test_operator_gpu.test_ndarray_concatenate ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=466881030 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_transpose ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=221598447 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_storage_fallback ... [00:06:32] c:\jenkins_slave\workspace\build-gpu\src\operator\../common/utils.h:417:
Storage type fallback detected:
operator = broadcast_add
input storage types = [default, default, ]
output storage types = [csr, ]
params = {}
context.dev_mask = gpu
The operator with default storage type will be dispatched for execution. You're seeing this warning message because the operator above is unable to process the given ndarrays with specified storage types, context and parameter. Temporary dense ndarrays are generated in order to execute the operator. This does not affect the correctness of the programme. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1026960206 to reproduce.
ERROR
test_operator_gpu.test_clip ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1647052128 to reproduce.
ERROR
test_operator_gpu.test_convolution_with_type ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.
ERROR
I have a lead on the problem. There is an out-of-bound read performed by the SequenceLastKernel. I'll stop here and let the person responsible for this kernel correct the problem. Kernels that read beyond their valid input tensor regions can be problematic, even if the random data read is never used in a subsequent kernel write. The problem surfaces when the reads are outside of valid mapped address ranges, which results in an unservicable TLB miss. The problems can be non-deterministic since the input tensors may have non-deterministic placement within their mapped pages.
I debugged the problem by going to the first test that showed the failure in one of the above posts, captured the MXNET_TEST_SEED, and then reproduced the error (on Linux no less) with the following command:
MXNET_TEST_SEED=731510245 cuda-memcheck nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last | c++filt
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
[WARNING] *** test-level seed set: all "@with_seed()" tests run deterministically ***
test_operator_gpu.test_sequence_last ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
========= at 0x00000390 in void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>)
========= by thread (2,0,0) in block (0,0,0)
========= Address 0x7f13f24003f8 is out of bounds
========= Device Frame:void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) (void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) : 0x390)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24cc4d]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 [0x15680]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaLaunch + 0x14e) [0x33c9e]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::SequenceLastOp<mshadow::gpu, float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0xc3a) [0x53384da]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0x363) [0x3214a53]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::exec::StatefulComputeExecutor::Run(mxnet::RunContext, bool) + 0x59) [0x3808f09]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5870]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
========= Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaStreamSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaStreamSynchronize + 0x176) [0x47336]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mshadow::Stream<mshadow::gpu>::Wait() + 0x26) [0x32635a6]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5945]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
ERROR
======================================================================
ERROR: test_operator_gpu.test_sequence_last
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/usr/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
return func(*arg, **kw)
File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/common.py", line 157, in test_new
orig_test(*args, **kwargs)
File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2998, in test_sequence_last
check_sequence_func("last", axis=0)
File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2989, in check_sequence_func
numeric_eps=1e-2, rtol=1e-2)
File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 906, in check_numeric_gradient
eps=numeric_eps, use_forward_train=use_forward_train, dtype=dtype)
File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 781, in numeric_grad
f_neps = executor.outputs[0].asnumpy()
File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/ndarray/ndarray.py", line 1894, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/base.py", line 210, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [11:31:45] /home/dcarter/mxnet_dev/dgx/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure
Stack trace returned 10 entries:
[bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
[bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
[bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0xd8) [0x7f14f31bc658]
[bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(+0x37d5945) [0x7f14f372e945]
[bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f14f3687a15]
[bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f14f369eb1b]
[bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f14f369ed8e]
[bt] (7) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f14f368701a]
[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f151efb1c80]
[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f15269af6ba]
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
common: WARNING: *** test-level seed set: all "@with_seed()" tests run deterministically ***
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
--------------------- >> end captured logging << ---------------------
----------------------------------------------------------------------
Ran 1 test in 22.915s
FAILED (errors=1)
terminate called after throwing an instance of 'dmlc::Error'
what(): [11:31:45] src/storage/./pooled_storage_manager.h:77: CUDA: unspecified launch failure
Stack trace returned 10 entries:
[bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
[bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
[bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle)+0x8f) [0x7f14f36aa8cf]
[bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::ReleaseAll()+0x95) [0x7f14f36a2ef5]
[bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager()+0x1a) [0x7f14f36aa9ca]
[bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0xa23) [0x7f14f36a9bd3]
[bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::shared_ptr<mxnet::Storage>::~shared_ptr()+0x52) [0x7f14f36aa822]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(+0x39ff8) [0x7f1526617ff8]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(+0x3a045) [0x7f1526618045]
[bt] (9) /usr/bin/python() [0x51dc1f]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
========= Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaFree + 0x1a0) [0x419b0]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle) + 0x32) [0x3751872]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::ReleaseAll() + 0x95) [0x3749ef5]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager() + 0x1a) [0x37519ca]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xa23) [0x3750bd3]
========= Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::shared_ptr<mxnet::Storage>::~shared_ptr() + 0x52) [0x3751822]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x39ff8]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x3a045]
========= Host Frame:/usr/bin/python [0x11dc1f]
========= Host Frame:/usr/bin/python [0x11b1b7]
========= Host Frame:/usr/bin/python (PyErr_PrintEx + 0x2d) [0x11aadd]
========= Host Frame:/usr/bin/python [0x309d5]
========= Host Frame:/usr/bin/python (Py_Main + 0x612) [0x93ae2]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
========= Host Frame:/usr/bin/python (_start + 0x29) [0x933e9]
=========
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found
I'm still blocked by this. Was it really disabled?
I don't think anyone disabled the test. Opening a PR to do so here: https://github.com/apache/incubator-mxnet/pull/11485
@DickJC123 Thank you for investigating this issue! I am not able to reproduce the test failure pointed by you with the commit : b786ead562590300519a3f0725dafe7d40325edd with the same test seed: 731510245 . Are you able to reproduce the issue with the latest master ?
(I'll answer on behalf of dick because he told me that he will be quite busy around this time)
Hey anirudh, could you elaborate how you tried to reproduce this issue? Please be aware that this is about memory corruption because of out of bounds writes. This behaviour might now always be reproducible locally. Did you use cuda memcheck to investigate?
yes i am not able to similar errors with cuda memcheck either.
Tried the following:
MXNET_TEST_SEED=731510245 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last
Also did:
MXNET_TEST_SEED=731510245 cuda-memcheck nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last
couldn't reproduce Invalid __global__ read of size 4 with cudamemcheck
How often did you run?
ran repeatedly around 2k times.
Is this related? CUDA: unspecified launch failure- http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17444/2/pipeline/
Another windows gpu cuda launch error : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17444/4/pipeline #17444
Unspecified launch failure here:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17435/7/pipeline
What's the workaround? Need to a get PR through that has nothing to do with CUDA...
I haven't found one. I kept retriggering and finally gave up.
Unspecified Launch Failure:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17435/7/pipeline/#step-169-log-1210
Again here.... and another I just restarted...
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17594/1/pipeline

Most helpful comment
I have a lead on the problem. There is an out-of-bound read performed by the SequenceLastKernel. I'll stop here and let the person responsible for this kernel correct the problem. Kernels that read beyond their valid input tensor regions can be problematic, even if the random data read is never used in a subsequent kernel write. The problem surfaces when the reads are outside of valid mapped address ranges, which results in an unservicable TLB miss. The problems can be non-deterministic since the input tensors may have non-deterministic placement within their mapped pages.
I debugged the problem by going to the first test that showed the failure in one of the above posts, captured the MXNET_TEST_SEED, and then reproduced the error (on Linux no less) with the following command: