Cntk: "CUDA failure 77: an illegal memory access was encountered" over a simple dataset

Created on 23 Nov 2017  路  6Comments  路  Source: microsoft/CNTK

We are frequently facing CUDA failures with CNTK.

CUDA failure 77: an illegal memory access was encountered ; GPU=0 ; hostname=HERMES ; expr=cudaMemcpy(&value, SecondaryIndexLocation() + idx, sizeof(GPUSPARSE_INDEX_TYPE), cudaMemcpyDeviceToHost)
cudaStreamDestroy failed (PrefetchGPUDataTransferer dtor): an illegal memory access was encountered (cuda error 77)
cudaStreamDestroy failed (PrefetchGPUDataTransferer dtor): an illegal memory access was encountered (cuda error 77)

In order to make the problem easily reproducible, we have compile both a BrainScript and a small binary dataset, attached to this ticket.

cuda_stream_error.zip

Here is the full output:

C:\foo> cntk configFile=cuda_stream_error.cntk
CNTK 2.2 (HEAD 23878e, Sep 15 2017 07:49:58) at 2017/11/23 17:07:47

C:\local\CNTK-2-2\cntk\CNTK.exe  configFile=cuda_stream_error.cntk
-------------------------------------------------------------------
Build info:

                Built time: Sep 15 2017 07:42:54
                Last modified date: Thu Sep 14 22:33:54 2017
                Build type: Release
                Build target: GPU
                With 1bit-SGD: no
                With ASGD: yes
                Math lib: mkl
                CUDA version: 9.0.10
                CUDNN version: 6.0.21
                Build Branch: HEAD
                Build SHA1: 23878e5d1f73180d6564b6f907b14fe5f53513bb
                MPI distribution: Microsoft MPI
                MPI version: 7.0.12437.6
-------------------------------------------------------------------
-------------------------------------------------------------------
GPU info:

                Device[0]: cores = 960; computeCapability = 5.0; type = "GeForce GTX 960M"; total memory = 2048 MB; free memory = 2019 MB
-------------------------------------------------------------------

##############################################################################
#                                                                            #
# Train command (train action)                                               #
#                                                                            #
##############################################################################

Node 'T1E.arrayOfFunctions[0].W' (LearnableParameter operation) operation: Tensor shape was inferred as [32 x 20].
Node 'T2E.arrayOfFunctions[0].W' (LearnableParameter operation) operation: Tensor shape was inferred as [32 x 20].
Node 'T3E.arrayOfFunctions[0].W' (LearnableParameter operation) operation: Tensor shape was inferred as [16 x 52].
Node 'T4E.arrayOfFunctions[0].W' (LearnableParameter operation) operation: Tensor shape was inferred as [16 x 7].
Node 'L.arrayOfFunctions[0].W' (LearnableParameter operation) operation: Tensor shape was inferred as [256 x 96].
Node 'Para.PlusArgs[0].arrayOfFunctions[0].W' (LearnableParameter operation) operation: Tensor shape was inferred as [1 x 256].

Model has 47 nodes. Using GPU 0.

Training criterion:   lr = ReduceElements

Training 27409 parameters in 12 parameter tensors.

CUDA failure 77: an illegal memory access was encountered ; GPU=0 ; hostname=LOKAD ; expr=cudaMemcpy(&value, SecondaryIndexLocation() + idx, sizeof(GPUSPARSE_INDEX_TYPE), cudaMemcpyDeviceToHost)
cudaStreamDestroy failed (PrefetchGPUDataTransferer dtor): an illegal memory access was encountered (cuda error 77)
cudaStreamDestroy failed (PrefetchGPUDataTransferer dtor): an illegal memory access was encountered (cuda error 77)
terminate_this: aborting.

This is a blocking problem for us. Any help would be highly appreciated.

Most helpful comment

You need to set environment variable CUDA_LAUNCH_BLOCKING=1 to get the precise cuda error location. Here's the callstack with that:

Cntk.Math-2.2.dll!Microsoft::MSR::CNTK::RuntimeError<char const * __ptr64 & __ptr64,int,char const * __ptr64,int & __ptr64,char const * __ptr64,char const * __ptr64 & __ptr64,char const * __ptr64 & __ptr64>(const char * format, const char * & <_Args_0>, int && <_Args_1>, const char * && <_Args_2>, int & <_Args_3>, const char * && <_Args_4>, const char * & <_Args_5>, const char * & <_Args_6>) Line 114 C++
Cntk.Math-2.2.dll!CudaCall<enum cublasStatus_t>(cublasStatus_t retCode, const char * exprString, const char * libName, cublasStatus_t successCode, const char * msg) Line 701   C++

Cntk.Math-2.2.dll!Microsoft::MSR::CNTK::GPUMatrix::TensorOp(float beta, const Microsoft::MSR::CNTK::GPUMatrix & a, float alpha, Microsoft::MSR::CNTK::ElementWiseOperator op, Microsoft::MSR::CNTK::ElementWiseOperator reductionOp, const std::array & offsets, const Microsoft::MSR::CNTK::SmallVector & regularOpDims, const std::array,2> & regularStrides, const Microsoft::MSR::CNTK::SmallVector & reducingOpDims, const std::array,2> & reducingStrides) Line 4935 C++
Cntk.Math-2.2.dll!Microsoft::MSR::CNTK::Matrix::TensorOp(float beta, const Microsoft::MSR::CNTK::Matrix & a, float alpha, Microsoft::MSR::CNTK::ElementWiseOperator op, Microsoft::MSR::CNTK::ElementWiseOperator reductionOp, const std::array & offsets, const Microsoft::MSR::CNTK::SmallVector & regularOpDims, const std::array,2> & regularStrides, const Microsoft::MSR::CNTK::SmallVector & reducingOpDims, const std::array,2> & reducingStrides) Line 6166 C++
Cntk.Math-2.2.dll!Microsoft::MSR::CNTK::TensorView::DoUnaryOpOf(float beta, const Microsoft::MSR::CNTK::TensorView & a, float alpha, Microsoft::MSR::CNTK::ElementWiseOperator op, Microsoft::MSR::CNTK::ElementWiseOperator reductionOp) Line 250 C++
Cntk.Math-2.2.dll!Microsoft::MSR::CNTK::TensorView::AssignCopyOf(const Microsoft::MSR::CNTK::TensorView & a, float alpha) Line 94 C++
CNTK.exe!Microsoft::MSR::CNTK::PlusNode::BackpropTo(const unsigned __int64 inputIndex, const Microsoft::MSR::CNTK::FrameRange & fr) Line 74 C++

The code path seems to be in gradient optimization in PlusNode's BackProp, when automatically reducing an input of 32x1000 to 32x1. I tried to disable gradient optimization by setting optimizeGradientAccumulation=false and the problem seems went away. I'll dig a bit more on this.

All 6 comments

You need to set environment variable CUDA_LAUNCH_BLOCKING=1 to get the precise cuda error location. Here's the callstack with that:

Cntk.Math-2.2.dll!Microsoft::MSR::CNTK::RuntimeError<char const * __ptr64 & __ptr64,int,char const * __ptr64,int & __ptr64,char const * __ptr64,char const * __ptr64 & __ptr64,char const * __ptr64 & __ptr64>(const char * format, const char * & <_Args_0>, int && <_Args_1>, const char * && <_Args_2>, int & <_Args_3>, const char * && <_Args_4>, const char * & <_Args_5>, const char * & <_Args_6>) Line 114 C++
Cntk.Math-2.2.dll!CudaCall<enum cublasStatus_t>(cublasStatus_t retCode, const char * exprString, const char * libName, cublasStatus_t successCode, const char * msg) Line 701   C++

Cntk.Math-2.2.dll!Microsoft::MSR::CNTK::GPUMatrix::TensorOp(float beta, const Microsoft::MSR::CNTK::GPUMatrix & a, float alpha, Microsoft::MSR::CNTK::ElementWiseOperator op, Microsoft::MSR::CNTK::ElementWiseOperator reductionOp, const std::array & offsets, const Microsoft::MSR::CNTK::SmallVector & regularOpDims, const std::array,2> & regularStrides, const Microsoft::MSR::CNTK::SmallVector & reducingOpDims, const std::array,2> & reducingStrides) Line 4935 C++
Cntk.Math-2.2.dll!Microsoft::MSR::CNTK::Matrix::TensorOp(float beta, const Microsoft::MSR::CNTK::Matrix & a, float alpha, Microsoft::MSR::CNTK::ElementWiseOperator op, Microsoft::MSR::CNTK::ElementWiseOperator reductionOp, const std::array & offsets, const Microsoft::MSR::CNTK::SmallVector & regularOpDims, const std::array,2> & regularStrides, const Microsoft::MSR::CNTK::SmallVector & reducingOpDims, const std::array,2> & reducingStrides) Line 6166 C++
Cntk.Math-2.2.dll!Microsoft::MSR::CNTK::TensorView::DoUnaryOpOf(float beta, const Microsoft::MSR::CNTK::TensorView & a, float alpha, Microsoft::MSR::CNTK::ElementWiseOperator op, Microsoft::MSR::CNTK::ElementWiseOperator reductionOp) Line 250 C++
Cntk.Math-2.2.dll!Microsoft::MSR::CNTK::TensorView::AssignCopyOf(const Microsoft::MSR::CNTK::TensorView & a, float alpha) Line 94 C++
CNTK.exe!Microsoft::MSR::CNTK::PlusNode::BackpropTo(const unsigned __int64 inputIndex, const Microsoft::MSR::CNTK::FrameRange & fr) Line 74 C++

The code path seems to be in gradient optimization in PlusNode's BackProp, when automatically reducing an input of 32x1000 to 32x1. I tried to disable gradient optimization by setting optimizeGradientAccumulation=false and the problem seems went away. I'll dig a bit more on this.

Thanks a lot @KeDengMS , your follow-up is very appreciated. My team confirms that the option optimizeGradientAccumulation=false seems to be a valid work-around. However, this option increases computation time by 2x on our setups. I am really looking forward a fix on that one, it would really help.

The option optimizeGradientAccumulation=false is actually not solving all the problems. We are now facing again crashes. Attached, a small BrainScript script and a binary file to reproduce the failure.

cuda_stream_error_v2.zip

When using CPU, we observe the failure:

C:\foo> cntk configFile=cuda_stream_error.cntk
CNTK 2.2 (HEAD 23878e, Sep 15 2017 07:49:58) at 2017/11/29 14:32:55

C:\local\CNTK-2-2\cntk\CNTK.exe  configFile=cuda_stream_error.cntk
-------------------------------------------------------------------
Build info:

                Built time: Sep 15 2017 07:42:54
                Last modified date: Thu Sep 14 22:33:54 2017
                Build type: Release
                Build target: GPU
                With 1bit-SGD: no
                With ASGD: yes
                Math lib: mkl
                CUDA version: 9.0.10
                CUDNN version: 6.0.21
                Build Branch: HEAD
                Build SHA1: 23878e5d1f73180d6564b6f907b14fe5f53513bb
                MPI distribution: Microsoft MPI
                MPI version: 7.0.12437.6
-------------------------------------------------------------------
No GPUs found

##############################################################################
#                                                                            #
# Train command (train action)                                               #
#                                                                            #
##############################################################################


Starting from checkpoint. Loading network from 'model.dnn.0'.

Model has 54 nodes. Using CPU.

Training criterion:   lr = ReduceElements

Training 237873 parameters in 16 parameter tensors.

CNTK: Caught Win32 exception 0xc0000005: Access violation.

Then, with GPU, the error message is:

C:\foo> cntk configFile=cuda_stream_error.cntk
CNTK 2.2 (HEAD 23878e, Sep 15 2017 07:49:58) at 2017/11/29 14:33:20

C:\local\CNTK-2-2\cntk\CNTK.exe  configFile=cuda_stream_error.cntk
-------------------------------------------------------------------
Build info:

                Built time: Sep 15 2017 07:42:54
                Last modified date: Thu Sep 14 22:33:54 2017
                Build type: Release
                Build target: GPU
                With 1bit-SGD: no
                With ASGD: yes
                Math lib: mkl
                CUDA version: 9.0.10
                CUDNN version: 6.0.21
                Build Branch: HEAD
                Build SHA1: 23878e5d1f73180d6564b6f907b14fe5f53513bb
                MPI distribution: Microsoft MPI
                MPI version: 7.0.12437.6
-------------------------------------------------------------------
-------------------------------------------------------------------
GPU info:

                Device[0]: cores = 960; computeCapability = 5.0; type = "GeForce GTX 960M"; total memory = 2048 MB; free memory = 2019 MB
-------------------------------------------------------------------

##############################################################################
#                                                                            #
# Train command (train action)                                               #
#                                                                            #
##############################################################################


Starting from checkpoint. Loading network from 'model.dnn.0'.

Model has 54 nodes. Using GPU 0.

Training criterion:   lr = ReduceElements

Training 237873 parameters in 16 parameter tensors.

CUDA failure 77: an illegal memory access was encountered ; GPU=0 ; hostname=LOKAD ; expr=cudaMemcpy(&value, SecondaryIndexLocation() + idx, sizeof(GPUSPARSE_INDEX_TYPE), cudaMemcpyDeviceToHost)
cudaStreamDestroy failed (PrefetchGPUDataTransferer dtor): an illegal memory access was encountered (cuda error 77)
cudaStreamDestroy failed (PrefetchGPUDataTransferer dtor): an illegal memory access was encountered (cuda error 77)
terminate_this: aborting.

Any help would be highly appreciated. Thanks!

We have gathered 4 more cases that crashes in various situations.

cuda_stream_error_v3.zip

We have situations GPU works but CPU crashes. We haven't yet included the scenarios where CPU works and GPU crashes.

I debugged your CPU crash with CNTK.exe in debug build with the repro in error1.cntk. It's caused by a dense (16x52) and sparse (52x1000) matrix multiply, with an invalid row index of 52 (valid range should be 0..51) in the sparse matrix. This caused out-of-bound memory access in the dense matrix, and since it's in a busy loop there's no validation before the access.

Please double check your sparse matrix indices and see if that would fix the crash for all repros.

@KeDengMS You're the best! Yes, it did fix all our issues. It was on our side. We are going to add sanitization layers on our side to avoid this class of problems.

Was this page helpful?
0 / 5 - 0 ratings