Incubator-mxnet: NDArray.asscalar(): CUDA an illegal memory access was encountered

Created on 27 Jul 2018 · 10Comments · Source: apache/incubator-mxnet

Description

I've found a flaky crash relating to rnns and cuda, however, I'm not sure exactly what is causing it. I noticed it when I was working with a larger encoder-decoder network, but I've reduced it a single script with less than 100 lines of code with a data generator.

Environment info (Required)

I'm using python3.6 on 2 different environments, windows 10 and ubuntu:

----------Python Info----------
Version      : 3.6.2
Compiler     : MSC v.1900 64 bit (AMD64)
Build        : ('v3.6.2:5fd33b5', 'Jul  8 2017 04:57:36')
Arch         : ('64bit', 'WindowsPE')
------------Pip Info-----------
Version      : 9.0.1
Directory    : E:\SL\Software\Python36\lib\site-packages\pip
----------MXNet Info-----------
Version      : 1.2.0
Directory    : E:\SL\Software\Python36\lib\site-packages\mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Windows-10-10.0.17134-SP0
system       : Windows
node         : STEVE-PC
release      : 10
version      : 10.0.17134
----------Hardware Info----------
machine      : AMD64
processor    : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
Name
Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz

----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0450 sec, LOAD: 1.8009 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0405 sec, LOAD: 0.1900 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0436 sec, LOAD: 0.1333 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0410 sec, LOAD: 0.1675 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0326 sec, LOAD: 0.6613 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0283 sec, LOAD: 0.1286 sec.

----------Python Info----------
Version      : 3.6.6
Compiler     : GCC 5.4.0 20160609
Build        : ('default', 'Jun 28 2018 04:42:43')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 18.0
Directory    : /usr/local/lib/python3.6/dist-packages/pip
----------MXNet Info-----------
Version      : 1.2.1
Directory    : /usr/local/lib/python3.6/dist-packages/mxnet
Commit Hash   : 106391a1f0ee012b1ea38764d711e76774ce77e1
----------System Info----------
Platform     : Linux-4.4.0-124-generic-x86_64-with-Ubuntu-16.04-xenial
system       : Linux
node         : steve-pc
release      : 4.4.0-124-generic
version      : #148-Ubuntu SMP Wed May 2 13:00:18 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD Ryzen 7 1700 Eight-Core Processor
Stepping:              1
CPU MHz:               1550.000
CPU max MHz:           3000.0000
CPU min MHz:           1550.0000
BogoMIPS:              5999.21
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             64K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx hw_pstate retpoline retpoline_amd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero ibpb arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0114 sec, LOAD: 0.7577 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0394 sec, LOAD: 0.1501 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1848 sec, LOAD: 0.1442 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0323 sec, LOAD: 0.1578 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0224 sec, LOAD: 0.5354 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0184 sec, LOAD: 0.0980 sec.



md5-eb9ea82df23df219176faa26cc537e37



Traceback (most recent call last):
  File "asymmetric_encoder_decoder_test.py", line 84, in <module>
    avg_loss = mx.nd.sum(loss).asscalar() / args.batch_size
  File "C:\Python36\lib\site-packages\mxnet\ndarray\ndarray.py", line 1894, in asscalar
    return self.asnumpy()[0]
  File "C:\Python36\lib\site-packages\mxnet\ndarray\ndarray.py", line 1876, in asnumpy
    ctypes.c_size_t(data.size)))
  File "C:\Python36\lib\site-packages\mxnet\base.py", line 149, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [22:07:27] c:\jenkins\workspace\mxnet-tag\mxnet\src\storage\pooled_storage_manager.h:77: CUDA: an illegal memory access was encountered



md5-0e0e451aeeb5cdaa2524372778459fb4



Traceback (most recent call last):
  File "asymmetric_encoder_decoder_test.py", line 84, in <module>
    avg_loss = mx.nd.sum(loss).asscalar() / args.batch_size
  File "/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py", line 1894, in asscalar
    return self.asnumpy()[0]
  File "/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py", line 1876, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 149, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [22:41:02] src/storage/./pooled_storage_manager.h:108: cudaMalloc failed: an illegal memory access was encountered

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f9912) [0x7fa9237c5912]
[bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f9ee8) [0x7fa9237c5ee8]
[bt] (2) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f4b6cf) [0x7fa9264176cf]
[bt] (3) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f4fcdc) [0x7fa92641bcdc]
[bt] (4) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2af5369) [0x7fa925fc1369]
[bt] (5) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2b0a5c8) [0x7fa925fd65c8]
[bt] (6) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2b0a6eb) [0x7fa925fd66eb]
[bt] (7) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2942714) [0x7fa925e0e714]
[bt] (8) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x294897b) [0x7fa925e1497b]
[bt] (9) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2948b9e) [0x7fa925e14b9e]

Minimum reproducible example

I'm pretty sure this could be reduced further, but given the conditions of when the crash happens, I don't want to mess with it too much: https://gist.githubusercontent.com/ssttevee/1476b2b680df4eb70e02ed04cb643e96

Steps to reproduce

wget https://gist.githubusercontent.com/ssttevee/1476b2b680df4eb70e02ed04cb643e96/raw/c1ac08912d995f8c25f453266d2f457139db95b1/asymmetric_encoder_decoder_test.py
python3.6 asymmetric_encoder_decoder_test.py --data_length=10000 --batch_size=6

You might have to fiddle around with the batch size to cause it to crash. On my windows machine, it crashes with --data_length=10000 --batch_size=6. On my ubuntu machine, it crashes with --data_length=1000 --batch_size=64. Those parameters are not the only ones that crashes the program, batch size of ±~2 would also cause it to crash.

EDIT: Both systems also crash with --data_length=1000000 --batch_size=1

What have you tried to solve it?

Reinstalled various versions of nvidia drivers and cuda toolkit
Run on a two different environments

Backend Bug CUDA Pending Requester Info Python

Source

ssttevee

All 10 comments

Since it is directly correlating with data length, looks like GPU ran out of memory?

sandeep-krishnamurthy on 27 Jul 2018

Sorry, I meant to say that it doesn't crash with higher a data length like --data_length=1000000 --batch_size=2. The crashes seem to happen at arbitrary data length and batch size values.

It wouldn't make much sense for it to be a memory limit anyways, since none of the parameters gets anywhere close to 1 GB of memory where as both my GPUs have well over 1 GB of memory.

ssttevee on 27 Jul 2018

I also encountered this problem, have you solved it?

lengxia on 10 Sep 2018

👍1

@mxnet-label-bot update [Bug, Python, Backend, CUDA]

zachgk on 15 Nov 2018

Also running into this. Very flaky

altosaar on 30 Nov 2018

😕2

@ssttevee are you still facing this issue?
@altosaar @lengxia Can you please share your reproduction scripts as well? Thanks!

mseth10 on 5 Feb 2019

No, I've since used a different approach in which this does not occur.

ssttevee on 6 Feb 2019

@ssttevee could you paste the code snippet of the alternate approach that worked for you?

vandanavk on 11 Mar 2019

Hi @ssttevee I tried to reproduce the issue with the given script, I am running it on p2.8xlarge. It ran successfully for 10000 and 100000 datalengths.
Generally this error occurs if GPU is going out of memory or if there is some bug in the data iterator script for not splitting indices properly across GPUs. If you fixed it, can you share the code snippet which worked for you so that it will be helpful for others?

Roshrini on 25 Apr 2019

I tried to reproduce this issue on p2.8xlarge instance. Wasn't able to. Will close this issue for now. Please reopen if closed in error or if the issue still persists.
@altosaar @lengxia It will be great if you can you share your reproduction scripts to further debug this.

Roshrini on 29 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings