Currently, the CI tests fail when running mxnet on top of CUDA 10 and cuDNN 7.5 as demonstrated in this PR.
The tests pass when using CUDA 10 and cuDNN 7.3.1.20, as demonstrated in this PR.
g3.8xlarge with CUDA 10 and nvidia driver 410.73 installed.
The code is running inside the CI GPU container based on nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04.
Usually: src/operator/./cudnn_rnn-inl.h:759: Check failed: e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH
Here are some example logs:
# Launch g3.8xlarge instance with ubuntu 16.04
# ==-_-==-_-== Environment Setup ==-_-==-_-==
sudo apt update
sudo apt-get install -y \
apt-transport-https \
build-essential \
ca-certificates \
curl \
git \
libatlas-base-dev \
libcurl4-openssl-dev \
libjemalloc-dev \
libhdf5-dev \
liblapack-dev \
libopenblas-dev \
libopencv-dev \
libturbojpeg \
libzmq3-dev \
ninja-build \
software-properties-common \
sudo \
unzip \
wget
sudo apt-get install -y python-dev python3-dev virtualenv wget
# the version of the pip shipped with ubuntu may be too lower, install a recent version here
wget -nv https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
sudo python2 get-pip.py
pip2 install --user nose cpplint==1.3.0 pylint==1.9.3 'numpy<=1.15.2,>=1.8.2' nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3
pip3 install --user nose cpplint==1.3.0 pylint==2.1.1 'numpy<=1.15.2,>=1.8.2' nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3
# ==-_-==-_-== CUDA Installation ==-_-==-_-==
wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
chmod +x cuda_10.0.130_410.48_linux && sudo ./cuda_10.0.130_410.48_linux
# Installation except:
# Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
# (y)es/(n)o/(q)uit: y
#
# Do you want to install the OpenGL libraries?
# (y)es/(n)o/(q)uit [ default is yes ]:
#
# Do you want to run nvidia-xconfig?
# This will update the system X configuration file so that the NVIDIA X driver
# is used. The pre-existing X configuration file will be backed up.
# This option should not be used on systems that require a custom
# X configuration, such as systems with multiple GPU vendors.
# (y)es/(n)o/(q)uit [ default is no ]:
#
# Install the CUDA 10.0 Toolkit?
# (y)es/(n)o/(q)uit: y
#
# Enter Toolkit Location
# [ default is /usr/local/cuda-10.0 ]:
#
# Do you want to install a symbolic link at /usr/local/cuda?
# (y)es/(n)o/(q)uit: y
#
# Install the CUDA 10.0 Samples?
# (y)es/(n)o/(q)uit: n
# Set LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
# Check installation
nvidia-smi
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 410.48 Driver Version: 410.48 |
# |-------------------------------+----------------------+----------------------+
# | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
# | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
# |===============================+======================+======================|
# | 0 Tesla M60 Off | 00000000:00:1D.0 Off | 0 |
# | N/A 31C P0 43W / 150W | 0MiB / 7618MiB | 0% Default |
# +-------------------------------+----------------------+----------------------+
# | 1 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
# | N/A 34C P0 41W / 150W | 0MiB / 7618MiB | 99% Default |
# +-------------------------------+----------------------+----------------------+
#
# +-----------------------------------------------------------------------------+
# | Processes: GPU Memory |
# | GPU PID Type Process name Usage |
# |=============================================================================|
# | No running processes found |
# +-----------------------------------------------------------------------------+
# ==-_-==-_-== Setup cuDNN ==-_-==-_-==
# https://developer.nvidia.com/rdp/cudnn-download
# Register with NVIDIA and download cudnn-10.0-linux-x64-v7.5.0.56.tgz
# scp it to your instance
# https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html
tar -xvzf cudnn-10.0-linux-x64-v7.5.0.56.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
# ==-_-==-_-== Clone MXNet Repo. ==-_-==-_-==
mkdir -p repositories/apache && cd repositories/apache
git clone --recursive https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet
# ==-_-==-_-== Compile MXNet ==-_-==-_-==
make \
DEV=1 \
ENABLE_TESTCOVERAGE=1 \
USE_BLAS=openblas \
USE_MKLDNN=0 \
USE_CUDA=1 \
USE_CUDA_PATH=/usr/local/cuda \
USE_CUDNN=1 \
USE_CPP_PACKAGE=0 \
USE_DIST_KVSTORE=1 \
USE_SIGNAL_HANDLER=1 \
-j$(nproc)
# ==-_-==-_-== Run failing test ==-_-==-_-==
export PYTHONPATH=./python/
nosetests-3.4 --verbose tests/python/gpu/test_gluon_gpu.py:test_rnn_layers_fp16
# Error excerpt:
# ======================================================================
# ERROR: test_gluon_gpu.test_rnn_layers_fp16
# ----------------------------------------------------------------------
# Traceback (most recent call last):
# File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
# self.test(*self.arg)
# File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
# return func(*arg, **kw)
# File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/common.py", line 110, in test_new
# orig_test(*args, **kwargs)
# File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 545, in test_rnn_layers_fp16
# run_rnn_layers('float16', 'float32', mx.gpu())
# File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 479, in run_rnn_layers
# check_rnn_layer_forward(gluon.rnn.RNN(10, 2, dtype=dtype), mx.nd.ones((8, 3, 20), dtype=dtype), ctx=ctx)
# File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 451, in check_rnn_layer_forward
# np_out = out.asnumpy()
# File "/home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1995, in asnumpy
# ctypes.c_size_t(data.size)))
# File "/home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
# raise MXNetError(py_str(_LIB.MXGetLastError()))
# mxnet.base.MXNetError: [07:41:30] src/operator/./cudnn_rnn-inl.h:759: Check failed: e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH
#
# Stack trace returned 10 entries:
# [bt] (0) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1c7) [0x7fe8ec2eebd7]
# [bt] (1) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7fe8ec2ef082]
# [bt] (2) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNRNNOp<mshadow::half::half_t>::Init(mshadow::Stream<mshadow::gpu>*, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x333c) [0x7fe8f36f8afc]
# [bt] (3) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNRNNOp<mshadow::half::half_t>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x1501) [0x7fe8f3700c61]
# [bt] (4) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x48b) [0x7fe8ef82dd5b]
# [bt] (5) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::LegacyOpForward(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x18) [0x7fe8ef820838]
# [bt] (6) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&), void (*)(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)>::_M_invoke(std::_Any_data const&, mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x20) [0x7fe8ef5d9250]
# [bt] (7) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x2e8) [0x7fe8ef8d7e88]
# [bt] (8) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, # mxnet::RunContext&&)+0x25) [0x7fe8ef8d8215]
# [bt] (9) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x5f9056e) [0x7fe8f02a656e]
#
#
# -------------------- >> begin captured logging << --------------------
# common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1716277661 to reproduce.
# --------------------- >> end captured logging << ---------------------
#
# ----------------------------------------------------------------------
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Cuda
My suggestion would be to maybe merge the PR with cuDNN v7.3.1.20 (which would at least ensure that mxnet works with cuDNN up to this version), then whoever tackles the v7.5 issue can just update the CI image to use the latest version of cuDNN.
only for to know
the latest version of cuda is 10.1.105_418.39 and cudnn 10.1-linux-x64-v7.5.0.56
why not use this version?
greetings
Hey,
This requires a bit more work on the AMI side. I'm also no convinced that it will solve the problem.
Once we get on cuDNN 7.5, we can look at updating the AMIs to CUDA 10.1 and then bumping the CI images.
Cheers
Hey @perdasilva
I can tackle the update v7.3.1.20 -> v7.5 and then bump up to CUDA 10.1 if you decide to merge PR first
Thanks
@mxnet-label-bot add [CUDA, CI]
@stu1130 thank you. It seems that the nvidia drivers on the linux nodes has been bumped to 418 because of the tensorrt issues. This means we should be able to use CUDA 10.1 =) (let me know if it doesn't work)
@perdasilva any updates? Thanks
@stu1130 I'm currently on leave until Thursday. I totally missed that you wanted me to merge the other PR first. I will do that as soon as I'm back. I'm sorry missed that. I'll see about already bumping CI to 10.1 as well - then that's done.
@perdasilva no rush! Thanks a lot for this awesome job!!!
@stu1130 There's no cudnn 7.3 package for cuda 10.1, so I won't be able to update CI to 10.1 in my PR.
I've just done a rebase and I'm putting it through CI =D I'll let you know once it's through.
@stu1130 it's been merged! Feel free to take it away and let me know if I can help you =)
@perdasilva Awesome Thanks a lot!!!
Here are what I found
cudnnGetRNNWorkspaceSize in hereThis has since been fixed ^^ thx to @stu1130
Most helpful comment
@stu1130 it's been merged! Feel free to take it away and let me know if I can help you =)