Incubator-mxnet: CUDA 10 w/ cuDNN 7.5 Support

Created on 9 Apr 2019  路  15Comments  路  Source: apache/incubator-mxnet

Description

Currently, the CI tests fail when running mxnet on top of CUDA 10 and cuDNN 7.5 as demonstrated in this PR.

The tests pass when using CUDA 10 and cuDNN 7.3.1.20, as demonstrated in this PR.

Environment info (Required)

g3.8xlarge with CUDA 10 and nvidia driver 410.73 installed.
The code is running inside the CI GPU container based on nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04.

Error Message:

Usually: src/operator/./cudnn_rnn-inl.h:759: Check failed: e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH

Here are some example logs:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/

Steps to reproduce

# Launch g3.8xlarge instance with ubuntu 16.04

# ==-_-==-_-== Environment Setup ==-_-==-_-==

sudo apt update
sudo apt-get install -y \
    apt-transport-https \
    build-essential \
    ca-certificates \
    curl \
    git \
    libatlas-base-dev \
    libcurl4-openssl-dev \
    libjemalloc-dev \
    libhdf5-dev \
    liblapack-dev \
    libopenblas-dev \
    libopencv-dev \
    libturbojpeg \
    libzmq3-dev \
    ninja-build \
    software-properties-common \
    sudo \
    unzip \
    wget

sudo apt-get install -y python-dev python3-dev virtualenv wget

# the version of the pip shipped with ubuntu may be too lower, install a recent version here
wget -nv https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
sudo python2 get-pip.py

pip2 install --user nose cpplint==1.3.0 pylint==1.9.3 'numpy<=1.15.2,>=1.8.2' nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3
pip3 install --user nose cpplint==1.3.0 pylint==2.1.1 'numpy<=1.15.2,>=1.8.2' nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3

# ==-_-==-_-== CUDA Installation ==-_-==-_-==

wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
chmod +x cuda_10.0.130_410.48_linux && sudo ./cuda_10.0.130_410.48_linux

# Installation except:
# Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
# (y)es/(n)o/(q)uit: y
# 
# Do you want to install the OpenGL libraries?
# (y)es/(n)o/(q)uit [ default is yes ]:
#
# Do you want to run nvidia-xconfig?
# This will update the system X configuration file so that the NVIDIA X driver
# is used. The pre-existing X configuration file will be backed up.
# This option should not be used on systems that require a custom
# X configuration, such as systems with multiple GPU vendors.
# (y)es/(n)o/(q)uit [ default is no ]:
# 
# Install the CUDA 10.0 Toolkit?
# (y)es/(n)o/(q)uit: y
#
# Enter Toolkit Location
# [ default is /usr/local/cuda-10.0 ]:
#
# Do you want to install a symbolic link at /usr/local/cuda?
# (y)es/(n)o/(q)uit: y
#
# Install the CUDA 10.0 Samples?
# (y)es/(n)o/(q)uit: n

# Set LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}

# Check installation
nvidia-smi

# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
# |===============================+======================+======================|
# |   0  Tesla M60           Off  | 00000000:00:1D.0 Off |                    0 |
# | N/A   31C    P0    43W / 150W |      0MiB /  7618MiB |      0%      Default |
# +-------------------------------+----------------------+----------------------+
# |   1  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |
# | N/A   34C    P0    41W / 150W |      0MiB /  7618MiB |     99%      Default |
# +-------------------------------+----------------------+----------------------+
#
# +-----------------------------------------------------------------------------+
# | Processes:                                                       GPU Memory |
# |  GPU       PID   Type   Process name                             Usage      |
# |=============================================================================|
# |  No running processes found                                                 |
# +-----------------------------------------------------------------------------+

# ==-_-==-_-== Setup cuDNN ==-_-==-_-==

# https://developer.nvidia.com/rdp/cudnn-download
# Register with NVIDIA and download cudnn-10.0-linux-x64-v7.5.0.56.tgz
# scp it to your instance
# https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html
tar -xvzf cudnn-10.0-linux-x64-v7.5.0.56.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

# ==-_-==-_-== Clone MXNet Repo. ==-_-==-_-==
mkdir -p repositories/apache && cd repositories/apache
git clone --recursive https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet

# ==-_-==-_-== Compile MXNet ==-_-==-_-==
make \
        DEV=1                                     \
        ENABLE_TESTCOVERAGE=1                     \
        USE_BLAS=openblas                         \
        USE_MKLDNN=0                              \
        USE_CUDA=1                                \
        USE_CUDA_PATH=/usr/local/cuda             \
        USE_CUDNN=1                               \
        USE_CPP_PACKAGE=0                         \
        USE_DIST_KVSTORE=1                        \
        USE_SIGNAL_HANDLER=1                      \
        -j$(nproc)

# ==-_-==-_-== Run failing test ==-_-==-_-==
export PYTHONPATH=./python/                                                                                        
nosetests-3.4 --verbose tests/python/gpu/test_gluon_gpu.py:test_rnn_layers_fp16

# Error excerpt:
# ======================================================================
# ERROR: test_gluon_gpu.test_rnn_layers_fp16
# ----------------------------------------------------------------------
# Traceback (most recent call last):
#   File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
#     self.test(*self.arg)
#   File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
#     return func(*arg, **kw)
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/common.py", line 110, in test_new
#     orig_test(*args, **kwargs)
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 545, in test_rnn_layers_fp16
#     run_rnn_layers('float16', 'float32', mx.gpu())
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 479, in run_rnn_layers
#     check_rnn_layer_forward(gluon.rnn.RNN(10, 2, dtype=dtype), mx.nd.ones((8, 3, 20), dtype=dtype), ctx=ctx)
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 451, in check_rnn_layer_forward
#     np_out = out.asnumpy()
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1995, in asnumpy
#     ctypes.c_size_t(data.size)))
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
#     raise MXNetError(py_str(_LIB.MXGetLastError()))
# mxnet.base.MXNetError: [07:41:30] src/operator/./cudnn_rnn-inl.h:759: Check failed: e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH
# 
# Stack trace returned 10 entries:
# [bt] (0) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1c7) [0x7fe8ec2eebd7]
# [bt] (1) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7fe8ec2ef082]
# [bt] (2) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNRNNOp<mshadow::half::half_t>::Init(mshadow::Stream<mshadow::gpu>*, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x333c) [0x7fe8f36f8afc]
# [bt] (3) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNRNNOp<mshadow::half::half_t>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x1501) [0x7fe8f3700c61]
# [bt] (4) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x48b) [0x7fe8ef82dd5b]
# [bt] (5) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::LegacyOpForward(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x18) [0x7fe8ef820838]
# [bt] (6) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&), void (*)(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)>::_M_invoke(std::_Any_data const&, mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x20) [0x7fe8ef5d9250]
# [bt] (7) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x2e8) [0x7fe8ef8d7e88]
# [bt] (8) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, # mxnet::RunContext&&)+0x25) [0x7fe8ef8d8215]
# [bt] (9) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x5f9056e) [0x7fe8f02a656e]
# 
# 
# -------------------- >> begin captured logging << --------------------
# common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1716277661 to reproduce.
# --------------------- >> end captured logging << ---------------------
# 
# ----------------------------------------------------------------------
CI CUDA

Most helpful comment

@stu1130 it's been merged! Feel free to take it away and let me know if I can help you =)

All 15 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Cuda

My suggestion would be to maybe merge the PR with cuDNN v7.3.1.20 (which would at least ensure that mxnet works with cuDNN up to this version), then whoever tackles the v7.5 issue can just update the CI image to use the latest version of cuDNN.

only for to know

the latest version of cuda is 10.1.105_418.39 and cudnn 10.1-linux-x64-v7.5.0.56

why not use this version?

greetings

Hey,

This requires a bit more work on the AMI side. I'm also no convinced that it will solve the problem.
Once we get on cuDNN 7.5, we can look at updating the AMIs to CUDA 10.1 and then bumping the CI images.

Cheers

Hey @perdasilva
I can tackle the update v7.3.1.20 -> v7.5 and then bump up to CUDA 10.1 if you decide to merge PR first
Thanks

@mxnet-label-bot add [CUDA, CI]

@stu1130 thank you. It seems that the nvidia drivers on the linux nodes has been bumped to 418 because of the tensorrt issues. This means we should be able to use CUDA 10.1 =) (let me know if it doesn't work)

@perdasilva any updates? Thanks

@stu1130 I'm currently on leave until Thursday. I totally missed that you wanted me to merge the other PR first. I will do that as soon as I'm back. I'm sorry missed that. I'll see about already bumping CI to 10.1 as well - then that's done.

@perdasilva no rush! Thanks a lot for this awesome job!!!

@stu1130 There's no cudnn 7.3 package for cuda 10.1, so I won't be able to update CI to 10.1 in my PR.
I've just done a rebase and I'm putting it through CI =D I'll let you know once it's through.

@stu1130 it's been merged! Feel free to take it away and let me know if I can help you =)

@perdasilva Awesome Thanks a lot!!!

Here are what I found

  1. The unit test failed on cuDNN 7.5.0 & 7.5.1 but work perfectly on 7.4.2 & 7.3.1. And using Tesla V100 will resolve this problem i.e. work fine on cuDNN 7.3.1 & 7.5.0 & 7.5.1.
  2. The function that causes the error is cudnnGetRNNWorkspaceSize in here
    https://github.com/apache/incubator-mxnet/blob/874fb89cd33b0e4affd7f3fb1b4ae4e09f25ef84/src/operator/rnn-inl.h#L1369
  3. I tried CUDA 10.1 with latest CUDA driver 418.67 still not working

This has since been fixed ^^ thx to @stu1130

Was this page helpful?
0 / 5 - 0 ratings