Nvidia-docker: error #35 - installed CUDA driver version older than runtime even though 361.42 driver is installed?

Created on 6 Sep 2016 · 10Comments · Source: NVIDIA/nvidia-docker

Using EC2 Amazon machine, with nvidia drivers version 361.42, and nvidia-docker, nvidia-docker-plugin installed and running.

running latest DIGITS (4.0) shows in the log:

cudaRuntimeGetVersion() failed with error #35

nvidia-docker volume ls on my machine shows

nvidia-docker nvidia_driver_361.42

there are no CUDA bin files (e.g. deviceQuery or nvidia-smi) that I could find in the DIGITS docker, but running

nvidia-docker run --rm nvidia/cuda nvidia-smi

results in

| NVIDIA-SMI 361.42     Driver Version: 361.42         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   35C    P8    17W / 125W |     11MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

trying to nvidia-docker build a dockerfile based on nvidia/cuda:7.0-cudnn4-devel-ubuntu14.04 which clones the master branch of caffe and compiles it with cudnn enabled fails on the beginning of testing with the following error:

    Cuda number of devices: 0
    Setting to use device 0
    Current device id: 0
    Current device name: 
    Note: Randomizing tests' orders with a seed of 21847 .
    [==========] Running 2081 tests from 277 test cases.
    [----------] Global test environment set-up.
    [----------] 50 tests from NeuronLayerTest/3, where TypeParam = caffe::GPUDevice<double>
    [ RUN      ] NeuronLayerTest/3.TestSigmoidGradient
    E0905 10:18:15.161348   263 common.cpp:113] Cannot create Cublas handle. Cublas won't be available.
    E0905 10:18:15.162796   263 common.cpp:120] Cannot create Curand generator. Curand won't be available.
    F0905 10:18:15.162914   263 syncedmem.hpp:18] Check failed: error == cudaSuccess (35 vs. 0)  CUDA driver version is insufficient for CUDA runtime version

But oddly enough, beniz/deepdetect_gpu does seem to work properly with the GPU...

Any Ideas?

Source

Motherboard

All 10 comments

Looks like your driver wasn't installed properly. How did you install it?

3XX0 on 7 Sep 2016

It's Ubuntu 15.10 (GNU/Linux 4.2.0-42-generic x86_64), this is what I did from the beginning:

$ sudo apt-get update
$ sudo apt-get install --no-install-recommends -y gcc make libc-dev
$ wget -P /tmp http://us.download.nvidia.com/XFree86/Linux-x86_64/361.42/NVIDIA-Linux-x86_64-361.42.run
$ sudo sh /tmp/NVIDIA-Linux-x86_64-361.42.run --silent
$ wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0-rc.3/nvidia-docker_1.0.0.rc.3-1_amd64.deb
$ sudo dpkg -i /tmp/nvidia-docker_.deb && rm /tmp/nvidia-docker_.deb
$ sudo apt-get install dkms build-essential linux-headers-generic
$ sudo nano /etc/modprobe.d/blacklist-nouveau.conf

adding the following lines:

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

save and quit

$echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
$sudo update-initramfs -u

I may have had to re-run the nvidia installer again at this stage. (exactly the same 2 lines as before)

And finally
$sudo usermod -aG docker ubuntu
$sudo service nvidia-docker start

made sure both docker and nvidia-docker-plugin services are up:

$service nvidia-docker status
$service docker status

And as mentioned above, the nvidia/cuda docker is able to run nvidia-smi and show the GPU and driver versions show as expected, and beniz/deepdetect_gpu does seem to work properly with the GPU.

Motherboard on 7 Sep 2016

What's the the output of ldconfig -p | grep libcuda and sudo ls -lR /var/lib/nvidia-docker | grep libcuda

3XX0 on 7 Sep 2016

$ldconfig -p | grep libcuda

libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so

$sudo ls -lR /var/lib/nvidia-docker | grep libcuda

lrwxrwxrwx 1 nvidia-docker nvidia-docker       17 Sep  1 09:36 libcuda.so -> libcuda.so.361.42
lrwxrwxrwx 1 nvidia-docker nvidia-docker       17 Sep  1 09:36 libcuda.so.1 -> libcuda.so.361.42
-rwxr-xr-x 2 root          root          16881416 Aug 31 22:54 libcuda.so.361.42

Motherboard on 7 Sep 2016

Hmm. I just got a similar unexpected error while playing with a Torch-based docker image.

THCudaCheck FAIL file=/torch/extra/cutorch/lib/THC/THCGeneral.c line=20 error=35 : CUDA driver version is insufficient for CUDA runtime version
/torch/install/bin/luajit: /torch/install/share/lua/5.1/trepl/init.lua:384: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /torch/extra/cutorch/lib/THC/THCGeneral.c:20
stack traceback:
    [C]: in function 'error'
    /torch/install/share/lua/5.1/trepl/init.lua:384: in function 'require'
    neural_style.lua:51: in function 'main'
    neural_style.lua:515: in main chunk
    [C]: in function 'dofile'
    /torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

Host
- Os: Ubuntu 16.04
- Driver: 367.48
- nvidia-docker: 1.0.0~rc.3-1
- docker: 1.12.1-0~xenial
Image
- Base: nvidia/cuda:7.5-cudnn5-devel-ubuntu14.04

lukeyeager on 7 Sep 2016

👍1

... but a DIGITS image and an NVcaffe image work fine? Not sure what's happening here.

lukeyeager on 7 Sep 2016

@3XX0 helped me figure out my problem. I was trying to use CUDA while building the image, but it's not available yet. When I changed the last step in my Dockerfile from a RUN to a CMD, everything worked fine. Nevermind!

@Motherboard what does this command do for you?

nvidia-docker run --rm --entrypoint 'digits/device_query.py' nvidia/digits

lukeyeager on 7 Sep 2016

👍1

nvidia-docker doesn't like it when I don't give it all the volumes declared in the docker file, so

$ nvidia-docker run --rm --entrypoint 'digits/device_query.py' nvidia/digits

gives

docker: Error response from daemon: create f64b902e8ee8344f2a45a9e0420aa63b2d70349473229877a65cb9ac47152029: bad volume format: f64b902e8ee8344f2a45a9e0420aa63b2d70349473229877a65cb9ac47152029.
But
$ nvidia-docker run --rm -v /home/ubuntu/notebook:/data -v /home/ubuntu/jobs:/jobs --entrypoint 'digits/device_query.py' nvidia/digits

gives

Device #0:
>>> CUDA attributes:
  name                         GRID K520
  totalGlobalMem               4294770688
  clockRate                    797000
  major                        3
  minor                        0
>>> NVML attributes:
  Total memory                 4095 MB
  Used memory                  48 MB
  Memory utilization           0%
  GPU utilization              0%
  Temperature                  36 C

Motherboard on 7 Sep 2016

😕1

I don't know what was previously wrong, but I've tried running digits again, and it seems to be fine...

Can't reproduce the error...

Motherboard on 7 Sep 2016

I have pain a lot trying to use my GTX 860M in a Lenovo Y70 machine with i7 and intel integrated graphics card and one error is quite similar to the ones you are getting . I discover this regarding how to activate nvidia before any try to access to it thru drivers... Just for giving you ideas:

Just to open a possible solution path:
when I type: NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery, I get:

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

But if I try with $optirun NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
the result is the one we want....
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 860M"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 4044 MBytes (4240965632 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1020 MHz (1.02 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 860M
Result = PASS

That make me think all my problems are related to the way I invoque programs . Now I'm investigating how to make it work with torch for recurrent neural networks but with GPU....