Nvidia-docker: docker run fails with signal 11 on ubuntu 16.04 with CUDA9.2

Created on 24 May 2018 · 30Comments · Source: NVIDIA/nvidia-docker

1. Issue or feature description

Docker run option produces ldcache error with signal 11

Ubuntu 16.04
CUDA9.2 with Driver 396.26
2 x Tiatan Xp

2. Steps to reproduce the issue

Any docker run command with --runtime=nvidia option to run nvidia-smi

e.g.

docker run --runtime=nvidia --rm nvidia/cuda:9.2-runtime-ubuntu16.04 nvidia-smi

gives

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods --debug=/var/log/nvidia-container-runtime-hook.log configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=5658 /var/lib/docker/overlay2/79f27ebedeff1ab14b3c77ccfbc6fbc6afccfac8635a81f1bf39987426f40d1b/merged]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig.real terminated with signal 11\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled

3. Information to attach (optional if deemed irrelevant)

[x] Kernel version from uname -a

Linux PC-Name 4.4.0-127-generic #153-Ubuntu SMP Sat May 19 10:58:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[x] Any relevant kernel output lines from dmesg

[ 8809.443159] device vethed90836 entered promiscuous mode
[ 8809.443331] docker0: port 1(vethed90836) entered forwarding state
[ 8809.443354] docker0: port 1(vethed90836) entered forwarding state
[ 8809.444031] docker0: port 1(vethed90836) entered disabled state
[ 8809.599967] docker0: port 1(vethed90836) entered forwarding state
[ 8809.600011] docker0: port 1(vethed90836) entered forwarding state
[ 8809.719434] traps: nvc:[ldconfig][5471] general protection ip:7efc3ad44196 sp:7ffcad07bc00 error:0 in libc-2.23.so[7efc3ad0d000+1c0000]
[ 8809.923148] docker0: port 1(vethed90836) entered disabled state
[ 8809.950357] docker0: port 1(vethed90836) entered disabled state
[ 8809.952652] docker0: port 1(vethed90836) entered disabled state

[x] Driver information from nvidia-smi -a
nvidia-smi.txt
[x] Docker version from docker version

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:20 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:30 2018
  OS/Arch:      linux/amd64
  Experimental: false

[x] NVIDIA packages version from dpkg -l '*nvidia*' _or_ rpm -qa '*nvidia*'

||/ Name                Version        Architecture   Description
+++-===================-==============-==============-============================================
ii  libnvidia-container 1.0.0~rc.1-1   amd64          NVIDIA container runtime library (command-li
ii  libnvidia-container 1.0.0~rc.1-1   amd64          NVIDIA container runtime library
ii  nvidia-396          396.26-0ubuntu amd64          NVIDIA binary driver - version 396.26
ii  nvidia-396-dev      396.26-0ubuntu amd64          NVIDIA binary Xorg driver development files
un  nvidia-common       <none>         <none>         (no description available)
ii  nvidia-container-ru 2.0.0+docker18 amd64          NVIDIA container runtime
ii  nvidia-container-ru 1.3.0-1        amd64          NVIDIA container runtime hook
un  nvidia-current      <none>         <none>         (no description available)
un  nvidia-docker       <none>         <none>         (no description available)
ii  nvidia-docker2      2.0.3+docker18 all            nvidia-docker CLI wrapper
un  nvidia-driver-binar <none>         <none>         (no description available)
un  nvidia-legacy-340xx <none>         <none>         (no description available)
un  nvidia-libopencl1-3 <none>         <none>         (no description available)
un  nvidia-libopencl1-d <none>         <none>         (no description available)
ii  nvidia-modprobe     396.26-0ubuntu amd64          Load the NVIDIA kernel driver and create dev
un  nvidia-opencl-icd   <none>         <none>         (no description available)
ii  nvidia-opencl-icd-3 396.26-0ubuntu amd64          NVIDIA OpenCL ICD
un  nvidia-persistenced <none>         <none>         (no description available)
ii  nvidia-prime        0.8.2          amd64          Tools to enable NVIDIA's Prime
ii  nvidia-settings     396.26-0ubuntu amd64          Tool for configuring the NVIDIA graphics dri
un  nvidia-settings-bin <none>         <none>         (no description available)
un  nvidia-smi          <none>         <none>         (no description available)
un  nvidia-vdpau-driver <none>         <none>         (no description available)

[x] NVIDIA container library version from nvidia-container-cli -V

version: 1.0.0
build date: 2018-04-26T22:53+00:00
build revision: 163054a04b21c4455c8cae7e47873d9f2a091f55
build compiler: gcc-5 5.4.0 20160609
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

[x] NVIDIA container library logs (see troubleshooting)
docker-log.txt
[x] Docker command, image and tag used

$ docker run --runtime=nvidia nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods --debug=/var/log/nvidia-container-runtime-hook.log configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=5658 /var/lib/docker/overlay2/79f27ebedeff1ab14b3c77ccfbc6fbc6afccfac8635a81f1bf39987426f40d1b/merged]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig.real terminated with signal 11\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled

same result with

docker run --runtime=nvidia --rm nvidia/cuda:9.2-runtime-ubuntu16.04 nvidia-smi

CUDA Version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88

If I do something wrong, thank you for your help.

Source

Chachay

👍1

Most helpful comment

I know this issue is closed but I recently had the exact same error and found this issue. @Chachay mentioning having the ESET NOD32 antivirus installed rang a bell because I've had this error since I installed the same antivirus.

I tried killing all the ESET processes on my linux system and it fixed the issue so I can definitely confirm that the issue is linked to ESET NOD32 running on the system. Disabling the real-time protection via the interface does not work by the way, I had to manually kill the ESET processes.

Djoulihen on 22 Jun 2018

👍2

All 30 comments

Thanks for the detailed bug report!

[ 8809.719434] traps: nvc:[ldconfig][5471] general protection ip:7efc3ad44196 sp:7ffcad07bc00 error:0 in libc-2.23.so[7efc3ad0d000+1c0000]

Any idea @3XX0?

flx42 on 24 May 2018

@Chachay Can you do sudo ldconfig on the host? Does it work fine? Check $? too

flx42 on 24 May 2018

@flx42 Thank you for your quick respond. sudo ldconfig works fine on the host. And echo $? returns 125 after docker run fails.

Chachay on 24 May 2018

Try editing /etc/nvidia-container-runtime/config.toml and change ldconfig = "@/sbin/ldconfig.real" to ldconfig = "/sbin/ldconfig.real" see if you still have the issue

3XX0 on 24 May 2018

❤1 👍1

I got the almost same result as before.

$ docker run --runtime=nvidia --rm nvidia/cuda:9.2-runtime-ubuntu16.04 nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods --debug=/var/log/nvidia-container-runtime-hook.log configure --ldconfig=/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.2 --pid=7385 /var/lib/docker/overlay2/99baf9e0c0a1e878f48138655562033217c300e621079514a6fe59c9bdc92a65/merged]\\\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig.real terminated with signal 11\\\\n\\\"\"": unknown.

Chachay on 24 May 2018

Can you install the following package see if it fixes the issue?

sudo dpkg -i libnvidia-container1_1.0.0~rc.1-1_amd64.deb

libnvidia-container1.zip

3XX0 on 29 May 2018

I have the same issue. My host is Ubuntu 18.04 with Nvidia 390 (GeForce GTX 1050 Ti Mobile).

I tried to install sudo dpkg -i libnvidia-container1_1.0.0~rc.1-1_amd64.deb and restart the docker service but the issue is still here.

hadim on 2 Jun 2018

@hadim did you also try what I mentioned above?

3XX0 on 2 Jun 2018

I tried with ldconfig = "/sbin/ldconfig.real" without success (I didn't restart the Docker service, should I?).

hadim on 2 Jun 2018

sudo dpkg -i libnvidia-container1_1.0.0~rc.1-1_amd64.deb did not work with my place either. I rebooted ubuntu after dpkg too.

Note: While waiting for your patch, I removed nvidia-docker2 and worked with nvidia-docker1 (worked without trouble). Just before I tested your package, I came back to nvidia-docker2 with this instruction and checked that this fail happens as before.

This nvidia-docker1 worked.
https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb

Chachay on 2 Jun 2018

@hadim @Chachay can you both provide the output of docker info on your systems? Thanks

flx42 on 3 Jun 2018

$ docker info
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 36
Server Version: 17.12.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: nvidia runc
Default Runtime: nvidia
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e-dirty (expected: 9f9c96235cc97674e935002fc3d78361b696a69e)
init version: v0.13.0 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-22-generic
Operating System: Ubuntu 18.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.54GiB
Name: boromir
ID: GCCE:ZJXY:ZJFF:O4EQ:NOZA:IGVM:BBBM:3IC7:3TFW:O7IK:BAYR:4LUX
Docker Root Dir: /home/hadim/.local/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

hadim on 3 Jun 2018

I followed the above steps but I am still having the problem.

Error Output
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smidocker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=4290 /var/lib/docker/overlay2/5de1fc82ef1ec5c30c41111e5142a53b668ff9904258e918097696d27b43cad9/merged]\\\\nnvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\\\n\\\"\"": unknown.

lsb_release -a output

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.4 LTS Release: 16.04 Codename: xenial

Linux Kernel Information
4.4.0-127-generic
docker info output
``
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 3
Server Version: 18.03.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: nvidia runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-127-generic
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.4GiB
Name: spencer
ID: A4LD:VXTZ:Q26L:AACU:GAQU:XM5D:ZOA5:MS4G:DVI6:KG5B:GS4A:RBIK
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support
``
nvidia-smi outout

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1758 G /usr/lib/xorg/Xorg 110MiB |
+-----------------------------------------------------------------------------+
**ldconfig output**
sudo ldconfig
echo $?
0
``

saurabhjha1 on 4 Jun 2018

@saurabhjha1 your problem is different, please open a new issue.

flx42 on 4 Jun 2018

@hadim can you open a separate issue and provide the requested information in the issue template?

flx42 on 4 Jun 2018

@thdrl @Chachay Can you try this one and let me know if this works.
If not, please attach your log output.

libnvidia-container1.zip

3XX0 on 5 Jun 2018

@3XX0 , Thank you for your management and advice. But i'm sorry that i can't try it. Because i maked running container with wierd method and the server admin doesn't want to test at now.

I succeeded to run container by installing nvidia-docker (not nvidia-docker2). But there is no graphic driver (but cuda 9.0 is exist by nvidia-docker from host), so i installed graphic driver(nvidia-384) in the container. Everythig works fine like running container via nvidia-docker2.

Thank you for your help.

luvix on 5 Jun 2018

@3XX0 I've done 'sudo dpkg -i ' and still get this error. The log is following. Please tell me if I can help you by downgrading the nvidia driver etc.
nvidia-container-runtime-hook.log
@flx42 my docker info | egrep -v 'Name:|Proxy' > dockerfix/mydockinfo.txt is following
mydockinfo.txt

Chachay on 5 Jun 2018

@Chachay I'm honestly at a loss with regards to your problem :)
Anything special with your setup? Some special security configuration? Or is it a plain ubuntu install?

flx42 on 5 Jun 2018

Ok, this one hopefully solves the problem. @Chachay can you try it?

libnvidia-container1.zip

3XX0 on 7 Jun 2018

@3XX0 no, it didn't work. I even reinstalled docker-ce as well.
@flx42 I have ESET NOD32 on my system. Does this possibly harm on nvidia-docker2? Well, maybe should I stick on nvidia-docker1?

docker run --runtime=nvidia --rm nvidia/cuda:8.0-devel nvidia-smi # not work
sudo apt-get remove nvidia-docker2
sudo apt-get autoremove
sudo pkill -SIGHUP dockerd # restart docker even if it may not make sense
sudo dpkg -i nvidia-docker_1.0.1-1_amd64.deb
sudo pkill -SIGHUP dockerd # ditto
nvidia-docker run --rm nvidia/cuda:8.0-devel nvidia-smi # works

Chachay on 11 Jun 2018

if you remove this from etc file where it is located
--ldconfig=@/sbin/ldconfig.real
it works
it seems loader for shared libs fails to start
does it have any other effect than just bloating memory a bit?

NikolaMandic on 11 Jun 2018

@NikolaMandic the ldconfig execution is needed so that the bind-mounted driver libraries are visible for containerized applications at container startup.

flx42 on 18 Jun 2018

@Chachay we released a new version of libnvidia-container with a bunch of fixes, we hoped that https://github.com/NVIDIA/libnvidia-container/commit/931bd4f08282ce98da4496db0993e2c8f67a05c1 would cover your problem, but it doesn't seem to be the case for you.

Not sure what ESET NOD32 does, you would have to check the logs.

flx42 on 18 Jun 2018

Closing for now, I don't think there is anything else we can do at this point. Until you get additional information, in this case feel free to reopen.

flx42 on 21 Jun 2018

Djoulihen on 22 Jun 2018

👍2

@Djoulihen thanks for your comment! This is very helpful because we have no way to test with this configuration.

flx42 on 22 Jun 2018

@flx42 sorry for late response, but have had to work with docker this a few weeks, but sounds nice to find a clue! :)

@Djoulihen thank you so much for your comment!!!! I'll definitely check this next week.

Chachay on 26 Jun 2018

I got the same problem. Removing ESET fixes the issue. Apart from that, newest version of eset no longer locks the libraries, so either removal or updating your eset should work.