nvidia-docker on ppc64le-ubuntu16.04

Created on 24 Mar 2017  ·  35Comments  ·  Source: NVIDIA/nvidia-docker

I am trying to follow this guide : https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-44a4f27eba32/entry/Using_a_GPU_in_a_Docker_Container_on_POWER?lang=en to install docker and nvidia-docker. But I am not sure whether this guide is correct/latest in first place, as there were several issues on the way but I somehow managed to get nvidia-docker installed. When do this:

cuda driver : 375.20

$nvidia-docker run -ti cuda:8.0 bash ; I get an error

/var/log/syslog
returned error: create nvidia_driver_375.39: create nvidia_driver_375.39: Error looking up volume plugin nvidia-docker: plugin not found"

Looks like nvidia-docker-plugin is not started properly.. Am I following the right way to get this going ? Or have you tested on different driver version where nvidia-docker works on Ubuntu 16.04 ?

service nvidia-docker status
● nvidia-docker.service
   Loaded: not-found (Reason: No such file or directory)
   Active: inactive (dead)

Mar 23 17:43:46 ibm-hpc-08 nvidia-docker-plugin[106223]: /usr/bin/nvidia-docker-plugin | 2017/03/23 17:43:46 Error: cuda: all CUDA-capable devices are busy or unavailable
Mar 23 17:43:46 ibm-hpc-08 systemd[1]: nvidia-docker.service: Main process exited, code=exited, status=1/FAILURE
Mar 23 17:43:46 ibm-hpc-08 systemd[1]: nvidia-docker.service: Unit entered failed state.
Mar 23 17:43:46 ibm-hpc-08 systemd[1]: nvidia-docker.service: Failed with result 'exit-code'.
Mar 23 17:43:47 ibm-hpc-08 systemd[1]: nvidia-docker.service: Service hold-off time over, scheduling restart.
Mar 23 17:43:47 ibm-hpc-08 systemd[1]: Stopped NVIDIA Docker plugin.
Mar 23 17:43:47 ibm-hpc-08 systemd[1]: nvidia-docker.service: Start request repeated too quickly.
Mar 23 17:43:47 ibm-hpc-08 systemd[1]: Failed to start NVIDIA Docker plugin.
Mar 23 18:13:11 ibm-hpc-08 systemd[1]: Stopped NVIDIA Docker plugin.
Mar 23 18:13:11 ibm-hpc-08 systemd[1]: Stopped NVIDIA Docker plugin.

All 35 comments

@clnperez maybe you have seen this error message already:

Error: cuda: all CUDA-capable devices are busy or unavailable

If not, it could be an issue related to the driver. Make sure the CUDA samples work outside of Docker.

CUDA samples work as expected outside nvidia-docker.

Hm, no. I haven't seen that specifically before. I have seen issues with starting the plugin with the 8.0.54 drivers installed, and moving to 8.0.61 fixed it. What point-release did you install to get your CUDA 8.0 drivers?

Also, what did you get stuck on in the blog post? I'll update it.

Cuda compilation tools, release 8.0, V8.0.61

I had to install docker using docker.io package as I could not find docker-engine even after updating the repos. And the version in doc is older, the current is 1.13.3 .. I am not sure which one to use although any of them didn't work to get docker-engine.

which driver version you got nvidia-docker working with cuda 8.0.61 ?

If you are using Power8, try using cuda-repo-ubuntu1604-8-0-local-ga2_8.0.54-1_ppc64el-deb from https://developer.nvidia.com/cuda-downloads-power8 .

@superbug7 -- I don't think this will fix your issue, but you couldn't install docker-engine using this repo?

I'm double-checking the driver version and will get back asap.

@clnperez I could not get docker-engine package in my apt cache.. @pedropgusmao Let me try 8.0.54 ...

@superbug7 There was a mistake in the repo instructions. Can you try echo ‘deb http://ftp.unicamp.br/pub/ppc64el/ubuntu/16_04/docker-1.13.1-ppc64el/ xenial main’ >> /etc/apt/sources.list ?

@clnperez I did try with 1.13.1 already. That didn't help.

@superbug7 You mean you still couldn't install docker-engine using apt after changing xenial docker-engine to xenial main in that echo command?

With docker-engine downloaded from http://ftp.unicamp.br/pub/ppc64el/ubuntu/16_04, I have built deb and rpm successfully. You can found the patches talked at #338

@clnperez I wanted to weigh in on this.

I was running nvidia-docker on POWER8 machines without any hitch, when I had the following.
nvidia module -- 361.121

and

cuda -- 8.0.61-1 from the ga2 repo

When I switched to similar levels as @superbug7 I started seeing the
cuda: all CUDA-capable devices are busy or unavailable

My levels are now
nvidia module -- 375.51

and

cuda -- 8.0.61-1 from the ga2 repo

I'm going to keep looking for any hints on why the newer drivers/cuda stack aren't behaving on POWER, I'll let you know what I find

In Nvidia-docker code I found it seems not support version 375 driver.

# grep -r drivers nvidia-docker-1.0.1
nvidia-docker-1.0.1/README.md:Assuming the NVIDIA drivers and Docker are properly installed (see [installation](https://github.com/NVIDIA/nvidia-docker/wiki/Installation))
nvidia-docker-1.0.1/build/deb/changelog:  * Support for 364 drivers
nvidia-docker-1.0.1/build/deb/changelog:  * Support for 361 drivers
nvidia-docker-1.0.1/build/rpm/SPECS/nvidia-docker.spec:- Support for 364 drivers
nvidia-docker-1.0.1/build/rpm/SPECS/nvidia-docker.spec:- Support for 361 drivers

@junlizhang That may be a red-herring, a false positive so to speak. I talked to the nvidia-docker devs yesterday and they assured me the 375 driver was working on x86 architectures without these issues. Even without explicitly mentioning it in the changelogs.

A few things, that I've tried...first if I run
nvidia-docker-plugin as myself, I'd expect to get a permission denied trying to access the nvidia-docker.sock. (strace does confirm this) However my stderr output is

nvidia-docker-plugin | 2017/04/19 15:50:11 Loading NVIDIA unified memory
nvidia-docker-plugin | 2017/04/19 15:50:11 Loading NVIDIA management library
nvidia-docker-plugin | 2017/04/19 15:50:11 Discovering GPU devices
nvidia-docker-plugin | 2017/04/19 15:50:11 Error: cuda: all CUDA-capable devices are busy or unavailable

For those interested, the strace output has

nvidia-docker-plugin | 2017/04/19 15:51:51 Error: listen unix /run/docker/plugins/nvidia-docker.sock: bind: permission denied

In addition, the strace doesn't seem to contain the all CUDA-capable error in it. So that must be coming
from the driver itself.

Other notes.
Running as sudo does appear to make this issue go away, however, rerunning the systemctl start nvidia-docker.service does not.

Thanks @dllehr81, so you're saying that with strace you are able to go past the "Discovering GPU devices" part? That's surprising. Could you provide the full strace command and its output?

There seems to be a discrepancy in our driver packaging between x86_64 and ppc64le.

On a x86_64 machine:

$ apt-cache depends cuda-drivers
cuda-drivers
  Depends: nvidia-375
  Depends: nvidia-375-dev
  Depends: libcuda1-375                                                                                                                                                                                           
  Depends: nvidia-modprobe                                                                                                                                                                                        
  Depends: nvidia-settings                                                                                                                                                                                        
  Depends: nvidia-opencl-icd-375                                                                                                                                                                                  
  Depends: <libopencl1>                                                                                                                                                                                           
    ocl-icd-libopencl1                                                                                                                                                                                            
    nvidia-libopencl1-304                                                                                                                                                                                         
    nvidia-libopencl1-340                                                                                                                                                                                         
    nvidia-libopencl1-375 

$ apt-cache rdepends nvidia-modprobe                                                                                                                                                    
nvidia-modprobe                                                                                                                                                                                                   
Reverse Depends:                                                                                                                                                                                                  
  cuda-drivers                                                                                                                                                                                                    
  cuda-drivers                                                                                                                                                                                                    
  cuda-drivers

On ppc64le:

$ apt-cache depends cuda-drivers
cuda-drivers
  Depends: nvidia-375
  Depends: nvidia-375-dev
  Depends: libcuda1-375
$ apt-cache rdepends nvidia-modprobe
nvidia-modprobe
Reverse Depends:
$ # No reverse dependency!

In this case, cuda-drivers doesn't depend on nvidia-modprobe.

This means that on a ppc64le machine, the CUDA samples themselves won't work. This is what I did:

$ sudo apt-get update && sudo apt-get install -y --no-install-recommends linux-headers-generic dkms cuda-drivers cuda-samples-8-0`

Then I compiled the samples:

$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery 
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL

With strace:

[...]
geteuid()                               = 1000
stat("/usr/bin/nvidia-modprobe", 0x3ffff3c24870) = -1 ENOENT (No such file or directory)
open("/proc/devices", O_RDONLY)         = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(3, "Character devices:\n  1 mem\n  4 /"..., 1024) = 542
close(3)                                = 0
stat("/usr/bin/nvidia-modprobe", 0x3ffff3c24870) = -1 ENOENT (No such file or directory)
open("/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = -1 ENXIO (No such device or address)
open("/dev/nvidia-uvm", O_RDWR)         = -1 ENXIO (No such device or address)
ioctl(-6, _IOC(_IOC_NONE, 0x00, 0x01, 0x1000), 0x3ffff3c249c0) = -1 EBADF (Bad file descriptor)
ioctl(-6, _IOC(_IOC_NONE, 0x00, 0x02, 0x1000), 0) = -1 EBADF (Bad file descriptor)
close(-6)                               = -1 EBADF (Bad file descriptor)
munmap(0x3fff863c0000, 8705232)         = 0
munmap(0x3fff86350000, 444248)          = 0
futex(0x100a0510, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(1, "cudaGetDeviceCount returned 30\n", 31cudaGetDeviceCount returned 30
) = 31
write(1, "-> unknown error\n", 17-> unknown error
)      = 17
write(1, "Result = FAIL\n", 14Result = FAIL
)         = 14
exit_group(1)                           = ?
+++ exited with 1 +++

As root, it will work because root can use /sbin/modprobe to load the NVIDIA kernel modules. As a user it's not possible, but nvidia-modprobe is setuid.
So sudo ./deviceQuery will work.

I believe manually installing the nvidia-modprobe package should do the trick. Hopefully we will fix the packaging issue soon.

The nvidia-modprobe command once be provided by NVIDIA binary driver in 361.xxx has not shipped version 375.51 but by nvidia-modprobe package separately.

After nvidia-docker_1.0.1-1_ppc64el.deb installed, the nvidia-docker service works abnormally and if using "systemctl status nvidia-docker.service" command you can see the process start up and die endlessly.

By modify the /lib/systemd/system/nvidia-docker.service User defination from nvidia-docker to root, the nvidia-docker.service will work normally. After /lib/systemd/system/nvidia-docker.service changed, you need issue "systemctl daemon-reload" to reload the service and then issue "systemctl restart nvidia-docker.service" to restart nvidia-docker service.

Now the nvidia-docker works normally. BUT I don't know whether the change to /lib/systemd/system/nvidia-docker.service is valid by development view or not.

Detail commands:

root@ubuntu:~# which nvidia-modprobe
/usr/bin/nvidia-modprobe
root@ubuntu:~# ls -l /usr/bin/nvidia-modprobe
-rwsr-sr-x 1 root root 67872 Mar 23 05:47 /usr/bin/nvidia-modprobe
root@ubuntu:~# dpkg -S /usr/bin/nvidia-modprobe
nvidia-modprobe: /usr/bin/nvidia-modprobe
root@ubuntu:~# sed -i.ori -e 's:User=nvidia-docker:User=root:g' /lib/systemd/system/nvidia-docker.service
root@ubuntu:~# systemctl daemon-reload
root@ubuntu:~# systemctl start nvidia-docker.service
root@ubuntu:~# useradd junli -g docker -m
root@ubuntu:~# su - junli
junli@ubuntu:~$ id junli
uid=1000(junli) gid=998(docker) groups=998(docker)
junli@ubuntu:~$ nvidia-docker run --rm -ti ppc64le/centos-cuda-devel:8.0 nvidia-smi
Sat Apr 22 21:52:49 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:03.0     Off |                    0 |
| N/A   38C    P8    26W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@junlizhang As I said, just install nvidia-modprobe and it should work.

Hi @flx42. I think the nvidia-modprobe definitely fixes one issue..but the one I'm seeing is still prevalent. Here's a few more things I've noticed
For starters, deviceQuery works for me with nvidia-modprobe, however, the nvidia-docker-plugin still fails.

dllehr@dlm04:/usr/local/cuda/samples$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery 
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16276 MBytes (17066885120 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
.
.
.

One thing I've noticed is that nvidia-docker-plugin detects gpu's correctly when I'm running the local nvidia-docker-plugin compiled in tools/bin/nvidia-docker-plugin
however, when I'm using the one installed from the .deb package
/usr/bin/nvidia-docker-plugin
I receive the

nvidia-docker-plugin | 2017/04/24 14:18:27 Error: cuda: all CUDA-capable devices are busy or unavailable

The permissions of the two are

dllehr@dlm04:~/nvidia-docker$ ls -ltr tools/bin/nvidia-docker-plugin 
-rwxr-xr-x 1 dllehr dllehr 8192256 Apr 24 13:45 tools/bin/nvidia-docker-plugin
dllehr@dlm04:~/nvidia-docker$ ls -ltr /usr/bin/nvidia-docker-plugin 
-rwxr-xr-x 1 root root 8192048 Apr 24 13:45 /usr/bin/nvidia-docker-plugin

As an aside, if it helps, I've diagnosed the failing call to

bindings.go
.
deviceGetByPCIBusID
.
r := C.cudaDeviceGetByPCIBusId(&dev, id)

@dllehr81 it seemed to work for me. Did you lose the capabilities on the binary?

$ getcap /usr/bin/nvidia-docker-plugin
/usr/bin/nvidia-docker-plugin = cap_fowner+ep

@flx42 The capabilities for the /usr/bin/nvidia-docker-plugin are the same as yours...of course this is ppc64le as well...not sure if you're testing on that as well?

your upstream master branch works just fine for my x86 boxes here..something unique about the power side for some reason

@dllehr81 mmm, I was actually working on a rebased version of the ppc64le branch. I will try to push this branch this week. Could be related to this fix: https://github.com/NVIDIA/nvidia-docker/commit/16a7d7da6467d7e772dd04e1f28d4786dccb73c9

@flx42 Thanks! For grins I'll try out that fix on 16a7d7d ....I had started to merge the ppc64le branch with your master, but if you have a more official version we can wait.

@flx42 okay, so 16a7d7d didn't fix it....I also noticed a slight issue with the debugging...running strace on nvidia-docker-plugin actually causes it to get past the device lookup issues. This is why I was seeing the Error: listen unix /run/docker/plugins/nvidia-docker.sock: bind: permission denied at the very end. It stops failing in nvidia.go. This smacks of a really odd permissions issue.

The /run path is incorrect, you are probably starting nvidia-docker incorrectly:

sudo -u nvidia-docker nvidia-docker-plugin -s /var/lib/nvidia-docker

I'm now seeing this issue too with the deb package. Could be a driver bug (or a kernel bug?), it's actually related to CAP_FOWNER.

$ cp /usr/local/cuda/samples/1_Utilities/bandwidthTest/bandwidthTest .
$ ./bandwidthTest 
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-SXM2-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     32970.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     32772.3

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     501768.9

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Let's add CAP_FOWNER:

$ sudo setcap cap_fowner+pe bandwidthTest
$ ./bandwidthTest 
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-SXM2-16GB
 Quick Mode

CUDA error at bandwidthTest.cu:730 code=46(cudaErrorDevicesUnavailable) "cudaEventCreate(&start)"

Now let's try stracing it:

$ strace ./bandwidthTest
[...]
write(1, "Result = PASS\n", 14Result = PASS
)         = 14
write(1, "\n", 1
)                       = 1
write(1, "NOTE: The CUDA Samples are not m"..., 111NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
) = 111
exit_group(0)                           = ?
+++ exited with 0 +++

Note that the above works on x86_64

@flx42 Thanks for looking into that! You are correct, I'm able to recreate on my ppc64le box, and as you mentioned x86_64 behaves correctly. So we have a recreate that we can use without involving nvidia-docker. How would you prefer to proceed? I'm not sure if you have folks on your side you can reach out to, otherwise we can try to submit a bug via our channel?

I submitted a bug internally, we would need more details before involving more people from IBM.

As predicted by @3XX0, strace drops capabilities, that's why it starts fine (but it won't work when creating the volume):

$ grep Cap /proc/$(pidof nvidia-docker-plugin)/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000

On x86_64, without strace:

$ grep Cap /proc/$(pidof nvidia-docker-plugin)/status
CapInh: 0000000000000000
CapPrm: 0000000000000008
CapEff: 0000000000000008
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000

@dllehr81 @clnperez when returning from a syscall on ppc64le, do you know where is the return value? (in which register?)

Here's the info I got from Lynn Boger on our toolchain team @flx42 :

In general, if a syscall has an integer return value then it would be in
r3. Argument passing and return values for ppc64le are described in the ppc64
v2 ABI found here https://openpowerfoundation.org/technical/technical-resources/technical-specifications

Fixed with the latest drivers

Thank you kindly for all of the work you and your team put in on this!

The Linux kernel fix has been backported for xenial in version 4.4.0-83

Was this page helpful?
0 / 5 - 0 ratings