Nvidia-docker: Link Issue on volume creation

Created on 8 Jul 2016 · 15Comments · Source: NVIDIA/nvidia-docker

While creating the volume I get this issue, seems to be caused by hard link crossing two different partitions. Is this a know issue ?

sudo nvidia-docker volume setupnvidia-docker-plugin | 2016/07/08 02:12:39 Received remove request for volume 'nvidia_driver_367.27'
nvidia-docker run --rm nvidia/cuda nvidia-sminvidia-docker-plugin | 2016/07/08 02:12:52 Received create request for volume 'nvidia_driver_367.27'
nvidia-docker-plugin | 2016/07/08 02:12:52 Error: link /usr/bin/nvidia-cuda-mps-control /var/lib/nvidia-docker/volumes/nvidia_driver/367.27/bin/nvidia-cuda-mps-control: invalid cross-device link

work as intended

Source

guilhermehartmann

Most helpful comment

Not sure if anyone will find this useful, but there was the one last step I had to do to get this working:

Ensure that the directory specified by the -d in the systemd config file exists and is owned by nvidia-docker:

mkdir /usr/local/nvidia-driver
chown -hR nvidia-docker /usr/local/nvidia-driver
chgrp nvidia-docker /usr/local/nvidia-driver

Mr-Grieves on 22 Feb 2017

👍4

All 15 comments

First of all, you shouldn't use volume setup, we removed this command from our latest version. You should use nvidia-docker-plugin (started automatically if you install nvidia-docker using the deb or the rpm).

And yes, it's a known limitation:
https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin#known-limitations
You can use the -d option of nvidia-docker-plugin to change the path for the volume.

flx42 on 8 Jul 2016

Thanks, I missed the limitations bit. Opted to use /usr/local/nvidia-docker as the default volume

guilhermehartmann on 10 Jul 2016

I am experiencing this problem as well because I have '/var' on a separate partition from '/usr' where the nvidia drivers are located. I would like to switch the default volume location to a folder in '/usr' as the workaround suggests. However, I cannot, for the life of me, figure out how to accomplish this using nvidia-driver-plugin -d.

I am running:
sudo nvidia-docker-plugin -d /usr/local/nvidia-driver

and the change appears to be taking place,

nvidia-docker-plugin | 2016/07/20 20:00:49 Loading NVIDIA unified memory
nvidia-docker-plugin | 2016/07/20 20:00:49 Loading NVIDIA management library
nvidia-docker-plugin | 2016/07/20 20:00:49 Discovering GPU devices
nvidia-docker-plugin | 2016/07/20 20:00:50 Provisioning volumes at /usr/local/nvidia-driver
nvidia-docker-plugin | 2016/07/20 20:00:50 Serving plugin API at /run/docker/plugins
nvidia-docker-plugin | 2016/07/20 20:00:50 Serving remote API at localhost:3476
nvidia-docker-plugin | 2016/07/20 20:00:50 Error: listen tcp 127.0.0.1:3476: bind: address already in use

but then I run this:

sudo nvidia-docker run --rm nvidia/cuda nvidia-smi

and I still get this error

docker: Error response from daemon: create nvidia_driver_367.35: VolumeDriver.Create: internal error, check logs for details.
See 'docker run --help'.

@flx42, would you be able to point me in the right direction? or, @guilhermehartmann, how were you able to use /usr/local/nvidia-driver as your default volume?

Apologies as I am new to Docker and it seems I have jumped into the deep end of the pool :-).

Thanks!

dpatschke on 21 Jul 2016

@dpatschke Look at your log after running nvidia-docker-plugin -d [...] it failed:

nvidia-docker-plugin | 2016/07/20 20:00:50 Error: listen tcp 127.0.0.1:3476: bind: address already in use

This is because the nvidia-docker service is still running, so you're still using the other version of the plugin, without the -d. You should try to modify your service configuration file directly, which OS are you on?

flx42 on 21 Jul 2016

Thank you for your response, @flx42.

I am running Ubuntu 16.04. I would love to be able to modify some configuration file and restart docker or the nvidia-docker-plugin or whatever, but have been scouring the web and message boards for hours and can't seem to find what I am looking for.

Would you be able to point me to the correct config file to modify? Also, I have no idea how nvidia-docker-plugin is running in the first place. Is the plugin launched when the docker service is started? How do I stop the current plugin and 'restart' one with the 'd' option?

Thank you very much for your help!!

David

dpatschke on 21 Jul 2016

Something like that

# systemctl edit nvidia-docker

[Service]
ExecStart=
ExecStart=/usr/bin/nvidia-docker-plugin -s $SOCK_DIR -d /usr/local/nvidia-driver

flx42 on 21 Jul 2016

Thank you @flx42 ... unfortunately, I could not get the problem resolved.

I executed the 'edit' command as you suggested and created the file with what you had listed. The 'nano' editor wanted to save it as 'override.conf' with a bunch of additional characters at the end.

I ended up saving the file as /etc/systemd/system/nvidia-docker.service.d/override.conf.

I then restarted the systemd service:
sudo systemctl restart nvidia-docker

I am still getting the old folder when I issue the command:
sudo nvidia-docker-plugin

Here is the output:

nvidia-docker-plugin | 2016/07/20 23:02:53 Loading NVIDIA unified memory
nvidia-docker-plugin | 2016/07/20 23:02:53 Loading NVIDIA management library
nvidia-docker-plugin | 2016/07/20 23:02:53 Discovering GPU devices
nvidia-docker-plugin | 2016/07/20 23:02:53 Provisioning volumes at /var/lib/nvidia-docker/volumes
nvidia-docker-plugin | 2016/07/20 23:02:53 Serving plugin API at /run/docker/plugins
nvidia-docker-plugin | 2016/07/20 23:02:53 Serving remote API at localhost:3476
nvidia-docker-plugin | 2016/07/20 23:02:53 Error: listen tcp 127.0.0.1:3476: bind: address already in use

When I issue the command:
sudo systemctl edit nvidia-docker
I am seeing the new file I created.

Now, when I issue the following command, though:
sudo nvidia-docker run --rm nvidia/cuda nvidia-smi

I get the following error:
docker: Error response from daemon: create nvidia_driver_367.35: create nvidia_driver_367.35: Error looking up volume plugin nvidia-docker: plugin not found.

dpatschke on 21 Jul 2016

Don't try to start nvidia-docker-plugin manually, it's handled by systemd.
Try to restart the docker service too.

flx42 on 21 Jul 2016

Yeah ... did a a sudo service docker restart and still getting the same result - 'plugin not found'. Restarted entire system ... same result.

When I do a sudo nvidia-docker volume ls it is completely empty. I seem to remember reading somewhere that there should be something present.

I am also stlll getting the 'address already in use' error as well.

I don't know where things went wrong but any other suggestions/recommendations would be greatly appreciated.

David

dpatschke on 21 Jul 2016

👍1

@dpatschke: give me the output of

journalctl -n -u nvidia-docker

flx42 on 21 Jul 2016

Looks like I didn't have the nvidia-docker service started last time. Started it up again, but was still erroring out.

Here is the output from your recommended command:

 Jul 21 00:06:04 Precision-Tower-7910 systemd[1]: Starting NVIDIA Docker plugin...
Jul 21 00:06:04 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:04 Loading NVIDIA unified memory
Jul 21 00:06:04 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:04 Loading NVIDIA management library
Jul 21 00:06:04 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:04 Discovering GPU devices
Jul 21 00:06:04 Precision-Tower-7910 systemd[1]: Started NVIDIA Docker plugin.
Jul 21 00:06:05 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:05 Provisioning volumes at /usr/local/nvidia-driver
Jul 21 00:06:05 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:05 Serving plugin API at /var/lib/nvidia-docker
Jul 21 00:06:05 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:05 Serving remote API at localhost:3476

This looks good, doesn't it? Still getting this error, though, when actually trying to launch nvidia-docker:
docker: Error response from daemon: create nvidia_driver_367.35: VolumeDriver.Create: internal error, check logs for details.

dpatschke on 21 Jul 2016

@dpatschke yes it looks good.
At this point, I would advise you to simply purge nvidia-docker the hard way:

apt-get purge nvidia-docker
rm -rf /var/lib/nvidia-docker

Then restart docker, reinstall nvidia-docker from the deb, edit the systemd configuration file again, reboot.

If you still have the problem after that, please file a new bug with the new output of journalctl -n -u nvidia-docker.

flx42 on 21 Jul 2016

OK, will do it again ... thank you so much for your help and guidance!

dpatschke on 21 Jul 2016

Not sure if anyone will find this useful, but there was the one last step I had to do to get this working:

Ensure that the directory specified by the -d in the systemd config file exists and is owned by nvidia-docker:

mkdir /usr/local/nvidia-driver
chown -hR nvidia-docker /usr/local/nvidia-driver
chgrp nvidia-docker /usr/local/nvidia-driver

Mr-Grieves on 22 Feb 2017

👍4

this solution help me a lot.

I use centos7.2 with k40c x 4

qiaohaijun on 8 Mar 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Error: nvidia-docker2 : Depends: nvidia-container-runtime (>= 3.4.0)

JingL1014 · 25Comments

[SOLVED] How to run nvidia-docker with TensorFlow GPU docker

bcordo · 29Comments

nvidia-docker on ppc64le-ubuntu16.04

superbug7 · 35Comments

Ubuntu 16.04 LTS support

jendap · 26Comments

ARM64 Support

scosaje · 44Comments