While creating the volume I get this issue, seems to be caused by hard link crossing two different partitions. Is this a know issue ?
sudo nvidia-docker volume setupnvidia-docker-plugin | 2016/07/08 02:12:39 Received remove request for volume 'nvidia_driver_367.27'
nvidia-docker run --rm nvidia/cuda nvidia-sminvidia-docker-plugin | 2016/07/08 02:12:52 Received create request for volume 'nvidia_driver_367.27'
nvidia-docker-plugin | 2016/07/08 02:12:52 Error: link /usr/bin/nvidia-cuda-mps-control /var/lib/nvidia-docker/volumes/nvidia_driver/367.27/bin/nvidia-cuda-mps-control: invalid cross-device link
First of all, you shouldn't use volume setup, we removed this command from our latest version. You should use nvidia-docker-plugin (started automatically if you install nvidia-docker using the deb or the rpm).
And yes, it's a known limitation:
https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin#known-limitations
You can use the -d option of nvidia-docker-plugin to change the path for the volume.
Thanks, I missed the limitations bit. Opted to use /usr/local/nvidia-docker as the default volume
I am experiencing this problem as well because I have '/var' on a separate partition from '/usr' where the nvidia drivers are located. I would like to switch the default volume location to a folder in '/usr' as the workaround suggests. However, I cannot, for the life of me, figure out how to accomplish this using nvidia-driver-plugin -d.
I am running:
sudo nvidia-docker-plugin -d /usr/local/nvidia-driver
and the change appears to be taking place,
nvidia-docker-plugin | 2016/07/20 20:00:49 Loading NVIDIA unified memory
nvidia-docker-plugin | 2016/07/20 20:00:49 Loading NVIDIA management library
nvidia-docker-plugin | 2016/07/20 20:00:49 Discovering GPU devices
nvidia-docker-plugin | 2016/07/20 20:00:50 Provisioning volumes at /usr/local/nvidia-driver
nvidia-docker-plugin | 2016/07/20 20:00:50 Serving plugin API at /run/docker/plugins
nvidia-docker-plugin | 2016/07/20 20:00:50 Serving remote API at localhost:3476
nvidia-docker-plugin | 2016/07/20 20:00:50 Error: listen tcp 127.0.0.1:3476: bind: address already in use
but then I run this:
sudo nvidia-docker run --rm nvidia/cuda nvidia-smi
and I still get this error
docker: Error response from daemon: create nvidia_driver_367.35: VolumeDriver.Create: internal error, check logs for details.
See 'docker run --help'.
@flx42, would you be able to point me in the right direction? or, @guilhermehartmann, how were you able to use /usr/local/nvidia-driver as your default volume?
Apologies as I am new to Docker and it seems I have jumped into the deep end of the pool :-).
Thanks!
@dpatschke Look at your log after running nvidia-docker-plugin -d [...] it failed:
nvidia-docker-plugin | 2016/07/20 20:00:50 Error: listen tcp 127.0.0.1:3476: bind: address already in use
This is because the nvidia-docker service is still running, so you're still using the other version of the plugin, without the -d. You should try to modify your service configuration file directly, which OS are you on?
Thank you for your response, @flx42.
I am running Ubuntu 16.04. I would love to be able to modify some configuration file and restart docker or the nvidia-docker-plugin or whatever, but have been scouring the web and message boards for hours and can't seem to find what I am looking for.
Would you be able to point me to the correct config file to modify? Also, I have no idea how nvidia-docker-plugin is running in the first place. Is the plugin launched when the docker service is started? How do I stop the current plugin and 'restart' one with the 'd' option?
Thank you very much for your help!!
David
Something like that
# systemctl edit nvidia-docker
[Service]
ExecStart=
ExecStart=/usr/bin/nvidia-docker-plugin -s $SOCK_DIR -d /usr/local/nvidia-driver
Thank you @flx42 ... unfortunately, I could not get the problem resolved.
I executed the 'edit' command as you suggested and created the file with what you had listed. The 'nano' editor wanted to save it as 'override.conf' with a bunch of additional characters at the end.
I ended up saving the file as /etc/systemd/system/nvidia-docker.service.d/override.conf.
I then restarted the systemd service:
sudo systemctl restart nvidia-docker
I am still getting the old folder when I issue the command:
sudo nvidia-docker-plugin
Here is the output:
nvidia-docker-plugin | 2016/07/20 23:02:53 Loading NVIDIA unified memory
nvidia-docker-plugin | 2016/07/20 23:02:53 Loading NVIDIA management library
nvidia-docker-plugin | 2016/07/20 23:02:53 Discovering GPU devices
nvidia-docker-plugin | 2016/07/20 23:02:53 Provisioning volumes at /var/lib/nvidia-docker/volumes
nvidia-docker-plugin | 2016/07/20 23:02:53 Serving plugin API at /run/docker/plugins
nvidia-docker-plugin | 2016/07/20 23:02:53 Serving remote API at localhost:3476
nvidia-docker-plugin | 2016/07/20 23:02:53 Error: listen tcp 127.0.0.1:3476: bind: address already in use
When I issue the command:
sudo systemctl edit nvidia-docker
I am seeing the new file I created.
Now, when I issue the following command, though:
sudo nvidia-docker run --rm nvidia/cuda nvidia-smi
I get the following error:
docker: Error response from daemon: create nvidia_driver_367.35: create nvidia_driver_367.35: Error looking up volume plugin nvidia-docker: plugin not found.
Don't try to start nvidia-docker-plugin manually, it's handled by systemd.
Try to restart the docker service too.
Yeah ... did a a sudo service docker restart and still getting the same result - 'plugin not found'. Restarted entire system ... same result.
When I do a sudo nvidia-docker volume ls it is completely empty. I seem to remember reading somewhere that there should be something present.
I am also stlll getting the 'address already in use' error as well.
I don't know where things went wrong but any other suggestions/recommendations would be greatly appreciated.
David
@dpatschke: give me the output of
journalctl -n -u nvidia-docker
Looks like I didn't have the nvidia-docker service started last time. Started it up again, but was still erroring out.
Here is the output from your recommended command:
Jul 21 00:06:04 Precision-Tower-7910 systemd[1]: Starting NVIDIA Docker plugin...
Jul 21 00:06:04 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:04 Loading NVIDIA unified memory
Jul 21 00:06:04 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:04 Loading NVIDIA management library
Jul 21 00:06:04 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:04 Discovering GPU devices
Jul 21 00:06:04 Precision-Tower-7910 systemd[1]: Started NVIDIA Docker plugin.
Jul 21 00:06:05 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:05 Provisioning volumes at /usr/local/nvidia-driver
Jul 21 00:06:05 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:05 Serving plugin API at /var/lib/nvidia-docker
Jul 21 00:06:05 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:05 Serving remote API at localhost:3476
This looks good, doesn't it? Still getting this error, though, when actually trying to launch nvidia-docker:
docker: Error response from daemon: create nvidia_driver_367.35: VolumeDriver.Create: internal error, check logs for details.
@dpatschke yes it looks good.
At this point, I would advise you to simply purge nvidia-docker the hard way:
apt-get purge nvidia-docker
rm -rf /var/lib/nvidia-docker
Then restart docker, reinstall nvidia-docker from the deb, edit the systemd configuration file again, reboot.
If you still have the problem after that, please file a new bug with the new output of journalctl -n -u nvidia-docker.
OK, will do it again ... thank you so much for your help and guidance!
Not sure if anyone will find this useful, but there was the one last step I had to do to get this working:
Ensure that the directory specified by the -d in the systemd config file exists and is owned by nvidia-docker:
mkdir /usr/local/nvidia-driver
chown -hR nvidia-docker /usr/local/nvidia-driver
chgrp nvidia-docker /usr/local/nvidia-driver
this solution help me a lot.
I use centos7.2 with k40c x 4
Most helpful comment
Not sure if anyone will find this useful, but there was the one last step I had to do to get this working:
Ensure that the directory specified by the -d in the systemd config file exists and is owned by nvidia-docker: