nvidia-docker fails on EC2 after restarting instance

Created on 14 Jul 2016 · 8Comments · Source: NVIDIA/nvidia-docker

I'm following the instructions on how to Deploy on Amazon EC2. Right after the gpu instance creation I test:
nvidia-docker run --rm nvidia/cuda nvidia-smi and everything works fine.

Then I stop the instance docker-machine stop aws01 and start it again docker-machine start aws01 and test again:

ubuntu@aws04:~$ nvidia-docker run --rm nvidia/cuda nvidia-smi
nvidia-docker | 2016/07/13 23:40:32 Error: Could not load UVM kernel module

This time it fails. Is this expected behavior?

documentation

Source

alantrrs

Most helpful comment

Is it correct that this issue is closed?

It is not documented in https://github.com/NVIDIA/nvidia-docker/wiki/Deploy-on-Amazon-EC2 .

pasky on 6 Sep 2016

👍6

All 8 comments

We run into it as well. For some reason after the VM restarts the kernel can slightly change and nouveau gets loaded by default (in the initramfs).

Best way I know of is to upgrade the machine:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade

Blacklist nouveau:

sudo cat << EOF > /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

sudo update-initramfs -u

Reboot (just in case) and reinstall the drivers with DKMS

sudo apt-get install dkms linux-headers-generic
sudo sh /tmp/NVIDIA-Linux-x86_64-361.42.run --dkms --silent

From now on it should be fine. I will probably update the doc when I know for sure what's happening

3XX0 on 14 Jul 2016

👎1

Yeah, created a couple of machines and blacklisting nouveau works. :+1: on adding it to the docs.

alantrrs on 14 Jul 2016

Is it correct that this issue is closed?

It is not documented in https://github.com/NVIDIA/nvidia-docker/wiki/Deploy-on-Amazon-EC2 .

pasky on 6 Sep 2016

👍6

@3XX0 ping^
doesn't look like this step is in the documentation, did you figure out what was happening?

Thanks for your help!

christikaes on 23 Feb 2017

The documentation is right, but depending on the AMI used you might want to restart the instance after creating it . For example, some Ubuntu AMIs have been snapshoted with a running kernel different from the one that will be used at next reboot.

3XX0 on 23 Feb 2017

@christinakayastha I elaborated a bit on the installation for a base AMI ami-40d28157 (Ubuntu server 16.04 LTS) here:
https://github.com/empiricalci/machines/tree/master/gpu-ec2#installing-the-nvidia-driver

alantrrs on 23 Feb 2017

👍1

ahhh gochha, thanks a ton!

christikaes on 23 Feb 2017

I added a docker-machine restart line to the tutorial, I advise you to install the driver through our method, you will get the latest driver and any update that is released.