I'm following the instructions on how to Deploy on Amazon EC2. Right after the gpu instance creation I test:
nvidia-docker run --rm nvidia/cuda nvidia-smi and everything works fine.
Then I stop the instance docker-machine stop aws01 and start it again docker-machine start aws01 and test again:
ubuntu@aws04:~$ nvidia-docker run --rm nvidia/cuda nvidia-smi
nvidia-docker | 2016/07/13 23:40:32 Error: Could not load UVM kernel module
This time it fails. Is this expected behavior?
We run into it as well. For some reason after the VM restarts the kernel can slightly change and nouveau gets loaded by default (in the initramfs).
Best way I know of is to upgrade the machine:
sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade
Blacklist nouveau:
sudo cat << EOF > /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF
sudo update-initramfs -u
Reboot (just in case) and reinstall the drivers with DKMS
sudo apt-get install dkms linux-headers-generic
sudo sh /tmp/NVIDIA-Linux-x86_64-361.42.run --dkms --silent
From now on it should be fine. I will probably update the doc when I know for sure what's happening
Yeah, created a couple of machines and blacklisting nouveau works. :+1: on adding it to the docs.
Is it correct that this issue is closed?
It is not documented in https://github.com/NVIDIA/nvidia-docker/wiki/Deploy-on-Amazon-EC2 .
@3XX0 ping^
doesn't look like this step is in the documentation, did you figure out what was happening?
Thanks for your help!
The documentation is right, but depending on the AMI used you might want to restart the instance after creating it . For example, some Ubuntu AMIs have been snapshoted with a running kernel different from the one that will be used at next reboot.
@christinakayastha I elaborated a bit on the installation for a base AMI ami-40d28157 (Ubuntu server 16.04 LTS) here:
https://github.com/empiricalci/machines/tree/master/gpu-ec2#installing-the-nvidia-driver
ahhh gochha, thanks a ton!
I added a docker-machine restart line to the tutorial, I advise you to install the driver through our method, you will get the latest driver and any update that is released.
Most helpful comment
Is it correct that this issue is closed?
It is not documented in https://github.com/NVIDIA/nvidia-docker/wiki/Deploy-on-Amazon-EC2 .