Standard K8s GPU pods don't get scheduled due to a lack of nvidia.com/gpu
See here:
Hi @tmbdev , indeed your node does not have the "nvidia.com/gpu" capacity. I am testing it now and it should look like this:
"capacity": {
"cpu": "4",
"ephemeral-storage": "235852560Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "16312140Ki",
"nvidia.com/gpu": "1",
"pods": "110"
},
The "nvidia.com/gpu": "1" is added by the nvidia-device-plugin-daemonset in the kube-system namespace. Could you share the output of microk8s.kubectl logs -n kube-system pod/nvidia-device-plugin-daemonset-4rcjj and microk8s.kubectl describe -n kube-system pod/nvidia-device-plugin-daemonset-4rcjj. Also what happens if you microk8s.disable gpu and then microk8s.enable gpu again? Does the problem persist?
Apparently, I need to make "Nvidia" the default runtime in /etc/docker (did I miss that in the documentation?). I have done that and restarted both docker and microk8s.
But that still doesn't fix the problem. The problem now is:
2019/06/11 18:28:00 Loading NVML
2019/06/11 18:28:00 Failed to initialize NVML: could not load NVML library.
2019/06/11 18:28:00 If this is a GPU node, did you set the docker default runtime to nvidia?
2019/06/11 18:28:00 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/06/11 18:28:00 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
The library seems to be installed:
$ sudo ldconfig -p | grep nvidia-ml
libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
libnvidia-ml.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
$
@tmbdev switching MicroK8s to the nvidia runtime is done automatically as long as you have microk8s.enable gpu https://github.com/ubuntu/microk8s/blob/master/microk8s-resources/wrappers/run-containerd-with-args#L23
Where do you see these logs? Do you have the nvidia drivers installed? This is how we try to check that https://github.com/ubuntu/microk8s/blob/master/microk8s-resources/actions/enable.gpu.sh#L8
@ktsakalozos I have the same problem as @tmbdev: node does not have 'gpu' label. exactly the same errors in logs of nvidia-device-plugin-daemonset-xxxx pod.
I went throuugh steps to set "nvidia" as the default runtime in Docker: https://www.pugetsystems.com/labs/hpc/How-To-Install-Docker-and-NVIDIA-Docker-on-Ubuntu-19-04-1460/
So, now if I check Docker info it shows "nvidia" as a default runtime:
and I can successfully run sudo docker run --rm nvidia/cuda nvidia-smi without specifying runtime.
Then I realized that microk8s uses Containerd instead of Docker and I assume has different config from Docker. I then tried to run sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi on GPU node and it worked.
Could anyone point me to the further steps for debugging this issue?
@ktaletsk could you share the tarball produced by microk8s.inspect?
Did you do the microk8s.enable gpu? This will update the containerd configuration used in microk8s. Most of the work described in the blog post you are following is taken care by the microk8s.enable gpu command, the only prerequisite is that you have the nvidia drivers already installed.
Okay, this is the microk8s.inspect log. I reset the cluster and only enabled GPU (microk8s.enable gpu), nothing else.
See the logs here: https://github.com/ktaletsk/microk8s-inspection-report.
@ktaletsk the nvidia drivers are not detected, see https://github.com/ktaletsk/microk8s-inspection-report/blob/master/k8s/cluster-info-dump#L1653 . How did you install the nvidia drivers?
I have Nvidia drivers come preinstalled with pop!_os. Everything is working fine with Nvidia drivers except for microk8s. I am even able to run the same container in Docker without problems:
$ docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.11
2019/10/26 01:04:27 Loading NVML
2019/10/26 01:04:27 Fetching devices.
2019/10/26 01:04:27 Starting FS watcher.
2019/10/26 01:04:27 Starting OS watcher.
2019/10/26 01:04:27 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2019/10/26 01:04:27 Registered device plugin with Kubelet
When the same containers runs in microk8s (as I enable gpu, microk8s.enable gpu), the logs are:
2019/10/24 14:21:16 Loading NVML
2019/10/24 14:21:16 Failed to initialize NVML: could not load NVML library.
2019/10/24 14:21:16 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2019/10/24 14:21:16 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/10/24 14:21:16 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I still don't understand what's wrong
@ktaletsk Had exactly the same error as yours. Finally I'm using docker instead of containerd. For now it seems working and the nvidia daemon is loading correctly.
vim /var/snap/microk8s/current/args/kubelet
And added --container-runtime=docker. I'll try with KubeFlow and see if it still works.
_edit:_ kubeflow is working properly, I can use my GPU as expected now

_(screenshot taken from jupyter notebook in kubeflow/jupyter hub)_
I had the same issue, and it emerged while I had a GPU-enabled container up and running. The original setup was simply to install the snap and enable the GPU module, and everything from there worked fine.
The container had been running fine for months, and then... crashed. On restart, I got a 1 Insufficient nvidia.com/gpu. error on the pod and found nvidia.com/gpu: 0 in the node description. Same logs as ktaletsk, and dhassualt's add-to-kubelet-args fix hasn't worked. Rebooting, restarting the service, and disabling/enabling the GPU module also haven't worked.
As far as I know no updates had been installed recently. nvidia drivers 418.56, CUDA 10.1, kubelet v1.16.3, and several RTX 2080 TIs under the hood. I am stumped.
@andyljones could you attach the tarball from microk8s.inspect?
Actually, checking snap changes, there was an auto-refresh of microk8s to 1.16.3 at about the same time as the crash. Unfortunately the logs don't seem to show what the old version was, but
sudo snap remove microk8s
sudo snap install microk8s --channel 1.15/stable --classic
fixes it!
Christ, I didn't even realise snap would kill my microk8s instance to upgrade it. That explains a whole separate issue I was suffering.
e: @ktsakalozos Rolling back to the broken version now, give me a few mins
@andyljones if you want to have along lasting MicroK8s deployment you would be safer specifying a channel during installation. If you do not specify a channel any updates will track the latest stable k8s releases. We try our best to transition users to the new version but if the underling components change too much this might not be possible.
Thanks, savvy to that now! Here are inspect tarballs from my broken 1.16.3 install and my working 1.15 install:
@andyljones the 1.16 channel should also be safe to use.
The upgrade to containerd 1.3.0 caused this issue. I am reverting this change (moving back to containerd 1.2.5). The fix should be on the latest/edge channel in some hours.
I am really sorry for the trouble we may have caused, apologies.
Not to get into a back-and-forth on the expectations around free software, but I didn't pay for this and yet I'm getting an awful lot of value out of it. I can't complain if it causes me a bit of hassle once in a blue moon. Thanks for all your work, and sorry for not saying this sooner!
I have this "same" issue. I've downgraded to 1.15 to see if that fixes it. It's odd because GPUs are actually working in my containers, but the Nvidia plugin Daemonset isn't seeing the GPU's. I'm perplexed.
I'm not sure what part fixed it but for me, i was able to get nvidia gpu showing up and working with:
docker 19.03.6, installed nvidia-docker installed https://github.com/NVIDIA/nvidia-docker#ubuntu-16041804-debian-jessiestretchbuster and made nvidia docker the default runtime by adding:
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"insecure-registries" : ["localhost:32000"]
}
to /etc/docker/daemon.json
(note: "insecure-registries" : ["localhost:32000"] is only needed if you actually want that, i wanted to use this on my machine so i added it there for another reason, not this problem.)
Then on each node i made sure GPU was free and not used by anything, made the node that didn't work leave the cluster, ran microk8s.enable gpu there first, and then rejoined the cluster, and everything seemed to have worked.
I'm in the same place, just though I'd contribute the following: trying to mirror commands on my system containerd,
sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi
returns the good stuff you would expect, while
microk8s ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi
returns
ctr: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown
I did notice that they are different versions, sudo ctr is 1.2.6, microk8s is 1.2.5
I'm going to try the old 'purge and reinstall everything' technique, will report back.
I just had to update my nvidia drivers! I was using 420 or something, updated to 440 and it works.
I think this means this part of the documentation is incorrect? Or doesnt apply to microk8s?
The list of prerequisites for running the NVIDIA device plugin is described below:
NVIDIA drivers ~= 384.81
My diveres were certainly above that. A note on this subject to the microk8s gpu page would be great.
<3 microk8s btw, much better than minikube!
It totally stopped working for me today. I've trying rebuilding everything but for some unknown reason the GPUs are never exposed to any containers, including the nvidia-plugin. It is very frustrating.
When I run
microk8s ctr run --rm --gpus 0 docker.io/nvidia/cuda:10.2-base nvidia-smi
I get:
ctr: failed to create temp dir: stat /run/user/0/snap.microk8s: no such file or directory
I can confirm that does not exist, no idea why its looking for that.
If I do the same with ctr on my system outside of microk8s it works, but I'm running ctr 1.2.2, but microk8s is running 1.2.5.
@ktaletsk Had exactly the same error as yours. Finally I'm using docker instead of containerd. For now it seems working and the nvidia daemon is loading correctly.
vim /var/snap/microk8s/current/args/kubeletAnd added
--container-runtime=docker. I'll try with KubeFlow and see if it still works._edit:_ kubeflow is working properly, I can use my GPU as expected now
_(screenshot taken from jupyter notebook in kubeflow/jupyter hub)_
This has helped me on a workstation with a single GPU shared across Desktop and microk8s
I had a relapse of this issue today as well, though I don't think it's anything to do with the microk8s team.
I'd installed nvidia-docker and - because the version of docker in snap was too old to support it - I'd installed docker from the docker.com repo too. It seems that one of those two seized the GPUs off of containerd, and suddenly microk8s was showing nvidia.com/gpu: 0.
Anyway, the fix was:
--container-runtime=docker
to /var/snap/microk8s/current/args/kubelet
nvidia-docker as the default runtime. This is @igorbrigadir 's fix, adding
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []}}
/etc/docker/daemon.jsonnvidia-docker install. That's sudo apt-get install nvidia-container-runtime
sudo systemctl restart docker
With all that done, microk8s.kubectl describe no once again shows nvidia.com/gpu: 2.
Thanks all!
@andyljones what do you do to restart the kubelet for the change to /var/snap/microk8s/current/args/kubelet? I'm still seeing nvidia.com/gpu: 0 after the above changes.
@GdMacmillan My kubelet appeared to pick up the changes automatically; my already-running node updated to nvidia.com/gpu: 2 without me doing anything. Was a bit of a surprise actually.
If it's any help, some things that helped localise my issues were
microk8s.kubectl describe no | grep Runtimedocker info | grep Runtimedocker run --gpus all nvidia/cuda:10.0-base nvidia-smiAs an aside, make sure to kill any docker containers you might have up in the background, since they'll grab GPUs too. That was another cause of this symptom for me in the past.
e: For what it's worth, I have restarted the kubelet in the past with sudo snap restart microk8s, just I didn't need to this time.
this is the result of those few commands:
➜ ~ docker info | grep Runtime
Runtimes: nvidia runc
Default Runtime: runc
WARNING: No swap limit support
➜ ~ docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Unable to find image 'nvidia/cuda:10.0-base' locally
10.0-base: Pulling from nvidia/cuda
7ddbc47eeb70: Pull complete
c1bbdc448b72: Pull complete
8c3b70e39044: Pull complete
45d437916d57: Pull complete
d8f1569ddae6: Pull complete
de5a2c57c41d: Pull complete
ea6f04a00543: Pull complete
Digest: sha256:e6e1001f286d084f8a3aea991afbcfe92cd389ad1f4883491d43631f152f175e
Status: Downloaded newer image for nvidia/cuda:10.0-base
Sat Apr 25 15:09:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:2F:00.0 On | N/A |
| 27% 43C P0 40W / 180W | 496MiB / 8116MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
➜ ~ microk8s.kubectl describe no | grep Runtime
Container Runtime Version: docker://19.3.8
Runtimes: nvidia runc
Default Runtime: runc
ie, it's not using the nvidia runtime. Up to you to figure out what's wrong, but my spidey-sense is that either the syntax of daemon.json is wrong, or it's picking up daemon.json from a different location to normal, or you need to restart docker.
It appears microk8s.kubectl describe no now shows the correct number of gpu's (1). I think my problem was one of having nvidia container runtime installed correctly. And setting the default runtime. My daemon.json file:
➜ ~ cat /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
If it helps my system info:
➜ ~ uname
Linux
➜ ~ uname -n
pop-os
➜ ~ uname -v
#29~1587437458~20.04~2960161-Ubuntu SMP Tue Apr 21 05:16:44 UTC
➜ ~ uname -r
5.4.0-7625-generic
➜ ~ uname -m
x86_64
It's a shame this was not working so easily the whole rest of the microk8s setup process is really beautiful..
20.04 LTS10.2440.8219.03.81.3.3Before installing microk8s I was able to get access to GPU with both docker and containerd
> sudo snap install microk8s --classic
microk8s v1.18.2 from Canonical✓ installed
> microk8s.enable gpu
Enabling NVIDIA GPU
NVIDIA kernel module detected
Enabling DNS
Applying manifest
serviceaccount/coredns created
configmap/coredns created
deployment.apps/coredns created
service/kube-dns created
clusterrole.rbac.authorization.k8s.io/coredns created
clusterrolebinding.rbac.authorization.k8s.io/coredns created
Restarting kubelet
DNS is enabled
Applying manifest
daemonset.apps/nvidia-device-plugin-daemonset created
NVIDIA is enabled
md5-27040c87d7df06960cd66044918562cc
ctr: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown
@illya-havsiyevych Did you see my post above? Try changing the docker/daemon.json file as I and others have suggested. Also make sure you have installed nvidia-container-runtime correctly. This was one of my problems. Another thing you must do is edit /var/snap/microk8s/current/args/kubelet and set --container-runtime=docker as @andyljones suggested above.
@GdMacmillan changes in docker/daemon.json are already in place but are not used by default by microk8s v1.18.2 as it internally switched to containerd.
Pointing kubelet to use docker
microk8s itself in other places starts/stops/expects containerd. I had a similar experience trying the docket route. When I tried to switch kubelet to use docker everything stopped working.
My current understanding - we need to wait a fix from microk8s to upgrade to containerd 1.3.x.
So I've switched to kubeadm for now on Ubuntu 20.04
@illya-havsiyevych while far from ideal, i got it to work on 1.18/stable release and docker (like others mentioned). However, you do have to actually force docker to use nvidia runtime by default and not just add nvidia runtime in list of options in daemon.json.
My setup is Ubuntu 20.04, nvidia-driver-440 from ubuntu repos, docker v19+ from official repos, with nvidia-container-toolkit and nvidia-container-runtime packages installed.
In /etc/docker/daemon.json :
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"insecure-registries" : ["localhost:32000"]
}
In /var/snap/microk8s/current/args/kubelet :
--kubeconfig=${SNAP_DATA}/credentials/kubelet.config
--cert-dir=${SNAP_DATA}/certs
--client-ca-file=${SNAP_DATA}/certs/ca.crt
--anonymous-auth=false
--network-plugin=cni
--root-dir=${SNAP_COMMON}/var/lib/kubelet
--fail-swap-on=false
--cni-conf-dir=${SNAP_DATA}/args/cni-network/
--cni-bin-dir=${SNAP}/opt/cni/bin/
--feature-gates=DevicePlugins=true
--eviction-hard="memory.available<100Mi,nodefs.available<1Gi,imagefs.available<1Gi"
--container-runtime=docker
--node-labels="microk8s.io/cluster=true"
--cluster-domain=cluster.local
--cluster-dns=10.152.183.10
:point_up: just the stock kubelet config with --container-runtime=docker instead of remote
My microk8s plugins on top of defaults are only dns, storage, and gpu and I am running with firewall on
Make sure to restart docker daemon then microk8s stop; microk8s start and you should be in business
Aside: I also run dockerized jupyterhub/jupyterlab setup and had to do similar shenanigans because something about docker v19+ doesn't jive with a lot of existing tooling when it comes to the gpu runtime api, so you have to either rewrite parts of that tooling to use new api or make it compatible by installing nvidia-container-runtime package beside the nvidia-container-toolkit and then setup /etc/docker/daemon.json. In stubborn cases you also have to make the nvidia runtime the default one by editing the said daemon.json file.
fyi (for Windows folks) i guess this workaround will not work with ubuntu WSL2 (latest insider version) as nvidia is providing cuda toolkit/docker driver/wrapper and no proper/expected mod is installed (?)
Problematic line
if lsmod | grep "nvidia" &> /dev/null ; then
echo "NVIDIA kernel module detected"
docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
works fine and returns
Compute 6.1 CUDA device: [GeForce GTX 1060]
Sounds like lsmod check is not exactly accurate?
I'm still having the exact same issue. Is somebody still actively working on this?
Switching to the docker runtime doesn't do the trick for me either. Well, the GPUs are recognized then but everything else is pretty much breaking down. I guess this issue is preventing a lot of people from using microk8s.
Most helpful comment
@illya-havsiyevych while far from ideal, i got it to work on 1.18/stable release and docker (like others mentioned). However, you do have to actually force docker to use nvidia runtime by default and not just add nvidia runtime in list of options in daemon.json.
My setup is Ubuntu 20.04, nvidia-driver-440 from ubuntu repos, docker v19+ from official repos, with nvidia-container-toolkit and nvidia-container-runtime packages installed.
In /etc/docker/daemon.json :
In /var/snap/microk8s/current/args/kubelet :
:point_up: just the stock kubelet config with
--container-runtime=dockerinstead ofremoteMy microk8s plugins on top of defaults are only
dns, storage, and gpuand I am running with firewall onMake sure to restart docker daemon then
microk8s stop; microk8s startand you should be in businessAside: I also run dockerized jupyterhub/jupyterlab setup and had to do similar shenanigans because something about docker v19+ doesn't jive with a lot of existing tooling when it comes to the gpu runtime api, so you have to either rewrite parts of that tooling to use new api or make it compatible by installing
nvidia-container-runtimepackage beside thenvidia-container-toolkitand then setup/etc/docker/daemon.json. In stubborn cases you also have to make the nvidia runtime the default one by editing the said daemon.json file.