Nvidia-docker: Updating CPU quota causes NVML unknown error

Created on 2 Nov 2017 · 5Comments · Source: NVIDIA/nvidia-docker

I'm testing nvidia-docker 2, starting containers through Zoe Analytics, that uses the network Docker API.
What Zoe does is to dynamically set CPU quotas to redistribute spare capacity, but it makes nvidia-docker break down:

Start a container (the nvidia plugin is set as default in daemon.json):

$ docker run -d -e NVIDIA_VISIBLE_DEVICES=all -p 8888 gcr.io/tensorflow/tensorflow:1.3.0-gpu-py3

Test with nvidia-smi (it works):

$ docker exec -it 9e nvidia-smi
Thu Nov  2 08:03:25 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   26C    P0    31W / 250W |      0MiB / 16276MiB |      0%      Default |

[...]

Change the CPU quota:

$ docker update --cpu-quota 640000 9e

Test with nvidia-smi (it breaks):

$ docker exec -it 9e nvidia-smi
Failed to initialize NVML: Unknown Error

If I set the cpu quota at the beginning, it works.
I tried with different values for the quota, it always breaks
I could find no messages in the logs
The same happens updating the memory soft limit (--memory-reservation)

bug upstream issue

Source

dvenza

All 5 comments

Good catch, it looks like Docker is resetting all the cgroups when it only needs to update one (CPU quota in this case).
Not sure how we can workaround that though.

3XX0 on 2 Nov 2017

Has there been any progress on this? It seems I ran into the same problem while trying to set up the Kubernetes cpu-manager with „static“ policy.
(https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/)

mrjackbo on 10 Jan 2019

@3XX0 I think it is unlikely that this will ever be addressed upstream.

From docker's perspective, they own and control all of the cgroups/devices set up for the containers they launch. If something comes along (in this case, libnvidia-container) and changes those cgroups/device settings outside of docker, then docker should be free to resolve these discrepancies in order to keep its state in sync.

The long-term solution should probably involve making libnvidia-container "docker-aware" in some way so that it can update the necessary state changes via:

https://docs.docker.com/engine/api/v1.25/#operation/ContainerUpdate

I know this goes against the current design (i.e. making libnvidia-container container runtime agnostic), but I don't see any other way around this.

For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly. However, once some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager does in Kubernetes), docker resolves this empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices.

@RenaudWasTaken How does the new --gpu flag for docker handle the fact that libnvidia-container is messing with cgroups/devices outside of docker's control?

klueska on 4 Apr 2019

👍1

@mrjackbo if your setup is constrained such that GPUs will only ever be used by containers that have CPUsets assigned to them via the static allocation policy, then the following patch to Kubernetes will avoid having docker update its cgroups after these containers are initially launched.

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go
index 4ccddd5..ff3fbdf 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go
@@ -242,7 +242,8 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                        // - policy does not want to track the container
                        // - kubelet has just been restarted - and there is no previous state file
                        // - container has been removed from state by RemoveContainer call (DeletionTimestamp is set)
-                       if _, ok := m.state.GetCPUSet(containerID); !ok {
+                       cset, ok := m.state.GetCPUSet(containerID)
+                       if !ok {
                                if status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
                                        klog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID)
                                        err := m.AddContainer(pod, &container, containerID)
@@ -258,7 +259,13 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                                }
                        }

-                       cset := m.state.GetCPUSetOrDefault(containerID)
+                       if !cset.IsEmpty() && m.policy.Name() == string(PolicyStatic) {
+                               klog.V(4).Infof("[cpumanager] reconcileState: skipping container; assigned cpuset unchanged (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset)
+                               success = append(success, reconciledContainer{pod.Name, container.Name, containerID})
+                               continue
+                       }
+
+                       cset = m.state.GetDefaultCPUSet()
                        if cset.IsEmpty() {
                                // NOTE: This should not happen outside of tests.
                                klog.Infof("[cpumanager] reconcileState: skipping container; assigned cpuset is empty (pod: %s, container: %s)", pod.Name, container.Name)

klueska on 4 Apr 2019

👍1

Here's a workaround that might be helpful:
docker run --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 ...
(Replace/repeat nvidia0 with other/more devices as needed.)

This seems to fix the problem with both --runtime=nvidia or the newer --gpus option.