Nvidia-docker: Updating cpu-manager-policy=static causes NVML unknown error

Created on 25 Apr 2019  路  12Comments  路  Source: NVIDIA/nvidia-docker

What happened:

  1. After setting the cpu-manager-policy=static of kubenets, the pod with gpu running nvidia-smi will report an error

    Failed to initialize NVML: Unknown Error
    
  2. Setting the cpu-manager-policy=none for kubenets will not cause this error

  3. Sometimes when the pod first runs, nvidia-smi will not give an error, and about 10 seconds later, running nvidia-smi will give an error

  4. Check the reason for the error, find it is when reading /dev/nvidiactl Operation not permitted

    strace -v -a 100 -s 1000 nvidia-smi
    close(3)                                                                                           = 0
    open("/dev/nvidiactl", O_RDWR)                                                                     = -1 EPERM (Operation not permitted)
    open("/dev/nvidiactl", O_RDONLY)                                                                   = -1 EPERM (Operation not permitted)
    fstat(1, {st_dev=makedev(0, 704), st_ino=4, st_mode=S_IFCHR|0620, st_nlink=1, st_uid=0, st_gid=5, st_blksize=1024, st_blocks=0, st_rdev=makedev(136, 1), st_atime=2019/04/23-17:35:28.678347231, st_mtime=2019/04/23-17:35:28.678347231, st_ctime=2019/04/23-17:33:09.682347235}) = 0
    write(1, "Failed to initialize NVML: Unknown Error\n", 41Failed to initialize NVML: Unknown Error
    )                                         = 41
    exit_group(255)                                                                                     = ?
    +++ exited with 255 +++
    
  5. Update the cpu-manager-policy to none and static, and create two pods respectively as test-gpu(nvidia-smi can be run) and test-gpu-err (Running nvidia-smi reports an error).

    1. Check pods's /sys/fs/cgroup/devices/devices, found the difference between the list

    2. test-gpu(nvidia-smi can be run)

      root@super8:/sys/fs/cgroup/devices/kubepods/besteffort/pod52c61ec9-65b5-11e9-8cd2-0cc47aea540c/caca989a1f8d1a8c87f67c04d2d63347a98f52d745c44e77895b3ca4dfd9b18f# cat devices.list 
      c 1:5 rwm
      c 1:3 rwm
      c 1:9 rwm
      c 1:8 rwm
      c 5:0 rwm
      c 5:1 rwm
      c *:* m
      b *:* m
      c 1:7 rwm
      c 136:* rwm
      c 5:2 rwm
      c 10:200 rwm
      c 195:255 rw
      c 195:3 rw
      
    3. test-gpu-err (Running nvidia-smi reports an error)

      root@super8:/sys/fs/cgroup/devices/kubepods/besteffort/podbfa294b1-65aa-11e9-8cd2-0cc47aea540c/771eb2c6d41fe48160000ad481702d09bdda5bfe49d613f96273412e177b449d# cat devices.list 
      c 1:5 rwm
      c 1:3 rwm
      c 1:9 rwm
      c 1:8 rwm
      c 5:0 rwm
      c 5:1 rwm
      c *:* m
      b *:* m
      c 1:7 rwm
      c 136:* rwm
      c 5:2 rwm
      c 10:200 rwm
      
    4. 4.
  6. so, After setting the cpu-manager-policy=static of kubenets,pod with gpu can run nvidia-smi command for a short time, But in a function that runs once in 10 seconds, /sys/fs/cgroup/devices/devices.list will be modified to lose read and write access to /dev/nvidiactl (and should have other files), and then cause nvidia-smi error

  7. 7.

Most helpful comment

Unfortunately, this is a known issue. It was first reported here:
https://github.com/NVIDIA/nvidia-docker/issues/515

The underlying issue is that libnvidia-container injects some devices and modifies some cgroups out-of-band of the container engine it is operating on behalf of when setting a container up for use with GPUs. This causes the internal state of the container engine to be out of sync with what has actually been set up for the container.

For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly.

This has not been an issue until now because everything works fine at initial container creation time. These settings are modified by libnvidia-container only after a container has already been set up by docker and no further updates to the cgroups are necessary.

The problem comes when some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager in Kubernetes does). When this API is invoked, docker resolves its empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices.

We need to come up with a solution that allows libnvidia-container to take control of managing these devices on behalf of docker (or any container engine) while properly informing it so that it can keep its internal state in sync.

All 12 comments

Unfortunately, this is a known issue. It was first reported here:
https://github.com/NVIDIA/nvidia-docker/issues/515

The underlying issue is that libnvidia-container injects some devices and modifies some cgroups out-of-band of the container engine it is operating on behalf of when setting a container up for use with GPUs. This causes the internal state of the container engine to be out of sync with what has actually been set up for the container.

For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly.

This has not been an issue until now because everything works fine at initial container creation time. These settings are modified by libnvidia-container only after a container has already been set up by docker and no further updates to the cgroups are necessary.

The problem comes when some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager in Kubernetes does). When this API is invoked, docker resolves its empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices.

We need to come up with a solution that allows libnvidia-container to take control of managing these devices on behalf of docker (or any container engine) while properly informing it so that it can keep its internal state in sync.

If your setup is constrained such that GPUs will only ever be used by containers that have CPUSets assigned to them via the static allocation policy, then the following patch to Kubernetes will avoid having docker update its cgroups after these containers are initially launched. This is not a generic enough solution to work in all cases though, unfortunately.

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go
index 4ccddd5..ff3fbdf 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go
@@ -242,7 +242,8 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                        // - policy does not want to track the container
                        // - kubelet has just been restarted - and there is no previous state file
                        // - container has been removed from state by RemoveContainer call (DeletionTimestamp is set)
-                       if _, ok := m.state.GetCPUSet(containerID); !ok {
+                       cset, ok := m.state.GetCPUSet(containerID)
+                       if !ok {
                                if status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
                                        klog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID)
                                        err := m.AddContainer(pod, &container, containerID)
@@ -258,7 +259,13 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                                }
                        }

-                       cset := m.state.GetCPUSetOrDefault(containerID)
+                       if !cset.IsEmpty() && m.policy.Name() == string(PolicyStatic) {
+                               klog.V(4).Infof("[cpumanager] reconcileState: skipping container; assigned cpuset unchanged (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset)
+                               success = append(success, reconciledContainer{pod.Name, container.Name, containerID})
+                               continue
+                       }
+
+                       cset = m.state.GetDefaultCPUSet()
                        if cset.IsEmpty() {
                                // NOTE: This should not happen outside of tests.
                                klog.Infof("[cpumanager] reconcileState: skipping container; assigned cpuset is empty (pod: %s, container: %s)", pod.Name, container.Name)

@klueska Thanks so much as it works when we use your code snippet and setting CPU POLICY to static and QOS of the container is in "guaranteed" mode.

However, is there anyway to get GPU working for those containers that are not "guaranteed" while CPU_MANAGER_POLICY is set to STATIC. Is it something you guys intend to develop or is it possible for me to work it around.

@klueska As we know, LXD 3.0.0 is already supported for NVIDIA runtime passthrough.
https://discuss.linuxcontainers.org/t/lxd-3-0-0-has-been-released/1491
If there is a case which kubernetes can create LXD3.0.0 containers instead of docker containers, can we bypass the current problem, since LXD3.0.0 is not a 鈥榙ocker' and may not require 'libnvidia-container' ?

Any change made to kubernetes is always going to be a workaround. The real fix needs to come in libnvidia-container or docker or some combination of both.

I've never tried LXD with Kubernetes, so I'm not in a position to say how well it would work or not. I do know that LXD still uses libnvidia-container under the hood though, so it may exhibit the same problems.

Again, the underlying problem is that docker is not told about the devices that libnvidia-container injects into it, so if you come up with a workaround that updates docker's internal state with this information, that should be sufficient.

@klueska fixed this in the latest NVIDIA device-plugin (beta5) release. Though this is a workaround.

Note, to use this workaround you will need to use the new daemonset spec nvidia-device-plugin-compat-with-cpumanager.yml instead of the default one.

This spec does two things different from the default one:
1) It adds a new argument to the plugin executable for --pass-device-specs
2) It launches the plugin as --privileged

If you don't want to use the --privileged flag, then things will still "work" in terms of allowing pods with GPUs to run, but you will see the plugin restart anytime a container with guaranteed CPUs from the CPUManager starts. If you are OK with this restart, then launching the daemonset as --privileged is not strictly necessary.

@klueska Is there a PR of device plugin for this issue? I would like to learn about the fix.

@klueska nvidia-device-plugin-compat-with-cpumanager.yml is using nvidia/k8s-device-plugin:v0.7.1. My current cluster is using nvidia/k8s-device-plugin:1.11. Is k8s-device-plugin:v0.7.1 a newer version than k8s-device-plugin:1.11? Can I upgrade the ds directly without breaking things on my production?

@zionwu, Yes v0.7.1 is newer than 1.11 and you should be able to upgrade without any issues.

Please see https://github.com/NVIDIA/k8s-device-plugin#versioning for info on versioning / upgrading.

Also, keep in mind that the semantics around deploying the plugin on nodes that do not have GPUs has slightly changed. You may need to set this flag to false in your daemonset if you rely on the ability to deploy it on non-GPU nodes and not error out:
https://github.com/NVIDIA/k8s-device-plugin/commit/2a9b835b64edd8782e6beb662bbb2979a7e0cb0d

Got it. Thank you ! @klueska

Was this page helpful?
0 / 5 - 0 ratings