Nvidia-docker: How to show processes in container with cmd nvidia-smi?

Created on 24 Aug 2016 · 24Comments · Source: NVIDIA/nvidia-docker

Hi, in container I execute nvidia-smi shows: ( it shown nothing)
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

But in physical-host when I execute nvidia-smi show: (it shown Process with pid 28290)
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 28290 C python 1012MiB |
+-----------------------------------------------------------------------------+

How can I show the running processes in container?
I figure out one way is to run a container with --pid=host, any other graceful ways?

Source

fredy12

Most helpful comment

This should be implemented. 👍

loretoparisi on 22 Sep 2017

👍57 👀1

All 24 comments

It's a current limitation of our driver, sorry about that!
As you realized, it is related to PID namespaces, the driver is not aware of the PID namespace and thus nvidia-smi in the container doesn't see any process running.

flx42 on 24 Aug 2016

Thanks for you reply. @flx42
So we can consider that add PID namespace perception into the driver.

fredy12 on 25 Aug 2016

mark it

qiaohaijun on 31 Mar 2017

@fredy12 your way works

qiaohaijun on 31 Mar 2017

This should be implemented. 👍

loretoparisi on 22 Sep 2017

👍57 👀1

How can we help make this happen?
Checking the usage of GPU is part of workflow when running large training. Without the support for nvidia-smi inside container, there is no good way to keep an eye on the usage. 😿

dharmeshkakadia on 3 Jul 2018

👍2

Are there plans to fix this soon? BTW, this is a problem with not only the nvidia-smi tool, but also when calling NVML directly.

therc on 20 Dec 2018

👍6

Any news about this? Seems to me quite fundamental.
nvmlDeviceGetAccountingStats cannot work with getpid() argument for the process itself inside nvidia-docker.

bhack on 16 Jan 2019

@flx42 Can we track this issue somewhere? Cause seems that this is closed.

bhack on 17 Jan 2019

👍2

Breakend on 30 Sep 2019

Please reopen this

bhack on 30 Sep 2019

re-opening. It's a driver limitation and still exists. There is a mediocre work-around using node-manager but it is unsatisfactory for some work flows.

Requirement for closure is for the nvidia-smi to at least show processes running in the same container. The underlying libraries and drivers can't support this under the current architecture, don't expect this to be fixed any time soon. No ETA, sorry!

nvjmayo on 1 Oct 2019

👍6 🚀1

Is there any update on this?
How can I get the GPU memory usage of a process inside a docker container?

Are there any workarounds?

maaft on 11 Mar 2020

❤1

We need this definitively. A typical example is when using Pytorch and running more than one training process. Often OOM will kill the process without any info if we cannot monitor the GPU. Even assumed to use TF with gpu fraction like here

# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

I need within the container the GPU memory usage to decide which policy must be adopted at run time for running processes like training or to parallelize inferences, etc.

loretoparisi on 11 Mar 2020

👍4

FYI:
As a workaround I start the docker container with the --pid=host flag which works perfectly fine with e.g. python's os.getpid().

maaft on 11 Mar 2020

❤2

@maaft as soon you have the pid, how you do with nvidia-smi? Thanks

loretoparisi on 11 Mar 2020

Oh sorry, I guess I missed the topic on this one a bit.

Anyway, if using python is an option for you:

import os
import sys
import pynvml as N

MB = 1024 * 1024

def get_usage(device_index, my_pid):
    N.nvmlInit()

    handle = N.nvmlDeviceGetHandleByIndex(device_index)

    usage = [nv_process.usedGpuMemory // MB for nv_process in
             N.nvmlDeviceGetComputeRunningProcesses(handle) + N.nvmlDeviceGetGraphicsRunningProcesses(handle) if
             nv_process.pid == my_pid]

    if len(usage) == 1:
        usage = usage[0]
    else:
        raise KeyError("PID not found")

    return usage

if __name__ == "__main__":
   print(get_usage(sys.argv[1], sys.argv[2]))

Instead of calling nvidia-smi from your process you could just call this little script with the device-index of your GPU and the processe's PID as arguments using popen or anything else and reading from stdout.

maaft on 12 Mar 2020

❤1

No cigar. N.nvmlDeviceGetGraphicsRunningProcesses(handle) always returns the empty list for me, no matter what the actual load on the referenced GPU is.

te0006 on 27 Mar 2020

@te0006 I just included it for completeness.
If your PID is not in N.nvmlDeviceGetGraphicsRunningProcesses(handle) it should be in N.nvmlDeviceGetComputeRunningProcesses(handle).

If both lists are empty: Are you sure you started your docker container with:
--pid=host
?

maaft on 27 Mar 2020

Not really sure. Have to check with a colleague who does the administration. Thanks for the hint.

te0006 on 27 Mar 2020

❤1

Can't believe it is still not working after 4 years. Now people are running GPU inference/trainning jobs in every kubernetes cluster, and we really need a way to determine gpu usage inside docker containers.

dragonly on 4 Apr 2020

👍4

@dragonly you can use the workaround stated above in the meantime.

maaft on 8 Apr 2020

After consulting with my colleague and doing some testing, I can confirm that the above workaround works:
After he added 'hostPID: true' to the pod specification and restarting the container, nvidia-smi now shows the GPU-using Python processes correctly with pid and GPU memory usage. And querying the GPU usage with maaft's above Python code works as well.

te0006 on 16 Apr 2020

👍2 ❤1

options available to you, I am closing this:

add hostPID: true to pod spec
for docker (rather than Kubernetes) run with --privileged or --pid=host. This is useful if you need to run nvidia-smi manually as an admin for troubleshooting.
set up MIG partitions on a supported card

Some of the information people want to see (like memory and GPU usage) might be better represented in a monitor (Prometheus) where a graph can provide better visualization of multiple nodes in a cluster. I would love to see a way to monitor processes cluster-wide, and not simply for a particular node. (a bit out of scope for this issue)

If there is a specific feature or enhancement to one of the options above, please open a new issue.

nvjmayo on 17 Jun 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings