Hi, in container I execute nvidia-smi shows: ( it shown nothing)
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
But in physical-host when I execute nvidia-smi show: (it shown Process with pid 28290)
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 28290 C python 1012MiB |
+-----------------------------------------------------------------------------+
How can I show the running processes in container?
I figure out one way is to run a container with --pid=host, any other graceful ways?
It's a current limitation of our driver, sorry about that!
As you realized, it is related to PID namespaces, the driver is not aware of the PID namespace and thus nvidia-smi in the container doesn't see any process running.
Thanks for you reply. @flx42
So we can consider that add PID namespace perception into the driver.
mark it
@fredy12 your way works
This should be implemented. 馃憤
How can we help make this happen?
Checking the usage of GPU is part of workflow when running large training. Without the support for nvidia-smi inside container, there is no good way to keep an eye on the usage. 馃樋
Are there plans to fix this soon? BTW, this is a problem with not only the nvidia-smi tool, but also when calling NVML directly.
Any news about this? Seems to me quite fundamental.
nvmlDeviceGetAccountingStats cannot work with getpid() argument for the process itself inside nvidia-docker.
@flx42 Can we track this issue somewhere? Cause seems that this is closed.
+1
Please reopen this
re-opening. It's a driver limitation and still exists. There is a mediocre work-around using node-manager but it is unsatisfactory for some work flows.
Requirement for closure is for the nvidia-smi to at least show processes running in the same container. The underlying libraries and drivers can't support this under the current architecture, don't expect this to be fixed any time soon. No ETA, sorry!
Is there any update on this?
How can I get the GPU memory usage of a process inside a docker container?
Are there any workarounds?
We need this definitively. A typical example is when using Pytorch and running more than one training process. Often OOM will kill the process without any info if we cannot monitor the GPU. Even assumed to use TF with gpu fraction like here
# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
I need within the container the GPU memory usage to decide which policy must be adopted at run time for running processes like training or to parallelize inferences, etc.
FYI:
As a workaround I start the docker container with the --pid=host flag which works perfectly fine with e.g. python's os.getpid().
@maaft as soon you have the pid, how you do with nvidia-smi? Thanks
Oh sorry, I guess I missed the topic on this one a bit.
Anyway, if using python is an option for you:
import os
import sys
import pynvml as N
MB = 1024 * 1024
def get_usage(device_index, my_pid):
N.nvmlInit()
handle = N.nvmlDeviceGetHandleByIndex(device_index)
usage = [nv_process.usedGpuMemory // MB for nv_process in
N.nvmlDeviceGetComputeRunningProcesses(handle) + N.nvmlDeviceGetGraphicsRunningProcesses(handle) if
nv_process.pid == my_pid]
if len(usage) == 1:
usage = usage[0]
else:
raise KeyError("PID not found")
return usage
if __name__ == "__main__":
print(get_usage(sys.argv[1], sys.argv[2]))
Instead of calling nvidia-smi from your process you could just call this little script with the device-index of your GPU and the processe's PID as arguments using popen or anything else and reading from stdout.
No cigar. N.nvmlDeviceGetGraphicsRunningProcesses(handle) always returns the empty list for me, no matter what the actual load on the referenced GPU is.
@te0006 I just included it for completeness.
If your PID is not in N.nvmlDeviceGetGraphicsRunningProcesses(handle) it should be in N.nvmlDeviceGetComputeRunningProcesses(handle).
If both lists are empty: Are you sure you started your docker container with:
--pid=host
?
Not really sure. Have to check with a colleague who does the administration. Thanks for the hint.
Can't believe it is still not working after 4 years. Now people are running GPU inference/trainning jobs in every kubernetes cluster, and we really need a way to determine gpu usage inside docker containers.
@dragonly you can use the workaround stated above in the meantime.
After consulting with my colleague and doing some testing, I can confirm that the above workaround works:
After he added 'hostPID: true' to the pod specification and restarting the container, nvidia-smi now shows the GPU-using Python processes correctly with pid and GPU memory usage. And querying the GPU usage with maaft's above Python code works as well.
options available to you, I am closing this:
Some of the information people want to see (like memory and GPU usage) might be better represented in a monitor (Prometheus) where a graph can provide better visualization of multiple nodes in a cluster. I would love to see a way to monitor processes cluster-wide, and not simply for a particular node. (a bit out of scope for this issue)
If there is a specific feature or enhancement to one of the options above, please open a new issue.
Most helpful comment
This should be implemented. 馃憤