Cadvisor: Not able to collect metrics for nvidia GPU

Created on 23 Mar 2018  路  8Comments  路  Source: google/cadvisor

I am using amazon ec2 with GPU

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26 Driver Version: 387.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 42C P8 25W / 149W | 11MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Followed this https://github.com/google/cadvisor/blob/b26bf6ebb2999de15e6af43337687859a531c4ee/docs/running.md
I am running this as "root" inside EC2 instance

  1. One container is running with GPU in docker using nvidia-docker command
  2. Running c-advisor as separate container using

sudo nvidia-docker run
--volume=/:/rootfs:ro
--volume=/var/run:/var/run:rw
--volume=/sys:/sys:ro
--volume=/var/lib/docker/:/var/lib/docker:ro
--volume=/dev/disk/:/dev/disk:ro
--publish=8080:8080
--detach=true
--name=cadvisor
--device /dev/nvidia0:/dev/nvidia0
--device /dev/nvidiactl:/dev/nvidiactl
--device /dev/nvidia-uvm:/dev/nvidia-uvm
google/cadvisor:latest

  1. I am able to see container in ~HOST/8080/containers/ in metrics for sub-container section. It is showing CPU usage

--- Questions ---

  1. Can you add the screenshot to show how the metrics look like on UI ? or which API to check whether it is giving stats?
  2. Is there any endpoint to check metrics in c-advisor?
  3. Do container which is using GPU need to pass - device flags?
  4. Can you please add detailed documentation with some GPU sample images so that anybody can reproduce steps? any sample image to try
  5. Will this work in k8s cluster because kubelets have c-advisor pre-installed?
  6. Any idea where to check which c-advisor is used for which k8s version?
  7. Where to check c-advisor logs ? will it tell it is collecting gpu stats ?

Most helpful comment

Thanks for the reply ..
I am able to collect the GPU metrics without k8s

Here is c-advisor docker image

sudo nvidia-docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:rw \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --volume=/var/lib/nvidia-docker/volumes/nvidia_driver/387.26:/usr/local/nvidia \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  --device /dev/nvidia0:/dev/nvidia0 \
  --device /dev/nvidiactl:/dev/nvidiactl \
  --device /dev/nvidia-uvm:/dev/nvidia-uvm \
  -e LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \
  google/cadvisor:latest

EC2 instance should be "nvidia cuda" base image
GPU code ran on the same machine where c-advisor docker container is running
I have modified the code to run in a loop to create some load
you will need cuda and conda installed
https://github.com/siddharthsharmanv/cudacasts/blob/master/InstallingCUDAPython/VectorAdd.py

Sample output for GPU

Call http://{HOST}:8080/metrics

# HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
# TYPE cadvisor_version_info gauge
cadvisor_version_info{cadvisorRevision="1e567c2",cadvisorVersion="v0.28.3",dockerVersion="1.13.1",kernelVersion="4.4.111-k8s",osVersion="Alpine Linux v3.4"} 1
# HELP container_accelerator_duty_cycle Percent of time over the past sample period during which the accelerator was actively processing.
# TYPE container_accelerator_duty_cycle gauge
container_accelerator_duty_cycle{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 0
# HELP container_accelerator_memory_total_bytes Total accelerator memory.
# TYPE container_accelerator_memory_total_bytes gauge
container_accelerator_memory_total_bytes{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 1.1995578368e+10
# HELP container_accelerator_memory_used_bytes Total accelerator memory allocated.
# TYPE container_accelerator_memory_used_bytes gauge
container_accelerator_memory_used_bytes{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 1.2058624e+07
# HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.

All 8 comments

cc @mindprince

Does cAdvisor has access to NVML? From https://github.com/google/cadvisor/blob/v0.29.1/docs/running.md#hardware-accelerator-monitoring

If you are running cAdvisor inside a container, you will need to do the following to give the container access to NVML library:

 -e LD_LIBRARY_PATH=<path-where-nvml-is-present>
--volume <above-path>:<above-path>

Answers to your questions:

  1. GPU metrics don't show up in the UI.
  2. You can look at the /api/v1.3/subcontainers path.
  3. Yes. But I think nvidia-docker does it for you. Note that this integration is not tested with nvidia-docker, only with Kubernetes.
  4. Any image based on https://hub.docker.com/r/nvidia/cuda/tags/ should work.
  5. Yes, this is tested in Kubernetes. If you request containers using Kubernetes API (resource nvidia.com/gpu, things should work automatically). See OSS docs or GKE docs.
  6. Please look at Godeps.json in https://github.com/kubernetes/kubernetes for the version of you Kubernetes you want to check.
  7. The location of cAdvisor logs depends on how you run cAdvisor. It does tell whether it's able to collect GPU stats.

Thanks for the reply ..
I am able to collect the GPU metrics without k8s

Here is c-advisor docker image

sudo nvidia-docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:rw \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --volume=/var/lib/nvidia-docker/volumes/nvidia_driver/387.26:/usr/local/nvidia \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  --device /dev/nvidia0:/dev/nvidia0 \
  --device /dev/nvidiactl:/dev/nvidiactl \
  --device /dev/nvidia-uvm:/dev/nvidia-uvm \
  -e LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \
  google/cadvisor:latest

EC2 instance should be "nvidia cuda" base image
GPU code ran on the same machine where c-advisor docker container is running
I have modified the code to run in a loop to create some load
you will need cuda and conda installed
https://github.com/siddharthsharmanv/cudacasts/blob/master/InstallingCUDAPython/VectorAdd.py

Sample output for GPU

Call http://{HOST}:8080/metrics

# HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
# TYPE cadvisor_version_info gauge
cadvisor_version_info{cadvisorRevision="1e567c2",cadvisorVersion="v0.28.3",dockerVersion="1.13.1",kernelVersion="4.4.111-k8s",osVersion="Alpine Linux v3.4"} 1
# HELP container_accelerator_duty_cycle Percent of time over the past sample period during which the accelerator was actively processing.
# TYPE container_accelerator_duty_cycle gauge
container_accelerator_duty_cycle{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 0
# HELP container_accelerator_memory_total_bytes Total accelerator memory.
# TYPE container_accelerator_memory_total_bytes gauge
container_accelerator_memory_total_bytes{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 1.1995578368e+10
# HELP container_accelerator_memory_used_bytes Total accelerator memory allocated.
# TYPE container_accelerator_memory_used_bytes gauge
container_accelerator_memory_used_bytes{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 1.2058624e+07
# HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.

I checked k8s versions with Godeps.json in https://github.com/kubernetes/kubernetes , I see that GPU metrics support is k8s 1.9 version onward.
So I checked "ImportPath": "github.com/google/cadvisor/accelerators" only 1.9 onward accelerator support is there.
Can somebody just confirm this?

Yes, GPU monitoring support in Kubernetes was added in 1.9.

/close

@dashpole This can be closed.

I have also verified with k8s 1.9 version and heapster 1.5.1 version, cluster deployed in gcp.

Was this page helpful?
0 / 5 - 0 ratings