Cadvisor: Not able to collect metrics for nvidia GPU

Created on 23 Mar 2018 · 8Comments · Source: google/cadvisor

I am using amazon ec2 with GPU

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Followed this https://github.com/google/cadvisor/blob/b26bf6ebb2999de15e6af43337687859a531c4ee/docs/running.md
I am running this as "root" inside EC2 instance

One container is running with GPU in docker using nvidia-docker command
Running c-advisor as separate container using

sudo nvidia-docker run
--volume=/:/rootfs:ro
--volume=/var/run:/var/run:rw
--volume=/sys:/sys:ro
--volume=/var/lib/docker/:/var/lib/docker:ro
--volume=/dev/disk/:/dev/disk:ro
--publish=8080:8080
--detach=true
--name=cadvisor
--device /dev/nvidia0:/dev/nvidia0
--device /dev/nvidiactl:/dev/nvidiactl
--device /dev/nvidia-uvm:/dev/nvidia-uvm
google/cadvisor:latest

I am able to see container in ~HOST/8080/containers/ in metrics for sub-container section. It is showing CPU usage

--- Questions ---

Can you add the screenshot to show how the metrics look like on UI ? or which API to check whether it is giving stats?
Is there any endpoint to check metrics in c-advisor?
Do container which is using GPU need to pass - device flags?
Can you please add detailed documentation with some GPU sample images so that anybody can reproduce steps? any sample image to try
Will this work in k8s cluster because kubelets have c-advisor pre-installed?
Any idea where to check which c-advisor is used for which k8s version?
Where to check c-advisor logs ? will it tell it is collecting gpu stats ?

Source

itnilesh

Most helpful comment

Thanks for the reply ..
I am able to collect the GPU metrics without k8s

Here is c-advisor docker image

sudo nvidia-docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:rw \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --volume=/var/lib/nvidia-docker/volumes/nvidia_driver/387.26:/usr/local/nvidia \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  --device /dev/nvidia0:/dev/nvidia0 \
  --device /dev/nvidiactl:/dev/nvidiactl \
  --device /dev/nvidia-uvm:/dev/nvidia-uvm \
  -e LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \
  google/cadvisor:latest

EC2 instance should be "nvidia cuda" base image
GPU code ran on the same machine where c-advisor docker container is running
I have modified the code to run in a loop to create some load
you will need cuda and conda installed
https://github.com/siddharthsharmanv/cudacasts/blob/master/InstallingCUDAPython/VectorAdd.py

Sample output for GPU

Call http://{HOST}:8080/metrics

# HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
# TYPE cadvisor_version_info gauge
cadvisor_version_info{cadvisorRevision="1e567c2",cadvisorVersion="v0.28.3",dockerVersion="1.13.1",kernelVersion="4.4.111-k8s",osVersion="Alpine Linux v3.4"} 1
# HELP container_accelerator_duty_cycle Percent of time over the past sample period during which the accelerator was actively processing.
# TYPE container_accelerator_duty_cycle gauge
container_accelerator_duty_cycle{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 0
# HELP container_accelerator_memory_total_bytes Total accelerator memory.
# TYPE container_accelerator_memory_total_bytes gauge
container_accelerator_memory_total_bytes{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 1.1995578368e+10
# HELP container_accelerator_memory_used_bytes Total accelerator memory allocated.
# TYPE container_accelerator_memory_used_bytes gauge
container_accelerator_memory_used_bytes{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 1.2058624e+07
# HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.

itnilesh on 26 Mar 2018

👍2

All 8 comments

cc @mindprince

dashpole on 23 Mar 2018

Does cAdvisor has access to NVML? From https://github.com/google/cadvisor/blob/v0.29.1/docs/running.md#hardware-accelerator-monitoring

If you are running cAdvisor inside a container, you will need to do the following to give the container access to NVML library:
 -e LD_LIBRARY_PATH=<path-where-nvml-is-present>
--volume <above-path>:<above-path>

Answers to your questions:

GPU metrics don't show up in the UI.
You can look at the /api/v1.3/subcontainers path.
Yes. But I think nvidia-docker does it for you. Note that this integration is not tested with nvidia-docker, only with Kubernetes.
Any image based on https://hub.docker.com/r/nvidia/cuda/tags/ should work.
Yes, this is tested in Kubernetes. If you request containers using Kubernetes API (resource nvidia.com/gpu, things should work automatically). See OSS docs or GKE docs.
Please look at Godeps.json in https://github.com/kubernetes/kubernetes for the version of you Kubernetes you want to check.
The location of cAdvisor logs depends on how you run cAdvisor. It does tell whether it's able to collect GPU stats.

mindprince on 23 Mar 2018

👍1

Thanks for the reply ..
I am able to collect the GPU metrics without k8s

Here is c-advisor docker image

sudo nvidia-docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:rw \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --volume=/var/lib/nvidia-docker/volumes/nvidia_driver/387.26:/usr/local/nvidia \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  --device /dev/nvidia0:/dev/nvidia0 \
  --device /dev/nvidiactl:/dev/nvidiactl \
  --device /dev/nvidia-uvm:/dev/nvidia-uvm \
  -e LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \
  google/cadvisor:latest

Sample output for GPU

Call http://{HOST}:8080/metrics

# HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
# TYPE cadvisor_version_info gauge
cadvisor_version_info{cadvisorRevision="1e567c2",cadvisorVersion="v0.28.3",dockerVersion="1.13.1",kernelVersion="4.4.111-k8s",osVersion="Alpine Linux v3.4"} 1
# HELP container_accelerator_duty_cycle Percent of time over the past sample period during which the accelerator was actively processing.
# TYPE container_accelerator_duty_cycle gauge
container_accelerator_duty_cycle{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 0
# HELP container_accelerator_memory_total_bytes Total accelerator memory.
# TYPE container_accelerator_memory_total_bytes gauge
container_accelerator_memory_total_bytes{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 1.1995578368e+10
# HELP container_accelerator_memory_used_bytes Total accelerator memory allocated.
# TYPE container_accelerator_memory_used_bytes gauge
container_accelerator_memory_used_bytes{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 1.2058624e+07
# HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.

itnilesh on 26 Mar 2018

👍2

I checked k8s versions with Godeps.json in https://github.com/kubernetes/kubernetes , I see that GPU metrics support is k8s 1.9 version onward.
So I checked "ImportPath": "github.com/google/cadvisor/accelerators" only 1.9 onward accelerator support is there.
Can somebody just confirm this?

itnilesh on 26 Mar 2018

Yes, GPU monitoring support in Kubernetes was added in 1.9.

mindprince on 26 Mar 2018

/close

mindprince on 26 Mar 2018

@dashpole This can be closed.

mindprince on 26 Mar 2018

I have also verified with k8s 1.9 version and heapster 1.5.1 version, cluster deployed in gcp.

itnilesh on 28 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

containers using MADV_FREE do not see their memory usage decrease

sylr · 7Comments

kubernetes 1.3, systemd 229, memory working_set is radically different from `free`

Thermi · 6Comments

HELP - CAdvisor default in GKE cluster ?

arjun-dandagi · 4Comments

Error running cadvisor docker package on Ubuntu 18.04.4 LTS

NichUK · 5Comments

dashpole is moving on

dashpole · 7Comments