I am using amazon ec2 with GPU
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26 Driver Version: 387.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 42C P8 25W / 149W | 11MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Followed this https://github.com/google/cadvisor/blob/b26bf6ebb2999de15e6af43337687859a531c4ee/docs/running.md
I am running this as "root" inside EC2 instance
sudo nvidia-docker run
--volume=/:/rootfs:ro
--volume=/var/run:/var/run:rw
--volume=/sys:/sys:ro
--volume=/var/lib/docker/:/var/lib/docker:ro
--volume=/dev/disk/:/dev/disk:ro
--publish=8080:8080
--detach=true
--name=cadvisor
--device /dev/nvidia0:/dev/nvidia0
--device /dev/nvidiactl:/dev/nvidiactl
--device /dev/nvidia-uvm:/dev/nvidia-uvm
google/cadvisor:latest
--- Questions ---
cc @mindprince
Does cAdvisor has access to NVML? From https://github.com/google/cadvisor/blob/v0.29.1/docs/running.md#hardware-accelerator-monitoring
If you are running cAdvisor inside a container, you will need to do the following to give the container access to NVML library:
-e LD_LIBRARY_PATH=<path-where-nvml-is-present> --volume <above-path>:<above-path>
Answers to your questions:
/api/v1.3/subcontainers path.nvidia-docker does it for you. Note that this integration is not tested with nvidia-docker, only with Kubernetes.nvidia.com/gpu, things should work automatically). See OSS docs or GKE docs.Thanks for the reply ..
I am able to collect the GPU metrics without k8s
Here is c-advisor docker image
sudo nvidia-docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:rw \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--volume=/var/lib/nvidia-docker/volumes/nvidia_driver/387.26:/usr/local/nvidia \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia-uvm:/dev/nvidia-uvm \
-e LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \
google/cadvisor:latest
EC2 instance should be "nvidia cuda" base image
GPU code ran on the same machine where c-advisor docker container is running
I have modified the code to run in a loop to create some load
you will need cuda and conda installed
https://github.com/siddharthsharmanv/cudacasts/blob/master/InstallingCUDAPython/VectorAdd.py
Sample output for GPU
Call http://{HOST}:8080/metrics
# HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
# TYPE cadvisor_version_info gauge
cadvisor_version_info{cadvisorRevision="1e567c2",cadvisorVersion="v0.28.3",dockerVersion="1.13.1",kernelVersion="4.4.111-k8s",osVersion="Alpine Linux v3.4"} 1
# HELP container_accelerator_duty_cycle Percent of time over the past sample period during which the accelerator was actively processing.
# TYPE container_accelerator_duty_cycle gauge
container_accelerator_duty_cycle{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 0
# HELP container_accelerator_memory_total_bytes Total accelerator memory.
# TYPE container_accelerator_memory_total_bytes gauge
container_accelerator_memory_total_bytes{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 1.1995578368e+10
# HELP container_accelerator_memory_used_bytes Total accelerator memory allocated.
# TYPE container_accelerator_memory_used_bytes gauge
container_accelerator_memory_used_bytes{acc_id="GPU-642094d0-7acf-a6cc-0e15-790e1d269839",id="/docker/6ec56ca39196ff3e013be895fc9f6d46e9956fbd373ad559afee11a340537b6f",image="google/cadvisor:latest",make="nvidia",model="Tesla K80",name="cadvisor"} 1.2058624e+07
# HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.
I checked k8s versions with Godeps.json in https://github.com/kubernetes/kubernetes , I see that GPU metrics support is k8s 1.9 version onward.
So I checked "ImportPath": "github.com/google/cadvisor/accelerators" only 1.9 onward accelerator support is there.
Can somebody just confirm this?
Yes, GPU monitoring support in Kubernetes was added in 1.9.
/close
@dashpole This can be closed.
I have also verified with k8s 1.9 version and heapster 1.5.1 version, cluster deployed in gcp.
Most helpful comment
Thanks for the reply ..
I am able to collect the GPU metrics without k8s
Here is c-advisor docker image
EC2 instance should be "nvidia cuda" base image
GPU code ran on the same machine where c-advisor docker container is running
I have modified the code to run in a loop to create some load
you will need cuda and conda installed
https://github.com/siddharthsharmanv/cudacasts/blob/master/InstallingCUDAPython/VectorAdd.py
Sample output for GPU
Call http://{HOST}:8080/metrics