Cadvisor: GPU metrics didn't show on the `/metrics` endpoint

Created on 8 Nov 2017 · 9Comments · Source: google/cadvisor

Hi @mindprince ,

Thanks for your gpu metrics PR, I have been tested the v0.28.0, and log shown that NVML initialized OK:

NVML initialized. Number of nvidia devices: 4

but /metrics didn't expose the gpu metrics, like container_accelerator_duty_cycle. I didn't use the nvidia-docker， just installed the nvidia driver, is there anything I need to do?

Source

Colstuwjx

Most helpful comment

Hi @mindprince , I have been almost solved the problem "collect GPU metrics via cadvisor running inside container", but there still have some questions need to answer or document:

I need to set cadvisor container running with --privileged, is there any way, such as fulfilled with other specified capacity or sysctl arguments rather than full privileged unsafe container;
I have been added -e LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 --volume=/usr/local/lib/nvidia_drivers:/usr/local/nvidia/lib64 while run up the cadvisor, we should document it, I'm not sure whether these are full requirements, but it works;
while I cannot collect the gpu metrics from cadvisor, the log in cadvisor I can see above is ONLY NVML initialized. Number of nvidia devices: 4, we should make it more clear, such as adding more log about the initialize devices -> add nvidia collector -> containerData updateStats, to told us that we have some nvidia devices, and can NOT collect metrics, or we cannot recognize the nvidia device, things like adding logging at line 132 should be good.
BTW, after custom build cadvisor with more debug logging, I found that building the cadvisor would also need nvidia libraries? After moving to GPU host, I build it ok. maybe we also need to update the build page? I'm not sure, haha.

Anyway, thanks for your PR again, it's great to naturally use cadvisor to monitor GPU container!

Colstuwjx on 10 Nov 2017

👍2

All 9 comments

Hi @Colstuwjx, Thanks for trying this out.

NVML initialized. Number of nvidia devices: 4

This means everything was initialized correctly and metrics should show up.

Couple of things to note:

There are no machine level metrics. So, metrics won't show up if no container with accelerator attached is running.
Metrics will only show up if accelerators are explicitly attached to the container. For example, by passing --device /dev/nvidia0:/dev/nvidia0 flag to docker. If nothing is explicitly attached to the container, metrics will not show up.

I will use this issue to track adding documentation about GPU metrics.

mindprince on 8 Nov 2017

👍2

BTW, after tested on the host, I confirmed that cadvisor run in container would not show the metrics, and on the host would be OK. @mindprince

Colstuwjx on 9 Nov 2017

Yeah, cAdvisor running inside the container will not find NVML unless you mount the path containing the library and update LD_LIBRARY_PATH. I will add this to the documentation as well.

mindprince on 9 Nov 2017

Hi @mindprince , I have been almost solved the problem "collect GPU metrics via cadvisor running inside container", but there still have some questions need to answer or document:

I need to set cadvisor container running with --privileged, is there any way, such as fulfilled with other specified capacity or sysctl arguments rather than full privileged unsafe container;
I have been added -e LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 --volume=/usr/local/lib/nvidia_drivers:/usr/local/nvidia/lib64 while run up the cadvisor, we should document it, I'm not sure whether these are full requirements, but it works;
while I cannot collect the gpu metrics from cadvisor, the log in cadvisor I can see above is ONLY NVML initialized. Number of nvidia devices: 4, we should make it more clear, such as adding more log about the initialize devices -> add nvidia collector -> containerData updateStats, to told us that we have some nvidia devices, and can NOT collect metrics, or we cannot recognize the nvidia device, things like adding logging at line 132 should be good.
BTW, after custom build cadvisor with more debug logging, I found that building the cadvisor would also need nvidia libraries? After moving to GPU host, I build it ok. maybe we also need to update the build page? I'm not sure, haha.

Anyway, thanks for your PR again, it's great to naturally use cadvisor to monitor GPU container!

Colstuwjx on 10 Nov 2017

👍2

@mindprince @Colstuwjx could you tell me where is the document for using cadvisor to monitor GPU?

pineking on 18 Nov 2017

Can someone please add to the document about how I can find GPU information on CAdvisor web UI? I can see GPU info on matrics but can not find on Web UI.

TuranTimur on 11 Jan 2018

It's not on the web UI, only in the API.

PRs welcome!

mindprince on 11 Jan 2018

👍1

Hello @mindprince . Thanks for the clear answer! Then is it the only way to use metrics api for extracting gpu data per containers? I know that Cadvisor has RESTful API, and datastore like influxdb, and I would like to use those rather then using metrics.

TuranTimur on 11 Jan 2018

The GPU metrics should be available both under /metrics and through the cAdvisor REST API.

mindprince on 11 Jan 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

dashpole is moving on

dashpole · 7Comments

kubernetes 1.3, systemd 229, memory working_set is radically different from `free`

Thermi · 6Comments

Unable to start cadvisor when running on AWS Linux Optimized AMI, rootfs/sys/fs/cgroup/cpuset: no such file or directory

mpas · 3Comments

Global housekeeping is constantly looking for obsolete containers

jlec · 5Comments

container_memory_usage_bytes prometheus metrics help doesn't specify what memory we are measuring

octete · 6Comments