Cadvisor: GPU metrics didn't show on the `/metrics` endpoint

Created on 8 Nov 2017  Â·  9Comments  Â·  Source: google/cadvisor

Hi @mindprince ,

Thanks for your gpu metrics PR, I have been tested the v0.28.0, and log shown that NVML initialized OK:

NVML initialized. Number of nvidia devices: 4

but /metrics didn't expose the gpu metrics, like container_accelerator_duty_cycle. I didn't use the nvidia-docker, just installed the nvidia driver, is there anything I need to do?

Most helpful comment

Hi @mindprince , I have been almost solved the problem "collect GPU metrics via cadvisor running inside container", but there still have some questions need to answer or document:

  • I need to set cadvisor container running with --privileged, is there any way, such as fulfilled with other specified capacity or sysctl arguments rather than full privileged unsafe container;

  • I have been added -e LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 --volume=/usr/local/lib/nvidia_drivers:/usr/local/nvidia/lib64 while run up the cadvisor, we should document it, I'm not sure whether these are full requirements, but it works;

  • while I cannot collect the gpu metrics from cadvisor, the log in cadvisor I can see above is ONLY NVML initialized. Number of nvidia devices: 4, we should make it more clear, such as adding more log about the initialize devices -> add nvidia collector -> containerData updateStats, to told us that we have some nvidia devices, and can NOT collect metrics, or we cannot recognize the nvidia device, things like adding logging at line 132 should be good.

  • BTW, after custom build cadvisor with more debug logging, I found that building the cadvisor would also need nvidia libraries? After moving to GPU host, I build it ok. maybe we also need to update the build page? I'm not sure, haha.

Anyway, thanks for your PR again, it's great to naturally use cadvisor to monitor GPU container!

All 9 comments

Hi @Colstuwjx, Thanks for trying this out.

NVML initialized. Number of nvidia devices: 4

This means everything was initialized correctly and metrics should show up.

Couple of things to note:

  • There are no machine level metrics. So, metrics won't show up if no container with accelerator attached is running.
  • Metrics will only show up if accelerators are explicitly attached to the container. For example, by passing --device /dev/nvidia0:/dev/nvidia0 flag to docker. If nothing is explicitly attached to the container, metrics will not show up.

I will use this issue to track adding documentation about GPU metrics.

BTW, after tested on the host, I confirmed that cadvisor run in container would not show the metrics, and on the host would be OK. @mindprince

Yeah, cAdvisor running inside the container will not find NVML unless you mount the path containing the library and update LD_LIBRARY_PATH. I will add this to the documentation as well.

Hi @mindprince , I have been almost solved the problem "collect GPU metrics via cadvisor running inside container", but there still have some questions need to answer or document:

  • I need to set cadvisor container running with --privileged, is there any way, such as fulfilled with other specified capacity or sysctl arguments rather than full privileged unsafe container;

  • I have been added -e LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 --volume=/usr/local/lib/nvidia_drivers:/usr/local/nvidia/lib64 while run up the cadvisor, we should document it, I'm not sure whether these are full requirements, but it works;

  • while I cannot collect the gpu metrics from cadvisor, the log in cadvisor I can see above is ONLY NVML initialized. Number of nvidia devices: 4, we should make it more clear, such as adding more log about the initialize devices -> add nvidia collector -> containerData updateStats, to told us that we have some nvidia devices, and can NOT collect metrics, or we cannot recognize the nvidia device, things like adding logging at line 132 should be good.

  • BTW, after custom build cadvisor with more debug logging, I found that building the cadvisor would also need nvidia libraries? After moving to GPU host, I build it ok. maybe we also need to update the build page? I'm not sure, haha.

Anyway, thanks for your PR again, it's great to naturally use cadvisor to monitor GPU container!

@mindprince @Colstuwjx could you tell me where is the document for using cadvisor to monitor GPU?

Can someone please add to the document about how I can find GPU information on CAdvisor web UI? I can see GPU info on matrics but can not find on Web UI.

It's not on the web UI, only in the API.

PRs welcome!

Hello @mindprince . Thanks for the clear answer! Then is it the only way to use metrics api for extracting gpu data per containers? I know that Cadvisor has RESTful API, and datastore like influxdb, and I would like to use those rather then using metrics.

The GPU metrics should be available both under /metrics and through the cAdvisor REST API.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cheneypan picture cheneypan  Â·  4Comments

terrpan picture terrpan  Â·  5Comments

arjun-dandagi picture arjun-dandagi  Â·  4Comments

rikatz picture rikatz  Â·  5Comments

jlec picture jlec  Â·  5Comments