Can we gather GPU metrics for the host system?
Currently, only the mesos pluging, AFAIK, has any method of reporting GPU utilization.
Much like CPU and Memory utilization, it would be great to get host-level metric on how much a GPU is being utilized.
Main use case from the person in charge of the hardware is to know if additional hardware is required in order for our software to run optimally. Also, from a development standpoint, metrics gathered could be used by developers to better optimize their software.
:+1:
Out of curiosity what kind of GPUs are you looking to query from? I've been working on something for nvidia-related setups, not sure how hard that would be to extend beyond that.
Currently looks like Nvidia.
I'm afraid this will be vendor-specific. For nvidia you can use something like:
$ nvidia-smi --format=csv --query-gpu=power.draw,utilization.gpu,fan.speed,temperature.gpu
power.draw [W], utilization.gpu [%], fan.speed [%], temperature.gpu
11.04 W, 0 %, 33 %, 34
or without header and units:
$ nvidia-smi --format=csv,noheader,nounits --query-gpu=power.draw,utilization.gpu,fan.speed,temperature.gpu
10.80, 0, 33, 34
and just parse comma separated values.
can we still add it , even if its vendor specific ,
some already exist ( windows) https://dev.sigpipe.me/dashie/telegraf-plugins/src/master/nvidia_smi/nvidia_smi.go
Vendor specific is not a problem. BTW, for radeon cards some of this is in /sys/class/drm/card0/device/ as well as through lm-sensors.
I use the inputs.exec to collect gpu metrics. And create a dashboard to display. However, it seems that the dashboard can't be displayed in the host page together with cpu, disk and other system metrics?
@samson-wang Can't help with the dashboard problem here,
why don't you try asking over at the InfluxData Community site. Make sure to mention what application you are using to create your dashboards.
@danielnelson Thanks for your tip. It's my first day of using TICK STACK.
I use chronograf as my frontend. Will the gpu metrics be added to the default host view in v1.4.0? Or it's all chronograf's decision?
@samson-wang you should drop a feature request in chronograf's github, the "default" host view is generated by existing measurements. So if you are collecting data on that host it will show up. So default views of the data need to be created, if you have some queries that work for you this would be super easy.
The docs are at https://github.com/influxdata/chronograf/blob/master/docs/LAYOUT.md
@samson-wang -- also, it would be great to understand what stats you are gathering and using from the GPUs. Can you share more details? You are currently using a generic collection mechanism. To render stats from a generic source is going to be tricky in Chronograf - but if we can get some additional inputs/insight and engagement that would help determine how best to proceed with building a more specific plug-in that could be easily rendered. Thanks!
@nhaugo Thanks for your tip. I created the gpu pre-canned template. It works great.
@timhallinflux I would like to monitor gpu's memory usage and other stuffs. Though not a final version, you can refer to https://gist.github.com/samson-wang/6de1f19c0bea3741c150a3b54fd97dd7
If you don't mind experimental native code, I've thrown this together: https://github.com/influxdata/telegraf/compare/master...jbboehr:nvml?expand=1
It requires being built with -tags nvml and uses my fork of https://github.com/davidr/go-nvml
We're in the process of switching from newrelic, where we previously used https://github.com/jbboehr/newrelic-nvidia-plugin, to signalfx.
This is a working plugin for nvidia-smi. Would be nice to see it merged:
https://github.com/datamachines/telegraf-nvidia-smi
I've got a PR open that turns the above code 鈽濓笍into plugin.
I tried using the nvidia-smi on Windows 10 and didn't have any luck. Is it included in the 1.6.4 download?
@duffyjp -- No. It's in the 1.7rc's and soon 1.7 final.
@timhallinflux cool thanks. Looking forward to it!
1.7 is available now.
Most helpful comment
If you don't mind experimental native code, I've thrown this together: https://github.com/influxdata/telegraf/compare/master...jbboehr:nvml?expand=1
It requires being built with
-tags nvmland uses my fork of https://github.com/davidr/go-nvmlWe're in the process of switching from newrelic, where we previously used https://github.com/jbboehr/newrelic-nvidia-plugin, to signalfx.