I'm working on collecting metrics from ZFS disk pools, and using them in a system like prometheus' node exporter; things like number of spares available and in use, number of disks currently being repaired, number of corrupted files, etc.
As far as I can tell, there's no standard way to retrieve these metrics and use them, apart from running shell commands and parsing their output.
An idea I had to solve this issue was to publish these metrics to sysfs and have node exporter read them. If such a feature does not exist, I would be happy to implement it into ZFS.
sysfs really doesn't work well for this data. Most of the stats are currently
available via procfs and node_exporter can get them.
For the pool configuration information and stats, as seen via the zpool
command, they are gathered via an ioctl. I'm considering adding
https://github.com/richardelling/zpool_prometheus
to the ZFS repo under contrib, once it stabilizes a bit. Check it out and see
if that does what you want.
Also, all the sysfs symbols are GPL'd
Thanks everyone for your feedback!
@richardelling If I understand correctly, the repo you've linked essentially collects the metrics printed by zpool status, and prints them to STDOUT in a Prometheus friendly fashion?
In general, think of it as a zpool command replacement for TSDBs.
It collects the metrics you'll see in all zpool status and zpool iostat at a minimum.
NB, it is not feasible to fully screen scrape zpool status.
I've found this repo that exposes a lot of metrics that I require, and then some.
Is it acceptable if I implement similar metric fetch calls and then publish them to /proc/spl/kstat/zfs, so that node_exporter can read them from there?
I think this method makes the most sense, rather than running a separate binary and relying on its output to collect metrics.
As I tried to explain in the zpool_prometheus readme, it is not suitable to put an ioctl
reader in generic collectors like node_exporter because they can block forever. So it
should remain an external, single purpose program that can hang without impacting
other things.
That said, if what you really want is a single program serving a prometheus-style
endpoint, then :
Solving 2 & 3 adds technical debt as the C-to-python and C-to-go interfaces must now
be maintained in coordination with core C changes, while many of the core C devs aren't
also delivering python and go projects.
I'm open to working on this sort of project and I've scoped some of the work to do all of
the above. Meanwhile, does zpool_prometheus do what you want and can it be an
interim solution?
@richardelling If I understand this correctly, the information that I need, needs to be collected using an ioctl call, which could potentially block forever, and thus it does not make sense for it to be included as part of the ZFS repo?
If that's the case, then I agree that a stand-alone binary approach that collects the metrics would be the better approach. The repo that I linked above already collects most of the metrics I need, and is written in Go, so I think I will use that in conjunction with node_exporter.
Thank you!
Cool. I've got some updates for node_exporter to collect more ZFS stats, I'll try
to send a PR soon. These will be similar to https://github.com/richardelling/telegraf/tree/zfs_linux_4/plugins/inputs/zfs
which include objset performance.
Similarly, cstor commit https://github.com/openebs/cstor/commit/4c1ad8131d1c7c38b2cd8e39b4901832427d1cc7
adds a zpool dump command to output the config as raw json. This is definitely not user friendly, but is simple to implement.
@richardelling After a lot of deliberation, I've come to the conclusion that the approach that would make the most sense for me would be to extend pyzfs, and create a interface for libzfs in Python, similar to how libzfs_core has currently been implemented using CFFI.
Using the new Python interface for libzfs, I then plan to expose the metrics I require to a text file, and have node_exporter pick them up from there, with the script being run periodically (60s interval).
Obviously, I will create a PR to merge the Python iface for libzfs into pyzfs.
I'm still deliberating on how best to do this :-)
I think you have a good approach. If the daemon hangs, then it won't affect node_exporter.
Others can add info on pyzfs, but IIRC it was originally intended to be a libzfs_core consumer only.
Those are stable interfaces, whereas libzfs is not stable. Once you get it to work, let's discuss
what would be required to add a libzfs_core stable interface for those metrics.
@richardelling I've gotten quite far into creating an interface for libzfs in Python 3 using CFFI. I've also created a corresponding node exporter using the Prometheus client libs.
However, I've run into an issue whereby I cannot get the list of all the zpools + datasets using libzfs. Do you know of any way to do so?
it is difficult to do design in github issues, can you contact me directly at richard.[email protected]?
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.