There are currently many metrics which don't follow our naming guidelines, e.g. they're missing _total suffixes or units like _bytes.
This will be a breaking change, but for the better. As node_exporter is one of the most popular exporters, it should also lead by example.
@juliusv @brian-brazil
One thing to keep in mind is that we're not going to be able to apply this to all metrics that we're exporting here, as it's not practical to determine which are/aren't counters and what the units are (e.g. netstat). There's a balance with all exporters between maintainability and exposition perfection. Anything procedurally generated usually ends up on the maintainability end of things. We should be able to fix up most of the core metrics though.
it is quite unfortunate that difference exist even in a single collector between platforms, such as node_memory_Active for linux meminfo vs node_memory_active for freebsd meminfo module.
the longer the differences exist, the more painful it will be to switch...
@mischief That's one of the ones we can't fix unfortunately. I'd imagine memory is also one of the areas where the semantics will be subtly different.
@brian-brazil what does 'cant fix' mean here? not willing to change because of breaking compat?
@mischief What we have to work with is /proc/meminfo, which has 42 fields which appear to be largely undocumented and it's not clear how that file has changed in the past and will change in the future. In that sort of situation the best thing to do is dump them out with minimal munging.
What if in this case we at least converted CamelCase to under_scores (and downcase everything in general)? There are already some other underscores in that file, but the result should still be ok. Or too risky?
Generally it's a little risky to go to underscores for a dump like that, as the original name is no longer apparent to be searched for online. In this case we've also a few that'd end up messy like NFS_Unstable, HugePages_Total, Hugepagesize and DirectMap2M.
The original name can be put in the Help text. I'd vote to export metrics according to our naming standards.
Hmm, even with the name in the help text, huge_pages_total vs. hugepage_size would still look messy, and the _total also would make it look like a counter.
I'd vote to export metrics according to our naming standards.
The challenge is that requires hand choosing metric names in the case of exporters, so it's not practical when more than a handful of metrics are involved (and that's presuming they don't change much over time).
This problem is not unique to the node exporter, it's a tradeoff we have to make with all exporters.
Hum thought I commented after setting to accepted:
After thinking about this again and again I also think we should expose them according to our naming standards. I know that it requires some maintenance to keep this up to date but I don't think it's that much work either and I believe those changes are usually straight forward and something people can easily fix and contribute back.
The remaining question is just when and how to change it. It's a lot of work and IMO requires some test infra to make sure all collectors still work.
There are over 450 metrics exposed by default, and we don't even know the names of all of them in advance. That's more than a little maintenance.
@brian-brazil We don't necessarily need to change all but we should do it at least for the most common ones (cpu/memory/disk/net/).
Part of the consolidation should be having consistent metric names for the same things, see #414
We can only sanely fix things where we fully decide the name in the first place. CPU is one of those, memory is not.
Would it be useful to investigate how some of the older host monitoring solutions (Zabbix, etc) do this? They may have figured out a practical approach which also lends itself well to cross platform capable dashboards.
Based on what I've seen, I imagine they're hand picking 10-20 metrics. We're at a scale where that doesn't work.
Cross-platform is very unlikely to work for anything beyond the simplest of metrics. Even something like load average varies significantly across platforms.
No worries at all. :smile:
I'd like to point out that the decision to rename node exporter metrics only to better follow naming conventions, might make sense from a developer point of view, but from a user point of view this is going to cause a catastrophic experience. I wish not breaking stuff for users would be considered more important than cosmetic considerations.
How are people supposed to deploy these changes? Deploy new dashboards and new node exporters on all hosts at the same time?
@schweikert Dashboards like Grafana can be configured to make multiple queries. You can set them up to query both the old and new names, which will provide a reasonably seamless
Same goes for recording rules, you can setup rules to deal with both names, and even record the new name into the old rule name.
For example:
rules:
# The count of CPUs per node, useful for getting CPU time as a percent of total.
- record: instance:node_cpus:count
expr: count(node_cpu{mode="idle"}) without (cpu,mode)
- record: instance:node_cpus:count
expr: count(node_cpu_seconds_total{mode="idle"}) without (cpu,mode)
You could also use or in a single expression.
@SuperQ, @brian-brazil: thanks for the tips, that's good to know. I still worry about how users are going to take this. The reputation cost there is going to be quite significant.
@schweikert It's definitely painful, but it's a tradeoff between having broken metric names forever or having this temporary pain once (and as long as it's still a 0.x version, might not be possible later, depending on metrics stability guarantees in a 1.x version).
From a usage and teaching perspective the node_cpu metric has always annoyed me, as I had to explain to people why it was named that way, and that it's not how you're supposed to name metrics in Prometheus... and yet it's one of the most commonly used metrics.
Maybe it would help the transition to have an option to publish the old metric names as well, in addition to the new ones.
I am thinking of our company, where we publish a node exporter RPM package for consumption by various teams. If we would publish a new package with the new metric names in the yum repo, it would immediately start breaking things for our users.
If we had an option to expose both metric names, we could communicate that there is a transition period, and remove that option only a few months later in our package.
Doubling the load on user's Prometheus servers would likely take many of them out. There's always metric_relabel_configs if you want a smoother transition.
The metric_relabel_configs option does not provide our desired properties of a smooth transition. We need to keep our dashboards accessible for historical data. I re-opened #830 and added a comment with a few solutions I see to support the upgrade process.
Most helpful comment
it is quite unfortunate that difference exist even in a single collector between platforms, such as
node_memory_Activefor linux meminfo vsnode_memory_activefor freebsd meminfo module.the longer the differences exist, the more painful it will be to switch...