Node_exporter: Consolidate metrics names to adhere to our guide lines

Created on 30 Oct 2015 · 26Comments · Source: prometheus/node_exporter

There are currently many metrics which don't follow our naming guidelines, e.g. they're missing _total suffixes or units like _bytes.

This will be a breaking change, but for the better. As node_exporter is one of the most popular exporters, it should also lead by example.

@juliusv @brian-brazil

accepted enhancement

Source

grobie

Most helpful comment

it is quite unfortunate that difference exist even in a single collector between platforms, such as node_memory_Active for linux meminfo vs node_memory_active for freebsd meminfo module.

the longer the differences exist, the more painful it will be to switch...

mischief on 30 Oct 2015

👍5

All 26 comments

One thing to keep in mind is that we're not going to be able to apply this to all metrics that we're exporting here, as it's not practical to determine which are/aren't counters and what the units are (e.g. netstat). There's a balance with all exporters between maintainability and exposition perfection. Anything procedurally generated usually ends up on the maintainability end of things. We should be able to fix up most of the core metrics though.

brian-brazil on 30 Oct 2015

it is quite unfortunate that difference exist even in a single collector between platforms, such as node_memory_Active for linux meminfo vs node_memory_active for freebsd meminfo module.

the longer the differences exist, the more painful it will be to switch...

mischief on 30 Oct 2015

👍5

@mischief That's one of the ones we can't fix unfortunately. I'd imagine memory is also one of the areas where the semantics will be subtly different.

brian-brazil on 30 Oct 2015

@brian-brazil what does 'cant fix' mean here? not willing to change because of breaking compat?

mischief on 30 Oct 2015

@mischief What we have to work with is /proc/meminfo, which has 42 fields which appear to be largely undocumented and it's not clear how that file has changed in the past and will change in the future. In that sort of situation the best thing to do is dump them out with minimal munging.

brian-brazil on 30 Oct 2015

What if in this case we at least converted CamelCase to under_scores (and downcase everything in general)? There are already some other underscores in that file, but the result should still be ok. Or too risky?

juliusv on 30 Oct 2015

Generally it's a little risky to go to underscores for a dump like that, as the original name is no longer apparent to be searched for online. In this case we've also a few that'd end up messy like NFS_Unstable, HugePages_Total, Hugepagesize and DirectMap2M.

brian-brazil on 30 Oct 2015

The original name can be put in the Help text. I'd vote to export metrics according to our naming standards.

grobie on 30 Oct 2015

Hmm, even with the name in the help text, huge_pages_total vs. hugepage_size would still look messy, and the _total also would make it look like a counter.

juliusv on 30 Oct 2015

I'd vote to export metrics according to our naming standards.

The challenge is that requires hand choosing metric names in the case of exporters, so it's not practical when more than a handful of metrics are involved (and that's presuming they don't change much over time).

This problem is not unique to the node exporter, it's a tradeoff we have to make with all exporters.

brian-brazil on 30 Oct 2015

Hum thought I commented after setting to accepted:
After thinking about this again and again I also think we should expose them according to our naming standards. I know that it requires some maintenance to keep this up to date but I don't think it's that much work either and I believe those changes are usually straight forward and something people can easily fix and contribute back.

The remaining question is just when and how to change it. It's a lot of work and IMO requires some test infra to make sure all collectors still work.

discordianfish on 4 Jan 2017

There are over 450 metrics exposed by default, and we don't even know the names of all of them in advance. That's more than a little maintenance.

brian-brazil on 4 Jan 2017

@brian-brazil We don't necessarily need to change all but we should do it at least for the most common ones (cpu/memory/disk/net/).

discordianfish on 10 Jan 2017

Part of the consolidation should be having consistent metric names for the same things, see #414

discordianfish on 10 Jan 2017

We can only sanely fix things where we fully decide the name in the first place. CPU is one of those, memory is not.

brian-brazil on 10 Jan 2017

Would it be useful to investigate how some of the older host monitoring solutions (Zabbix, etc) do this? They may have figured out a practical approach which also lends itself well to cross platform capable dashboards.

justinclift on 10 Jan 2017

Based on what I've seen, I imagine they're hand picking 10-20 metrics. We're at a scale where that doesn't work.

Cross-platform is very unlikely to work for anything beyond the simplest of metrics. Even something like load average varies significantly across platforms.

brian-brazil on 10 Jan 2017

No worries at all. :smile:

justinclift on 10 Jan 2017

I'd like to point out that the decision to rename node exporter metrics only to better follow naming conventions, might make sense from a developer point of view, but from a user point of view this is going to cause a catastrophic experience. I wish not breaking stuff for users would be considered more important than cosmetic considerations.

How are people supposed to deploy these changes? Deploy new dashboards and new node exporters on all hosts at the same time?

schweikert on 20 Mar 2018

@schweikert Dashboards like Grafana can be configured to make multiple queries. You can set them up to query both the old and new names, which will provide a reasonably seamless

Same goes for recording rules, you can setup rules to deal with both names, and even record the new name into the old rule name.

For example:

  rules:
  # The count of CPUs per node, useful for getting CPU time as a percent of total.
  - record: instance:node_cpus:count
    expr: count(node_cpu{mode="idle"}) without (cpu,mode)
  - record: instance:node_cpus:count
    expr: count(node_cpu_seconds_total{mode="idle"}) without (cpu,mode)

SuperQ on 20 Mar 2018

You could also use or in a single expression.

brian-brazil on 20 Mar 2018

@SuperQ, @brian-brazil: thanks for the tips, that's good to know. I still worry about how users are going to take this. The reputation cost there is going to be quite significant.

schweikert on 20 Mar 2018

@schweikert It's definitely painful, but it's a tradeoff between having broken metric names forever or having this temporary pain once (and as long as it's still a 0.x version, might not be possible later, depending on metrics stability guarantees in a 1.x version).

From a usage and teaching perspective the node_cpu metric has always annoyed me, as I had to explain to people why it was named that way, and that it's not how you're supposed to name metrics in Prometheus... and yet it's one of the most commonly used metrics.

juliusv on 20 Mar 2018

Maybe it would help the transition to have an option to publish the old metric names as well, in addition to the new ones.

I am thinking of our company, where we publish a node exporter RPM package for consumption by various teams. If we would publish a new package with the new metric names in the yum repo, it would immediately start breaking things for our users.

If we had an option to expose both metric names, we could communicate that there is a transition period, and remove that option only a few months later in our package.

schweikert on 21 Mar 2018

Doubling the load on user's Prometheus servers would likely take many of them out. There's always metric_relabel_configs if you want a smoother transition.

brian-brazil on 21 Mar 2018

The metric_relabel_configs option does not provide our desired properties of a smooth transition. We need to keep our dashboards accessible for historical data. I re-opened #830 and added a comment with a few solutions I see to support the upgrade process.

grobie on 21 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings