Telegraf: Prometheus output reports Error: collected metric has label dimensions inconsistent with previously collected metrics in the same metric family

Created on 17 May 2017  路  8Comments  路  Source: influxdata/telegraf

Bug report

Hello,

We have a regression between telegraf 1.2.1 and 1.3.0 (with the same configuration).

Relevant telegraf.conf:

[[outputs.prometheus_client]]
  listen = ":9126"
[agent]
 interval = "120s"
 debug    = false
[[inputs.ntpq]]
  dns_lookup = false

System info:

  • Telegraf v1.3.0 (git: release-1.3 2bc5594b44145368823d7aa78bfb753ab51e9235)
  • Ubuntu 16.04.2 LTS

Steps to reproduce:

  1. Install telegraf 1.3.0, with the configuration above
  2. curl http://localhost:9126/metrics x5 or more

Expected behavior:

Telegraf should expose collected metrics through /metrics endpoint

Actual behavior:

Telegraf fail to display any metrics with this error message.

Additional info:

curl http://localhost:9126/metrics
An error has occurred during metrics collection:

5 error(s) occurred:
* collected metric ntpq_jitter label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_delay label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_poll label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:64 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_offset label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_reach label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 >  has label dimensions inconsistent with previously collected metrics in the same metric family

Use case:

  • This is a regression between telegraf 1.2.1 and 1.3.0

Thanks in advance!

bug regression

Most helpful comment

Merged fix; 1.3.2

All 8 comments

Seems to be caused by the version change in github.com/prometheus/client_golang

This happens because ntpq input generates points where the list of tagkeys changes, in particular the state_prefix tagkey is not always present:

ntpq,refid=.POOL.,remote=3.debian.pool.n,stratum=16,type=p delay=0,jitter=0,offset=0,poll=64i,reach=0i 1495059325000000000
ntpq,refid=204.9.54.119,remote=209.242.224.117,state_prefix=-,stratum=2,type=u delay=66.056,jitter=0.681,offset=2.246,poll=1024i,reach=37i,when=298i 1495059325000000000

This can be verified by excluding the tag:

[[outputs.prometheus_client]]
  tagexclude = ["state_prefix"]

@danielnelson danielnelson modified the milestone: 1.3.2, 1.3.1 12 hours ago
:(((

@freeseacher Please take a look at #2857 and comment if that fix will work for you.

@danielnelson, yep. that fixes issue for me.
Telegraf v26055d5 (git: fix-prometheus-output-labels 26055d5)
works for about an hour on ~40 servers without that bug

@danielnelson, any updates ?

I'm still getting reports that the fix is not sufficient, I'm trying to get an improved version out today.

Merged fix; 1.3.2

Was this page helpful?
0 / 5 - 0 ratings