Hello,
We have a regression between telegraf 1.2.1 and 1.3.0 (with the same configuration).
[[outputs.prometheus_client]]
listen = ":9126"
[agent]
interval = "120s"
debug = false
[[inputs.ntpq]]
dns_lookup = false
Telegraf v1.3.0 (git: release-1.3 2bc5594b44145368823d7aa78bfb753ab51e9235)Ubuntu 16.04.2 LTScurl http://localhost:9126/metrics x5 or moreTelegraf should expose collected metrics through /metrics endpoint
Telegraf fail to display any metrics with this error message.
curl http://localhost:9126/metrics
An error has occurred during metrics collection:
5 error(s) occurred:
* collected metric ntpq_jitter label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 > has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_delay label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 > has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_poll label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:64 > has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_offset label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 > has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_reach label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 > has label dimensions inconsistent with previously collected metrics in the same metric family
Thanks in advance!
Seems to be caused by the version change in github.com/prometheus/client_golang
This happens because ntpq input generates points where the list of tagkeys changes, in particular the state_prefix tagkey is not always present:
ntpq,refid=.POOL.,remote=3.debian.pool.n,stratum=16,type=p delay=0,jitter=0,offset=0,poll=64i,reach=0i 1495059325000000000
ntpq,refid=204.9.54.119,remote=209.242.224.117,state_prefix=-,stratum=2,type=u delay=66.056,jitter=0.681,offset=2.246,poll=1024i,reach=37i,when=298i 1495059325000000000
This can be verified by excluding the tag:
[[outputs.prometheus_client]]
tagexclude = ["state_prefix"]
@danielnelson danielnelson modified the milestone: 1.3.2, 1.3.1 12 hours ago
:(((
@freeseacher Please take a look at #2857 and comment if that fix will work for you.
@danielnelson, yep. that fixes issue for me.
Telegraf v26055d5 (git: fix-prometheus-output-labels 26055d5)
works for about an hour on ~40 servers without that bug
@danielnelson, any updates ?
I'm still getting reports that the fix is not sufficient, I'm trying to get an improved version out today.
Merged fix; 1.3.2
Most helpful comment
Merged fix; 1.3.2