Telegraf: Telegraf stops reporting CPU and instead reports a 0 metric

Created on 13 Mar 2018 · 14Comments · Source: influxdata/telegraf

Bug report

Working Telegraf / InfluxDB instance stops reporting CPU information and instead is reporting 0 metrics. The instance was working and no changes were made to the instance. The telegraf.log is now logging the following over and over:

2018-03-13T20:01:50Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
2018-03-13T20:02:00Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
2018-03-13T20:02:10Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time

This is similar to #3555, but Influx is being used instead of ElasticSearch, and similar to #721 but does not automatically resolve itself.

Relevant telegraf.conf:

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  ## The HTTP or UDP URL for your InfluxDB instance.  Each item should be
  ## of the form:
  ##   scheme "://" host [ ":" port]
  ##
  ## Multiple urls can be specified as part of the same cluster,
  ## this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://localhost:8089"] # UDP endpoint example
  urls = ["http://localhost:8086"] # required

[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states.
  report_active = false

System info:

AWS Linux Kernel 4.9.85-37.85 on a t2.medium instance
Telegraf version 1.5.2 (upgraded from 1.4.0)
InfluxDB version 1.5.0 (upgraded from 1.3.5)
GoLang version 1.9.2

Steps to reproduce:

Unknown, as this was working and suddenly broke with no known cause.

Expected behavior:

CPU metrics should be properly reported.

Actual behavior:

Zero (0) CPU Metric is being reported and an error is being logged.

Additional info:

Remediation steps that have been suggested in other issues have been attempted. The following have been attempted:

Upgrading InfluxDB (with service restarts of the relevant services)
Upgrading Telegraf (with service restarts of the relevant services) as suggested in #3555
Cleaning tsm1_cache series from InfluxDB as suggested by the community site
Waiting for the issue to resolve itself for two weeks (as suggested in #721 that the issue typically resolves itself, but it is not in this scenario)

Source

chronosis

Most helpful comment

@leucos -- Unfortunately, it's behind the Amazon AWS Support portal. But the most recent update from the support team was on April 15th, 2018 --

Thank you for your patience.

I have been actively following up with our internal team regarding a fix for this issue. And as per the latest update, they have created a fix for the steal time bug and it is expected to be deployed in stages. Please note that we, here at Amazon, follow systematic and quality testing procedures before deploying any fix and hence, it is expected to take some time.

We will update you once the fix has been successfully deployed. Thank you again for your patience while working through this issue.

I'll try to provide an update here when the AWS team closes or updates the support ticket on their end

chronosis on 1 Jun 2018

👍4

All 14 comments

Per issue #2871 -- Rebooting seems to address the issue.

Here are the before and after of grep '^cpu' /proc/stat

Before

cpu  9318294 4818 1179580 1434778026 1136958 0 105134 1473836475502 0 0
cpu0 9318294 4818 1179580 1434778026 1136958 0 105134 1473836475502 0 0

After

cpu  945 0 249 21052 328 0 6 400 0 0
cpu0 945 0 249 21052 328 0 6 400 0 0

The server had an uptime of several months when the issue began occurring. It seems that the issue might be related to an integer overflow problem, rather than a kernel issues as is suggested in #2871

chronosis on 13 Mar 2018

If you see this error again can you cat /proc/stat twice before rebooting, separated by whatever amount of time your interval is set to. From what I have seen in the past I think there is nothing we can really do to fix this, it seems to be a AWS issue or at least that is the only place we see it where it does not recover.

I believe that when the values overflow you can see this message once.

I think we need an option to only collect raw values, so we can kick the can on this issue, it can be dealt with at query time. This way we can always continue to report in all situations.

danielnelson on 14 Mar 2018

It may take some time for this issue to occur once again because it took several months of uptime before it occurred the last time. Therefore, because rebooting resolved the issue at the moment, I propose this issue be closed for now. I can updated and re-open and update the issue with the relevant information when the error re-appears.

chronosis on 14 Mar 2018

Perhaps when telegraf detects this, instead of aborting the gather, that it skips the usage_* fields. The time_* fields that it's gathered are still valid and usable, and can also be used for diagnosis on issues like this.

phemmer on 14 Mar 2018

If this happens in the future, we would only be able to fix it if it is a rollover issue. If the /proc/stat values are just changing wildly I don't know how we could address it. We have seen this from time to time on our AWS infrastructure and it has always been the latter.

Let's keep this issue open and close it when we do the following:

Report all the raw cpu times first as @phemmer suggests
Provide an option to only collect raw times (in case you want to avoid the error message)
Add examples queries that compute usage from raw times to the README.

danielnelson on 14 Mar 2018

I was able to find another instance in our AWS environments where this behavior was identified. Below are the cat /proc/stat results separated by 10 seconds (the interval period). I am also including the uptime for reference.

Initial (0s)

cpu  5653582 827 3220601 1281540974 85144 0 90281 35312117399 0 0
cpu0 5653582 827 3220601 1281540974 85144 0 90281 35312117399 0 0
intr 899204760 66070952 9 0 0 1013 0 0 0 0 0 0 0 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 530618431 0 0 0 0 0 266 294809355 7663033 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 2366469154
btime 1508183562
processes 41902240
procs_running 1
procs_blocked 0
softirq 1112470762 0 428497139 192964335 101535996 0 0 41741 0 0 389431551

First Interval (+10s)

cpu  5653582 827 3220601 1281542136 85144 0 90281 11924027608 0 0
cpu0 5653582 827 3220601 1281542136 85144 0 90281 11924027608 0 0
intr 899205741 66071187 9 0 0 1013 0 0 0 0 0 0 0 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 530618915 0 0 0 0 0 266 294809613 7663037 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 2366471217
btime 1508183562
processes 41902273
procs_running 1
procs_blocked 0
softirq 1112471736 0 428497532 192964481 101536089 0 0 41742 0 0 389431892

Second Interval (+20s)

cpu  5653582 827 3220601 1281543220 85144 0 90281 1833527471829 0 0
cpu0 5653582 827 3220601 1281543220 85144 0 90281 1833527471829 0 0
intr 899206809 66071407 9 0 0 1013 0 0 0 0 0 0 0 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 530619394 0 0 0 0 0 266 294809976 7663043 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 2366473652
btime 1508183562
processes 41902322
procs_running 1
procs_blocked 0
softirq 1112472888 0 428497921 192964717 101536219 0 0 41743 0 0 389432288

Uptime

19:23:31 up 149 days, 23:30,  1 user,  load average: 0.00, 0.01, 0.00

chronosis on 15 Mar 2018

Here is another set of data --

cpu  5653582 827 3220601 1281558689 85144 0 90281 1504706785701 0 0
cpu0 5653582 827 3220601 1281558689 85144 0 90281 1504706785701 0 0

cpu  5653582 827 3220601 1281559710 85144 0 90281 1481358336740 0 0
cpu0 5653582 827 3220601 1281559710 85144 0 90281 1481358336740 0 0

cpu  5653582 827 3220601 1281560732 85144 0 90281 1461300076681 0 0
cpu0 5653582 827 3220601 1281560732 85144 0 90281 1461300076681 0 0

The total CPU time seems to be counting down instead of up --
1504706785701 > 1481358336740 > 1461300076681

So it does appear to be some issue at the kernel level with servers with high uptime.

chronosis on 15 Mar 2018

This is a kernel bug. Looks like there's a few discussions around on the subject:
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest/
https://bugs.launchpad.net/linux/+bug/1494350
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785557;msg=61

phemmer on 15 Mar 2018

@phemmer thanks for the confirmation. I've reported the bug to the AWS Kernel team.

chronosis on 15 Mar 2018

❤1

Seeing this on Windows hosts - not VMs, but bare-metal servers.

lgwapnitsky on 7 May 2018

👍2

@chronosis is there a link that we can track to check progress on the AWS side ?

leucos on 1 Jun 2018

👍2

@leucos -- Unfortunately, it's behind the Amazon AWS Support portal. But the most recent update from the support team was on April 15th, 2018 --

Thank you for your patience.

I have been actively following up with our internal team regarding a fix for this issue. And as per the latest update, they have created a fix for the steal time bug and it is expected to be deployed in stages. Please note that we, here at Amazon, follow systematic and quality testing procedures before deploying any fix and hence, it is expected to take some time.

We will update you once the fix has been successfully deployed. Thank you again for your patience while working through this issue.

I'll try to provide an update here when the AWS team closes or updates the support ticket on their end

chronosis on 1 Jun 2018

👍4

Update for anyone experiencing this on AWS -- The AWS Kernel team provided the following update on the 8th of July, 2018 --

AWS Internal team have deployed the patch to fix this issue on the underlying hosts. However, you will need to reboot any affected instances for the Kernel updates to take effect.

After some testing, it appears if the AWS team hasn't hot-patched your Kernel for you then you may need to move any deployments to an the updated Kernel Image where they have released the fix.

chronosis on 16 Jul 2018

👍2

Update for anyone experiencing this on AWS -- The AWS Kernel team provided the following update on the 8th of July, 2018 --

AWS Internal team have deployed the patch to fix this issue on the underlying hosts. However, you will need to reboot any affected instances for the Kernel updates to take effect.

After some testing, it appears if the AWS team hasn't hot-patched your Kernel for you then you may need to move any deployments to an the updated Kernel Image where they have released the fix.

Hi @chronosis , may I know what is the kernel version that fixed the bug?