Telegraf: Logparser plugin don't process new lines after telegraf configuration reload.

Created on 12 Dec 2017  路  3Comments  路  Source: influxdata/telegraf

Bug report

The newly started telegraf works well, but after configuration reload via SIGHUP, new lines in log aren't parsed or sent to output plugin. Other metrics aren't affected.

Relevant telegraf.conf:

/etc/telegraf/telegraf.conf

[global_tags]

[agent]
  hostname = "hostname"
  omit_hostname = false
  interval = "30s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "30s"
  flush_jitter = "0s"
  logfile = ""
  debug = false
  quiet = false

#
# OUTPUTS:
#
[[outputs.influxdb]]
  database = "telegraf"
  insecure_skip_verify = true
  urls = ["https://influx.hostname"]

#
# INPUTS:
#
[[inputs.cpu]]
  percpu = true
  totalcpu = true
[[inputs.disk]]
[[inputs.io]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.swap]]
[[inputs.system]]

/etc/telegraf/telegraf.d/logparser.conf

[[inputs.logparser]]
  files = ["/var/log/nginx/web-*-access.log"]
  from_beginning = false
[inputs.logparser.grok]
  measurement = "nginx_access_log"
  patterns = ["\\[[0-9a-z-]+\\] %{CLIENT:client_ip} - %{NOTSPACE} \\[%{HTTPDATE:timestamp:ts-httpd}\\] \"(?:%{WORD:verb:tag} %{NOTSPACE:request}(?: HTTP/%{NUMBER:http_version:float})?|%{DATA})\" %{NUMBER:resp_code:tag} (?:%{NUMBER:resp_bytes:int}|-) \"%{NOTSPACE:referer}\" \"[^\"]+\" \"(?:%{CLIENT:forwarded_for}|-)\" \"(?:%{CLIENT:real_ip}|-)\" %{NOTSPACE} rt=%{NUMBER:request_time:float} uht=(?:%{NUMBER:upstream_header_time:float}|-) urt=(?:%{NUMBER:upstream_response_time:float}|-)"]
  timezone = ""

System info:

Centos 7.2
Telegraf 1.4.3,1.4.5

Steps to reproduce:

  1. Start telegraf
  2. Write new line to log file: echo '[-] 192.168.1.1 - - [12/Dec/2017:14:00:47 +0300] "GET /app HTTP/1.1" 501 178 "-" "axios/0.16.2" "-" "127.0.0.1" (.) rt=1.001 uht=1.001 urt=1.001' >> /var/log/nginx/web-api-access.log
  3. Wait 30 seconds and select last value from influx select last("request_time") from "nginx_access_log" limit 1 must be 1.001
  4. Reload telegraf: kill -HUP $PID
  5. Write new line: echo '[-] 192.168.1.1 - - [12/Dec/2017:14:05:47 +0300] "GET /app HTTP/1.1" 502 178 "-" "axios/0.16.2" "-" "127.0.0.1" (.) rt=2.002 uht=2.002 urt=2.002' >> /var/log/nginx/web-api-access.log
  6. Wait 30 seconds and select again. New value must be 2.002, but 1.001 is still here.

Additional info:

Telegraf log output:

Dec 12 14:46:33 web2 [1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Dec 12 14:46:33 web2 systemd[1]: Starting The plugin-driven server agent for reporting metrics into InfluxDB...
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Starting Telegraf v1.4.5
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Loaded outputs: influxdb
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Loaded inputs: inputs.system inputs.cpu inputs.disk inputs.diskio inputs.mem inputs.net inputs.swap inputs.logparser
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Tags enabled: host=web2
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Agent Config: Interval:30s, Quiet:false, Hostname:"web2", Flush Interval:30s
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Reloading Telegraf config
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Hang on, flushing any cached metrics before shutdown
Dec 12 14:51:41 web2 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics into InfluxDB.
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Starting Telegraf v1.4.5
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Loaded outputs: influxdb
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Loaded inputs: inputs.net inputs.swap inputs.system inputs.cpu inputs.disk inputs.diskio inputs.mem inputs.logparser
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Tags enabled: host=web2
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Agent Config: Interval:30s, Quiet:false, Hostname:"web2", Flush Interval:30s
aretail bug

Most helpful comment

To work around bug https://github.com/influxdata/telegraf/issues/3906 I do a service telegraf reload in my preporcessing script (that works around bug https://github.com/influxdata/telegraf/issues/3613 ). This leads to the effect that my 2nd logparser stops working.
I have one logparser which uses from_beginning = true and reads a preprocessed file. This one still works after the SIGHUP.
But the second logparser that reads from the live /var/log/nginx/access.log in conjunction with the basicstats aggregator stops working after SIGHUP. Here I use from_beginning = false.
Bothh logparsers use inotify.

All 3 comments

To work around bug https://github.com/influxdata/telegraf/issues/3906 I do a service telegraf reload in my preporcessing script (that works around bug https://github.com/influxdata/telegraf/issues/3613 ). This leads to the effect that my 2nd logparser stops working.
I have one logparser which uses from_beginning = true and reads a preprocessed file. This one still works after the SIGHUP.
But the second logparser that reads from the live /var/log/nginx/access.log in conjunction with the basicstats aggregator stops working after SIGHUP. Here I use from_beginning = false.
Bothh logparsers use inotify.

This issue is still valid :(

I have found the issue though. In hpcloud/tail (influxdata/tail) there is a problem:

In this function, it is possible for shared.watchNums[fname] to go negative.

When debugging, I found that it would decrement once because of tomb.Tomb's Dying() triggering:

github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.remove at inotify_tracker.go:102
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.RemoveWatch at inotify_tracker.go:87
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.(*InotifyFileWatcher).ChangeEvents.func1 at inotify.go:94
runtime.goexit at asm_amd64.s:1337
 - Async stack trace
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.(*InotifyFileWatcher).ChangeEvents at inotify.go:77

and again due to tail.Cleanup() being called

github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.remove at inotify_tracker.go:102
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.RemoveWatch at inotify_tracker.go:87
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.Cleanup at inotify_tracker.go:129
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail.(*Tail).Cleanup at tail.go:453
github.com/influxdata/telegraf/plugins/inputs/tail.(*Tail).Stop at tail.go:219
github.com/influxdata/telegraf/agent.(*Agent).stopServiceInputs at agent.go:651
github.com/influxdata/telegraf/agent.(*Agent).Run.func1 at agent.go:75
runtime.goexit at asm_amd64.s:1337
 - Async stack trace
github.com/influxdata/telegraf/agent.(*Agent).Run at agent.go:66
Was this page helpful?
0 / 5 - 0 ratings

Related issues

corentingi picture corentingi  路  3Comments

fahimeh2010 picture fahimeh2010  路  3Comments

aihysp picture aihysp  路  3Comments

mrcheeky123 picture mrcheeky123  路  3Comments

IxDay picture IxDay  路  3Comments