The newly started telegraf works well, but after configuration reload via SIGHUP, new lines in log aren't parsed or sent to output plugin. Other metrics aren't affected.
/etc/telegraf/telegraf.conf
[global_tags]
[agent]
hostname = "hostname"
omit_hostname = false
interval = "30s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "30s"
flush_jitter = "0s"
logfile = ""
debug = false
quiet = false
#
# OUTPUTS:
#
[[outputs.influxdb]]
database = "telegraf"
insecure_skip_verify = true
urls = ["https://influx.hostname"]
#
# INPUTS:
#
[[inputs.cpu]]
percpu = true
totalcpu = true
[[inputs.disk]]
[[inputs.io]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.swap]]
[[inputs.system]]
/etc/telegraf/telegraf.d/logparser.conf
[[inputs.logparser]]
files = ["/var/log/nginx/web-*-access.log"]
from_beginning = false
[inputs.logparser.grok]
measurement = "nginx_access_log"
patterns = ["\\[[0-9a-z-]+\\] %{CLIENT:client_ip} - %{NOTSPACE} \\[%{HTTPDATE:timestamp:ts-httpd}\\] \"(?:%{WORD:verb:tag} %{NOTSPACE:request}(?: HTTP/%{NUMBER:http_version:float})?|%{DATA})\" %{NUMBER:resp_code:tag} (?:%{NUMBER:resp_bytes:int}|-) \"%{NOTSPACE:referer}\" \"[^\"]+\" \"(?:%{CLIENT:forwarded_for}|-)\" \"(?:%{CLIENT:real_ip}|-)\" %{NOTSPACE} rt=%{NUMBER:request_time:float} uht=(?:%{NUMBER:upstream_header_time:float}|-) urt=(?:%{NUMBER:upstream_response_time:float}|-)"]
timezone = ""
Centos 7.2
Telegraf 1.4.3,1.4.5
echo '[-] 192.168.1.1 - - [12/Dec/2017:14:00:47 +0300] "GET /app HTTP/1.1" 501 178 "-" "axios/0.16.2" "-" "127.0.0.1" (.) rt=1.001 uht=1.001 urt=1.001' >> /var/log/nginx/web-api-access.logselect last("request_time") from "nginx_access_log" limit 1 must be 1.001kill -HUP $PIDecho '[-] 192.168.1.1 - - [12/Dec/2017:14:05:47 +0300] "GET /app HTTP/1.1" 502 178 "-" "axios/0.16.2" "-" "127.0.0.1" (.) rt=2.002 uht=2.002 urt=2.002' >> /var/log/nginx/web-api-access.logTelegraf log output:
Dec 12 14:46:33 web2 [1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Dec 12 14:46:33 web2 systemd[1]: Starting The plugin-driven server agent for reporting metrics into InfluxDB...
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Starting Telegraf v1.4.5
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Loaded outputs: influxdb
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Loaded inputs: inputs.system inputs.cpu inputs.disk inputs.diskio inputs.mem inputs.net inputs.swap inputs.logparser
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Tags enabled: host=web2
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Agent Config: Interval:30s, Quiet:false, Hostname:"web2", Flush Interval:30s
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Reloading Telegraf config
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Hang on, flushing any cached metrics before shutdown
Dec 12 14:51:41 web2 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics into InfluxDB.
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Starting Telegraf v1.4.5
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Loaded outputs: influxdb
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Loaded inputs: inputs.net inputs.swap inputs.system inputs.cpu inputs.disk inputs.diskio inputs.mem inputs.logparser
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Tags enabled: host=web2
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Agent Config: Interval:30s, Quiet:false, Hostname:"web2", Flush Interval:30s
To work around bug https://github.com/influxdata/telegraf/issues/3906 I do a service telegraf reload in my preporcessing script (that works around bug https://github.com/influxdata/telegraf/issues/3613 ). This leads to the effect that my 2nd logparser stops working.
I have one logparser which uses from_beginning = true and reads a preprocessed file. This one still works after the SIGHUP.
But the second logparser that reads from the live /var/log/nginx/access.log in conjunction with the basicstats aggregator stops working after SIGHUP. Here I use from_beginning = false.
Bothh logparsers use inotify.
Closing based on https://github.com/influxdata/telegraf/pull/5347#issuecomment-507465612
This issue is still valid :(
I have found the issue though. In hpcloud/tail (influxdata/tail) there is a problem:
In this function, it is possible for shared.watchNums[fname] to go negative.
When debugging, I found that it would decrement once because of tomb.Tomb's Dying() triggering:
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.remove at inotify_tracker.go:102
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.RemoveWatch at inotify_tracker.go:87
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.(*InotifyFileWatcher).ChangeEvents.func1 at inotify.go:94
runtime.goexit at asm_amd64.s:1337
- Async stack trace
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.(*InotifyFileWatcher).ChangeEvents at inotify.go:77
and again due to tail.Cleanup() being called
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.remove at inotify_tracker.go:102
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.RemoveWatch at inotify_tracker.go:87
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.Cleanup at inotify_tracker.go:129
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail.(*Tail).Cleanup at tail.go:453
github.com/influxdata/telegraf/plugins/inputs/tail.(*Tail).Stop at tail.go:219
github.com/influxdata/telegraf/agent.(*Agent).stopServiceInputs at agent.go:651
github.com/influxdata/telegraf/agent.(*Agent).Run.func1 at agent.go:75
runtime.goexit at asm_amd64.s:1337
- Async stack trace
github.com/influxdata/telegraf/agent.(*Agent).Run at agent.go:66
Most helpful comment
To work around bug https://github.com/influxdata/telegraf/issues/3906 I do a
service telegraf reloadin my preporcessing script (that works around bug https://github.com/influxdata/telegraf/issues/3613 ). This leads to the effect that my 2nd logparser stops working.I have one logparser which uses
from_beginning = trueand reads a preprocessed file. This one still works after the SIGHUP.But the second logparser that reads from the live
/var/log/nginx/access.login conjunction with the basicstats aggregator stops working after SIGHUP. Here I usefrom_beginning = false.Bothh logparsers use inotify.