Telegraf: Create a "logstreamer" plugin

Created on 10 Aug 2015 · 27Comments · Source: influxdata/telegraf

Inspired by issue #48, create a plugin for aggregating and pushing data from log files, allowing user-defined regex filters.

This would behave in a similar manner to heka's logstreamer plugin: https://hekad.readthedocs.org/en/v0.9.2/pluginconfig/logstreamer.html#logstreamerplugin

/cc @steverweber

Source

sparrc

👍6

All 27 comments

:+1:

skynet on 26 Sep 2015

perhaps something simpler, tail a file.
some code like: https://github.com/hpcloud/tail
and add some processing options like.

count match of a regex...
send raw text that a regex matches

this could be used in many ways! lets say you want to know howmany 404 nginx is returning a second. OR perhaps send raw error.log messages.. The log string lines would be nice in grafana when the table plugin is added.

steverweber on 11 Oct 2015

Where do we start?

skynet on 11 Oct 2015

Tail code looks interesting, but it may even be overkill for this situation. A telegraf plugin being able to handle a constant stream of messages is something that I've implemented in the statsd plugin that has a PR open now #237. So it's possible, but I think for this situation we might be able to just cache the position in the file, and then start reading from that position on the next call to Gather()

There is also a plugin in a PR that does exactly as @steverweber described (counting status codes of a webserver log), but I probably won't be merging it because it's very specific to that use-case and the author has not written unit tests for it, see #176.

I think that more ideally this plugin should be a general use-case where a user can input any regex that will be counted when matched (or output a string as @steverweber suggested). I'm thinking configuration would look something like this:

[logstreamer]
    [[logfile]]
    measurement = "bazbars"
    file = "/var/log/foo.log"
    regex = ".*bar.*|.*baz.*"
    # Type of output. Can be "string" or "counter"
    type = "counter"

    [[logfile]]
    measurement = "webserver_404"
    regex = ".*404.*"
    [...]

sparrc on 11 Oct 2015

👍1

timgriffiths on 12 Oct 2015

:+1:

skynet on 12 Oct 2015

keep in mined the logstreamer should recover if a file is

deleted and recreated.
truncated
is partway through a line write.

perhaps make it so multiple logstreamers are not needed for each metric.
/we only want to read log file once/

[streamer]

    [[file]]
    name = "/var/log/nginx/accept.log"
    delimiter = '\n' # default: '\n'

        [[[measurement]]]
        name = "nginx_requests"
        type = "counter" # counter(default)

        [[[measurement]]]
        name = "nginx_404"
        regex = ".*404.*"


    [[file]]
    name = "/var/log/nginx/error.log"

        [[[measurement]]]
        name = "nginx_errors"

        [[[measurement]]]
        name = "nginx_error_msg"
        regex = "<ignore timestamp> (<msg>.*)"
        type = "string"

steverweber on 12 Oct 2015

👍1

perhaps file could even be a network stream... this could open up support for syslog:
file = "udp:\\127.0.0.1:4880"

steverweber on 12 Oct 2015

some of the code in heka might be helpfull for udp input:
https://github.com/mozilla-services/heka/blob/dev/plugins/udp/udp_input.go

fyi: i feel telegraf objectives would be further along if it forked or contributed to: https://github.com/mozilla-services/heka - http://hekad.readthedocs.org/en/v0.10.0b1/
options are good tho :)

steverweber on 12 Oct 2015

How about (sample config):

[logstreamer]
dirs = ["/tmp/logs"]
    [[logstreamer.group]]
    mask = "^.*log$"
    rules = ['\s\[(?P<date>\d{1,2}/\w*/\d+:\d+:\d+:\d+ [+-]?\d+)\]\s.*?"\s(?P<code>\d{3})\s(?P<size_value>\d+)']
    name = "nginx"
    date_format = "02/Jan/2006:15:04:05 -0700"

The plugin recursively walks the specified directories and looks for all files that match the "mask".
The it starts tailing them.

There are rules to parse and extract data, where regex named groups are used.
The name "date" is special, so it requires date_format (for golang time.Parse) to be properly parsed and translated to timestamp in metrics.
Names that end with _value are metrics. The rest are tags.
So, for example, after parsing nginx log with the rules above we get:

time            code    dc  group   host                size
2015-10-10T08:22:09.169981459Z  200                 us-east-1   nginx   c7.local    753832
2015-10-10T08:24:19.17656864Z   200                 us-east-1   nginx   c7.local    753832
2015-10-10T08:28:59.828478721Z  200                 us-east-1   nginx   c7.local    753832
2015-10-10T08:39:40.812079491Z  200                 us-east-1   nginx   c7.local    753832
2015-10-10T08:42:14.991151971Z  200                 us-east-1   nginx   c7.local    753832
2015-10-10T08:46:19.562880205Z  200                 us-east-1   nginx   c7.local    753832

ekini on 13 Oct 2015

I like the idea of reading the datetime from the log, however I think it should be optional. Keep in-mind some time offsetting should be included to maintain the order of the log messages if not using the actual timestamps in the log.

also like the idea of including a tag or field name in the regex/rule.

steverweber on 13 Oct 2015

@ekini I'd like if there was an option to add a straight filename in addition to the "mask"

sparrc on 13 Oct 2015

Also, +1 to date parsing being optional, some people are only going to care about a count within an interval, not a point for every single instance of a regex match.

So you should support that as well, as in my original example above

sparrc on 13 Oct 2015

Of course, date is optional, as well as date_format. Timestamps will be time.Now() then.
And yes, maybe walking through directories is an overkill.

There is one more concern. If you want to cache position in a file, and parse it to the end at each Gather, what happens if file is big? Also, what happens if telegraf gets restarted?

My test code constantly reads files, and sends parsed content to a buffered channel, and after call to Gather get as much as possible from the channel within specified timeout interval.

ekini on 13 Oct 2015

what happens if file is big?

tailing/seeking to end of file is often not a problem when its big...
perhaps you are referring to many writes between the timespan of a gather().
should have some limit... perhaps 1mb for a string buffer. the tail code i linked above uses a "leaky-bucket"

Also, what happens if telegraf gets restarted?

it gets restarted and jumps to the end of the file... We don't care if we loose some data between. Keeping state data is kinda overkill.

steverweber on 14 Oct 2015

There is still a question of what to do if file is truncated. One option would be to make a ServicePIugin that has the Tail code that @steverweber running in the background.

This probably wouldn't be possible until I merge the statsd code

sparrc on 14 Oct 2015

the https://github.com/hpcloud/tail code seems to handle this well.
https://github.com/hpcloud/tail/blob/master/cmd/gotail/gotail.go

t, err := tail.TailFile("/var/log/nginx.log", tail.Config{
    Follow: true,
    ReOpen: true,
    Poll: true})
for line := range t.Lines {
    fmt.Println(line.Text)
}

Config.ReOpen is analogous to tail -F (capital F):

-F      The -F option implies the -f option, but tail will also check to see if the file being followed has been
         renamed or rotated.  The file is closed and reopened when tail detects that the filename being read from
         has a new inode number.  The -F option is ignored if reading from standard input rather than a file.

ref: http://stackoverflow.com/questions/10135738/reading-log-files-as-theyre-updated-in-go

steverweber on 14 Oct 2015

@ekini you mentioned you had some working code for this a couple weeks ago, do you happen to have anything I can take a look at? I'm interested in getting something working for this

sparrc on 28 Oct 2015

@sparrc yes, I've got something working at ekini@04f4b72182eaf5533275433be6933d15932af480
It's based on mentioned above hpcloud/tail.
It workd, but there are plenty of sharp edges.

ekini on 28 Oct 2015

a little trick i been toying with.

cat > /cron_mon_log <<EOFXX
#!/bin/bash
tail -F -n0 /var/log/syslog | while read line; do
    curl -X POST 'http://mon-dev-1.private.xxxx.ca:8086/write?db=db' --data-binary "log_mon,hostname=$(hostname) value=\'$line\'"
done
EOFXX

echo '@reboot  root  /cron_mon_log' >> /etc/crontab

might need work, but thought it worth the share.

steverweber on 6 Nov 2015

Maybe more simple with Rsyslog ?

rsyslog.conf:
*.* @127.0.0.1:1514

And listen on 1514 port for example.

tux-00 on 1 Feb 2016

👍1

Would be great is this could make it to telegraf. :+1:

ruudboon on 6 Feb 2016

:+1:

skynet on 7 Feb 2016

This will most likely start as a telegraf tail plugin that will accept the currently-available data input formats.

Recently came across this log analyzer project that looks like it has a pretty solid format for creating templates and parsing arbitrary logfile formats: https://github.com/trustpath/sequence

Right now it's discontinued, but influxdata could probably fork and take over that project if it turns out to be useful.

sparrc on 20 Feb 2016

I am so interested in this plugin, primarily to monitor the response codes of Apache httpd.
There is already some alpha version to try?