Beats: Add an new input type to backfill gzipped logs

Created on 6 Jan 2016 · 107Comments · Source: elastic/beats

Hi,

Following this discussion on filebeat forum I would like to ask if it is possible to implement a solution to easily backfill old gzipped logs with filebeat.

The proposed solution mentioned in the topic is to add a new dedicated input_type.

It is also mentioned in the topic that when filebeat reaches the end of input on stdin it does not give you the hand back and waits for new lines which makes things hard to script to perform backfilling.

What are your thoughts on this ?

Thanks for your hard work.

Filebeat Integrations Services enhancement

Source

lminaudier

👍85 🎉1

Most helpful comment

filebeat's -once flag (« _Run filebeat only once until all harvesters reach EOF_ ») makes it stops at the end of stdin.

I just used it successfully with the following filebeat configuration file to import multiple old gzipped log files to my logstash instance:

filebeat.inputs:
- type: stdin
  enabled: true
  ignore_older: 0
  tags: ["web_access_log"]
  fields:
    server: "foo-server.example.com"
    webserver: "apache"
  fields_under_root: true
  json.overwrite_keys: true

output.logstash:
  hosts: ["logstash.example.com:8088"]

And with the following command:

zcat "file.gz" | filebeat -once -e -c "filebeat-std.config.yml"

C-Duv on 3 Jul 2018

👍6 ❤5 🚀1

All 107 comments

I would see the implementation as following:

Prospector: A gzip prospector would be added. Harvesters would only be opened based on filenames (no inode etc.). It is assume, that the files are not renamed and don't have to be tracked. A file has a completion state and if it is once completed, it is never read again. This simplifies the implementation. If a .gz file with a new filename is found, a harvester is started.
Harvester: A gzip harvester is added. It unzips and reads the full file only once. After finishing reading the file, the harvester stops and is never started again. The offset is only stored, if it is interrupted in the middle of reading. In a first implementation, this could even be removed for simplicity.

This would be nice to have but I think it is not on the top of our priority list. A community contribution here would be more then welcome.

For your second issue about running filebeat only until first completion, lets refer to this issue: https://github.com/elastic/filebeat/issues/219

ruflin on 6 Jan 2016

Thanks for the fast reply and the pointer to the issue.

I will try to look at your implementation proposal. I am still quite new to Golang and the project, so I can't promise anything :)

lminaudier on 6 Jan 2016

@lminaudier Always here to help.

ruflin on 6 Jan 2016

This would be a great feature addition. Currently the splunk-forwarder does something similar and will index log rotated files that have been gziped automatically.

mryanb on 22 Jan 2016

+1. Is anyone working on this? If not I could possibly take it up..

Ragsboss on 1 Aug 2016

@Ragsboss I don't think anyone is working on that. Would be great to have a PR for that to discuss the details.

ruflin on 2 Aug 2016

@Ragsboss @ruflin I'm delighted to see there's someone looking to pick this up. Is this happening? The reason I ask is because it may be possible for me to spend some time helping out with this in lieu of building another solution for gziped logs to use internally.

cFire on 5 Aug 2016

@cFire please feel free to take this up. I've started looking at the code familiarizing myself with Go in general and FileBeat code. But I haven't started the real work yet. I'll try to help you in anyway you want, just let me know.
Few thoughts I had. From a pure functional viewpoint, defining a new input_type doesn't seem ideal as that would force users to author a new prospector in the config file. Instead I felt it may be better for the code to automatically deal with compressed files as long as they match the given filename patterns in the config file. The code can instantiate a different harvester (IIUC) based on the file extension/type. But from implementation viewpoint if this is turning out to be difficult, I think it's ok to burden/ask the users for some extra config...

Ragsboss on 5 Aug 2016

From an implementation point of view I think we should go the route of having a separate input type. Currently the way filebeat is designed is that a prospector and harvester type are tightly coupled, so a prospector only starts one type of harvesters. It is ok if gzip harvester reuses lots of code from the log harvester (which I think it will) but as log tailing and reading a file completely only once are form my perspective two quite different behaviours. The question which will also be raised is if the gzip files will change their name over time (means have to be tracked based on inode / device) or it is enough to just store the filename and a read / unread flag in the registry.

ruflin on 8 Aug 2016

Here is the PR related to the above discussions: https://github.com/elastic/beats/pull/2227

ruflin on 16 Aug 2016

👍3

We would like filebeat to be able to read gzipped files, too.

Our main use of filebeat would be to take a set of rotated, gzipped logs that represent the previous day's events, and send them to elasticsearch.

No tailing or running as a service is required, so a "batch mode" would also be good, but other workarounds solely using filebeat would also be acceptable.

willsheppard on 5 Sep 2016

👍5

@willsheppard Thanks for the details. For the batch mode you will be interested in https://github.com/elastic/beats/pull/2456 (in progress)

ruflin on 6 Sep 2016

Now that https://github.com/elastic/beats/pull/2456 is merged this feature got even more interesting :-)

ruflin on 16 Sep 2016

Hello,

Has there been any update regarding the support for gzip files ? Please let know.

Thanks.

collabccie7 on 18 Oct 2016

Idk about the others, but I've not gotten any time to work on this.

cFire on 18 Oct 2016

No progress yet on this from my side.

ruflin on 24 Oct 2016

This gzip input filter would be the killer feature for us. We're being forced to consider writing an Elasticsearch ingest script from scratch which writes to the Bulk API, because we need to operate on logs in-place (no space to unzip them), and we would be using batch-mode (#2456) to ingest yesterday's logs from our web clusters.

willsheppard on 22 Nov 2016

Has there been any movement on this as this too is the killer feature for me too

maddazzaofdudley on 27 Jan 2017

Throwing in my support for this feature.

sec-init on 28 Jan 2017

Would be an awesome feature to have.

plumpNation on 29 Jan 2017

There is this open PR here that still needs some work and also involves quite some discussions: https://github.com/elastic/beats/pull/3070

ruflin on 30 Jan 2017

Harvesters would only be opened based on filenames (no inode etc.).

@ruflin I believe inodes may to be tracked because logrotate (assuming this is a target use case) renames files and reuses file names. Unless another tracking mechanism (when is 'hello.txt.1.gz' a "new file" below, for example)

Example:

% ls -il /tmp/hello.txt*
103196 -rw-rw-r--. 1 jls jls 12 Jan 24 03:17 /tmp/hello.txt
103131 -rw-rw-r--. 1 jls jls 32 Jan 24 03:16 /tmp/hello.txt.1.gz

% cat test.conf
/tmp/hello.txt {
  rotate 5
  compress
}

% logrotate -s /tmp/example.logrotate -f test.conf

% ls -il /tmp/hello.txt*
103218 -rw-rw-r--. 1 jls jls 32 Jan 24 03:17 /tmp/hello.txt.1.gz
103131 -rw-rw-r--. 1 jls jls 32 Jan 24 03:16 /tmp/hello.txt.2.gz

^^ Above, 'hello.txt.2.gz' is the same file (inode) as previous 'hello.txt.1.gz'.

We can probably achieve this without tracking inodes (tracking modification time and only reading .gz files after they have been idle for some small time?), but I think the filename alone is not enough because file names are reused, right?

jordansissel on 31 Jan 2017

I exactly hit this issues during the implementation. That is why the implementation in https://github.com/elastic/beats/pull/3070 is not fully consistent with the initial theory in this thread.

The main difference now to a "normal" file is that it is expected, that a gz never changes, and if it changes, the complete file is read from the beginning again.

ruflin on 31 Jan 2017

+1 vote for gzip input

jpdaigle on 1 May 2017

👍2

+1 for gzip support

smsajid on 28 May 2017

👍2

@jpdaigle @smsajid Could you share more details on how exactly you would use this feature?

ruflin on 29 May 2017

@ruflin In our case, log files from hundreds of servers are streamed to a central log processor, which then outputs gzip-ed logs in chunks of a few seconds on a host that's accessible to developers. This is where a filebeat would come into play: prospect for the appearance of new .gz files, grab them and process them through a logstash pipeline.

The above is pretty "single use-case specific" though, in more generic terms, how we would use a .gz input filter is for grabbing the output of one logging system and "gluing" it to the input of a logstash pipeline.

jpdaigle on 29 May 2017

👍1

We would also be interested in this feature as we also have scenarios where zipped files are generated by an intermediate system and where we can't read the plain files directly. It would be great if this can be combined with #4373 so that we can specify how many days back and/or based on file name patterns we can read zipped files.

christiangalsterer on 30 May 2017

@ruflin My use case requires back-filling of gzipped log files. We have requirement to keep original log files for a minimum of 6 months. Right now, the only option for us is to unzip the log files from the central location to a staging location and read the files from there using file beats. This has the disadvantage that we need additional storage and regular cleanup of unzipped files once they are processed

smsajid on 30 May 2017

Very similar requirement also here which seems to be not so uncommon in regulated environments.

christiangalsterer on 30 May 2017

@jpdaigle @smsajid @christiangalsterer Thanks for sharing. The use case that you mention where you "only" have gzip files and no other files could actually be covered by https://github.com/elastic/beats/pull/3070 Now that we have prospectors better abstracted out, it should also be possible to make it a separate prospector type.

@christiangalsterer Which issue did you want to refer above? Not sure if the one you have in is the right one.

ruflin on 1 Jun 2017

Good Luck

strootman on 1 Jun 2017

😄4

+1 for gzip support

eperpetuo on 10 Oct 2017

@eperpetuo Could you also share your detailed use case? I want to keep collecting data on how people would use the feature to then validate a potential implementation.

ruflin on 10 Oct 2017

@ruflin In my case, during the log rotation, files are automatically gzipped. That is, the current log file is compressed and renamed to _foo.log.gz_ and a new _foo.log_ file is created. New log events start to be written in this new _foo.log_ file.

Now, I have experienced some delay during high throughput events. Imagine filebeat is 2 seconds behind the last line in the log file. When log rotation occurs and the file is gzipped, filebeat is not able to continue reading the file.

Although new lines written to the just created _foo.log_ file are perfectly collected by filebeat since from the beggining, last few lines of the now gzipped file are never shipped to Elasticsearch and in this case there is loss of information.

We currently use Splunk and it also presents some delay. However, the splunk-forwarder is able to collect all events even during log rotation and no message loss occurs, which is the most important.

This situation is preventing to move the solution to the production environment because we can't afford to lose messages.

eperpetuo on 10 Oct 2017

@eperpetuo For now, a workaround could be to set up logrotate to only gzip the second time a file is rotated (i.e. you end up with log, log.1, log.2.gz, log.3.gz, etc.). This is how many Linux distros do rotation as well.

praseodym on 10 Oct 2017

👍1

@praseodym Thanks. We'll look into this workaround.:metal:

eperpetuo on 10 Oct 2017

@eperpetuo If you disable close_renamed and close_removed filebeat should continue reading the file as it will keep it open. This should give you your expected behaviour. Feel free to open a topic on discuss for it to discuss further.

ruflin on 12 Oct 2017

+1 for gzip support, we have a ton of xml backlogs that are gunzipped that I would like to send to logstash. Unzipping everything would take up 30x more space.

jhnoor on 23 Oct 2017

👍1

+1 for gzip support. I have a pile of alerts.json.gz files I would like to re-run.

sly4 on 1 Nov 2017

+1 as well. Our use case is that our CDN provider (Edgecast) gzip's logs before shipping to us - as a result we have no way of receiving them raw. We could have a process that takes the gzipped objects, and decompress them to a second directory that filebeat watches, but that's frankly wasteful when we're dealing with terabytes of logs.

brennentsmith on 3 Nov 2017

@sly4 @jhnoor @brennentsmith The cases you mentioned should be easier to cover as there is no overlap between zip and unzipped files. Thanks for sharing the uses cases.

ruflin on 5 Nov 2017

👍2

+1 on behalf of the DFIR world. ELK is/was my go-to for ingesting large dumps of log data in whatever crazy format a customer provides them in. I have no use for "tailing" a log file; I only need to read in static files. Those files are almost always gzipped, and rarely is decompressing them a viable option. As an example, the case I'm working right now involves (among many other things) about 200G of compressed log evidence spanning the last 18 months. Decompression is not a viable option. I'd love to adopt the beats way of doing things, if it becomes possible to do so.

djjoshuad on 7 Feb 2018

👍4

yodog on 19 Feb 2018

+1 for "reading gzipped files should be supported out of the box"

georgezoto on 29 Mar 2018

kevin0211 on 12 Apr 2018

Any updates on this ?

arenard on 12 Apr 2018

+1 for gz files as input

iahmad-khan on 18 Apr 2018

Are there any updates on this?

mishat-realityi on 1 May 2018

If I understand the documentation for filebeat here, filebeat can take logs from stdin or a udp socket.
So for most of the cases talked about hear, you could setup the config file appropriately and do something in bash like:

for f in *.gz ; do
   zcat $f | filebeat
done

The important use case I see for reading gzip files is logs that are rotated before they are shipped in the cases of a prolonged network outage.
For example on my busy server I have log rotation that happens every minute and I have delaycompress enabled. If there is a network outage to where filebeat sends the logs for more than a minute, It could end up missing logs because they were rotated into a gzip file.
I understand the difficulties of tracking between the original file and the gzipped file and you don't want to read logs from a gzipped file that the logs have already been read from. So I think it would be great if filebeat could take care of doing the rotating. :-D
That way it could update the registry info with the file that it caused to be gzipped.
The rotation could have nice rich support for rotating with date formatted path names and cleanup.

cando-1p on 8 May 2018

👍2

filebeat's -once flag (« _Run filebeat only once until all harvesters reach EOF_ ») makes it stops at the end of stdin.

I just used it successfully with the following filebeat configuration file to import multiple old gzipped log files to my logstash instance:

filebeat.inputs:
- type: stdin
  enabled: true
  ignore_older: 0
  tags: ["web_access_log"]
  fields:
    server: "foo-server.example.com"
    webserver: "apache"
  fields_under_root: true
  json.overwrite_keys: true

output.logstash:
  hosts: ["logstash.example.com:8088"]

And with the following command:

zcat "file.gz" | filebeat -once -e -c "filebeat-std.config.yml"

C-Duv on 3 Jul 2018

👍6 ❤5 🚀1

+1 for gzip support

jundl77 on 7 Jul 2018

+1 for gzip support

austriae on 11 Jul 2018

👍2

+1 for gzip support

jorgeeb4 on 16 Jul 2018

After three years of waiting, we continue to +1 for gzip support

jasperdj on 23 Jul 2018

👍1

sm00thindian on 1 Aug 2018

+1 for gzip support

mielie1000 on 17 Aug 2018

+1 for gzip support

ravipz on 27 Aug 2018

+1 for gzip support

burandobata on 30 Aug 2018

kirillrst on 5 Sep 2018

Is there any update/progress for this issue?

GaganJotSingh on 10 Sep 2018

+1 for gzip support

arnoldasb on 17 Sep 2018

+1 from me for gzip support

widhalmt on 19 Sep 2018

+1 for gzip support

pleminh on 19 Sep 2018

+1 for gzip support

ivan-mezentsev on 9 Oct 2018

+1 for gzip support

martinfmeneses on 10 Oct 2018

+1 for gzip support

obfuscatedc0de on 11 Oct 2018

Gentlemen Splunk can read *.gz as is, and has for years. My problem is that my *.gz files are ~20GB and beyond my control. Currently I run in scheduler:
7z.exe e -oc *.gz* | logstash.bat -f sensioproxy.grok
or:
7z.exe e -oc *.gz* | filebeat.exe -once -e -c "sensioproxy-filebeat.yml" to a logstash on elk-server

This works somewhat, but the zcat|7z.exe e -oc is way faster than the logstash|filebeat part so it takes all 16GB of memory on the clients before data has been transferred.

Another issue is handling of duplicates, since sincedb is not working. I currently have an ugly pythonscript to handle duplicates.

The logstash-codec-gzip_lines plugin does not output a proper format of my files like zcat/7z.exe e -oc

fenchu on 15 Oct 2018

+1 for gzip support

GiovanniBattista on 29 Oct 2018

+1 for gzip support

caribbeantiger on 13 Nov 2018

+1 for gzip support

Umaigenomu on 16 Nov 2018

+1 for gzip support

anlijun on 27 Nov 2018

+1 for gzip support

jesusslim on 5 Dec 2018

+1 for gzip support

kop7 on 28 Dec 2018

+1 for gzip support

muthunagarajan on 1 Jan 2019

+1 for gzip support

sadokmtir on 7 Jan 2019

Is there any update on this feature? +1 for gzip support.

debojitkakoti on 23 Jan 2019

👍4

Log compression is the standard. It would be really great if filebeat could read gz-logs.

holgerbrandl on 6 Feb 2019

+1 gzip

kolikons on 12 Feb 2019

esparky on 25 Feb 2019

codingogre on 1 Mar 2019

This issue has been open for 3 years. Is the best approach to just use logstash for .gz files?

stevebanik-ndsc on 7 Mar 2019

👍7

+1 gzip

incredincomp on 21 Mar 2019

This issue has been open for 3 years. Is the best approach to just use logstash for .gz files?
I recommend unpacking it before giving it to logstash. Another option is to use the _bulk endpoint to insert directly into elasticsearch, one line index and one line data at a far faster rate.

fenchu on 4 Apr 2019

+1 gzip.

Also a comment from my side is that consistency breaks if you create a new input type for zipped logs. I do understand that plain log files are different from gz files but it is much more "readable" to have one entry for all log files of the same application.

ioigoume on 18 Apr 2019

any update :(

mahmouddar on 24 Apr 2019

+1 for gzip

IHeilig on 26 Apr 2019

👍1

Is there any update on this?

amrithamenon16 on 10 Sep 2019

I have many log files that are in .gz format that I'd love to decompress automatically rather than running custom decompressing scripts first. I have 100's of file locations that I need to create scripts for that I wish I didn't have to.

thekofimensah on 22 Oct 2019

any update?

notestyle on 20 Nov 2019

+1 for .gz file

notestyle on 20 Nov 2019

👍1

I think there are no news regarding this, right?

alesanmed on 5 Dec 2019

I hope someone does progress on this.

lnmohankumar on 9 Jan 2020

agreed, looking for this functionality if possible.

bunjamins on 24 Jan 2020

+1 please

zxt47 on 31 Jan 2020

This would be an awesome feature :+1:

jbpratt on 15 Feb 2020

+1 for gzip please

Mobil92 on 18 Feb 2020

what about https://github.com/elastic/beats/pull/2227 ?

yodog on 20 Feb 2020

+1 for gzip please

vprusa on 17 Mar 2020

+1 for gzip file processing

namishelex01 on 27 Mar 2020

Please refrain from "+1" comments, it’s spam. Use the emoji reaction button if you want, but either way it won’t make development any faster.

tjanson on 4 Apr 2020

👍6

Pinging @elastic/integrations (Team:Integrations)

elasticmachine on 5 Apr 2020

+1 for gzip please

ITSEC-Hescalona on 7 Apr 2020

😕4

As an alternative, you can use mkfifo and pipe that stream into filebeat and pipe all your files into that pipe. That way you do not need to worry when you are finished ingesting a file because you can just automatically start piping the next file into it. So my zcats go into zcat ... >fb input and in another terminal, I have filebeat < fb_input