Hi,
Following this discussion on filebeat forum I would like to ask if it is possible to implement a solution to easily backfill old gzipped logs with filebeat.
The proposed solution mentioned in the topic is to add a new dedicated input_type.
It is also mentioned in the topic that when filebeat reaches the end of input on stdin it does not give you the hand back and waits for new lines which makes things hard to script to perform backfilling.
What are your thoughts on this ?
Thanks for your hard work.
I would see the implementation as following:
This would be nice to have but I think it is not on the top of our priority list. A community contribution here would be more then welcome.
For your second issue about running filebeat only until first completion, lets refer to this issue: https://github.com/elastic/filebeat/issues/219
Thanks for the fast reply and the pointer to the issue.
I will try to look at your implementation proposal. I am still quite new to Golang and the project, so I can't promise anything :)
@lminaudier Always here to help.
This would be a great feature addition. Currently the splunk-forwarder does something similar and will index log rotated files that have been gziped automatically.
+1. Is anyone working on this? If not I could possibly take it up..
@Ragsboss I don't think anyone is working on that. Would be great to have a PR for that to discuss the details.
@Ragsboss @ruflin I'm delighted to see there's someone looking to pick this up. Is this happening? The reason I ask is because it may be possible for me to spend some time helping out with this in lieu of building another solution for gziped logs to use internally.
@cFire please feel free to take this up. I've started looking at the code familiarizing myself with Go in general and FileBeat code. But I haven't started the real work yet. I'll try to help you in anyway you want, just let me know.
Few thoughts I had. From a pure functional viewpoint, defining a new input_type doesn't seem ideal as that would force users to author a new prospector in the config file. Instead I felt it may be better for the code to automatically deal with compressed files as long as they match the given filename patterns in the config file. The code can instantiate a different harvester (IIUC) based on the file extension/type. But from implementation viewpoint if this is turning out to be difficult, I think it's ok to burden/ask the users for some extra config...
From an implementation point of view I think we should go the route of having a separate input type. Currently the way filebeat is designed is that a prospector and harvester type are tightly coupled, so a prospector only starts one type of harvesters. It is ok if gzip harvester reuses lots of code from the log harvester (which I think it will) but as log tailing and reading a file completely only once are form my perspective two quite different behaviours. The question which will also be raised is if the gzip files will change their name over time (means have to be tracked based on inode / device) or it is enough to just store the filename and a read / unread flag in the registry.
Here is the PR related to the above discussions: https://github.com/elastic/beats/pull/2227
We would like filebeat to be able to read gzipped files, too.
Our main use of filebeat would be to take a set of rotated, gzipped logs that represent the previous day's events, and send them to elasticsearch.
No tailing or running as a service is required, so a "batch mode" would also be good, but other workarounds solely using filebeat would also be acceptable.
@willsheppard Thanks for the details. For the batch mode you will be interested in https://github.com/elastic/beats/pull/2456 (in progress)
Now that https://github.com/elastic/beats/pull/2456 is merged this feature got even more interesting :-)
Hello,
Has there been any update regarding the support for gzip files ? Please let know.
Thanks.
Idk about the others, but I've not gotten any time to work on this.
No progress yet on this from my side.
This gzip input filter would be the killer feature for us. We're being forced to consider writing an Elasticsearch ingest script from scratch which writes to the Bulk API, because we need to operate on logs in-place (no space to unzip them), and we would be using batch-mode (#2456) to ingest yesterday's logs from our web clusters.
Has there been any movement on this as this too is the killer feature for me too
Throwing in my support for this feature.
Would be an awesome feature to have.
There is this open PR here that still needs some work and also involves quite some discussions: https://github.com/elastic/beats/pull/3070
Harvesters would only be opened based on filenames (no inode etc.).
@ruflin I believe inodes may to be tracked because logrotate (assuming this is a target use case) renames files and reuses file names. Unless another tracking mechanism (when is 'hello.txt.1.gz' a "new file" below, for example)
Example:
% ls -il /tmp/hello.txt*
103196 -rw-rw-r--. 1 jls jls 12 Jan 24 03:17 /tmp/hello.txt
103131 -rw-rw-r--. 1 jls jls 32 Jan 24 03:16 /tmp/hello.txt.1.gz
% cat test.conf
/tmp/hello.txt {
rotate 5
compress
}
% logrotate -s /tmp/example.logrotate -f test.conf
% ls -il /tmp/hello.txt*
103218 -rw-rw-r--. 1 jls jls 32 Jan 24 03:17 /tmp/hello.txt.1.gz
103131 -rw-rw-r--. 1 jls jls 32 Jan 24 03:16 /tmp/hello.txt.2.gz
^^ Above, 'hello.txt.2.gz' is the same file (inode) as previous 'hello.txt.1.gz'.
We can probably achieve this without tracking inodes (tracking modification time and only reading .gz files after they have been idle for some small time?), but I think the filename alone is not enough because file names are reused, right?
I exactly hit this issues during the implementation. That is why the implementation in https://github.com/elastic/beats/pull/3070 is not fully consistent with the initial theory in this thread.
The main difference now to a "normal" file is that it is expected, that a gz never changes, and if it changes, the complete file is read from the beginning again.
+1 vote for gzip input
+1 for gzip support
@jpdaigle @smsajid Could you share more details on how exactly you would use this feature?
@ruflin In our case, log files from hundreds of servers are streamed to a central log processor, which then outputs gzip-ed logs in chunks of a few seconds on a host that's accessible to developers. This is where a filebeat would come into play: prospect for the appearance of new .gz files, grab them and process them through a logstash pipeline.
The above is pretty "single use-case specific" though, in more generic terms, how we would use a .gz input filter is for grabbing the output of one logging system and "gluing" it to the input of a logstash pipeline.
We would also be interested in this feature as we also have scenarios where zipped files are generated by an intermediate system and where we can't read the plain files directly. It would be great if this can be combined with #4373 so that we can specify how many days back and/or based on file name patterns we can read zipped files.
@ruflin My use case requires back-filling of gzipped log files. We have requirement to keep original log files for a minimum of 6 months. Right now, the only option for us is to unzip the log files from the central location to a staging location and read the files from there using file beats. This has the disadvantage that we need additional storage and regular cleanup of unzipped files once they are processed
Very similar requirement also here which seems to be not so uncommon in regulated environments.
@jpdaigle @smsajid @christiangalsterer Thanks for sharing. The use case that you mention where you "only" have gzip files and no other files could actually be covered by https://github.com/elastic/beats/pull/3070 Now that we have prospectors better abstracted out, it should also be possible to make it a separate prospector type.
@christiangalsterer Which issue did you want to refer above? Not sure if the one you have in is the right one.

+1 for gzip support
@eperpetuo Could you also share your detailed use case? I want to keep collecting data on how people would use the feature to then validate a potential implementation.
@ruflin In my case, during the log rotation, files are automatically gzipped. That is, the current log file is compressed and renamed to _foo.log.gz_ and a new _foo.log_ file is created. New log events start to be written in this new _foo.log_ file.
Now, I have experienced some delay during high throughput events. Imagine filebeat is 2 seconds behind the last line in the log file. When log rotation occurs and the file is gzipped, filebeat is not able to continue reading the file.
Although new lines written to the just created _foo.log_ file are perfectly collected by filebeat since from the beggining, last few lines of the now gzipped file are never shipped to Elasticsearch and in this case there is loss of information.
We currently use Splunk and it also presents some delay. However, the splunk-forwarder is able to collect all events even during log rotation and no message loss occurs, which is the most important.
This situation is preventing to move the solution to the production environment because we can't afford to lose messages.
@eperpetuo For now, a workaround could be to set up logrotate to only gzip the second time a file is rotated (i.e. you end up with log, log.1, log.2.gz, log.3.gz, etc.). This is how many Linux distros do rotation as well.
@praseodym Thanks. We'll look into this workaround.:metal:
@eperpetuo If you disable close_renamed and close_removed filebeat should continue reading the file as it will keep it open. This should give you your expected behaviour. Feel free to open a topic on discuss for it to discuss further.
+1 for gzip support, we have a ton of xml backlogs that are gunzipped that I would like to send to logstash. Unzipping everything would take up 30x more space.
+1 for gzip support. I have a pile of alerts.json.gz files I would like to re-run.
+1 as well. Our use case is that our CDN provider (Edgecast) gzip's logs before shipping to us - as a result we have no way of receiving them raw. We could have a process that takes the gzipped objects, and decompress them to a second directory that filebeat watches, but that's frankly wasteful when we're dealing with terabytes of logs.
@sly4 @jhnoor @brennentsmith The cases you mentioned should be easier to cover as there is no overlap between zip and unzipped files. Thanks for sharing the uses cases.
+1 on behalf of the DFIR world. ELK is/was my go-to for ingesting large dumps of log data in whatever crazy format a customer provides them in. I have no use for "tailing" a log file; I only need to read in static files. Those files are almost always gzipped, and rarely is decompressing them a viable option. As an example, the case I'm working right now involves (among many other things) about 200G of compressed log evidence spanning the last 18 months. Decompression is not a viable option. I'd love to adopt the beats way of doing things, if it becomes possible to do so.
+1
+1 for "reading gzipped files should be supported out of the box"
+1
Any updates on this ?
+1 for gz files as input
Are there any updates on this?
If I understand the documentation for filebeat here, filebeat can take logs from stdin or a udp socket.
So for most of the cases talked about hear, you could setup the config file appropriately and do something in bash like:
for f in *.gz ; do
zcat $f | filebeat
done
The important use case I see for reading gzip files is logs that are rotated before they are shipped in the cases of a prolonged network outage.
For example on my busy server I have log rotation that happens every minute and I have delaycompress enabled. If there is a network outage to where filebeat sends the logs for more than a minute, It could end up missing logs because they were rotated into a gzip file.
I understand the difficulties of tracking between the original file and the gzipped file and you don't want to read logs from a gzipped file that the logs have already been read from. So I think it would be great if filebeat could take care of doing the rotating. :-D
That way it could update the registry info with the file that it caused to be gzipped.
The rotation could have nice rich support for rotating with date formatted path names and cleanup.
filebeat's -once flag (芦 _Run filebeat only once until all harvesters reach EOF_ 禄) makes it stops at the end of stdin.
I just used it successfully with the following filebeat configuration file to import multiple old gzipped log files to my logstash instance:
filebeat.inputs:
- type: stdin
enabled: true
ignore_older: 0
tags: ["web_access_log"]
fields:
server: "foo-server.example.com"
webserver: "apache"
fields_under_root: true
json.overwrite_keys: true
output.logstash:
hosts: ["logstash.example.com:8088"]
And with the following command:
zcat "file.gz" | filebeat -once -e -c "filebeat-std.config.yml"
+1 for gzip support
+1 for gzip support
+1 for gzip support
After three years of waiting, we continue to +1 for gzip support
+1
+1 for gzip support
+1 for gzip support
+1 for gzip support
+1
Is there any update/progress for this issue?
+1 for gzip support
+1 from me for gzip support
+1 for gzip support
+1 for gzip support
+1 for gzip support
+1 for gzip support
Gentlemen Splunk can read *.gz as is, and has for years. My problem is that my *.gz files are ~20GB and beyond my control. Currently I run in scheduler:
7z.exe e -oc *.gz* | logstash.bat -f sensioproxy.grok
or:
7z.exe e -oc *.gz* | filebeat.exe -once -e -c "sensioproxy-filebeat.yml" to a logstash on elk-server
This works somewhat, but the zcat|7z.exe e -oc is way faster than the logstash|filebeat part so it takes all 16GB of memory on the clients before data has been transferred.
Another issue is handling of duplicates, since sincedb is not working. I currently have an ugly pythonscript to handle duplicates.
The logstash-codec-gzip_lines plugin does not output a proper format of my files like zcat/7z.exe e -oc
+1 for gzip support
+1 for gzip support
+1 for gzip support
+1 for gzip support
+1 for gzip support
+1 for gzip support
+1 for gzip support
+1 for gzip support
Is there any update on this feature? +1 for gzip support.
Log compression is the standard. It would be really great if filebeat could read gz-logs.
+1 gzip
+1
+1
This issue has been open for 3 years. Is the best approach to just use logstash for .gz files?
+1 gzip
This issue has been open for 3 years. Is the best approach to just use logstash for .gz files?
I recommend unpacking it before giving it to logstash. Another option is to use the _bulk endpoint to insert directly into elasticsearch, one line index and one line data at a far faster rate.
+1 gzip.
Also a comment from my side is that consistency breaks if you create a new input type for zipped logs. I do understand that plain log files are different from gz files but it is much more "readable" to have one entry for all log files of the same application.
any update :(
+1 for gzip
Is there any update on this?
I have many log files that are in .gz format that I'd love to decompress automatically rather than running custom decompressing scripts first. I have 100's of file locations that I need to create scripts for that I wish I didn't have to.
any update?
+1 for .gz file
I think there are no news regarding this, right?
I hope someone does progress on this.
agreed, looking for this functionality if possible.
+1 please
This would be an awesome feature :+1:
+1 for gzip please
what about https://github.com/elastic/beats/pull/2227 ?
+1 for gzip please
+1 for gzip file processing
Please refrain from "+1" comments, it鈥檚 spam. Use the emoji reaction button if you want, but either way it won鈥檛 make development any faster.
Pinging @elastic/integrations (Team:Integrations)
+1 for gzip please
As an alternative, you can use mkfifo and pipe that stream into filebeat and pipe all your files into that pipe. That way you do not need to worry when you are finished ingesting a file because you can just automatically start piping the next file into it. So my zcats go into zcat ... >fb input and in another terminal, I have filebeat < fb_input
any update on gzip files? Using zcat solve some issues but it does need to be done in manual all the time.
Pinging @elastic/integrations-services (Team:Services)
+1 to adding this feature...is there any update on the timeline for this being added to Filebeat?
Edit: Sorry for "spam" just saw that comment... : )
Most helpful comment
filebeat's-onceflag (芦 _Run filebeat only once until all harvesters reach EOF_ 禄) makes it stops at the end of stdin.I just used it successfully with the following
filebeatconfiguration file to import multiple old gzipped log files to my logstash instance:And with the following command: