Hi!
Already posted on discusss.
I hope not made any mistakes, cause it took me quite a long to test it and to report it in here.
Also hopes it deserves pioneer's. ;)
Thanks in advance for your kind help.
Regards
FilebeatB reading from "filebeat.log" writing to "final.log", rotating each 3MB (really, rotation on this file can be disabled, but the problem persist)
If rotation renames file appending .X (eg: file.X) and FilebeatB is listening to file*, most events will be re-procesed (ie: processed twice)
FilebeatB seems to be losing events registered on FilebeatA due to rotation.
Some events registered on filebeat.log.X doesn't exists on final.log*
Seems FilebeatA rotates file before filebeatB can read last file events.
logstash 5.0.0-alpha3-1
filebeat 1.2.3 and 5.0.0-alpha3
logstash.yml
input {
generator{
add_field => { "from" => "${HOSTNAME}" }
}
}
filter {}
output {
stdout {}
file {
path => "/tmp/generator.log"
flush_interval => 0
}
}
filebeatA.yml
filebeat:
prospectors:
-
paths:
- /tmp/log*
input_type: log
scan_frequency: 0s
output:
file:
path: "/tmp"
filename: filebeat.log
rotate_every_kb: 1000
number_of_files: 100
console:
shipper:
name: filebeatA
tags: ["service-X", "web-tier"]
logging:
to_syslog: false
to_files: false
level: info
filebeatB.yml
filebeat:
prospectors:
-
paths:
- /tmp/filebeat.log
input_type: log
scan_frequency: 0s
output:
file:
path: "/tmp"
filename: final.log
rotate_every_kb: 3000
number_of_files: 10
console:
shipper:
name: filebeatB
tags: ["service-X", "web-tier"]
logging:
to_syslog: false
to_files: false
level: info
This command checks for gaps on sequence numbers used by logstash generator.
ls -v filebeat.log* | xargs -i grep generator '{}' | sed 's/._sequence\"://g' | sed 's/,._//g' | sort -n | awk -v "p=1" '(p==1){p=$1-1}($1!=p+1){print p+1"-"$1-1}{p=$1}END{print "ok"}'
This command checks for gaps on sequence number received from filebeat.
ls -v final.log* | xargs -i grep generator '{}' | sed 's/._sequence\\"://g' | sed 's/,._//g' | sort -n | awk -v "p=1" '(p==1){p=$1-1}($1!=p+1){print p+1"-"$1-1}{p=$1}END{print "ok"}'
If the above last command returns at least a pair of numbers, you can check those exists on filebeat* but doesn't exists on final.log*
For example, if printed 344356-352529:
grep -H generator filebeat.log* | grep 344356 prints one line.
grep -H generator final.log* | grep 344356 doesn't print anything.
In fact, doing the query for the previous sequence value (344355), shows the filebeat.log.X filename (grep -H) is different (just truncated).
I'm currently trying to reproduce this. Some notes so far:
scan_frequency: 1s/tmp/generator.log. You have /tmp/log*, not sure why this one is matching?/tmp/filebeat.log. This should be /tmp/filebeat* to make sure it also crawls the rotated files' is missing.Didn't manage to reproduce this yet. So far I'm not sure if that is just an issue of files rotating faster then filebeat can read it? To make sure we are talking about the same thing:
As Beats treat the Output like a log output with rotating files, the above is the expected behaviour. If I understand you correctly, what you have is more a feature request so that file output does not rotate files but creates them with a timestamp. The side affect of this is that the storage would grow over time.
One important question for me is to understand why you use the file output in the first place? Was this to only reproduce a potential bug (which is great)?
Please let me know if we are talking about the same thing here or if I miss some points above. Thanks for putting all the effort into this, I'm happy to investigate further.
Hi
The filebeatA.yml path I use is /tmp/generator.log. You have /tmp/log*, not sure why this one is matching?
At the beginning I tried with multiple generators, but at the end I used just one. Ignore it, as there no more log* files.
In filebeatB you use /tmp/filebeat.log. This should be /tmp/filebeat* to make sure it also crawls the rotated files
That's not what I'm doing. Notice the "_Please, notice that:_" section.
If filebeatB is "listening" to filebeat.log* and filebeat.log is rotated (creating a new filebeat.XX file), wouldn't filebeatB "read all events from newly created filebeat.XX"? Won't that mean all events being duplicated? (Hope I explained myself properly)
I'm trying it with a fixed number of events generated to keep it reproducible
I'm able to reproduce it with generator running for a while, until i got filebeat.log.99 (aprox). Event lost happens 100% time.
Both Filebeat instances are probably using the same registry file. This should never be the case as otherwise they overwrite the state of each other. It should not be an issue here as filebeat is never restarted, but worth mentioning.
Using docker, so less to worry.In your first command, the last ' is missing.
Corrected. Thanksfilebeat A which reads the generator.log file, persists it to disk. Each file is 1MB, max 100 files are created. If 100 files are exceeded, the oldest data is removed.
That's correct
filebeat B reads the persisted filebeat A data. If persists to disk with 3MB per file and max 10 files. This means total storage of 30MB compared to filebeat A with total storage of 100MB. So in case A writes more then 30MB, B will always only have a subset of the data.
That's correct
Over time this will be the most recent data as in the beginning it will just read all files in parallel (no priority) but then only the most recent files will be updated, so recent data is on top.
I'm reading only "one file" (filebeat.log), due to my concerns about events being duplicated.
As Beats treat the Output like a log output with rotating files, the above is the expected behaviour. If I understand you correctly, what you have is more a feature request so that file output does not rotate files but creates them with a timestamp. The side affect of this is that the storage would grow over time.
If Beats is intelligent enough to detect files being rotated, and hence reading only "events that were unread from last known position on filebeat.log, now named filebeat.log.1, while it starts reading filebeat.log from offset 0", then I have to test again with filebeat.log*.
Otherwise, you should be able to reproduce this.
If using filebeat.log* reads all filebeat.log.XX again (hence, duplicating events), then I'm requesting to change how rotation is done.
If using filebeat.log* read all events on truncated filebeat.log + last events that were not readed from filebeat.log.1, then that's great and I should test again.
One important question for me is to understand why you use the file output in the first place? Was this to only reproduce a potential bug (which is great)?
That's correct. We're trying to get 0% loss from a fast-growing apache access file.
Please let me know if we are talking about the same thing here or if I miss some points above. Thanks for putting all the effort into this, I'm happy to investigate further.
Thank you for your help
Filebeat keeps track of rotated files, so it should never send duplicate events. So in case filebeat.log is renamed to filebeat.log1 filebeat is detecting that and just continue at the old position. It is important, that the file is renamed and not a new one is created. But that is the case with log rotation algorithms I know so far.
Following rotation is crucial to exactly not have data loss, because it is not guaranteed that filebeat finished reading when you rotate the file. Of course, you will loose that if filebeat can never catch up until the file is actually removed from disk.
I assume the above note means, you should test again with filebeat.*?
I assume the above note means, you should test again with filebeat.*?
Indeed.
On the other hand, if rotation were made using creation timestamps (as I proposed at the begin of this topic), could solve any issues related to rotation, and perhaps it's good to consider it.
Be right back.
Thanks ;)
@mostolog I think / hope your issues or not related to rotation.
A general thing unrelated to this issue that this brought up for me is the question if it makes sense to treat the file output the same as writing logs. The nice thing about log format output with rotation is that it is possible to set a maximum to the data stored, at the same time I'm not 100% sure that is always what is expected. In general normally file output is only used for testing and not in production, but still a topic worth discussing in an other thread.
Let me know what your outcome is.
Hi
if it makes sense to treat the file output the same as writing logs
I'm not sure if I understood this. Do you mean filebeat output being a log file itself?
I sadly report that seems it's still happening.
To detail as much as possible, here's the trace of what I'm doing:
Dockerfile
filebeat.dockerfile.txt
Build image
docker build -f filebeat.dockerfile.txt -t filebeat .
Run and check version
docker run --rm -it filebeat /usr/share/filebeat/bin/filebeat -version
filebeat version 5.0.0-alpha3 (amd64), libbeat 5.0.0-alpha3
Delete before testing
sudo rm filebeat.log* final.log log*
Run filebeat1
filebeat1.yml.txt
docker run -d --name filebeat1 -h filebeat1 -v ~/beat/filebeat.yml:/usr/share/filebeat/bin/filebeat1.yml -v /tmp:/tmp/filebeat filebeat /usr/share/filebeat/bin/filebeat -c /usr/share/filebeat/bin/filebeat.yml
Run filebeat2
filebeat2.yml.txt
docker run -d --name filebeat2 -h filebeat2 -v ~/beat/filebeat2.yml:/usr/share/filebeat/bin/filebeat.yml -v /tmp:/tmp/filebeat filebeat /usr/share/filebeat/bin/filebeat -c /usr/share/filebeat/bin/filebeat.yml
Run generator1
generator.yml.txt
docker run -d --name generator1 -h generator1 -v ~/logstash/_generator/config:/etc/logstash/config -v /tmp:/tmp/filebeat logstash:5.0.0-alpha3-1 /usr/share/logstash/bin/logstash --path.settings=/etc/logstash/config
When there are ~80 filebeat.log files, stop generator container
docker rm -v -f generator1
Check for gaps on filebeat.log*
ls -v filebeat.log* | xargs -i grep generator1 '{}' | sed 's/._sequence\"://g' | sed 's/,._//g' | sort -n | awk -v "p=1" '(p==1){p=$1-1}($1!=p+1){print p+1"-"$1-1}{p=$1}END{print "ok"}'
Check for gaps on final.log*
ls -v final.log* | xargs -i grep generator1 '{}' | sed 's/._sequence\\"://g' | sed 's/,._//g' | sort -n | awk -v "p=1" '(p==1){p=$1-1}($1!=p+1){print p+1"-"$1-1}{p=$1}END{print "ok"}'
106980-109856 118489-118487 118490-118488 118491-118489 118492-11849 ...
Test results looking for an specific sequence number
grep -H 106980, filebeat.log*
filebeat.log.60:{"@timestamp":"2016-06-15T17:27:22.393Z","beat":{"hostname":"filebeat1","name":"filebeat1"},"input_type":"log","message":"{\"sequence\":106980,\"@timestamp\":\"2016-06-15T17:27:21.802Z\",\"@version\":\"1\",\"host\":\"generator1\",\"from\":\"generator1\",\"message\":\"Hello world!\"}","offset":14866230,"source":"/tmp/filebeat/loggenerator1","type":"log"}grep -H 106980, final.log
(No results)
I have uploaded log files if that if somehow useful...(hope it works: http://s000.tinyupload.com/index.php?file_id=56525537947986187956)
Regards
Thanks for sharing the log files. I did one specific check for the sequence number 118489-118487 you listed above. 118487 is the last entry in log file 57, 118488, 118489 are in log file 56.
In the final log I found sequence 118487 on line 115494 and squence 118488 and following at line 104388. Did you expect the sequence numbers to be in order also in the final.log?
Not 100% sure I read the output above correctly but I expected that the numbers you list above have a gap inbetween?
I also checked for 106980 and this one I couldn't find in the final.log. It seems the events that are potentially missing, are the first ones in a new file?
An additional note here: The last lines of final.log which I downloaded has some wiered lines inside where it seems something went wrong with the encoding. Same on your side?
Checking further, it seems like the complete log.60 is missing. I'm curious if this could be some internal race condition. Do you run the generation and crawling in parallel, correct? If you run it separately, means first run filebeatA and then run filebeatB, you still have missing events?
Can you share the registry file that was generated during the tests. It would be interesting to see if log60 shows up or not. And if yes, what the offset is.
@mostolog I just realised now, that this issue is actually in the logstash repo instead of the beats. We can keep it here for the moment but as soon as we find the potential bug, we have to open a new issue in the beats repo.
Hi
Thanks for sharing the log files. I did one specific check for the sequence number 118489-118487 you listed above. 118487 is the last entry in log file 57, 118488, 118489 are in log file 56.
That's correct.
In the final log I found sequence 118487 on line 115494 and sequence 118488 and following at line 104388.
With _grep -nH 118489 final.log_ I got 2 relevant results:
final.log:104388:{"@timestamp":"2016-06-15T17:27:24.238Z","beat":{"hostname":"filebeat2","name":"filebeat2"},"input_type":"log","message":"{\"@timestamp\":\"2016-06-15T17:27:23.956Z\",\"beat\":{\"hostname\":\"filebeat1\",\"name\":\"filebeat1\"},\"input_type\":\"log\",\"message\":\"{\"sequence\":118488,\"@timestamp\":\"2016-06-15T17:27:22.380Z\",\"@version\":\"1\",\"host\":\"generator1\",\"from\":\"generator1\",\"message\":\"Hello world!\"}\",\"offset\":16477350,\"source\":\"/tmp/filebeat/loggenerator1\",\"type\":\"log\"}","offset":356,"source":"/tmp/filebeat/filebeat.log","type":"log"}
final.log:115836:{"@timestamp":"2016-06-15T17:27:25.239Z","beat":{"hostname":"filebeat2","name":"filebeat2"},"input_type":"log","message":"{\"@timestamp\":\"2016-06-15T17:27:23.956Z\",\"beat\":{\"hostname\":\"filebeat1\",\"name\":\"filebeat1\"},\"input_type\":\"log\",\"message\":\"{\"sequence\":118488,\"@timestamp\":\"2016-06-15T17:27:22.380Z\",\"@version\":\"1\",\"host\":\"generator1\",\"from\":\"generator1\",\"message\":\"Hello world!\"}\",\"offset\":16477350,\"source\":\"/tmp/filebeat/loggenerator1\",\"type\":\"log\"}","offset":356,"source":"/tmp/filebeat/filebeat.log.8","type":"log"}
which means the sequence 118489 was processed twice (due to rotation). Isn't it?
Did you expect the sequence numbers to be in order also in the final.log?
Not if using filebeat* as prospector path, but it seemed buggy also when using "filebeat.log" (without *)
Not 100% sure I read the output above correctly but I expected that the numbers you list above have a gap inbetween?
Using "filebeat.log" i didn't expect gaps, as it only read one file. Using filebeat.log*, as far as I can understand, it's possible to have numbers unarranged, hence having gaps inbetween.
But, please, notice that:
grep -nH sequence\\":106980 filebeat.log*
filebeat.log.60:1:{"@timestamp":"2016-06-15T17:27:22.393Z","beat":{"hostname":"filebeat1","name":"filebeat1"},"input_type":"log","message":"{\"sequence\":106980,\"@timestamp\":\"2016-06-15T17:27:21.802Z\",\"@version\":\"1\",\"host\":\"generator1\",\"from\":\"generator1\",\"message\":\"Hello world!\"}","offset":14866230,"source":"/tmp/filebeat/loggenerator1","type":"log"}
while grep -nH 106980 final.log
final.log:100660:{"@timestamp":"2016-06-15T17:27:24.075Z","beat":{"hostname":"filebeat2","name":"filebeat2"},"input_type":"log","message":"{\"@timestamp\":\"2016-06-15T17:27:20.270Z\",\"beat\":{\"hostname\":\"filebeat1\",\"name\":\"filebeat1\"},\"input_type\":\"log\",\"message\":\"{\"sequence\":76918,\"@timestamp\":\"2016-06-15T17:27:20.212Z\",\"@version\":\"1\",\"host\":\"generator1\",\"from\":\"generator1\",\"message\":\"Hello world!\"}\",\"offset\":10698006,\"source\":\"/tmp/filebeat/loggenerator1\",\"type\":\"log\"}","offset":631900,"source":"/tmp/filebeat/filebeat.log","type":"log"}
final.log doesn't have a sequence number 106980. That is: is loosing events.
Regards
It seems the events that are potentially missing, are the first ones in a new file?
I'll test a few times and let you know if it happens "too often"
An additional note here: The last lines of final.log which I downloaded has some wiered lines inside where it seems something went wrong with the encoding. Same on your side?
Same on my side
Checking further, it seems like the complete log.60 is missing. I'm curious if this could be some internal race condition. Do you run the generation and crawling in parallel, correct?
I didn't notice it. Yes: generation, filebeatA and filebeatB are running at the same time.
If you run it separately, means first run filebeatA and then run filebeatB, you still have missing events?
Haven't tried. Probably won't happen.
I just realised now, that this issue is actually in the logstash repo instead of the beats. We can keep it here for the moment but as soon as we find the potential bug, we have to open a new issue in the beats repo
Glad to hear "there's a chance of being a bug".
Regards
I didn't realise, that it shows up twice. You are right, that this definitively should not happen. Interestingly it was crawled once as filebeat.log and filebeat.log.8 means it was picked up at a later stage again. This could speak for a race condition, means two harvester picked it up idependent of each other.
There is definitively here the potential for a bug :-) The problem with this race conditions is it is really hard to find a simple test to reproduce them all the time. I really appreciate all the effort you put into this.
For future testing, can we make sure to have the following:
BTW: When you check for the first events in a new file to be missing, can you check if then the complete file is missing? I get more and more the impression we don't have an issue related to single events missing, but actually complete files missing or read twice.
I really appreciate all the effort you put into this.
Thank YOU.
Define separate registry files for each filebeat instance
You mean directories? Please, elaborate.
Use Filebeat 5.0.0-alpha3 for testing
Already ;)
Use scan_frequency > 0
Although provided example states:
Scan frequency in seconds.
How often these files should be checked for changes. In case it is set
to 0s, it is done as often as possible. Default: 10s
:stuck_out_tongue_winking_eye:
BTW: When you check for the first events in a new file to be missing, can you check if then the complete file is missing?
Do you mean "before doing grep -nH 106980 final.log, check all filebeat.log* are present on final.log" ? Otherwise, i didn't understand.
I get more and more the impression we don't have an issue related to single events missing, but actually complete files missing or read twice.
And now's when I forward you to:
60 log above which is completely missing. This is to confirm my theory that we are not talking about missing events, but missing files.About your last two points:
Tomorrow (or next Monday) I'll test a few times setting different registry values, although using docker that shouldn't worry/happen, isn't it?
I'll include "check if any filebeat.log* is missing on final.log" to the script.
Identifiers: We are discussing to use fingerprints for files in the future and not only inode /device. This should also improve some problems like inode reuse. See elastic/beats#1022
Now my approach seems simpler to me...
Our approach of fingerprint is not to solve output issues with filebeat, but to solve challenges with lots of other logging systems. Only relying on path for file detection is definitively also a configuration option we are discussing.
For docker: Assuming you run each filebeat instance in its own docker container and the registry directory is not a shared volume, it should not matter. TBH it would be better to do the tests without shared volumes in docker. I used shared volumes in the past for some system tests and it leaded to some crazy things as inodes and devices sometimes just randomly changed. But that was on OS X and mounting local volumes. I'm just mentioning this to make sure in the end we don't discover a docker volume issue :-)
Ok.
I'll let you know any update/opening issue at logstash.
Thanks a lot!
@mostolog any updates from your side here?
I was thinking of writing a small python script for this to test so we could even add it to our test suite. The simplified version would be to use Python logger that generates about the 100 rotated files with a sequence number, filebeat keeps reading and when its finished, python reads the file, loads all sequence numbers in memory and checks if one is missing.
@mostolog Based on your script, I wrote some python tests. It seems that especially under heavy rotation, something goes wrong: https://github.com/elastic/beats/pull/1954 Can we move this issue to the Beats repo?
On my way! (Thanks, I have been quite busy these days)
@mostolog Thanks, closing this in favor of https://github.com/elastic/beats/issues/1956 AFAIK it is not possible to move issues :-(
I'm not sure whether I'm throwing this comment out of context, but here goes. Please let me know if this is the wrong place to ask this. In our production env we're doing log rotation through log4j, and renaming json.log to json.log.n where n is between 1 and 5 and fixed to 100MB. Disk space usage and what not. Filebeat version is 1.2.2 and configured to pick a single log file which log4j generates and rotates. The output is aimed towards a logstash listener over the wire. Any idea whether this issue will be hit in such conditions?
@sacasumo Did it happen that you lost some events? There are some edge cases that can in the 1.x release because of the way states are stored but events should not be lost. Please use https://discuss.elastic.co/c/beats/filebeat for questions and add additional details like config file etc. In case you hit some issues, please try 5.0.0-alpha5 as all known issues are fixed there: https://www.elastic.co/downloads/beats/filebeat
We haven't investigated this in a lot of detail so we can't really say whether we're losing logs through the cracks. We just stumbled across the issue and reached out for info 馃槃 We haven't put in an external systematic check to figure out whether this happens. What edge cases do you know of? We've pondered using 5.0.0-alpha5 however this is in production, and the alpha tag gives me the shivers 馃檧
The issue can happen, if the file rotation of multiple files happens during the file system scan exactly between reading the info and putting it into memory. To put this into perspective: To reproduce it I had to do 100 file rotations per second and set scan_frequency to 0.1 which is not recommend at all. Obviously this "race" condition can also happen with much lower file rotations and scan_frequency time but it is much less likely. That is why initially it was quite trick to even find the problem. To prevent these kind of state race conditions, state handling was completely rewritten and restructured in the 5.0 release. From this point of view I can recommend the alpha5 release as all these kind of know edge cases do not happen anymore or are handled gracefully. But obviously we do not recommend alphas for production. If it helps, there is currently no major issue in the filebeat alpha5 version I know of ;-)
thanks for the in depth response, in such context! In our case rotation happens much less frequently, just under over once an hour, with a default scan_frequency
thanks again!