After restart of Logstash it fails to start with the message:
[2017-06-26T16:24:04,802][ERROR][logstash.pipeline ] Logstash failed to create queue {"exception"=>"Page file size 0 different to configured page capacity 262144000 for /path/to/data/queue/main/page.136", "backtrace"=>["org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary.java:136:in `open'", "/path/to/install/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:41:in `with_queue'", "/path/to/install/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:30:in `create_file_based'", "/path/to/install/logstash/logstash-core/lib/logstash/queue_factory.rb:29:in `create'", "/path/to/install/logstash/logstash-core/lib/logstash/pipeline.rb:148:in `initialize'", "/path/to/install/logstash/logstash-core/lib/logstash/agent.rb:286:in `create_pipeline'", "/path/to/install/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'", "/path/to/install/logstash/logstash-core/lib/logstash/runner.rb:274:in `execute'", "/path/to/install/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", "/path/to/install/logstash/logstash-core/lib/logstash/runner.rb:185:in `run'", "/path/to/install/logsta
sh/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", "/path/to/install/logstash/lib/bootstrap
/environment.rb:71:in `(root)'"]}
[2017-06-26T16:24:04,810][ERROR][logstash.agent ] Cannot create pipeline {:reason=>"Page file size 0 different to configured
page capacity 262144000 for /path/to/data/queue/main/page.136"}
queue.type: persisted
queue.checkpoint.writes: 1024
queue.max_bytes: 1024mb
page.136 is indeed 0 bytesI can confirm this. If you try to adjust queue.page_capacity while there are any events stored in on-disk queue this error will be triggered. Value can be adjusted up & down if only disk queue is empty.
@nicknameforever Thanks for the report. A few questions:
queue.page_capacity as @shoggeh mentioned?page.136 the last page?@shoggeh thanks for your input. Did you in fact experience this specific error by just changing the queue.page_capacity ? I will try to reproduce that case.
"Page file size 0 different to configured page capacity xxx for ...
Specifically the "Page file size 0" ?
this is very weird - reviewing the related code it seems the only way a page size 0 can exists if an exception is thrown right after the page file creation and before or in the map() call in https://github.com/elastic/logstash/blob/b3b9e605ac2e35eaedf0cc9fc0668ca6427fca19/logstash-core/src/main/java/org/logstash/ackedqueue/io/MmapPageIO.java#L71-L73
The new RandomAccessFile(this.file, "rw"); will create the page size 0 and then the this.channel.map(FileChannel.MapMode.READ_WRITE, 0, this.capacity) should extend it to the configured page capacity.
@nicknameforever @shoggeh can you see any error or exception in the logs that would indicate such a condition??
@colinsurprenant
queue.page_capacity250M page.135
34 checkpoint.135
34 checkpoint.136
0 page.136
34 checkpoint.head
250M page.137
[2017-06-24T02:18:30,480][ERROR][logstash.pipeline ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"Java heap space", "backtrace"=>[]}
...
[2017-06-24T02:18:28,756][ERROR][logstash.filters.grok ] Error while attempting to check/cancel excessively long grok patterns {:message=>"Java heap space", :class=>"Java::JavaLang::OutOfMemoryError", :backtrace=>[]}
[2017-06-24T02:18:35,271][ERROR][logstash.pipeline ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"Java heap space", "backtrace"=>[]}
...
This could be a separate issue, but is most likely due to the issue https://github.com/elastic/logstash/issues/6970
@colinsurprenant yes, I think I'm running into different issue than @nicknameforever, although resulting in similar error.
queue.page_capacity size on logstash instance processing data,256mb, 512mb and 1024mb) as test for it.And then situation was as follows:
page_capacity size change, it was POSSIBLE to restart logstash without any issue. By "nothing" I mean _node/stats monitoring endpoint showing 0 under pipeline.queue.eventsWith example change from: 256 to 512mb:
PRIOR to change:
"queue" : {
"events" : 0,
"type" : "persisted",
"capacity" : {
"page_capacity_in_bytes" : 262144000,
<< actual change >>
[2017-06-28T16:21:57,852][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}
"queue" : {
"events" : 0,
"type" : "persisted",
"capacity" : {
"page_capacity_in_bytes" : 536870912,
However the size of page file remained at it's former value:
-rw-r--r-- 1 logstash logstash 250M Jun 28 16:20 /var/lib/logstash/queue/main/page.505
END RESULT: Logstash resumed all of the operations properly.
queue.page_capacity to its former value/var/lib/logstash/queue and let logstash to recreate itPRIOR TO CHANGE:
"queue" : {
"events" : 22237,
"type" : "persisted",
"capacity" : {
"page_capacity_in_bytes" : 262144000,
<< actual change >>
[2017-06-28T16:44:16,848][ERROR][logstash.agent ] Cannot create pipeline {:reason=>"Page file size 262144000 different to configured page capacity 536870912 for /var/lib/logstash/queue/main/page.505"}
END RESULT: Logstash not starting.
If it's not possible to not enforce page_capacity size on already created pages, then maybe it's just worth mentioning in docs that queue needs to be drained first to allow you to change it?
@shoggeh ok, thanks for the details - so yes, this is a different issue than the zero page file size which seems to be triggered by an exception which prevents the correct page file "initialization".
But your situation is definitely a problem we need to also fix, either with better documentation, better error message or by revisiting this limitation.
Would you mind creating a separate issue with your specific problem description here? - basically copy-paste your previous comment.
@nicknameforever thanks for the info. Ok, agree we have potentially 2 different issues here (one leading to the other):
LS_HEAP_SIZE vs LS_JAVA_OPTS as described in #6970 We will address both separately and keep focus ont the PQ corruption for this issue here.
How can we get logstash to start again in case of an empty page file?
$ du -sh /run/shm/logstash/queue/main/page*
250M /run/shm/logstash/queue/main/page.3470
0 /run/shm/logstash/queue/main/page.3471
250M /run/shm/logstash/queue/main/page.3511
250M /run/shm/logstash/queue/main/page.3525
250M /run/shm/logstash/queue/main/page.3898
4.0K /run/shm/logstash/queue/main/page.3899
$ tail -2 /var/log/logstash/logstash-plain.log
[2017-07-14T16:47:57,434][ERROR][logstash.pipeline ] Logstash failed to create queue {"exception"=>"Page file size 0 different to configured page capacity 262144000 for /run/shm/logstash/queue/main/page.3471", "backtrace"=>["org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary.java:133:in `open'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:41:in `with_queue'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:30:in `create_file_based'", "/usr/share/logstash/logstash-core/lib/logstash/queue_factory.rb:29:in `create'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:159:in `initialize'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:286:in `create_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:314:in `execute'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:209:in `run'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", "/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"]}
[2017-07-14T16:47:57,439][ERROR][logstash.agent ] Cannot create pipeline {:reason=>"Page file size 0 different to configured page capacity 262144000 for /run/shm/logstash/queue/main/page.3471"}
@martijnvermaat could you please run the following command from the logstash root dir and report the result?
$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'
@martijnvermaat sorry, I should have given more explanation: the above command will just dump the content of the checkpoint files and will help me in seeing if we can come up with a manual way to repair the broken queue state. The checkpoint files do not contain any ingested data but only metadata about the queue state.
@colinsurprenant I since moved the queue dir to some other location and restarted logstash (twice, because after ~ 15min the same thing happened). Here are the contents of the checkpoint files I moved:
At this point it's not a priority to recover the events, but it's a bit worrying that this happened twice.
@martijnvermaat Great, thanks. In this checkpoints.txt, which page is now 0 bytes? (or you can also include a ls -la of data/queue/main/)
Also, have you seen any error logs prior to this happening? Was Logstash somehow abnormally terminated? crashed? interrupted?
Thanks.
@colinsurprenant Hi! Colleague of @martijnvermaat here, jumping in since he's away for a few days.
Here's the directory listing for the queue mentioned above: queue_ls.txt
The reason for restarting Logstash was that the process was "hanging", meaning the pipeline(s) stopped.
[2017-07-14T15:23:11,978][INFO ][logstash.outputs.file ] Opening file {:path=>"/store/logstash/2017/07/14/message.log.gz"}
[2017-07-14T15:23:11,980][ERROR][logstash.pipeline ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"0", "backtrace"=>["org.logstash.dissect.DissectorErrorUtils.backtrace(DissectorErrorUtils.java:16)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.logException(JavaDissectorLibrary.java:224)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.dissect(JavaDissectorLibrary.java:130)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.dissect_multi(JavaDissectorLibrary.java:140)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect$INVOKER$i$2$0$dissect_multi.call(JavaDissectorLibrary$RubyDissect$INVOKER$i$2$0$dissect_multi.gen)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:202)", "rubyjit.LogStash::Filters::Dissect$$multi_filter_1299793c6e70636e916ab55a5dfbe2aaaf57a1ae1808253012.__file__(/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dissect-1.0.11/lib/logstash/filters/dissect.rb:182)", "rubyjit.LogStash::Filters::Dissect$$multi_filter_1299793c6e70636e916ab55a5dfbe2aaaf57a1ae1808253012.__file__(/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dissect-1.0.11/lib/logstash/filters/dissect.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:326)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:170)", "rubyjit.LogStash::FilterDelegator$$multi_filter_43640ebf68de601b56cb618392ab9de0b4f8c58a1808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:43)", "rubyjit.LogStash::FilterDelegator$$multi_filter_43640ebf68de601b56cb618392ab9de0b4f8c58a1808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)", "org.jruby.ast.DAsgnNode.interpret(DAsgnNode.java:110)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.IfNode.interpret(IfNode.java:118)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:157)", "org.jruby.runtime.Block.yield(Block.java:142)", "org.jruby.RubyArray.eachCommon(RubyArray.java:1606)", "org.jruby.RubyArray.each(RubyArray.java:1613)", "org.jruby.RubyArray$INVOKER$i$0$0$each.call(RubyArray$INVOKER$i$0$0$each.gen)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "org.jruby.ast.CallNoArgBlockNode.interpret(CallNoArgBlockNode.java:64)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call19(RubyProc.java:281)", "org.jruby.RubyProc$INVOKER$i$0$0$call19.call(RubyProc$INVOKER$i$0$0$call19.gen)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)", "org.jruby.ast.DAsgnNode.interpret(DAsgnNode.java:110)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.internal.runtime.methods.ProcMethod.call(ProcMethod.java:64)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.block_0$RUBY$__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:383)", "rubyjit$LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012$block_0$RUBY$__file__.call(rubyjit$LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012$block_0$RUBY$__file__)", "org.jruby.runtime.CompiledBlock19.yield(CompiledBlock19.java:159)", "org.jruby.runtime.CompiledBlock19.call(CompiledBlock19.java:87)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call19(RubyProc.java:281)", "org.jruby.RubyProc$INVOKER$i$0$0$call19.call(RubyProc$INVOKER$i$0$0$call19.gen)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.block_0$RUBY$__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:312)", "rubyjit$LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012$block_0$RUBY$__file__.call(rubyjit$LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012$block_0$RUBY$__file__)", "org.jruby.runtime.CompiledBlock19.yield(CompiledBlock19.java:135)", "org.jruby.runtime.Block.yield(Block.java:142)", "org.jruby.RubyHash$13.visit(RubyHash.java:1355)", "org.jruby.RubyHash.visitLimited(RubyHash.java:648)", "org.jruby.RubyHash.visitAll(RubyHash.java:634)", "org.jruby.RubyHash.iteratorVisitAll(RubyHash.java:1306)", "org.jruby.RubyHash.each_pairCommon(RubyHash.java:1351)", "org.jruby.RubyHash.each19(RubyHash.java:1342)", "org.jruby.RubyHash$INVOKER$i$0$0$each19.call(RubyHash$INVOKER$i$0$0$each19.gen)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:311)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:161)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.chained_0_rescue_1$RUBY$SYNTHETIC__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:382)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.FCallOneArgNode.interpret(FCallOneArgNode.java:36)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.ast.WhileNode.interpret(WhileNode.java:131)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)", "org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:225)", "org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:219)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:202)", "org.jruby.ast.FCallTwoArgNode.interpret(FCallTwoArgNode.java:38)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call(RubyProc.java:230)", "org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:103)", "java.lang.Thread.run(Thread.java:748)"]}
I suspect the actual hang of the pipeline to be related to some JVM config (perhaps the issue raised above) or some overflow related to the high volume throughput through a single TCP input. Not sure how to reproduce it yet.
@onnos @martijnvermaat Thanks, very good info, this is helpful.
@martijnvermaat @onnos Which version of LS are you running and on what platform? Thanks.
@colinsurprenant LS 2.3.2 on Ubuntu 14.04.
@martijnvermaat Are you sure? Persistent Queue was only introduced in LS 5.4 !? (in 5.1 in beta)
That's the retired setup, this is 5.5.0. :) Sorry for the confusion, we're transitioning between setups. Running on Ubuntu 14.04 with Xenial 4.4.0 kernel.
Again referring to the problem of the initial hang: I split the incoming stream into two separate TCP inputs (from rsyslog to logstash), and this has helped throughput under load tremendously. I suspect we won't be able to recreate the crash.
@martijnvermaat @onnos Here's what I can make of your queue state:
The oldest page is 3470 which contains the sequence numbers range of [1669303167..1669780635] and the First Unacked Sequence Num (FUS) is 1669595327, so there are 185309 (+/- 1, unimportant here) unacked events in that page.
The next page 3471 is the one with the 0 byte size and contains the sequence number range [1669780636..1670260155] and FUS is 1670260156 which means it is fully acked.
The next pages, 3472+ are all purged and the checkpoint says they are all fully acked up to page 3899 which is the head page.
I see 2 problems here:
1- Page 3470 is stuck behind with a bunch of unacked events. The only explanation I see is that this batch at seq num 1669595327 got stuck in a stalled output up until LS crashed or was restarted. I suspect that if it wasn't for the 0 byte page size, upon restart these events in page 3470 would have been replayed and the queue would have been correctly purged up to the head page. It is definitely worth for us to better evaluate such scenarios and think about how to mitigate it. At first thought I'd say that this is more a stuck output problem than a queue problem per-se. Will followup on this separately.
2- Page 3471 is in fact fully acked and should be purged completely but it is zero byte. I suspect the delete operation was, for some reason, not complete or something like that. I will investigate in that area see if I can spot such a condition.
The good news in this case is that simply deleting page.3471 should fix the problem.
The other good news is that I will see if we could add a condition at queue initialization that if a page file has zero byte and the checkpoint says it is fully acked, we could just clean it up.
@nicknameforever do you still have your problematic queue dir intact? If so would it also be possible for you to run the same command to list the content of the checkpoints (adjust queue dir path if necessary) and see if we have the same situation here?
$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'
@colinsurprenant Unfortunately not.
create issue #7809 specifically for the zero byte & fully acked page file case.
PR for #7809 in #7810
Above I mentioned a condition of stalled plugin, issue #7796 was created to get more insights on these.
@colinsurprenant I'm running into the same problem as the original reporter. Didn't change the queue size, just restarted logstash and now it's failing to start with the page file being 0 bytes. Running 5.6.4 of logstash from the elastic repos.
@andrewmiskell thanks for the report. Has did LS shutdown prior to the restart? did it complete a clean shutdown or was it forced stopped or crashed? Do you have any logs related to that?
@colinsurprenant Not sure if I have the logs, but it was observed after issuing a simple service logstash restart command on both my logstash nodes.
I still have the queues, I moved them to a temporary location for the time being so I could see if they could be fixed and imported into ElasticSearch later.
@andrewmiskell ah! great that you preserved the queue dir. I am optimistic that we should be able to recover it and it will help diagnose the problem. In the mean time, I'd really appreciate if you could check the logs to see if LS completed a clean/normal shutdown or if the service stop ended up killing the LS process.
Could you list the queue dir content and also run the following command from LS home dir, (adjust the queue dir path) to display the content of the checkpoints:
$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'
@colinsurprenant Attached to comment. It looks like there was a hung grok thread and it was killed forcefully by the service script.
@andrewmiskell could you also list the queue dir files showing the file sizes please? (ls -la)
Thanks.
@andrewmiskell better yet, could you run this new command from the LS home dir, adjust the dir = "data/queue/main" part to your actual queue dir path.
vendor/jruby/bin/jruby -rpp -e 'dir = "data/queue/main"; Dir.glob("#{dir}/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); fa = firstUnackedSeq >= (minSeq + elementCount); ps = File.exist?("#{dir}/page.#{page}") ? File.size("#{dir}/page.#{page}") : nil; print("#{File.basename(checkpoint)}, #{fa ? "FA" : "UN"}, size: #{ps ? ps : "NA"}, "); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'
For some reason #7809 which originally fixed the Page file size 0 different to configured ... startup exception was incorrectly merged into the 5.6 branch and this is why we see this problem again. I re-injected the fix and it will be part of the upcoming 5.6.5 release. This is really unfortunate.
I will go ahead and close this issue, feel free to re-open if needed.