Logstash: Logstash fails to restart with the persistent queuing enabled

Created on 26 Jun 2017 · 33Comments · Source: elastic/logstash

After restart of Logstash it fails to start with the message:

[2017-06-26T16:24:04,802][ERROR][logstash.pipeline        ] Logstash failed to create queue {"exception"=>"Page file size 0 different to configured page capacity 262144000 for /path/to/data/queue/main/page.136", "backtrace"=>["org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary.java:136:in `open'", "/path/to/install/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:41:in `with_queue'", "/path/to/install/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:30:in `create_file_based'", "/path/to/install/logstash/logstash-core/lib/logstash/queue_factory.rb:29:in `create'", "/path/to/install/logstash/logstash-core/lib/logstash/pipeline.rb:148:in `initialize'", "/path/to/install/logstash/logstash-core/lib/logstash/agent.rb:286:in `create_pipeline'", "/path/to/install/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'", "/path/to/install/logstash/logstash-core/lib/logstash/runner.rb:274:in `execute'", "/path/to/install/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", "/path/to/install/logstash/logstash-core/lib/logstash/runner.rb:185:in `run'", "/path/to/install/logsta
sh/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", "/path/to/install/logstash/lib/bootstrap
/environment.rb:71:in `(root)'"]}
[2017-06-26T16:24:04,810][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Page file size 0 different to configured
page capacity 262144000 for /path/to/data/queue/main/page.136"}

Logstash 5.4.0
SLES 11 (x86_64)
Config:

queue.type: persisted
queue.checkpoint.writes: 1024
queue.max_bytes: 1024mb

In the queue/main/ folder the file page.136 is indeed 0 bytes

bug

Source

nicknameforever

All 33 comments

I can confirm this. If you try to adjust queue.page_capacity while there are any events stored in on-disk queue this error will be triggered. Value can be adjusted up & down if only disk queue is empty.

shoggeh on 27 Jun 2017

@nicknameforever Thanks for the report. A few questions:

Did you modify queue.page_capacity as @shoggeh mentioned?
Could you list the queue directory content? is page.136 the last page?
Was Logstash somehow abnormally terminated? crashed? interrupted?

colinsurprenant on 27 Jun 2017

@shoggeh thanks for your input. Did you in fact experience this specific error by just changing the queue.page_capacity ? I will try to reproduce that case.

"Page file size 0 different to configured page capacity xxx for ...

Specifically the "Page file size 0" ?

colinsurprenant on 27 Jun 2017

this is very weird - reviewing the related code it seems the only way a page size 0 can exists if an exception is thrown right after the page file creation and before or in the map() call in https://github.com/elastic/logstash/blob/b3b9e605ac2e35eaedf0cc9fc0668ca6427fca19/logstash-core/src/main/java/org/logstash/ackedqueue/io/MmapPageIO.java#L71-L73

The new RandomAccessFile(this.file, "rw"); will create the page size 0 and then the this.channel.map(FileChannel.MapMode.READ_WRITE, 0, this.capacity) should extend it to the configured page capacity.

@nicknameforever @shoggeh can you see any error or exception in the logs that would indicate such a condition??

colinsurprenant on 27 Jun 2017

@colinsurprenant

I did not modify queue.page_capacity
files in the queue:

250M page.135
34 checkpoint.135
34 checkpoint.136
0 page.136
34 checkpoint.head
250M page.137

Yes, Logstash was crashed with OutOfMemory:

[2017-06-24T02:18:30,480][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"Java heap space", "backtrace"=>[]}
...
[2017-06-24T02:18:28,756][ERROR][logstash.filters.grok    ] Error while attempting to check/cancel excessively long grok patterns {:message=>"Java heap space", :class=>"Java::JavaLang::OutOfMemoryError", :backtrace=>[]}
[2017-06-24T02:18:35,271][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"Java heap space", "backtrace"=>[]}
...

This could be a separate issue, but is most likely due to the issue https://github.com/elastic/logstash/issues/6970

nicknameforever on 28 Jun 2017

@colinsurprenant yes, I think I'm running into different issue than @nicknameforever, although resulting in similar error.

I did modify queue.page_capacity size on logstash instance processing data,
I used non-zero values (usually 256mb, 512mb and 1024mb) as test for it.

And then situation was as follows:

EMPTY QUEUE: If nothing was stored in on-disk PQ at the moment of page_capacity size change, it was POSSIBLE to restart logstash without any issue. By "nothing" I mean _node/stats monitoring endpoint showing 0 under pipeline.queue.events

With example change from: 256 to 512mb:

PRIOR to change:

    "queue" : {
      "events" : 0,
      "type" : "persisted",
      "capacity" : {
        "page_capacity_in_bytes" : 262144000,

<< actual change >>

[2017-06-28T16:21:57,852][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}

    "queue" : {
      "events" : 0,
      "type" : "persisted",
      "capacity" : {
        "page_capacity_in_bytes" : 536870912,

However the size of page file remained at it's former value:

-rw-r--r-- 1 logstash logstash 250M Jun 28 16:20 /var/lib/logstash/queue/main/page.505

END RESULT: Logstash resumed all of the operations properly.

SOMETHING IN QUEUE: However, if there only was anything stored in PQ then subsequent logstash restarts errored and logstash didn't start until I either:
a) reverted queue.page_capacity to its former value
b) manually removed contents of /var/lib/logstash/queue and let logstash to recreate it

PRIOR TO CHANGE:

    "queue" : {
      "events" : 22237,
      "type" : "persisted",
      "capacity" : {
        "page_capacity_in_bytes" : 262144000,

<< actual change >>

[2017-06-28T16:44:16,848][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Page file size 262144000 different to configured page capacity 536870912 for /var/lib/logstash/queue/main/page.505"}

END RESULT: Logstash not starting.

If it's not possible to not enforce page_capacity size on already created pages, then maybe it's just worth mentioning in docs that queue needs to be drained first to allow you to change it?

shoggeh on 28 Jun 2017

@shoggeh ok, thanks for the details - so yes, this is a different issue than the zero page file size which seems to be triggered by an exception which prevents the correct page file "initialization".

But your situation is definitely a problem we need to also fix, either with better documentation, better error message or by revisiting this limitation.

Would you mind creating a separate issue with your specific problem description here? - basically copy-paste your previous comment.

colinsurprenant on 29 Jun 2017

@nicknameforever thanks for the info. Ok, agree we have potentially 2 different issues here (one leading to the other):

a crash which created "corruption" in the PQ (this issue)
a problem with LS_HEAP_SIZE vs LS_JAVA_OPTS as described in #6970

We will address both separately and keep focus ont the PQ corruption for this issue here.

colinsurprenant on 29 Jun 2017

👍1

How can we get logstash to start again in case of an empty page file?

$ du -sh /run/shm/logstash/queue/main/page*
250M    /run/shm/logstash/queue/main/page.3470
0   /run/shm/logstash/queue/main/page.3471
250M    /run/shm/logstash/queue/main/page.3511
250M    /run/shm/logstash/queue/main/page.3525
250M    /run/shm/logstash/queue/main/page.3898
4.0K    /run/shm/logstash/queue/main/page.3899
$ tail -2 /var/log/logstash/logstash-plain.log
[2017-07-14T16:47:57,434][ERROR][logstash.pipeline        ] Logstash failed to create queue {"exception"=>"Page file size 0 different to configured page capacity 262144000 for /run/shm/logstash/queue/main/page.3471", "backtrace"=>["org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary.java:133:in `open'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:41:in `with_queue'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:30:in `create_file_based'", "/usr/share/logstash/logstash-core/lib/logstash/queue_factory.rb:29:in `create'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:159:in `initialize'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:286:in `create_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:314:in `execute'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:209:in `run'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", "/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"]}
[2017-07-14T16:47:57,439][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Page file size 0 different to configured page capacity 262144000 for /run/shm/logstash/queue/main/page.3471"}

martijnvermaat on 14 Jul 2017

@martijnvermaat could you please run the following command from the logstash root dir and report the result?

$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

colinsurprenant on 14 Jul 2017

@martijnvermaat sorry, I should have given more explanation: the above command will just dump the content of the checkpoint files and will help me in seeing if we can come up with a manual way to repair the broken queue state. The checkpoint files do not contain any ingested data but only metadata about the queue state.

colinsurprenant on 14 Jul 2017

@colinsurprenant I since moved the queue dir to some other location and restarted logstash (twice, because after ~ 15min the same thing happened). Here are the contents of the checkpoint files I moved:

checkpoints.txt

At this point it's not a priority to recover the events, but it's a bit worrying that this happened twice.

martijnvermaat on 14 Jul 2017

@martijnvermaat Great, thanks. In this checkpoints.txt, which page is now 0 bytes? (or you can also include a ls -la of data/queue/main/)

Also, have you seen any error logs prior to this happening? Was Logstash somehow abnormally terminated? crashed? interrupted?

Thanks.

colinsurprenant on 15 Jul 2017

@colinsurprenant Hi! Colleague of @martijnvermaat here, jumping in since he's away for a few days.

Here's the directory listing for the queue mentioned above: queue_ls.txt

The reason for restarting Logstash was that the process was "hanging", meaning the pipeline(s) stopped.

[2017-07-14T15:23:11,978][INFO ][logstash.outputs.file    ] Opening file {:path=>"/store/logstash/2017/07/14/message.log.gz"}
[2017-07-14T15:23:11,980][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"0", "backtrace"=>["org.logstash.dissect.DissectorErrorUtils.backtrace(DissectorErrorUtils.java:16)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.logException(JavaDissectorLibrary.java:224)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.dissect(JavaDissectorLibrary.java:130)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.dissect_multi(JavaDissectorLibrary.java:140)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect$INVOKER$i$2$0$dissect_multi.call(JavaDissectorLibrary$RubyDissect$INVOKER$i$2$0$dissect_multi.gen)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:202)", "rubyjit.LogStash::Filters::Dissect$$multi_filter_1299793c6e70636e916ab55a5dfbe2aaaf57a1ae1808253012.__file__(/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dissect-1.0.11/lib/logstash/filters/dissect.rb:182)", "rubyjit.LogStash::Filters::Dissect$$multi_filter_1299793c6e70636e916ab55a5dfbe2aaaf57a1ae1808253012.__file__(/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dissect-1.0.11/lib/logstash/filters/dissect.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:326)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:170)", "rubyjit.LogStash::FilterDelegator$$multi_filter_43640ebf68de601b56cb618392ab9de0b4f8c58a1808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:43)", "rubyjit.LogStash::FilterDelegator$$multi_filter_43640ebf68de601b56cb618392ab9de0b4f8c58a1808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)", "org.jruby.ast.DAsgnNode.interpret(DAsgnNode.java:110)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.IfNode.interpret(IfNode.java:118)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:157)", "org.jruby.runtime.Block.yield(Block.java:142)", "org.jruby.RubyArray.eachCommon(RubyArray.java:1606)", "org.jruby.RubyArray.each(RubyArray.java:1613)", "org.jruby.RubyArray$INVOKER$i$0$0$each.call(RubyArray$INVOKER$i$0$0$each.gen)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "org.jruby.ast.CallNoArgBlockNode.interpret(CallNoArgBlockNode.java:64)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call19(RubyProc.java:281)", "org.jruby.RubyProc$INVOKER$i$0$0$call19.call(RubyProc$INVOKER$i$0$0$call19.gen)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)", "org.jruby.ast.DAsgnNode.interpret(DAsgnNode.java:110)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.internal.runtime.methods.ProcMethod.call(ProcMethod.java:64)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.block_0$RUBY$__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:383)", "rubyjit$LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012$block_0$RUBY$__file__.call(rubyjit$LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012$block_0$RUBY$__file__)", "org.jruby.runtime.CompiledBlock19.yield(CompiledBlock19.java:159)", "org.jruby.runtime.CompiledBlock19.call(CompiledBlock19.java:87)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call19(RubyProc.java:281)", "org.jruby.RubyProc$INVOKER$i$0$0$call19.call(RubyProc$INVOKER$i$0$0$call19.gen)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.block_0$RUBY$__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:312)", "rubyjit$LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012$block_0$RUBY$__file__.call(rubyjit$LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012$block_0$RUBY$__file__)", "org.jruby.runtime.CompiledBlock19.yield(CompiledBlock19.java:135)", "org.jruby.runtime.Block.yield(Block.java:142)", "org.jruby.RubyHash$13.visit(RubyHash.java:1355)", "org.jruby.RubyHash.visitLimited(RubyHash.java:648)", "org.jruby.RubyHash.visitAll(RubyHash.java:634)", "org.jruby.RubyHash.iteratorVisitAll(RubyHash.java:1306)", "org.jruby.RubyHash.each_pairCommon(RubyHash.java:1351)", "org.jruby.RubyHash.each19(RubyHash.java:1342)", "org.jruby.RubyHash$INVOKER$i$0$0$each19.call(RubyHash$INVOKER$i$0$0$each19.gen)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:311)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:161)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.chained_0_rescue_1$RUBY$SYNTHETIC__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:382)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.FCallOneArgNode.interpret(FCallOneArgNode.java:36)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.ast.WhileNode.interpret(WhileNode.java:131)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)", "org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:225)", "org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:219)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:202)", "org.jruby.ast.FCallTwoArgNode.interpret(FCallTwoArgNode.java:38)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call(RubyProc.java:230)", "org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:103)", "java.lang.Thread.run(Thread.java:748)"]}

I suspect the actual hang of the pipeline to be related to some JVM config (perhaps the issue raised above) or some overflow related to the high volume throughput through a single TCP input. Not sure how to reproduce it yet.

onnos on 15 Jul 2017

@onnos @martijnvermaat Thanks, very good info, this is helpful.

colinsurprenant on 15 Jul 2017

@martijnvermaat @onnos Which version of LS are you running and on what platform? Thanks.

colinsurprenant on 19 Jul 2017

@colinsurprenant LS 2.3.2 on Ubuntu 14.04.

martijnvermaat on 19 Jul 2017

@martijnvermaat Are you sure? Persistent Queue was only introduced in LS 5.4 !? (in 5.1 in beta)

colinsurprenant on 19 Jul 2017

That's the retired setup, this is 5.5.0. :) Sorry for the confusion, we're transitioning between setups. Running on Ubuntu 14.04 with Xenial 4.4.0 kernel.

Again referring to the problem of the initial hang: I split the incoming stream into two separate TCP inputs (from rsyslog to logstash), and this has helped throughput under load tremendously. I suspect we won't be able to recreate the crash.

onnos on 19 Jul 2017

@martijnvermaat @onnos Here's what I can make of your queue state:

The oldest page is 3470 which contains the sequence numbers range of [1669303167..1669780635] and the First Unacked Sequence Num (FUS) is 1669595327, so there are 185309 (+/- 1, unimportant here) unacked events in that page.
The next page 3471 is the one with the 0 byte size and contains the sequence number range [1669780636..1670260155] and FUS is 1670260156 which means it is fully acked.
The next pages, 3472+ are all purged and the checkpoint says they are all fully acked up to page 3899 which is the head page.

I see 2 problems here:

1- Page 3470 is stuck behind with a bunch of unacked events. The only explanation I see is that this batch at seq num 1669595327 got stuck in a stalled output up until LS crashed or was restarted. I suspect that if it wasn't for the 0 byte page size, upon restart these events in page 3470 would have been replayed and the queue would have been correctly purged up to the head page. It is definitely worth for us to better evaluate such scenarios and think about how to mitigate it. At first thought I'd say that this is more a stuck output problem than a queue problem per-se. Will followup on this separately.

2- Page 3471 is in fact fully acked and should be purged completely but it is zero byte. I suspect the delete operation was, for some reason, not complete or something like that. I will investigate in that area see if I can spot such a condition.

The good news in this case is that simply deleting page.3471 should fix the problem.

The other good news is that I will see if we could add a condition at queue initialization that if a page file has zero byte and the checkpoint says it is fully acked, we could just clean it up.

colinsurprenant on 19 Jul 2017

@nicknameforever do you still have your problematic queue dir intact? If so would it also be possible for you to run the same command to list the content of the checkpoints (adjust queue dir path if necessary) and see if we have the same situation here?

$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

colinsurprenant on 19 Jul 2017

@colinsurprenant Unfortunately not.

nicknameforever on 25 Jul 2017

👍1

create issue #7809 specifically for the zero byte & fully acked page file case.

colinsurprenant on 25 Jul 2017

PR for #7809 in #7810

colinsurprenant on 25 Jul 2017

Above I mentioned a condition of stalled plugin, issue #7796 was created to get more insights on these.

colinsurprenant on 25 Jul 2017

@colinsurprenant I'm running into the same problem as the original reporter. Didn't change the queue size, just restarted logstash and now it's failing to start with the page file being 0 bytes. Running 5.6.4 of logstash from the elastic repos.

andrewmiskell on 27 Nov 2017

@andrewmiskell thanks for the report. Has did LS shutdown prior to the restart? did it complete a clean shutdown or was it forced stopped or crashed? Do you have any logs related to that?

colinsurprenant on 28 Nov 2017

@colinsurprenant Not sure if I have the logs, but it was observed after issuing a simple service logstash restart command on both my logstash nodes.

I still have the queues, I moved them to a temporary location for the time being so I could see if they could be fixed and imported into ElasticSearch later.

andrewmiskell on 28 Nov 2017

@andrewmiskell ah! great that you preserved the queue dir. I am optimistic that we should be able to recover it and it will help diagnose the problem. In the mean time, I'd really appreciate if you could check the logs to see if LS completed a clean/normal shutdown or if the service stop ended up killing the LS process.

Could you list the queue dir content and also run the following command from LS home dir, (adjust the queue dir path) to display the content of the checkpoints:

$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

colinsurprenant on 28 Nov 2017

@colinsurprenant Attached to comment. It looks like there was a hung grok thread and it was killed forcefully by the service script.

logstash-queue.txt
logstash-log.txt

andrewmiskell on 28 Nov 2017

@andrewmiskell could you also list the queue dir files showing the file sizes please? (ls -la)
Thanks.

colinsurprenant on 28 Nov 2017

@andrewmiskell better yet, could you run this new command from the LS home dir, adjust the dir = "data/queue/main" part to your actual queue dir path.

vendor/jruby/bin/jruby -rpp -e 'dir = "data/queue/main"; Dir.glob("#{dir}/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); fa = firstUnackedSeq >= (minSeq + elementCount); ps = File.exist?("#{dir}/page.#{page}") ? File.size("#{dir}/page.#{page}") : nil; print("#{File.basename(checkpoint)}, #{fa ? "FA" : "UN"}, size: #{ps ? ps : "NA"}, "); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

colinsurprenant on 28 Nov 2017

For some reason #7809 which originally fixed the Page file size 0 different to configured ... startup exception was incorrectly merged into the 5.6 branch and this is why we see this problem again. I re-injected the fix and it will be part of the upcoming 5.6.5 release. This is really unfortunate.

I will go ahead and close this issue, feel free to re-open if needed.

colinsurprenant on 30 Nov 2017

Was this page helpful?

0 / 5 - 0 ratings