Logstash: Persistent queues exception causes data loss/pipeline stop

Created on 18 Sep 2017 · 33Comments · Source: elastic/logstash

Version: Logstash 6 beta 1
Operating System: ubuntu 16.04

I've got two pipelines. Pipeline 1 gets inputs from several locations and sends some of them (after some modification) to pipeline #2 which outputs it to the elasticsearch output.
The seconds I enable the persistent queue on any of these pipelines, data isn't sent to ES. The queues aren't full and are defined as 1gb in size. The second i disable the queues, data is being sent again to ES.
So, I know that the pipelines aren't stuck BUT the elasticsearch output pipeline isn't sending the data and keeps writing them to disk until at some point the queues are full and new events are ignored.

I saw in debug mode the following exception:

@task=#0x14741235@observers=#0x2cbe9635@observers={

}>,
@timeout_interval=120.0,
@running=<#Concurrent: : AtomicBoolean: 0xfc8value: true>,
@StoppedEvent=# @iteration=0>,
@execution_interval=5.0,
@do_nothing_on_deref=true,
@run_now=nil,
@freeze_on_deref=nil,
@executor=#0x2487c202@task=#,
@exception_class=StandardError>,
@StopEvent=# @iteration=0>,
@value=nil,
@copy_on_deref=nil,
@dup_on_deref=nil>,
@metric=#0x7f1c77f4@collector=#0x69413f62@agent=nil,
@metric_store=#0x6e26f702@store=#,
@structured_lookup_mutex=#,
@fast_lookup=#>>>,
@options={
: polling_interval=>5,
: polling_timeout=>120
}>,
: result=>nil,
: polling_timeout=>120,
: polling_interval=>5,
: exception=>Concurrent: : TimeoutError,
: executed_at=>2017-09-1409: 13: 47UTC
}

Pioneer Program bug

Source

randude

All 33 comments

I'd like to point out that I'm using several machines with the same conf, and only one machines gives me these errors. The machines are basically the same and the conf is the same.
Any insight as to why this isn't working would be great.

If you require more info please let me know.

randude on 18 Sep 2017

@randude so PQ enabled works on other machines and only fails on one specific one?

original-brownbear on 18 Sep 2017

2 machines fail, 5 machines work, same conf, same os/memory/disk etc.

Is the exception related?

randude on 18 Sep 2017

@randude looks like it, seems something dead-locks in PQ.

So just to recap:

You have multiple machines, all of them run fine with PQ enabled, except for one?
That one machine works fine without PQ enabled?

If those two are true we (you I guess, sorry) have to find the difference between your machines :)
But ... what you can do regardless is maybe provide a stack-trace of the blocked Logstash?

original-brownbear on 18 Sep 2017

Yes you are correct. We tried to figure out what is different between the machines (2 are stuck, not one and 5 are ok). ES server is 6 beta 2 if that makes any difference.

How can I provide you a stack trace? what conf do I change and where?

randude on 18 Sep 2017

You can get the stacktrace using jstack ${PID} where ${PID} is the pid of your running Logstash

original-brownbear on 18 Sep 2017

@randude permissions and free disk space are ruled out as possible issues? :)

original-brownbear on 18 Sep 2017

@original-brownbear Yes, we checked those out. The queues seem to be growing over time.

I've attached 2 files here. You can see that many thread are "waiting/parking".
First log is when the process is up, seconds log is after one minute of missing logs (logs aren't sent to ES for 1 minute).

logstash_first_run.txt
logstash_after_1_min_no_output.txt

randude on 18 Sep 2017

I was able to reproduce long pauses in processing in PQ in stress tests now. Also, we have some PQ Jenkins runs that took extremely long to finish on the PQ stress tests (40mins vs. the usual 3min) => looking into this.
Thanks for the report!

original-brownbear on 18 Sep 2017

@original-brownbear Thanks! That sounds great.
However, I'd like to point out that in my case logstash doesnt output anything to ES even after 2 days. That's how we found it, we saw a huge gap in Kibana from those machines.

randude on 18 Sep 2017

@randude thanks, that's good to know. I did fix one issue that could explain your behaviour in #8186 . Are you able to build and test master by any chance?
I'm not 100% confident it's this one though because the condition that was fixed in #8186 is somewhat unlikely, but it could be that your environment is such (event size relative to other configs) that it became more likely.

original-brownbear on 18 Sep 2017

@original-brownbear I'd rather wait for the next official build, I've currently disabled PQ in our deployment. Do you know when this fix will be officially out?

randude on 18 Sep 2017

@randude I don't know yet sorry. BUt I have a question:

What file system and hard drive type are you using, not AWS EBS by any chance is it?

original-brownbear on 19 Sep 2017

We are using ext4 disks based on vmware on-prem machines.

randude on 19 Sep 2017

@randude thanks :) also, finally figured this one out in #8293 (hard to reproduce but obvious concurrency issue).

original-brownbear on 19 Sep 2017

@original-brownbear Thats's great news! Deadlocks are nasty bugs, I'm glad I could help you guys before this happened at production ;)

randude on 19 Sep 2017

@randude haha yea :) fortunately, backporting the fix to all branches worked out fine, so it will be in the next release :)

original-brownbear on 19 Sep 2017

🎉1

@original-brownbear Hey, so is this fixed in 6.0.0-rc1?

randude on 1 Oct 2017

@randude yes, it's in the RC1 https://github.com/elastic/logstash/commit/7920b260f536c545ce896d6e472f02fc796e11a3 :)

original-brownbear on 1 Oct 2017

🎉1

@original-brownbear Sorry to be the bearer of bad news but it seems that the issue still persists on my enviorment. Last time only 2 machines out of 6 were stuck. Now, ALL machines are stuck and don't output data to elasticsearch :(

randude on 2 Oct 2017

@randude same error message from the periodic poller still?

original-brownbear on 2 Oct 2017

I attached the stack trace below.
Anyways, I did some research .We have 6 machines. 5 send data to the 6th. The 6th has 2 pipelines. The first gets inputs from the local machine and the other machines and sends it over udp to the second pipeline.
The second pipeline gets the logs from the udp and sends it to elasticsearch. Very simple.
I disabled All the persistent qeues, deleted the data folders and restarted logstash.
Everything was working fine, i could see in kibana data from all machines.
I then enabled the persistent queue only on the pipeline that sends to elasticsearch, i.e one pipeline out of two on 1 machines out of 6.
Immediately i stopped seeing logs from 4 machines but for some reason logs from one machines were still displayed. Any help is appriciated :)

If it helps, the first pipeline, the one that gets inputs from all machines is also using the "pipe" input - we send data between machines via a python script that sends logs over ssl.
Let me know if you need more info.
Thanks and sorry again :(

logstash.log

randude on 2 Oct 2017

@randude no need to be sorry! Will look into this later today.

original-brownbear on 2 Oct 2017

@randude are you sure you attached the correct stack trace? The one attached contains only references to the standard in-memory queue like this:

"[main_worker]>worker3" #73 daemon prio=5 os_prio=0 tid=0x00007f117c026800 nid=0x1572 runnable [0x00007f114d1e8000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000e38240e0> (a java.util.concurrent.SynchronousQueue$TransferStack)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
    at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
    at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
    at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(JavaMethod.java:468)
    at org.jruby.javasupport.JavaMethod.invokeDirect(JavaMethod.java:326)
    at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:57)
    at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:193)
    at usr.share.logstash.logstash_minus_core.lib.logstash.util.wrapped_synchronous_queue.invokeOther1:poll(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:23)

original-brownbear on 2 Oct 2017

Attaching file with the [elasticsearch_worker] pipeline.
stack.log
stack2.log

The second file has both pipelines with pq enabled.

I Confirmed that we'r getting the same behavior as before. No data is getting to ES (the part about one machine is a mistake because apparently it sent data to ES by itself, it was a conf mistake on my part) However, by setting logstash to debug mode, i can see that the first worker is still getting logs from all machines and does filters on it. So, that means that the pq is ok there, maybe the problem is the combination of pq & ES output?
Enabling pq on the elasticworker by itself is enough to stop viewing logs on kibana. Enabling both had no apparent change and from the logstash.log i can see that data is still processed in the main worker pipeline.

randude on 3 Oct 2017

@randude interesting, stacktrace number 2 seems to be completely broken. I can see neither a PQ nor a normal Queue running there.
=> Question (sorry if I misread and it's somewhere in the above):

Does PQ break for you in single pipeline configurations as well, or is it always multi-pipeline configurations that break in combination with PQ?

original-brownbear on 3 Oct 2017

Enabling just one PQ , on the elasticworker stops immediately the logs flow.

here's my logstash.yml:
path.data: /var/lib/logstash
path.settings: /etc/logstash
path.logs: /var/log/logstash
dead_letter_queue.enable: false
log.level: info

and my pipelines.yml:

pipeline.id: main_worker
path.config: "/etc/logstash/main_pipeline.d/*.conf"
queue.type: persisted
queue.max_bytes: 1gb
pipeline.id: elasticsearch_worker
path.config: "/etc/logstash/elastic_pipeline.d/*.conf"
queue.type: persisted
queue.max_bytes: 1gb

randude on 3 Oct 2017

@original-brownbear what do you mean you dont see the worker's stack? if you search for the string [elasticsearch_worker] or [main_worker] you can see the relevant stacks. Each has a manager thread and one thread.

randude on 3 Oct 2017

@randude sorry I misspoke there, what I can't see is blocking/waiting on any Queue in the second thread dump. That's why it's a little confusing to me how PQ is at fault here.
In fact, the Elasticsearch worker seems to be blocked on input I/O:

"[elasticsearch_worker]<udp" #33 daemon prio=5 os_prio=0 tid=0x00007f37e41c2000 nid=0x2a16 runnable [0x00007f38117ed000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000000d861cff0> (a sun.nio.ch.Util$3)
    - locked <0x00000000d861cfd8> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000db4c5ad0> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at org.jruby.util.io.SelectExecutor$1.run(SelectExecutor.java:466)

as for the pipelines question. So you are running a multi-pipeline configuration (multiple pipelines inside one LS/JVM) right? Does PQ still block if running just a single pipeline in that LS/JVM (not only a single PQ one, really just one overall :))?

original-brownbear on 3 Oct 2017

so you are running a multi-pipeline configuration (multiple pipelines inside one LS/JVM) right? - Yes
Does PQ still block if running just a single pipeline in that LS/JVM (not only a single PQ one, really just one overall - I didn't try that yet.

randude on 3 Oct 2017

@randude it would be very helpful if you could try with a single pipeline to narrow this down.
I just experimented a little with multiple pipelines and PQ and it worked fine, so I'm a little stuck here without further information.

original-brownbear on 3 Oct 2017

@original-brownbear I tested the depyloyment on a different env. and everything was working fine there. However, that env. had less logs volume. I believe this is somehow related to our environment which somehow mess up the PQ.
I'll let you know if i find anything. Thanks again for help thus far.

randude on 3 Oct 2017

@original-brownbear see https://github.com/elastic/logstash/issues/8462
I was able to find a workaround for me but this seems as another bug.

randude on 10 Oct 2017

Was this page helpful?

0 / 5 - 0 ratings