Logstash: [meta] kv and grok filters timeout handling

Created on 18 Jul 2019  路  13Comments  路  Source: elastic/logstash

Description

Both the kv and grok filters share a similar strategy for supporting the timeout option to abort long running regex operations.

The strategy, initially put in place for grok, is to rely on the Java interrupt status flag. The basic idea is to create a watchdog thread which monitor the time spent from the worker thread inside the regex parsing calls (that are ultimately handled by the Java joni regex library). Upon figuring that a worker thread has spent too much time in a regex call, to abort it, it sets that worker thread underlying Java thread interrupt status by using Thread.interrupt(). See kv implementation for example.

Problem

We have realized that logstash was sometimes randomly crashing with InterruptedException and figured these were tied to the grok and kv filters timeout handling and also figured that the timeout was not working at all in other cases.

The problem seems to be related to the direct manipulation of the Java interrupt status flag on the underlying Java thread from a JRuby execution context. JRuby threads + runtime has its own layer to handle the Java interrupt status flag, see the RubyThread implementation. We can see that the RubyThread.pollThreadEvents() method calls RubyThreads.anyInterrupted() which is turn call Thread.interrupted() which reads and resets the interrupt status flag :

The interrupted status of the thread is cleared by this method. In other words, if this method were to be called twice in succession, the second call would return false

The problem is that that the JRuby RubyThread.pollThreadEvents() methods is invoked in many places in the JRuby runtime and it is very hard to predict the behaviour of explicitly setting a Java thread interrupt flag from a JRuby context because once set in Java from JRuby, the JRuby runtime will continue to be invoked up until the final joni Java code is executed.

Also looking at this JRuby issue we can see that figuring the right interruption semantics between Java and Ruby in not trivial and that these implementation details can change. Relying on such implementation details can become future problems if these implementation details change.

Solution

I am not sure. There are possibly two paths;

  • Finding a fix for the current Ruby plugins implementations of grok and kv, maybe look into using the Timeout class?
  • Rewriting both plugins in pure Java using the Java plugins API and make sure that setting the Java thread interrupt flag on a thread is never inter winded with JRuby execution on that same thread?

All 13 comments

/cc @headius @kares do you think this is a correct assessment of a strategy of directly using Java Thread.interrupt() from JRuby?

I see that this strategy was actually suggested back then (1.7.3) by @headius in https://github.com/jruby/jruby/issues/792 because Timeout did not seem to be working correctly with joni - I will have to see what the state of this is.

will need to look at details, but there's a somehow recent interrupt related change, altough pbly not regexp related (certainly smt where logstash was considered). anyways, there's some support in joni itself for regexp timeouts that isn't used much but could be fine tuned if it would help logstash's case ...

@kares thanks - you probably refer to https://github.com/jruby/jruby/pull/5683 ?

While ultimately the end goal is about interrupting a long regex call in joni (and obviously all suggestions are welcome) my question is more about the potential dangers of using Java Thread.interrupt() from JRuby? My assessment is that it is dangerous and can lead to hard to predict behaviours. Feedback about that would be useful in helping to decide if we should spend more energy in trying to make it work correctly or is it probably better to avoid that?

Finding a fix for the current Ruby plugins implementations of grok and kv, maybe look into using the Timeout class?

From what I tested I see no current way of interrupting joni from jruby, be it using interrupt or ruby's Timeout.

FYI I found some history on joni's interruptibility; https://github.com/jruby/joni/pull/6 and digging further I realize that an option was added to control joni's interruptability from JRuby https://github.com/jruby/jruby/blob/493822c134fe74e9630d2dc49325e291008b3101/core/src/main/java/org/jruby/util/cli/Options.java#L147 and it is off by default.

@jsvd running your above interrupt example using

jruby -Xregexp.interruptible=true using_interrupt.rb

actually works - so that solves the mystery of why joni's interruption stopped working at some point.

yep, this makes both java interrupt and ruby's Timeout work again.

pushed kv logstash-plugins/logstash-filter-kv#84 and grok logstash-plugins/logstash-filter-grok#147 modifications

馃憦 Well done, gentlemen. I know a few sites I've visited had kv timeout issues, and this is a tremendous bit of investigation and correction. Bravo!

Also related and somewhat full-circle logstash-plugins/logstash-filter-grok#130 logstash-plugins/logstash-filter-grok#131

@kares thanks - you probably refer to jruby/jruby#5683 ?

right, my shady memory forgot that the work-around was specific to mutex-es. shouldn't matter here.

While ultimately the end goal is about interrupting a long regex call in joni (and obviously all suggestions are welcome) my question is more about the potential dangers of using Java Thread.interrupt() from JRuby? My assessment is that it is dangerous and can lead to hard to predict behaviours. Feedback about that would be useful in helping to decide if we should spend more energy in trying to make it work correctly or is it probably better to avoid that?

believe you are correct esp. as you dissected JRuby's Thread.interrupted() usage.
there's some historical reasons for that, I am not sure JRuby could change that in 9.2 (to keep the native interrupted status) maybe in a new major but take my words lightly.
if logstash needs it we could do some experiments.

that being said the interruptible regexp is probably a better way, the reason why its not on by default is likely just performance (slows down matching - not sure how much pbly worth an isolated performance test).

grok v4.1.0 and kv v4.4.0 have been published both using Timeout.timeout instead direct Java Thread.interrupt() for timeout handling.

For reference, we traced back to v6.7.0 (where JRuby was updated from 9.1.13.0 to 9.2.6.0) when joni interruptability was disabled by default and resulted in both the kv and grok filters timeout options just not working. This is tracked in #10978

I am closing this issue since a fix was pushed.

Note that we discovered a substantial performance regression with the current fix as described in #11302

Was this page helpful?
0 / 5 - 0 ratings