Jruby: Thread kill does not release lock

Created on 21 Sep 2020  路  6Comments  路  Source: jruby/jruby

Hit a deadlock where jstack shows a number of threads waiting on a lock but no thread is holding the lock. We were able to reproduce it with the code snippet below. This snippet works (runs indefinitely) on jruby-9.1.17.0 and MRI 2.5.8, but deadlocks on jruby-9.2.9.0 and jruby-9.2.13.0

The suspicion is that the t.kill is killing the thread without releasing the mutex, possible related to this change https://github.com/jruby/jruby/commit/cc6bf49f8075e8de3acb0031753340bac994f179#diff-7768018ea9b120637b1da8461cb5df2b

def test_it
  mutex = Mutex.new
  sleep_div = 500
  starts = 0
  ends = 0
  2000.times do |i|
    ts = 3.times.map do
      Thread.new do
        sleep(rand / sleep_div)
        mutex.synchronize{ starts += 1; sleep(rand / sleep_div); ends += 1 }
      end
    end
    3.times do
      sleep(rand / sleep_div)
      t = ts.sample
      t.kill
    end
    if i % 10 == 0
      mutex.synchronize do
        puts "iter #{i}, starts: #{starts}, ends: #{ends}"
      end
    end
  end
end

if __FILE__ == $0
  test_it
end

Most helpful comment

@stalbot Given your excellent reproduction script, I found the issue in about five minutes! Thank you! I will push a PR shortly.

All 6 comments

We only use thread kill in a few places so right now it looks like this isn't a blocker for us, we're going to try to refactor it out.

Commit cc6bf49 was also implicated by me in #6326 but the reporter was unable to provide a reproduction. Maybe now we can fix it!

I got lucky finding this, as we had just one instance of an actually exercised Thread#kill in our production code, right around a point where we were hitting a "re-locking by same thread" exception in a traceback.

Interestingly, cc6bf49f8075e8de3acb0031753340bac994f179 was my guess about a culprit but I could also repro this on 9.2.8.0 but not 9.2.7.0. So that suggests it might be f1c7836096c81896a825fdaef283a69f74fb0ef6 instead? Pretty sure I double-checked 9.2.8.0 really did have the issue.

@stalbot Given your excellent reproduction script, I found the issue in about five minutes! Thank you! I will push a PR shortly.

@stalbot PR is in, but clearly we should add some kind of test for this.

The issue came up with a thread that had just acquired a lock was interrupted for a thread event like kill or raise. It would see that event in the finally logic for RubyThread.executeTask and in both cases raise an exception, skipping the lock-adding code in RubyThread.lockInterruptibly that comes after executeTask.

The issue was introduced in bea9ad42f62e24a3963e135f2138c75a716f7790 (released in 9.2.8.0) by adding the executeTask wrapper around the lock acquisition, but failing to keep the lock add right next to it.

I have adapted your reproduction as a regression spec in #6407.

Was this page helpful?
0 / 5 - 0 ratings