Openj9: Bug in OpenJ9's JITC?

Created on 13 Oct 2017  路  63Comments  路  Source: eclipse/openj9

Hi all,

I've been running my stuff on OpenJ9 recently after a lot of the issues (like the one blocking IntelliJ) were fixed. Unfortunately, in one of the apps I wrote using akka, I'm hitting an issue where a benchmark I'm running will freeze and the program will stop functioning, and this issue is related to OpenJ9 as it disappears on Zulu9, but appears within 2-5 iterations on OpenJ9.

I'm trying to nail down what exactly is going on here, but I don't know how to write a minimized test-case for you guys that you can reproduce. Is there any profiling or thread information I can extract from openj9 that would help me figure out where the issue is so I can produce a minimized test case for this bug?

Thanks.

OpenJ9 version string:

openjdk 9-internal
OpenJDK Runtime Environment (build 9-internal+0-adhoc.jenkins.openjdk)
Eclipse OpenJ9 VM (build 2.9, JRE 9 Linux amd64-64 Compressed References 20171013_26 (JIT enabled, AOT enabled)
OpenJ9   - ef663e0
OMR      - f79d16d
OpenJDK  - c37e708 based on )

Update: using -Xint makes the issue disappear, though I'm not sure if this is because the bench is slower and less demanding on OpenJ9, or if it's because of an issue with the jit compiler. I'll retest this on a slower connection and see if openj9 works still...

bug jit

Most helpful comment

@markehammons Good news, the build is available on AdoptOpenJDK at https://adoptopenjdk.net/nightly.html?variant=openjdk9-openj9
The build is dated 16 November 2017 and the filename is OpenJDK9-OPENJ9_x64_Linux_20171611.tar.gz. I've downloaded it and confirmed that it doesn't hang with the test case you provided.
Many thanks for helping us fix this problem.

All 63 comments

Easiest way I can think of is to send kill -3 to the process periodically, that will cause the JVM to dump javacore files, which are text files that contain Java backtraces for every application thread, the state of locks, etc. You can usually determine if forward progress is being made or if one or more threads are stuck waiting.

I'm back home with my significantly slower internet and I've confirmed this seems to be a JIT bug. Disabling jit with -Xnojit fixes the problem, and the program fails after a few warmup iterations when I have the default options enabled.

I've attached the jit log from a run that exhibited the hang, as well as two core dumps (one possibly before the problem manifested, one definitely after)

bench.zip

I hope this is helpful cause I'm not sure I can make a minimal test case for you guys, and unfortunately, this code is accessing parts of my job's infrastructure so I can't share it with you as it is.

When it hangs, is it with high cpu or idle?

I used your core analyzer to look at the coredumps i gave you, and it looks like a jit thread is hanging or something. It's the largest cpu time user of any thread and grows between the two dumps.

@JamesKingdon It's idle/near idle. Still some slight cpu usage in the 1-2% range.

That (jit thread consuming cpu) is not necessarily diagnostic, the jit may just be working through the backlog of methods to be compiled. The verbose log has the same number of compilation starts as it does compilation ends, so it's not stuck in a compilation as far as I can see.

Ok, low cpu would also suggest we're stuck on a lock somewhere.

Hmm, lock dump doesn't immediately look suspicious. Which thread would you expect to be working in this app?

@JamesKingdon The thread titled mySystem-irods-dispatcher takes near 70% of cpu time when -Xnojit is passed to openj9.

The eu.bioemergences.utils.irods.IRODSServiceBench,listcontents thread and the main thread are the next two biggest, and all others take minimal time.

Hmm, that thread is sat at

3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at jdk/internal/misc/Unsafe.park(Native Method)
4XESTACKTRACE                at java/util/concurrent/locks/LockSupport.parkNanos(LockSupport.java:234)
4XESTACKTRACE                at java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2104)
4XESTACKTRACE                at java/util/concurrent/LinkedBlockingQueue.poll(LinkedBlockingQueue.java:460)
4XESTACKTRACE                at java/util/concurrent/ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1091)
4XESTACKTRACE                at java/util/concurrent/ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152(Compiled Code))
4XESTACKTRACE                at java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)

Edit: wild cards weren't showing up

Would it be possible to run a test with the following?
-Xjit:exclude={java/util/concurrent/*},dontInline={java/util/concurrent/*}

I get the same results. Here's the core dump from that run:
javacore.20171013.204324.7707.0001.txt

Ah, that was my fault, my first attempt at posting the exclude, I forgot to quote the text and the formatting hid the wildcards. You need a '*' at the end of each method filter for it to match anything.

@JamesKingdon That didn't fix it either, but switching to -Xjit:exclude={sun/misc/*}... did. I tried -Xjit:exclude={sun/misc/Unsafe},dontInline={sun/misc/Unsafe} to see if that narrowing still fixed things but it didn't.

Did you wild card the end of {sun/misc/Unsafe*} ?

@JamesKingdon no I hadn't. -Xjit:exclude={sun/misc/Unsafe*},dontInline={sun/misc/Unsafe*} fixes the issue.

OK, the only jit compiled method from sun/misc in your original verbose log is
sun/misc/Unsafe.getObjectVolatile, so if excluding sun/misc/* works reliably, we might assume that excluding and not inlining getObjectVolatile should also work:
(this will probably need quotes depending on how you launch the JVM)
-Xjit:exclude={sun/misc/Unsafe.getObjectVolatile(*},dontInline={sun/misc/Unsafe.getObjectVolatile(*}
If the program works with that, the next step is to get a compilation log. We would need to remove the exclude/dontInline options and replace with:
-Xjit:{sun/misc/Unsafe.getObjectVolatile(*}(tracefull,log=getObjectVolatile.log)

It appears it wasn't getObjectVolatile. Excluding it didn't fix the problem, however the jit log of the new run that failed showed another sun.misc.Unsafe method: compareAndSwapObject. Excluding it fixed execution. I'm creating the compilation log now.

Interesting. I hate bugs that move...

This is all I can get from running both:
-Xjit:{sun/misc/Unsafe.compareAndSwapObject(*}(tracefull,log=getObjectVolatile.log)
and
-Xjit:exclude={sun/misc/Unsafe.getObjectVolatile*},dontInline={sun/misc/Unsafe.getObjectVolatile*},{sun/misc/Unsafe.compareAndSwapObject(*}(tracefull,log=compareAndSwapObject.log)

compareAndSwapObject.log

As you can see, nothing particularly useful there. I'm not sure why it's not compiling, but excluding that method stops the hang from happening...

here's the jit log where I found compareAndSwapObject being jitted:
unsafe.20171013.211151.13867.log

@0dvictor you might be able to help since this is one of the primitives we probably do something special for in the code generator when we do encounter it - maybe you can provide some guidance?

The trace log is not complete. It seems that multiple log files were generated, do you mind double check and attach all of them? These log files may have names like compareAndSwapObject.log.1 compareAndSwapObject.log.2 compareAndSwapObject.log.3 compareAndSwapObject.log.4 compareAndSwapObject.log.5 compareAndSwapObject.log.6.

They do not exist. I've double and triple checked and there are no such logs

As I said, the jit log for compareAndSwapObject is not particularly helpful.

Might be worth combining a run which logs anything from sun/misc/Unsafe with a verbose log so we can confirm what did get compiled during the run:
-Xjit:{sun/misc/Unsafe.*}(tracefull,log=Unsafe.log),verbose={compile*},vlog=jit.log
This might get large, but at least in the one verbose log I have there was only one method compiled from Unsafe, so hopefully it will be ok.

Here's that:
Unsafe.20171013.215123.17207.log

You'll notice compareAndSwapObject isn't even in the log...

True, but getObjectVolatile was. Was there a compilation log (tracefile) to go with the verbose log? I see you used the same name for both, I'm not sure how that would work out...

Another two configs that may worth trying:

  1. Export environment variable TR_Options=disableUnsafe
  2. Export environment variable TR_DisableCASInlining=1

These two trials may help to isolate the issue.

@JamesKingdon only one of the runs produced a tracefile:
bench_1.zip

I'll try those @0dvictor. I assume I should do those one at a time?

Yes, they should be two separate tests.

DisableUnsafe made the bench run perfectly fine.

DisableCASInlining caused it to hard crash instead of hanging

DisableCASInlining.zip
DisableUnsafe.zip

I included the coredump with the inlining zip

Thank you. The result indicates the defect might be related to runtime/tr.source/trj9/optimizer/InlinerTempForJ9.cpp.
Unfortunately I cannot tell anything from these two logs. Instead trying to figure out which function to log, it may be simpler to log every compiled methods. To achieve this, one can use -Xjit:tracefull,log=Unsafe.log,verbose={compile*},vlog=jit.log; and please unset the two environment variables when logging.

Depending on how many methods, logging everything could generate quite a lot of data.

@0dvictor when I do -Xjit:exclude={sun/misc/Unsafe.compareAndSwapObject(*},dontInline={sun/misc/Unsafe.compareAndSwapObject(*} the hanging problem disappears entirely.

I think something is happening during the jit compilation of that method. Even though it rarely shows up in the jit logs, excluding it from jit makes the problem go away.

I'm compressing the logs you asked for right now.

edit: here's the logs in tar.xz.gz (use gunzip, followed by tar xvf to unpack it):
FullTrace.tar.xz.gz

It's about 400+ MB unpacked.

A quick note: -Xjit:dontInline={sun/misc/Unsafe.compareAndSwapObject(*}" alone fixes the problem as well.

When running -Xjit:dontInline={sun/misc/Unsafe.compareAndSwapObject*},{sun/misc/Unsafe.*}(tracefull,log=compareAndSwapObject.trace.log),verbose={compile*},vlog=compareAndSwapObject.log these are the files produced. You'll notice that unlike many other jit logs, compareAndSwapObject is actually compiled by the jitc. Even though it's compiled, no trace is provided for some reason...

CompareAndSwapObject.zip

edit: I've done a number of runs of this now. compareAndSwapObject never generates trace data, but it is always compiled by the JITC for my program under these settings. Considering the relative rarity of it before in my jit logs, I am almost certain that in-lining it was hanging the JITC and halting execution of my program (which is why i had abnormal cpu time devoted to a jit thread in my coredumps.

Thanks Mark - did the hang happen in that last run that you provided logs for?

@JamesKingdon not at all. The bench ran perfectly.

Also, I managed to finally get a tracefile for compareAndSwapObject.
CompareAndSwapObjectWTrace.zip

I'm going to rerun the bench a bunch of times with -Xjit:{sun/misc/Unsafe.compareAndSwapObject(*}(tracefull,log=compareAndSwapObjectInlined.trace.log),verbose={compile*},vlog=compareAndSwapObjectInlined.log" and see if I can't manage to get a trace for compareAndSwapObject with inlining enabled. For some reason openJ9 doesn't seem to want to produce traces for it, considering only 1 run in 10 actually produced any trace despite all runs compiling it.

Yes, I'm not sure what's going on with the trace. The CompareAndSwapObject.zip files you sent in showed several methods from sun/misc/Unsafe getting compiled that all should have matched the trace filter, and yet no output was present in the trace file. I'm guessing it's some part of the way we handle Unsafe that's getting in the way, but I'll defer to Victor on that.
I'm about to head out for the weekend - thanks for all the data!

Something very weird. I don't see any method involving compareAndSwapObject from the full trace log.

Another environment variable might be worth trying: TR_Options=disableUnsafeFastPath
The reason why compareAndSwapObject was not shown in the log might be it was transformed by UnsafeFastPath.

Ok I'll give that a shot. In the meantime though, I managed to get a trace of compareAndSwapObject with -Xjit:{sun/misc/Unsafe.compareAndSwapObject(*}(tracefull,log=compareAndSwapObjectInlined.trace.log),verbose={compile*},vlog=compareAndSwapObjectInlined.log"
here you go:
CompareAndSwapInline.zip

Unfortunately, TR_Options=disableUnsafeFastPath doesn't make compareAndSwapObject have a trace. Here's one of the jit logs + trace files...

DisableUnsafeFastPath.zip

Hey guys, good news! It turns out this flaw is not in openj9 specifically, and is actually endemic to openjdk9! I think openj9 is just more likely to trigger it than hotspot jvms at the moment, but I managed to trigger the error running the benchmark on zulu9. So I guess I need to report this to the OpenJDK team!

Ah, that's good to know - thanks. It raises really interesting questions about why the xjit: options we tried change the behaviour. The most likely explanation is that it's a narrow window race condition and the options change the timing.
If you open an issue with OpenJDK please can you drop a link here? I'd like to follow the progress. I started working on some low level tests to try and reproduce the problem, but haven't got very far as yet. (Chances are I was looking in the wrong place, given your latest news!)

Hmm, actually there was a problem with my benchmark code that was triggering an issue that I mistook as the same. Let me retest with the fixed code and make sure I'm still getting the same problems.

@markehammons and @JamesKingdon just wanted to say thanks for helping to continue investigation on this issue. I have certainly seen cases where different VMs/JITs are more or less prone to deadlocks etc due to subtle timing differences. There could still be an OpenJ9 issue here so I look forward to seeing the next round of test results to see if they continue to point to a problem elsewhere or not...

So I want to make clear, this issue in my benchmark code was in a new bench I added to test a service I'm developing, and the bench OpenJ9 was failing on is untouched. Since I fixed the new bench I have not seen the thread hang appear again, and the bench had a thread hang cause I threw an exception that was not caught and the benchmark code did not shut down the threading system upon close like it should have. Having fixed that issue, openjdk 9 is working as expected, while OpenJ9 still hangs on the first bench. So unfortunately, the issue is in fact related to OpenJ9. Sorry to get your hopes up guys :(.

No worries, I'll keep looking :)

Tell me if your need more info. I can even see if there's a way I can rig up a dummy system for the innards of this system so that I can provide you guys an open test case to poke at and use yourselves...

@markehammons thanks for the offer and we may well take you up on that. I think @JamesKingdon was thinking of just writing some small sample code and doing an inspection of the generated code for anything obvious and then trying some kind of small microbenchmark/test to see if we could provoke the problem. If that doesn't work then we may need your help to try and dummy something up that can be used to help debug the problem. Thanks again for bringing this to our attention and for helping with the problem determination.

@andrewcraik here's a reduced version of the bench that still triggers the bug, but doesn't rely on our servers (cept for the build, but that's fine) https://github.com/markehammons/OpenJ9Issue

You'll need sbt to build and run it, which you can get here

Got it, investigating. Many thanks.

This is the expected failure behaviour?

[info] # Run progress: 0.00% complete, ETA 00:37:00
[info] # Fork: 1 of 3
[info] # Warmup Iteration 1:
[info] java.util.concurrent.TimeoutException: Futures timed out after [20000 milliseconds]

Yes, the system will timeout if it doesn't receive a response in 20 seconds. The hang is the fact that no response is received under openj9 after a varying number of runs, while openjdk shows no such issues

Likewise, if you disable inlining of compareAndSwapObject the issue should disappear.

edit: hmm, there is a new issue with this bench. -Xnojit makes it run properly, but -Xjit:dontInline on sun.misc.Unsafe* doesn't solve the issue anymore. Maybe it's revealing other bugs? This is a lot more stress on the jvm than the old one since I use dummy methods that take around 0.016ms/op instead of 30ms/op

In any case, openjdk9 runs the bench without issue.

Great, this should be a big help.

Likewise, if you disable inlining of compareAndSwapObject the issue should disappear.
edit: hmm, there is a new issue with this bench. -Xnojit makes it run properly, but -Xjit:dontInline on sun.misc.Unsafe* doesn't solve the issue anymore. Maybe it's revealing other bugs?

I confirmed that disabling inlining of compareAndSwapObject makes the test pass for me, as does disabling inlining on the Unsafe class, which must include the compareAndSwapObject method. It might be a test environment difference, but the syntax on the filters is finicky, so possibly a typo crept in to your run? -Xjit:dontInline={sun/misc/Unsafe.*}
Victor just dropped by and gave me an idea to test...

You're right, i forgot to use / for the package names instead of . 馃憥

The Akka framework uses Unsafe extensively, making this an excellent stress test for those parts of Unsafe that are used for implementing synchronization primitives. Unfortunately it means we have a lot of code to look at to figure out what's going on. I'm trying to narrow it down by looking for methods that were jit compiled with compareAndSwapObject inlined during the hanging case, but my first choice didn't turn up anything interesting. Moving on to the next one...

OK, there have been a few more twists and turns along the path than I was expecting, but I think we're getting closer to the answer now. I have a working prototype fix and I just need to flesh out the details to make it real.
The problem is around the fact that in Java9 the sun.misc.Unsafe implementation was moved to the jdk.internal.misc package, with a set of wrappers remaining in sun.misc.Unsafe for compatibility. The JIT recognises most of the methods in Unsafe so that it can generate fast inline equivalents instead of making expensive JNI calls. For Java9 it has been updated to recognise the Unsafe methods in both the old and new packages. For most of the methods this is working as expected, but in a small number of cases there is some additional complexity that got overlooked. The issue is that it was sometimes assumed that the Unsafe method was a JNI call, which is no longer true for the wrapper methods that remain in sun.misc.Unsafe. This resulted in both a performance issue (the Unsafe call didn't get properly optimised) and the functional problem that causes the test to hang.
So far it looks like the problem only occurs with the compareAndSwap calls, but I'll be doing some additional testing to verify that.
The fix is to work through the various places that deal with the Unsafe methods and add a test to differentiate between the implementation classes (the JNI methods) and the wrappers. We'd like the same code to work both for Java8 classes (impl in sun.misc) and Java9 (impl in jdk.internal.misc) which makes it a little more fiddly than it would otherwise be.

Thanks for your hard work @JamesKingdon! I look forward to switching back to openj9 as the JVM for my development system! 馃憤

Hi Mark, just to update the status - I'm waiting on the two PRs to work their way through the build/test hierarchy and finally appear on adoptopenjdk.net. Once there is a build available I'll download, test it and update this issue again.

@r30shah this may be of interest to you.

Looks like AdoptOpenJDK has been having infrastructure problems recently - no new builds since November 6th. I've seen some discussions on the Adopt Slack channels, so hopefully things will get unblocked fairly soon.

@markehammons Good news, the build is available on AdoptOpenJDK at https://adoptopenjdk.net/nightly.html?variant=openjdk9-openj9
The build is dated 16 November 2017 and the filename is OpenJDK9-OPENJ9_x64_Linux_20171611.tar.gz. I've downloaded it and confirmed that it doesn't hang with the test case you provided.
Many thanks for helping us fix this problem.

I've also confirmed that it no longer hangs in the full version of the bench. Thanks for the hard work @JamesKingdon 馃憤

Was this page helpful?
0 / 5 - 0 ratings