openjdk version "14.0.1" 2020-04-14
OpenJDK Runtime Environment AdoptOpenJDK (build 14.0.1+7)
Eclipse OpenJ9 VM AdoptOpenJDK (build openj9-0.20.0, JRE 14 Windows 10 amd64-64-Bit Compressed References 20200422_50 (JIT enabled, AOT enabled)
OpenJ9 - 05fa2d361
OMR - d4365f371
JCL - 5757187cae based on jdk-14.0.1+7)
Occasionally a JUnit test that lets a stubbed method throw an exception from a thread obtained from a thread pool and then unpacks the Future result shows a NullPointerException happening in Throwable.getStackTrace:
Caused by: java.lang.NullPointerException: null
at java.base/java.lang.Throwable.getStackTrace(Throwable.java:217)
at org.mockito.internal.exceptions.stacktrace.ConditionalStackTraceFilter.filter(ConditionalStackTraceFilter.java:22)
at org.mockito.internal.stubbing.answers.ThrowsException.answer(ThrowsException.java:50)
at org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:39)
This must be caused by java.lang.J9VMInternals#getStackTrace returning null, which should never happen AFAICT.
See https://1drv.ms/u/s!AqgVRjTS12sYjArXYp0N-8KgW-58?e=5cbfdr for several dumps created during a single test run.
Produced using -Xdump:system+java:events=throw+systhrow,filter=java/lang/NullPointerException#java/lang/Throwable.getStackTrace,request=exclusive+prepwalk+preempt
Per Jason Feng's request i ran the test again with an additional -Xjit:exclude={java/lang/Throwable.getStackTrace*} argument, butI could still reproduce the error.
Dumps for this run can be found here: https://1drv.ms/u/s!AqgVRjTS12sYjAvFfmA7amtyseM1?e=eUAct7
I managed to reproduce the problem with a small sample app:
j9-npe.zip
I can run the test from IntelliJ or from Gradle on the command line, and in both cases it will often trigger the NPE to be thrown. Hope this helps in reproducing the issue on your side as well.
Also tried to run the sample app with -Xjit:exclude={java/lang/Throwable.get*StackTrace*}, but am still able to reproduce the error (both from IntelliJ and using Gradle).
Beware that using gradlew test won't actually run the test again if it succeeded and nothing changed, which might cause you to believe that the error can't be reproduced. You can use gradlew --info clean test to recompile first and see more output.
Followed @jkuipers instruction gradlew --info clean test, I am able to reproduce the NPE reported:
https://github.com/eclipse/openj9/blob/1c2b61fa5ddb7945963919a93e8cc517455c2ba9/jcl/src/java.base/share/classes/java/lang/Throwable.java#L219
in which it calls
https://github.com/eclipse/openj9/blob/1c2b61fa5ddb7945963919a93e8cc517455c2ba9/jcl/src/java.base/share/classes/java/lang/Throwable.java#L285-L296
Putting printfs within native https://github.com/eclipse/openj9/blob/1c2b61fa5ddb7945963919a93e8cc517455c2ba9/runtime/jcl/common/jclexception.c#L311
which doesn't return null when the NPE was thrown.
Tried a few JIT options like -Xjit:exclude={java/lang/Throwable.*} without luck, -Xint seems pass the test though.
@andrewcraik Request JIT help investigation.
Slack discussion is https://app.slack.com/client/T8312L63V/C8312LCV9/thread/C8312LCV9-1590178252.161200
@rpshukla could you have a look?
I was able to reproduce a failure with the following options: -Xshareclasses:none -Xjit:tracefull,log=log,limit={java/util/ImmutableCollections\$SetN\$SetNIterator.next()Ljava/lang/Object;}.
I'm looking into what goes wrong when that method is compiled.
Any update @rpshukla ?
A few updates:
lastoptindex=-1 so the problem doesn't seem to be caused by an optimization.-Xint.I am still not sure why the compilation of that iterator method seems to be causing a NullPointerException in an interpreted method (Throwable.getStackTrace). It is possible that the iterator is being called while the array is being cloned and something goes wrong there. However, I'm not sure why the NPE would appear in getStackTrace and not clone.
It seems that the native method getStackTrace may be returning null.
I modified Throwable.getStackTrace as follows so that the result of getInternalStackTrace is stored to a temporary variable.
public StackTraceElement[] getStackTrace() {
StackTraceElement[] tmp1 = getInternalStackTrace();
Object tmp2 = tmp1.clone();
StackTraceElement[] tmp3 = (StackTraceElement[])tmp1;
return tmp3;
//return (StackTraceElement[])getInternalStackTrace().clone();
}
Using DDR, the top of the stack looks like this when the NPE is thrown. Note that t1 is null:
<a1e100> *** BEGIN STACK WALK, flags = 00400001 walkThread = 0x0000000000A1E100 ***
<a1e100> ITERATE_O_SLOTS
<a1e100> RECORD_BYTECODE_PC_OFFSET
<a1e100> Initial values: walkSP = 0x0000000000AFC5A8, PC = 0x0000000000000001, literals = 0x0000000000000000, A0 = 0x0000000000AFC5C0, j2iFrame = 0x0000000000000000, ELS = 0x00007FB320207BA0, decomp = 0x0000000000000000
<a1e100> Generic special frame: bp = 0x0000000000AFC5C0, sp = 0x0000000000AFC5A8, pc = 0x0000000000000001, cp = 0x0000000000000000, arg0EA = 0x0000000000AFC5C0, flags = 0x0000000000000000
<a1e100> Bytecode frame: bp = 0x0000000000AFC5D8, sp = 0x0000000000AFC5C8, pc = 0x00007FB33C11D302, cp = 0x0000000000070780, arg0EA = 0x0000000000AFC5F8, flags = 0x0000000000000000
<a1e100> Method: java/lang/Throwable.getStackTrace()[Ljava/lang/StackTraceElement; !j9method 0x0000000000070F80
<a1e100> Bytecode index = 6
<a1e100> Using local mapper
<a1e100> Locals starting at 0x0000000000AFC5F8 for 0x0000000000000004 slots
<a1e100> I-Slot: a0[0x0000000000AFC5F8] = 0x00000000FFCE3B08
<a1e100> O-Slot: t1[0x0000000000AFC5F0] = 0x0000000000000000
<a1e100> I-Slot: t2[0x0000000000AFC5E8] = 0x00000000009D7E28
<a1e100> I-Slot: t3[0x0000000000AFC5E0] = 0x0000000000000001
@JasonFengJ9 it looks like this might not be a jit issue. Although I'm not sure why it doesn't appear with -Xint.
It seems that the native method
getStackTracemay be returningnull.
The native method should only be able to return NULL when there is a pending exception (OOM or ClassNotFoundError) which should be thrown on return from the native.
Has the native been compiled for directToJNI?
Has the native been compiled for directToJNI?
I limited the compile to only java/util/ImmutableCollections$SetN$SetNIterator.next()Ljava/lang/Object; and looking at the verbose log this was the only method compiled.
As per experiment I did at https://github.com/eclipse/openj9/issues/9663#issuecomment-633586616, the instrumental log within native https://github.com/eclipse/openj9/blob/1c2b61fa5ddb7945963919a93e8cc517455c2ba9/runtime/jcl/common/jclexception.c#L311
shows that the native doesn't return null when the NPE was thrown.
@rpshukla did you see something different?
As per experiment I did at #9663 (comment), the instrumental log within native
shows that the native doesn't return
nullwhen the NPE was thrown.@rpshukla did you see something different?
I used printf to print the return value of the native getStackTrace and I also saw that it doesn't return null. So that native method itself doesn't seem to be causing the problem.
I added some print statements in getInternalStackTrace like so:
@@ -285,6 +290,7 @@ private static int countDuplicates(StackTraceElement[] currentStack, StackTraceE
StackTraceElement[] getInternalStackTrace() {
if (disableWritableStackTrace) {
+ System.out.println("getInternalStackTrace: return ZeroStrackTraceElementArray " + Arrays.toString(ZeroStackTraceElementArray));
return ZeroStackTraceElementArray;
}
@@ -292,6 +298,7 @@ StackTraceElement[] getInternalStackTrace() {
stackTrace = J9VMInternals.getStackTrace(this, true);
}
+ System.out.println("getInternalStackTrace: return " + Arrays.toString(stackTrace));
return stackTrace;
}
When I ran the test I saw some lines in the output like this:
getInternalStackTrace: return null
So it looks like J9VMInternals.getStackTrace is returning null even though when we add printfs in the C code we do not see the C function returning null.
Edit: I also noticed that after I added the calls to System.out.println in getInternalStackTrace, the NPE seems to be thrown in every run (it was only thrown about 1 in 20 times previously).
There's something very strange here as reading the code, I don't see any path through that returns null without having a pending exception.
Can you try disabling the directToJNI path for this native? Just trying to simplify the flow on return.
There's something very strange here as reading the code, I don't see any path through that returns null without having a pending exception.
Can you try disabling the directToJNI path for this native? Just trying to simplify the flow on return.
I ran the test with -Xjit:disableDirectToJNI and still saw the failure.
@gacholio any thoughts on what could be going wrong here? I don't see a path through the native that returns NULL unless there's a pending exception. Yet somehow the stackTrace is null
I see a lot of conflicting information here. Does the issue occur with -Xint or not? If the native was instrumented to prove it didn't return NULL, then then the issue must be in the caller.
One of the comments indicates that compiling only the native thunk results in the error. Is that true?
One (extremely) outside possibility is that the local ref creation failed and returned NULL - we're not checking that in the native.
There are also drop to frame checks throughout internalFindKnownClass, which may not be rigorously checked in the callers - any debug happening here?
Does the issue occur with -Xint or not? If the native was instrumented to prove it didn't return NULL, then then the issue must be in the caller.
The NPE is not seen with -Xint.
One (extremely) outside possibility is that the local ref creation failed and returned NULL - we're not checking that in the native.
Put following log
printf("Java_java_lang_J9VMInternals_getStackTrace 2 returns result (%p) \n", result);
Assert_JCL_notNull(result);
just before return result; https://github.com/eclipse/openj9/blob/5bdc9a55e1e96e1668ab9f8f65e3aeb2ecb7141a/runtime/jcl/common/java_lang_J9VMInternals.cpp#L34-L46
The console output are:
Java_java_lang_J9VMInternals_getStackTrace 2 returns result (0xec35ea0)
Java_java_lang_J9VMInternals_getStackTrace
Gradle Test Executor 11 finished executing tests.
> Task :test FAILED
J9NpeTest > allExceptionsPropagated() FAILED
java.lang.NullPointerException
at java.base/java.lang.Throwable.getStackTrace(Throwable.java:217)
at org.mockito.internal.exceptions.stacktrace.ConditionalStackTraceFilter.filter(ConditionalStackTraceFilter.java:22)
at org.mockito.internal.stubbing.answers.ThrowsException.answer(ThrowsException.java:50)
at org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:39)
at org.mockito.internal.handler.MockHandlerImpl.handle(MockHandlerImpl.java:96)
at org.mockito.internal.handler.NullResultGuardian.handle(NullResultGuardian.java:29)
at org.mockito.internal.handler.InvocationNotifierHandler.handle(InvocationNotifierHandler.java:33)
at org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:61)
at org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:49)
at org.mockito.internal.creation.bytebuddy.MockMethodInterceptor$DispatcherDefaultingToRealMethod.interceptSuperCallable(MockMethodInterceptor.java:108)
at rs.kuipe.j9npe.Worker$MockitoMock$1272740891.doSomeWork(Unknown Source)
at rs.kuipe.j9npe.AsyncClient.doSomeWorkAsync(AsyncClient.java:20)
at rs.kuipe.j9npe.AsyncClient$$FastClassBySpringCGLIB$$aafb8426.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:771)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
at org.springframework.aop.interceptor.AsyncExecutionInterceptor.lambda$invoke$0(AsyncExecutionInterceptor.java:115)
at org.springframework.aop.interceptor.AsyncExecutionInterceptor$$Lambda$588/0000000000000000.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:851)
2 returns result (0xeb89bc0)
Java_java_lang_J9VMInternals_getStackTrace 2 returns result (0xeb58b00)
Java_java_lang_J9VMInternals_getStackTrace 2 returns result (0xec20c00)
One (extremely) outside possibility is that the local ref creation failed and returned NULL - we're not checking that in the native.
local ref creation can only return null if passed null. Otherwise, we'd fatalExit before returning null
It seems clear the native is not returning the NULL, so I'm not sure what more there is to look at here.
I was able to reproduce a failure with the following options:
-Xshareclasses:none -Xjit:tracefull,log=log,limit={java/util/ImmutableCollections\$SetN\$SetNIterator.next()Ljava/lang/Object;}.I'm looking into what goes wrong when that method is compiled.
@rpshukla can you share the full commandline used to reproduce this? and the generated trace log? Along with -Xjit:verbose output? Something very strange is happening here so it would be good to get consistent data for everyone to look at the same thing
Here are some trace logs and verbose logs. The script I used to run the test is run.sh. output-0 is the std out from a run. There is additional output in that file from the logging that I mentioned in a previous comment
Next step is probably to start limitting the opts and/or disabling things like inline allocation, recompilation, etc to simplify the output code (which is already pretty simple). Does it reproduce with count=0 and a fixed opt level?
It can be reproduced with count=0. The logs above are from runs with -Xjit:lastoptindex=-1 (i.e. it fails even when no opts are run).
How about the opposite question - if you exclude that seemingly-unrelated method from compilation, does the problem go away?
How about the opposite question - if you exclude that seemingly-unrelated method from compilation, does the problem go away?
The failure still occurs when that method is excluded with exclude={java/util/ImmutableCollections\$SetN\$SetNIterator.next()Ljava/lang/Object;}
Update: I am able to reproduce the failure with -Xint, but only after I added the following logging:
diff --git a/jcl/src/java.base/share/classes/java/lang/Throwable.java b/jcl/src/java.base/share/classes/java/lang/Throwable.java
index 64069ff84..6b0727a0a 100644
--- a/jcl/src/java.base/share/classes/java/lang/Throwable.java
+++ b/jcl/src/java.base/share/classes/java/lang/Throwable.java
@@ -24,6 +24,7 @@ package java.lang;
import java.io.*;
import java.util.ArrayList;
+import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.IdentityHashMap;
@@ -285,6 +286,7 @@ private static int countDuplicates(StackTraceElement[] currentStack, StackTraceE
StackTraceElement[] getInternalStackTrace() {
if (disableWritableStackTrace) {
+ System.out.println("getInternalStackTrace: return ZeroStrackTraceElementArray " + Arrays.toString(ZeroStackTraceElementArray));
return ZeroStackTraceElementArray;
}
@@ -292,6 +294,7 @@ StackTraceElement[] getInternalStackTrace() {
stackTrace = J9VMInternals.getStackTrace(this, true);
}
+ System.out.println("getInternalStackTrace: return " + Arrays.toString(stackTrace));
return stackTrace;
}
I ran the test with these options:
export OPENJ9_JAVA_OPTIONS="-Xint"
./gradlew --info clean test
Before adding this logging, I had run the test 50+ times with -Xint and saw no failures. When I limited the compile to only java/util/ImmutableCollections\$SetN\$SetNIterator.next()Ljava/lang/Object;, I saw failures in about 1/20 runs. I also saw about the same rate of failure when I removed the limit from the command line options and allowed the jit to compile any method.
As I mentioned here, after I added this logging, the test failed every run. However, I didn't try running it again with -Xint until now.
Something strange is happening here. It seems that the conditions for this failure can be triggered by either adding extra logging or allowing one or more methods to be compiled.
Reproduced the failure with -Xint option as well, it takes ~50 runs while it failed in first one or two runs w/o it. Like previous experiments, the native log didn't show a null returning value in Java_java_lang_J9VMInternals_getStackTrace().
Preliminary result:
There is a time-hole within j.l.Throwable.getInternalStackTrace()
https://github.com/eclipse/openj9/blob/50ba90c7297c641d9af24697cf87b0a585df1813/jcl/src/java.base/share/classes/java/lang/Throwable.java#L285-L296
between J9VMInternals.getStackTrace(this, true) and return stackTrace.
If j.l.Throwable.setStackTrace(trace) sets null within that time window, a NPE can occur while the native returns a non-null value.
Indeed, a log shows that (StackTraceElement[])trace.clone() returns null and eventually leads to NPE reported.
The test code in question does invoke j.l.Throwable.setStackTrace(trace) regularly:
https://github.com/mockito/mockito/blob/e734fe6697e1f397b952b304c5935cf86add6eba/src/main/java/org/mockito/internal/exceptions/stacktrace/ConditionalStackTraceFilter.java#L12-L25
Edit: assigning J9VMInternals.getStackTrace(this, true) result to a local variable and return it, no NPE observed so far.
How is the stackTrace being written? setStackTrace doesn't allow setting null:
https://github.com/eclipse/openj9/blob/50ba90c7297c641d9af24697cf87b0a585df1813/jcl/src/java.base/share/classes/java/lang/Throwable.java#L229-L233
Edit: assigning
J9VMInternals.getStackTrace(this, true)result to a local variable and return it, noNPEobserved so far.
I think this is a good thing to do but I'm not sure I understand the race window here.
@DanHeidinga ~it appears null value is from~ https://github.com/eclipse/openj9/blob/0edda637754386e320c0d39c6f5bd91135805bb3/jcl/src/java.base/share/classes/java/lang/Throwable.java#L244
~Here is the race window roughly:~
if (stackTrace == null) {
stackTrace = J9VMInternals.getStackTrace(this, true); --> non-null stackTrace
}
---> setStackTrace() assign a null value (still investigating why this happened).
return stackTrace; --> null value is returned and cause NPE later at (StackTraceElement[])getInternalStackTrace().clone();
Edit: it is not setStackTrace() setting null stackTrace value, currently leaning towards memory ordering issue. Assigning J9VMInternals.getStackTrace(this, true) result to a local variable to avoid refetching the instance variable #9918 can make the test pass.
Thanks to everyone - @rpshukla , @gacholio , @JasonFengJ9 - for tracking this down.
@jkuipers The fixes should appear in the AdoptOpenJDK nightly builds in next day or so. Would you be able to validate a nightly build fixes the issue?
Yeah, that shouldn't be a problem. Thanks a lot for investigating and fixing this, that was quite a quest ;)
I have relabeled this since the fix is in an area of the VM for tracking purposes.
@DanHeidinga should this be closed by https://github.com/eclipse/openj9/pull/9918 ?
yes