Openj9: GPF in j9vm29.dll with openj9-0.12.0 on Windows Server 2016

Created on 28 Feb 2019  路  33Comments  路  Source: eclipse/openj9

Hello,

We have a GPF (Global Protection Fault) crash of one of our Java utility running with AdoptOpenJDK 11.0.2 with OpenJ9 on Windows Server 2016, with a java dump showing that it occurred in j9vm29.dll.

I found the following page from an IBM site giving some hints about how to analyze the related Java dump:
https://www.ibm.com/support/knowledgecenter/en/SSYKE2_8.0.0/com.ibm.java.vm.80.doc/docs/javadump_tags_info.html

Based on this page, my understanding is that the crash took place in j9vm29.dll library which is part of the OpenJ9 JVM, and we have no JNI in our Java application.

This issue is not systematic, and when it occurred, the machine's memory usage wasn't heavy, and we found no interesting events in Windows events logs.

Here's the java full file:
javacore.20190227.222307.7284.0002.txt

if this crash report isn't sufficient for OpenJ9 developers to spot where the issue is, then would you please let me know what kind options I could add to 'java' command to get more traces in case this crash takes place again?

Kind regards,
Alexandre Vermeerbergen

vm userRaised

Most helpful comment

@JasonFengJ9 will look at this if @ChengJin01 is still busy with other issues.

All 33 comments

@avermeer is there core file created as part of the diagnostic output, and are you able to compress and share it via a private message?

Hello @pshipton,

Unfortunately no, I do not have a core file other than the java core file attached to the incident.

But our crash is on Windows 2016, isn't "core" file a notion specific to UNIX or Linux operating system?

(sorry if the question sounds trivial)

Kind regards,
Alexandre Vermeerbergen

@avermeer np. Windows has the same concept, and regardless of what Windows calls the file, the JVM renames it by default to core.<date>.<time>.<pid>.<seq>.dmp i.e. like core.20190228.125330.6080.0001.dmp. For a gpf, the default behavior is to create a number of diagnostic files. A javacore, snap (trace output), core, jitdump.

It is a bit surprising you only got the javacore file and nothing else. Usually we need the core file to diagnose a gpf problem. If there is console output, it shows the diagnostic files written, and where they were written. Like so:

JVMDUMP039I Processing dump event "vmstart", detail "" at 2019/02/28 12:53:30 - please wait.
JVMDUMP032I JVM requested System dump using 'C:\sdks\core.20190228.125330.6080.0001.dmp' in response to an event
JVMDUMP010I System dump written to C:\sdks\core.20190228.125330.6080.0001.dmp

Are you using any -Xdump options that would change the default behavior of the diagnostics? Or do you have the IBM_JAVA_OPTIONS or OPENJ9_JAVA_OPTIONS environment variable set with some options?

Is there enough disk space to create a system dump (core file).

(to answer my own question) looking that javacore there are no changes to the default -Xdump behavior, so a bit of a mystery why the file wasn't produced.

Hello @pshipton,

I confirm that there was no dump file generated by this crash, and we have not tweaked any options that could affect this dumping (the environment variables are listed in the java core, no such option seems to be set).

Not sure it can help, but I was told that the crash seems to be related to code doing extensive XSLT processing. And in the java core which I attached to this incident, it is indeed visible that couple of threads are in thing kind of XSLT related code ...

Kind regards,
Alexandre

I'm not sure what we can do without the core file. Do you have the console output from when the crash occurred?

Hello @pshipton ,

Unfortunately no I do not have this console output.

Is the crash repeatable?

The crash occurs randomly about every two weeks (our tool runs nightly over a large set of XML files).

Is there anything that can be done to preserve the console output when the crash occurs? There should be a core file, and the console output may help either show where it was written or what is going wrong when trying to create it.

@pshipton, I have some good news: we have been able to create a minimal sample to reproduce this GPF, and with this sample we are able to reproduce the crash on both Linux and Windows.

I will send the material & instructions before end of this week.

Stay tuned :)

@avermeer Please give it a try on the latest nightly build just in case it is already fixed. https://adoptopenjdk.net/nightly.html?variant=openjdk11&jvmVariant=openj9

Hello @pshipton,

Sorry for having taken that long to come back. I confirm the GPF crash can be reproduced with latest nightly build (I just tried again with today's build of the 20190317) and core file are generated!

Material to reproduce:

  • TestXsltFat.jar file in the attached ZIP file if you are in a hurry
  • GPFSampleSources.tgz in the attached ZIP file if you want to build the .jar from its source (On Linux, extract the .tgz using tar zxvf GPFSampleSources.tgz , then cd GPFSampleSources;mvm package => the TestXsltFat.jar will be generated in target/ subdirectory.

Steps to reproduce:

  • Very simple: just run:
    java.exe -jar TestXsltFat.jar
    withAdoptOpenJDK 11 with OpenJ9 11.0.2. But see above: I was also able to reproduce the GPF with latest nightly build of AdoptOpenJDK 11 with OpenJ9
  • Good news is that with this sample, the core files are generated. However, there's one that's huge. Please let me know if I should upload these core files and how give their size...
  • I was told by people working with me that using this sample it's also possible to reproduce the GPF on Linux, but I haven't yet been able to reproduce with my CentOS 6.7 bare metal PC. I will try later with CentOS 7.x on AWS, and let you know if I confirm the crash on Linux
  • Very important : you may need to run several times the sample to reproduce the crash: on a Window Server 2012 R2 machine, it took me 4 to 5 attempts before getting a a crash with core using OpenJ9 11.0.2, then I switched to 20190317 nightly build and I got immediately another crash. On my on Windows 10 powered laptop I got the crash at 1st attempt, so experience can vary.
  • Other people around me mentioned that once they got the crash, they could even trigger another crash by just running java -version. Weird, isn't it ?

Hope this will help finding out these GPF's cause and hopefully make OpenJ9 better!

Kind regards,
Alexandre Vermeerbergen

CleanIncident.zip

Hello,

More good news: I have been able to reproduce the GPF crash on Linux with the sample provided in my previous post, by just running the sample thru a cron every 15 minutes on a CentOS 7.2 Linux VM on AWS, using AdoptOpenJDK11 with OpenJ9 nighly build of the 2019/03/17.

The output shows that core files have been indeed been generated, see below.
Please let me know if you need me to upload these or to run some commands to help you narrowing the issue's cause:

Type=Bus error vmState=0x00000000
J9Generic_Signal_Number=00000008 Signal_Number=00000007 Error_Value=00000000 Signal_Code=00000080
Handler1=00007F1051E90DC0 Handler2=00007F105178D4C0 InaccessibleAddress=0000000000000000
RDI=00007F0FF31EF130 RSI=00007F0FF31EF138 RAX=00007F1008079A40 RBX=00007F100806CD78
RCX=0000000000000000 RDX=00007F104C011E10 R8=000000000000000C R9=0000000000000074
R10=0000000000000000 R11=0000000000000018 R12=00007F0FF31EF188 R13=00007F0FF31EF1C8
R14=00007F0FFC1A7690 R15=00007F0FFC12BAA0
RIP=00007F1051E6F0A4 GS=0000 FS=0000 RSP=00007F0FF31EF130
EFlags=0000000000010206 CS=0033 RBP=5B30405ABADBAD21 ERR=0000000000000000
TRAPNO=000000000000000C OLDMASK=0000000000000000 CR2=0000000000000000
xmm0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm1 2f6665722f6d7464 (f: 795702400.000000, d: 2.361059e-80)
xmm2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm3 43e0000000000000 (f: 0.000000, d: 9.223372e+18)
xmm4 402c4c91fbe63ee3 (f: 4226170624.000000, d: 1.414955e+01)
xmm5 3fe80691a7ca06c2 (f: 2815035136.000000, d: 7.508019e-01)
xmm6 402c4c92bf1bc122 (f: 3206267136.000000, d: 1.414956e+01)
xmm7 3bbcc86800000000 (f: 0.000000, d: 6.095003e-21)
xmm8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm9 3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
xmm10 bee0400000000000 (f: 0.000000, d: -7.748604e-06)
xmm11 3d735816bc5ca07e (f: 3160186880.000000, d: 1.099585e-12)
xmm12 bcb8000000000000 (f: 0.000000, d: -3.330669e-16)
xmm13 bc74400000000000 (f: 0.000000, d: -1.756408e-17)
xmm14 bc50f76c57075e9e (f: 1460100736.000000, d: -3.679023e-18)
xmm15 402bb9d3beb8c600 (f: 3199780352.000000, d: 1.386294e+01)
Module=/mnt/jdk-11.0.3+2/lib/compressedrefs/libj9vm29.so
Module_base_address=00007F1051DF2000
Target=2_90_20190315_169 (Linux 3.10.0-862.11.6.el7.x86_64)
CPU=amd64 (2 logical CPUs) (0x1c949f000 RAM)
----------- Stack Backtrace -----------
(0x00007F1051E6F0A4 [libj9vm29.so+0x7d0a4])
(0x00007F1051F343ED [libj9vm29.so+0x1423ed])
(0x00007F1051ED5626 [libj9vm29.so+0xe3626])
(0x00007F1051EC276B [libj9vm29.so+0xd076b])
(0x00007F1051EC2F11 [libj9vm29.so+0xd0f11])
(0x00007F1051EB7801 [libj9vm29.so+0xc5801])
(0x00007F1051EB9802 [libj9vm29.so+0xc7802])
(0x00007F1051E15C4C [libj9vm29.so+0x23c4c])
(0x00007F1051E0AE10 [libj9vm29.so+0x18e10])
(0x00007F1051ECBD02 [libj9vm29.so+0xd9d02])
---------------------------------------
JVMDUMP039I Processing dump event "gpf", detail "" at 2019/03/17 23:15:05 - please wait.
JVMDUMP032I JVM requested System dump using '/root/core.20190317.231505.14779.0001.dmp' in response to an event
Warning: unable to move dump to "/root/core.20190317.231505.14779.0001.dmp" across file systems (check kernel core_pattern). Using alternate file location "/mnt/core.20190317.231505.14779.0001.dmp"
JVMDUMP010I System dump written to /mnt/core.20190317.231505.14779.0001.dmp
JVMDUMP032I JVM requested Java dump using '/root/javacore.20190317.231505.14779.0002.txt' in response to an event
JVMDUMP010I Java dump written to /root/javacore.20190317.231505.14779.0002.txt
JVMDUMP032I JVM requested Snap dump using '/root/Snap.20190317.231505.14779.0003.trc' in response to an event
JVMDUMP010I Snap dump written to /root/Snap.20190317.231505.14779.0003.trc
JVMDUMP010I JIT dump written to /root/jitdump.20190317.231505.14779.0004.dmp
JVMDUMP013I Processed dump event "gpf", detail "".
<forked VM failed with exit code 255>
<stdout last='20 lines'>
</stdout>
<stderr last='20 lines'>
(0x00007F1051ED5626 [libj9vm29.so+0xe3626])
(0x00007F1051EC276B [libj9vm29.so+0xd076b])
(0x00007F1051EC2F11 [libj9vm29.so+0xd0f11])
(0x00007F1051EB7801 [libj9vm29.so+0xc5801])
(0x00007F1051EB9802 [libj9vm29.so+0xc7802])
(0x00007F1051E15C4C [libj9vm29.so+0x23c4c])
(0x00007F1051E0AE10 [libj9vm29.so+0x18e10])
(0x00007F1051ECBD02 [libj9vm29.so+0xd9d02])
---------------------------------------
JVMDUMP039I Processing dump event "gpf", detail "" at 2019/03/17 23:15:05 - please wait.
JVMDUMP032I JVM requested System dump using '/root/core.20190317.231505.14779.0001.dmp' in response to an event
Warning: unable to move dump to "/root/core.20190317.231505.14779.0001.dmp" across file systems (check kernel core_pattern). Using alternate file location "/mnt/core.20190317.231505.14779.0001.dmp"
JVMDUMP010I System dump written to /mnt/core.20190317.231505.14779.0001.dmp
JVMDUMP032I JVM requested Java dump using '/root/javacore.20190317.231505.14779.0002.txt' in response to an event
JVMDUMP010I Java dump written to /root/javacore.20190317.231505.14779.0002.txt
JVMDUMP032I JVM requested Snap dump using '/root/Snap.20190317.231505.14779.0003.trc' in response to an event
JVMDUMP010I Snap dump written to /root/Snap.20190317.231505.14779.0003.trc
JVMDUMP007I JVM Requesting JIT dump using '/root/jitdump.20190317.231505.14779.0004.dmp'
JVMDUMP010I JIT dump written to /root/jitdump.20190317.231505.14779.0004.dmp
JVMDUMP013I Processed dump event "gpf", detail "".

Hope it helps,
Alexandre Vermeerbergen

I was able to recreate the gpf on Linux after a few runs.

#13 0x00007f99ed4540a4 in moduleHashEqualFn () from /home/pete/jdk-11.0.3+2/lib/compressedrefs/libj9vm29.so
#14 0x00007f99ed5193ed in hashTableFind () from /home/pete/jdk-11.0.3+2/lib/compressedrefs/libj9vm29.so
#15 0x00007f99ed4ba626 in isAllowedReadAccessToModule () from /home/pete/jdk-11.0.3+2/lib/compressedrefs/libj9vm29.so
#16 0x00007f99ed4a776b in checkModuleAccess () from /home/pete/jdk-11.0.3+2/lib/compressedrefs/libj9vm29.so
#17 0x00007f99ed4a7f11 in checkVisibility () from /home/pete/jdk-11.0.3+2/lib/compressedrefs/libj9vm29.so
#18 0x00007f99ed49d8d6 in resolveInstanceFieldRefInto () from /home/pete/jdk-11.0.3+2/lib/compressedrefs/libj9vm29.so
#19 0x00007f99ed49dd24 in resolveInstanceFieldRef () from /home/pete/jdk-11.0.3+2/lib/compressedrefs/libj9vm29.so
#20 0x00007f99ed40a905 in VM_BytecodeInterpreter::run(J9VMThread*) ()
   from /home/pete/jdk-11.0.3+2/lib/compressedrefs/libj9vm29.so

Diagnostic files copied to /team/triage/openj9-4921

@DanHeidinga

@ChengJin01 Can you take a look at this? Peter can hopefully share the cores with you if needed.

Hello, is there a need for more inputs from my side to get this GPF analyzed ?

@ChengJin01 have you had a chance to look at this?

@JasonFengJ9 will look at this if @ChengJin01 is still busy with other issues.

Still working on other crash issue at #4904 on Windows. Will double-check this crash on Windows when I got a chance.

@avermeer , I tried a couple of times with the same build 20190317 you mentioned above but unable to reproduce it on Windows.

Could you try to compress & upload the core dumps to Google Drive and share it with me via [email protected] ? I will double-check to see whether it ended up with the same native stack as Linux.

Reproduced the crash at Linux AMD, got stacktrace like https://github.com/eclipse/openj9/issues/4921#issuecomment-474097699

Update:

The native stacktrace in question:

#13 <signal handler called>
#14 0x00007fee8105da12 in moduleHashEqualFn (tableNode=0x7fee04115e48, queryNode=0x7fee3971fb58, userData=0x7fee7c011b80) at ModularityHashTables.c:78
#15 0x00007fee8114a21d in hashTableFindNodeSpaceOpt (head=<optimized out>, entry=<optimized out>, table=<optimized out>) at hashtable.c:468

The corresponding code:

    const J9Module* const tableNodeModule = *((J9Module**)tableNode);
    j9object_t tableNodeModuleName = tableNodeModule->moduleName;  --> crashed here

DDR output:

> !j9module 0x43dcaa72badbad21
J9Module at 0x43dcaa72badbad21 {
  Fields for J9Module:
    0x0: struct J9Object* moduleName = <FAULT>  --> crashed at this reference
    0x8: struct J9Object* moduleObject = <FAULT>

This occurred when checking if from module jdk.translet is allowed to access to module java.xml via hashTableFind(toModule->readAccessHashTable, &fromModule). Apparently there are some unloaded modules within the readAccessHashTable which are supposed to be removed via jvmfree.c:freeJ9Module(J9Module *j9module) through j9module->removeAccessHashTable.

Next step is to find out why those unloaded modules weren't removed from toModule->readAccessHashTable.

There were quite a few module object with name jdk.translet:

> !j9module 0x7f6df8090950
J9Module at 0x7f6df8090950 {
  Fields for J9Module:
    0x0: struct J9Object* moduleName = !j9object 0x00000000E033B0D8 // java/lang/String
    0x18: struct J9ClassLoader* classLoader = !j9classloader 0x00007F6DF8098C00
}
Module name: jdk.translet

> !j9classloader 0x00007F6DF8098C00
J9ClassLoader at 0x7f6df8098c00 {
  Fields for J9ClassLoader:
    0x18: struct J9Object* classLoaderObject = !j9object 0x00000000FFD19438 // com/sun/org/apache/xalan/internal/xsltc/trax/TemplatesImpl$TransletClassLoader
}

jdk.translet module objects are created by com/sun/org/apache/xalan/internal/xsltc/trax/TemplatesImpl$TransletClassLoader which is a local instance and is subject to GC.
It appears some jdk.translet module objects are GCed but their references are still within java.xml module readAccessHashTable and eventually caused the crash.

@JasonFengJ9 any progress?

The initial thought was the J9Module in question is freed after TransletClassLoader and jdk.translet are GCed. However from the snaptrace, freeJ9Module() wasn't be invoked and apparently the classloader wasn't unloaded either.
Currently looking at two possibilities, one is the scenario that module object is unloaded but the classloader wasn't unloaded hencefreeJ9Module() is not invoked; the other is that the hashtable entries might be confused due to multiple modules with same name.
Due to the intermittent failure nature, I have to keep adding trace points to find out where the bad object is created and how it becomes invalid. The build and test cycle is a bit long (~one hour to reproduce the failure) and there were distraction from other workitems as well.

Update:
The problem appears caused by multiple module objects with same name which is interned and shared.

Following two different j9module have different moduleObject but same moduleName

> !j9module 0x7f1e90030190
J9Module at 0x7f1e90030190 {
  Fields for J9Module:
    0x0: struct J9Object* moduleName = !j9object 0x00000000FD132E08 // java/lang/String
    0x8: struct J9Object* moduleObject = !j9object 0x00000000FD262F78 // java/lang/Module
}
Module name: jdk.translet

> !j9module 0x00007f1e881b4788
J9Module at 0x7f1e881b4788 {
  Fields for J9Module:
    0x0: struct J9Object* moduleName = !j9object 0x00000000FD132E08 // java/lang/String
    0x8: struct J9Object* moduleObject = !j9object 0x00000000FFDB5888 // java/lang/Module
}
Module name: jdk.translet

The hashModuleTable uses the moduleName as the moduleHashEqualFn:
https://github.com/eclipse/openj9/blob/eadc32af8ec8645afd86f5e166f4f94b9eeab843/runtime/vm/ModularityHashTables.c#L35-L52

jdk.translet is a dynamically generated module and one such module will be generated for each call to com.sun.org.apache.xalan.internal.xsltc.trax.TemplatesImpl.defineTransletClasses(). jdk.translet is required to read module java.xml and need to be added to java.xml readAccessHashTable, first time it was performed as expected, afterwards, the newly created jdk.translet won't be added because of java11vmi.c:hashTableAtPut(table, value, collisionIsFailure = false). Later usage of jdk.translet actually refer the object added first until it becomes invalid and causes crash.
https://github.com/eclipse/openj9/blob/eadc32af8ec8645afd86f5e166f4f94b9eeab843/runtime/j9vm/java11vmi.c#L94-L119

Next step is to verify the theory via module hash function modification.

Great news!

Question: I see that it's closed on 0.14.0 release, but current nightly builds have 0.13.0 version.
Am I missing something?
Is there a way this fix could be part of upcoming nightly builds?

Kind regards,
Alexandre

@avermeer the PR was just merged and will be available in a nightly build next day or two. I think 0.14.0 release build is not ready yet.

@JasonFengJ9 alright, I will wait a little bit to test using nightly build then.
BTW, is there a way to know when a given PR is part of a given nightly built?
For example, is there any file in the binary distribution that gets the list of all fixes?

@avermeer thanks for your patience. This PR commit has SHA c11ff94, SDK -version gives a OpenJ9 SHA. If the PR SHA is a parent of the SDK OpenJ9 SHA, the fix is included.
I think the SDK build has a way to show the PR included since last build, not sure if there is such standalone file showing the commits since last binary distribution, nice feature though.

@avermeer the nightly build should not show 0.13 or another OpenJ9 release version. This is the latest nightly build for xlinux. It has dates and SHAs.
The 0.14 release will be created after April 16th. We can't commit to an exact date until the OpenJDK changes for the release are known.

openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.3+2-201904021826)
Eclipse OpenJ9 VM AdoptOpenJDK (build master-127ba34f1, JRE 11 Linux amd64-64-Bit Compressed References 20190402_189 (JIT enabled, AOT enabled)
OpenJ9   - 127ba34f1
OMR      - 5ceecf19
JCL      - ff6f49a4ff based on )
Was this page helpful?
0 / 5 - 0 ratings