Openj9: Segmentation error when running a test using 11.0.4+11

Created on 28 Aug 2019  路  37Comments  路  Source: eclipse/openj9

Java version information from core file:

1CIJAVAVERSION JRE 11 Linux amd64-64 (build 11.0.4+11)
1CIVMVERSION   20190717_286
1CIJ9VMTAG     openj9-0.15.1
1CIJ9VMVERSION 0f66c6431
1CIJITVERSION  0f66c6431
1CIOMRVERSION  ec782f26_CMPRSS
1CIJCLVERSION  fa49279450 based on jdk-11.0.4+11
1CIJITMODES    JIT enabled, AOT enabled, FSD disabled, HCR enabled
1CIRUNNINGAS   Running as a standalone JVM
1CIVMIDLESTATE VM Idle State: ACTIVE
1CICONTINFO    Running in container : FALSE
1CICGRPINFO    JVM support for cgroups enabled : TRUE

Summary of problem:

[junit] Unhandled exception [junit] Type=Segmentation error vmState=0x00000000 when running a WebSphere Liberty unit test; the problem did not occur until trying to upgrade to 11.0.4+11.

Java core file is at https://gist.github.com/wraschke/88cf3b243e640b3b1591f7ce002bd1c8

jit userRaised

Most helpful comment

I shared a zip of these files with Henry at https://ibm.ent.box.com/file/516304039514.

All 37 comments

Abort is occurring on the IProfiler thread:
https://gist.github.com/wraschke/88cf3b243e640b3b1591f7ce002bd1c8#file-javacore-L1842-L1886

FYI @andrewcraik @hzongaro

I've annotated the call stack on the IProfiler thread with the methods I believe are on the call stack when the segmentation error happens:

4XENATIVESTACK               abort+0x16a (0x00007F622801B02A [libc.so.6+0x3702a])
4XENATIVESTACK               (0x00007F622645F5CB [libj9prt29.so+0x1e5cb])  masterSynchSignalHandler+171
4XENATIVESTACK               (0x00007F62287D4390 [libpthread.so.0+0x11390])
4XENATIVESTACK               (0x00007F62245AAD90 [libj9jit29.so+0x204d90]) _ZN12TR_IProfiler21searchForMethodSampleEP20TR_OpaqueMethodBlocki+32   TR_IProfiler::searchForMethodSample
4XENATIVESTACK               (0x00007F62245AAE12 [libj9jit29.so+0x204e12]) _ZN12TR_IProfiler23findOrCreateMethodEntryEP8J9MethodS1_bj+98   TR_IProfiler::findOrCreateMethodEntry
4XENATIVESTACK               (0x00007F62245B02C0 [libj9jit29.so+0x20a2c0]) _ZN12TR_IProfiler11parseBufferEP10J9VMThreadPKhmb+1520   TR_IProfiler::parseBuffer
4XENATIVESTACK               (0x00007F62245B0467 [libj9jit29.so+0x20a467]) _ZN12TR_IProfiler19processWorkingQueueEv+263   TR_IProfiler::processWorkingQueue
4XENATIVESTACK               (0x00007F62245B053A [libj9jit29.so+0x20a53a]) _ZL19iprofilerThreadProcPv+186  iprofilerThreadProc
4XENATIVESTACK               (0x00007F62268B90E6 [libj9thr29.so+0xe0e6])
4XENATIVESTACK               (0x00007F62287CA6BA [libpthread.so.0+0x76ba])
4XENATIVESTACK clone+0x6d (0x00007F62280EB41D [libc.so.6+0x10741d])

This is the code for ~TR_IProfiler::findOrCreateMethodEntry~ TR_IProfiler::searchForMethodSample. I'm just trying to work out precisely which of the pointer dereferences in there goes wrong.

Wendy @wraschke, would it be possible to produce a core file for the segmentation error?

@hzongaro I'm not completely understanding your question.

Are you asking if I can recreate the segmentation error and get another javacore generated?

If so, I can.

Are you asking if I can recreate the segmentation error and get another javacore generated?

I think @hzongaro is asking for a system core he core work in a debugger. There may already be one from the previous abort on your system.

I think @hzongaro is asking for a system core he core work in a debugger. There may already be one from the previous abort on your system.

That's right. Thanks for clarifying, Dan @DanHeidinga

I'm totally showing my ignorance. :sweat_smile:

From previous segmentation error, I have files

Snap.20190823.204140.9905.0005.trc
core.20190823.204140.9905.0001.dmp
core.20190823.204142.9905.0004.dmp

Any of those you're after?

These are the system cores:

core.20190823.204140.9905.0001.dmp
core.20190823.204142.9905.0004.dmp

Can you upload them somewhere (ie: Box) and provide a link? Or send the details on where to download the cores directly to @hzongaro

I shared a zip of these files with Henry at https://ibm.ent.box.com/file/516304039514.

Thanks, Wendy @wraschke! I was able to download them. I'll continue investigating.

Here's where I am with core.20190823.204142.9905.0004.dmp. On entry to TR_IProfiler::searchForMethodSample, omb is in $rsi which has the value 0x00000000013c9960. That refers to J9Method org/objectweb/asm/util/CheckClassAdapter.visitEnd, and bucket is in $rdx and has the value 0x220:

R_IPMethodHashTableEntry *
TR_IProfiler::searchForMethodSample(TR_OpaqueMethodBlock *omb, int32_t bucket)
   {
//   printf("Searching for method %p in bucket %d\n",omb,bucket);
   TR_IPMethodHashTableEntry *entry;
   for (entry = _methodHashTable[bucket]; entry; entry = entry->_next)
      {
//      printf("entry = %p entry->_method = %p\n",entry,entry->_method);
      if (omb == entry->_method)
         return entry;
      }
   return NULL;
   }

The IProfiler instance is in $rdi with the value 0x00007f62153d9290. The _methodHashTable field is at offset 0x0c8 and the address it contains is 00007f621541c9c0. The initial value of entry in the loop at _methodHashTable[0x220] is 00007f61f682b350. Walking through the loop,

When entry == 0x00007f61f682b350
entry->_next == 00007f6224ff3168 and entry->_method == NULL
When entry == 0x00007f6224ff3168
entry->_next == 00007f62245b1f40 and entry->_method == 00007f62245b1f50
When entry == 0x00007f62245b1f40
entry->_next == 1f0f2e66c320478b and entry->_method == 9066000000000084
When entry == 1f0f2e66c320478b
entry->_method yields a segmentation error.

The error happens on this instruction, and $rax does contain 0x1f0f2e66c320478b at that point.

0x7f62245aad90 {libj9jit29.so}{_ZN12TR_IProfiler21searchForMethodSampleEP20TR_OpaqueMethodBlocki} +32  48397008             cmp       qword ptr [rax+8], rsi <<< ^+44

Now working on understanding how either the entry in _methodHashTable might come to be corrupted or the entry _methodHashTable[0x220] points to is corrupted.

Update 2019-09-04: The object that's stored in _methodHashTable[0x220] appears to be a reference to an instance of TR_IPBCDataFourBytes, rather than an instance of TR_IPMethodHashTableEntry, as would be expected. A reference to the same object appears in _bcHashTable[0x6007].

Wendy @wraschke, does the problem recur 100% of the time or is it intermittent? Also, may I ask you to try running that test with the Java option -Xshare:off to check whether the problem can be reproduced with that option? I'm just trying to pin down where things are going wrong.

Actually, rather than -Xshare:off would it be possible to check whether the problem can be reproduced with the -Xjit:dontUsePersistentIprofiler option? Sorry for the churn, Wendy @wraschke!

Hi, Henry, sorry for the delay. I will try to get to this later today.

@hzongaro finally! Please see https://ibm.ent.box.com/file/521929023954, which I shared with you.

Thanks for those new core and javacore files, Wendy @wraschke. I was hoping that -Xjit:dontUserPersistentIprofiler would avoid the problem, and that would give a strong hint to where the problem lies. Unfortunately, that didn't work out.

I think we have two ways of proceeding. Either I can try adding some debugging code to the library for the JIT compiler, and ask you to use that in some more runs of this test, or I can try setting up the test to run locally myself. With the latter option, I can hopefully investigate without having to bug you too much, except for some initial help with getting Liberty and the test set up.

Which would you prefer?

By the way, does the problem occur 100% of the time in running this test, or is it intermittent? If it's intermittent, how frequently does it occur?

@hzongaro actually, with that flag, a different unittest failed, while another one that had been failing before the flag still failed.

Running yourself in your own environment would be too complicated: you'd have to download two repos, build one one repo and then do another, and this doesn't include getting you access and etc.

Yes, please supply the debug in the JIT compiler and then my colleague will run the unittest with it, as he'll be taking over this effort from me.

Thanks so much!

Sounds good, Wendy @wraschke. I've uploaded a zip file that contains a debug version of the JIT compiler.

Unzip the file in /somedirectory. Then export the environment variable

export traceIProfilerHashing=yes

and run the test with the option -XXjitdirectory=/somedirectory/traceIProfilingHashJIT.r1.

The debug output will be directed to stderr. If that's inconvenient, I can rework it a bit to direct the output to a file. When it fails, please send me the contents of stderr and the core file.

Please let me know if your colleague needs access to the zip file in Box. Thanks!

Hi @hzongaro I'm working on this with @wraschke as she's a bit busy. I can't download the zip file you posted, could you share it with me as well please?

I can't download the zip file you posted, could you share it with me as well please?

Hi, Andrew @Azquelt - OK - you should now have access.

Thanks @hzongaro, that seems to have worked, I can see a bunch of trace in the log.

The log and dumps are here: https://ibm.ent.box.com/folder/87487164511

Thank you, Andrew! This time the crash occurred in a slightly different place within IProfiler. I've uploaded a zip file containing another JIT compiler with additional debugging code to try to capture more information about what might be going wrong.

As before, unzip the file in /somedirectory. Then export the environment variable

export traceIProfilerHashing=yes

and run the test with the option -XXjitdirectory=/somedirectory/traceIProfilingHashJIT.r2 - note the change in the directory.

@hzongaro that's done, logs are in the same directory as above: https://ibm.ent.box.com/folder/87487164511

Update: Sorry - there are gaps in the debugging code added to the JIT compiler described here. I'll submit another.

Thanks, Andrew! I've added more debugging information to the JIT compiler, and I've uploaded it to https://ibm.box.com/s/5lrj2gml1yg0xbgjqp1wsp36vt4tyswx~~

As before, unzip the file into /somedirectory. Then export the environment variable

~~ export traceIProfilerHashing=yes ~~

and run the test with the option -XXjitdirectory=/somedirectory/traceIProfilingHashJIT.r3.

This one adds some additional debugging to a lower level memory management routine. I'm hoping it doesn't affect your ability to reproduce the problem or result in log files that are much too large.

I've encountered a similar error (https://github.com/OpenLiberty/open-liberty/issues/9039) through a different scenario.

Core dump and output files from Linux and Mac are here: https://ibm.box.com/s/bobo9x1feyj3143k2f9jtyx725k0kqxl

On both Linux and Mac, the issue occurs with OpenJDK 11.0.4+11. It did not seem to occur on the 11.0.5+7 nightly build (although I did encounter a core dump once through slightly different steps but could not reproduce it again and did not retain the dump files).

Steps to reproduce:

  1. Set $JAVA_HOME to OpenJDK 11.0.4+11 on OpenJ9.

  2. Run the following

git clone https://github.com/OpenLiberty/guide-getting-started.git
cd guide-getting-started/start
mvn clean install liberty:run
  1. After you see CWWKF0011I: The GettingStartedServer server is ready to run a smarter planet. in the terminal, add the following into target/liberty/wlp/usr/servers/GettingStartedServer/server.xml between the <featureManager> and </featureManager> tags and save the file.
<feature>mpHealth-1.0</feature>
  1. Add the following after (outside of) the </featureManager> tag and save the file again.
<logging traceSpecification="com.ibm.ws.microprofile.health.*=all" />
  1. If the core dump doesn't happen, delete the added lines from the two above steps and save the file again.

I've just uploaded a JIT compiler with additional debugging in issue6875.traceIProfilingHashJIT.r4.zip.

As before, unzip the file into /somedirectory. Then export the environment variable

export traceIProfilerHashing=yes

and run the test with the option -XXjitdirectory=/somedirectory/traceIProfilingHashJIT.r4

This one adds some additional debugging to lower level memory management routines. I'm hoping it doesn't affect your ability to reproduce the problem or result in log files that are much too large.

Thanks, Eric @ericglau! I opened up one of the core files you uploaded, and I was able to confirm that this is the same problem. Next I'll try reproducing the problem as you've described.

@hzongaro ... just FYI, Andrew is away this week. Hopefully Eric can help you make some progress with this... not sure that Wendy or myself will get enough time.

Thanks, Tom @tevans78 - I'm able to reproduce the problem fairly consistently with the instructions that Eric @ericglau provided, so I shouldn't need any further help tracking down the problem. Once I have a possible fix, I'll likely ask Andrew @Azquelt and Eric to try it out one more time.

Eric @ericglau and Andrew @Azquelt - I think I've tracked down the source of the problem, and it turns out it was recently fixed under pull request #6932. May I ask you to try a more recent nightly build to see whether it resolves the segmentation errors you were seeing? You can try this recent nightly build, for instance: https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk11u-2019-09-24-05-15/OpenJDK11U-jdk_x64_linux_openj9_2019-09-24-05-15.tar.gz

@hzongaro I'll give that a try and get back to you.

@hzongaro I don't see the segmentation errors anymore on the latest nightly builds for Linux (11.0.5+8-201909260313) and Mac (11.0.5+8-201909260314).

Closing this one as already fixed.

I haven't quite completed my testing (in a remote build, which takes a while and can sometimes be flakey). Can we leave this open until I've verified?

reopening

@wraschke I'm just going to close it. If you find a problem pls let us know and we can re-open at that time. If there is still a problem there likely isn't time to get it resolved for the 0.17 release.

Thank you, @pshipton! I am just confirming that I no longer see those errors when I run the previously failing tests.

But my org does need the actual released JDK. How can we get https://github.com/eclipse/openj9/pull/6932 in an "official" build?

How can we get #6932 in an "official" build?

This fix is in the 0.16 release of Java 13 which was released in Sept and will be in the 0.17 release of Java 8, 11, 13 which will be released in mid October.

Binaries will be available from AdoptOpenJDK as usual

Was this page helpful?
0 / 5 - 0 ratings