Openj9: Jprofiling calls on x86-32 are using the wrong linkage

Created on 9 Sep 2019 · 24Comments · Source: eclipse/openj9

The issue is a stack misalignment when storing an XMM in C code which has been called directly from the code cache:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf7d4ab40 (LWP 11187)]
0xf65f1fc7 in TR_EmbeddedHashTable<unsigned int, 2u>::rearrange(TR_HashTableProfilerInfo<unsigned int>::HashFunction&) ()
   from /bluebird/builds/bld_426736/sdk/xi3280/jre/lib/i386/default/libj9jit29.so
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.i686 libgcc-4.8.5-39.el7.i686
(gdb) bt
#0  0xf65f1fc7 in TR_EmbeddedHashTable<unsigned int, 2u>::rearrange(TR_HashTableProfilerInfo<unsigned int>::HashFunction&) ()
   from /bluebird/builds/bld_426736/sdk/xi3280/jre/lib/i386/default/libj9jit29.so
#1  0xf65f3163 in TR_EmbeddedHashTable<unsigned int, 2u>::addKey(unsigned int) ()
   from /bluebird/builds/bld_426736/sdk/xi3280/jre/lib/i386/default/libj9jit29.so
#2  0xf65f48aa in _jProfile32BitValue () from /bluebird/builds/bld_426736/sdk/xi3280/jre/lib/i386/default/libj9jit29.so
#3  0x5563bff5 in ?? ()
#4  0xf7409fa0 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

(gdb) frame 0
#0  0xf65f1fc7 in TR_EmbeddedHashTable<unsigned int, 2u>::rearrange(TR_HashTableProfilerInfo<unsigned int>::HashFunction&) ()
   from /bluebird/builds/bld_426736/sdk/xi3280/jre/lib/i386/default/libj9jit29.so
(gdb) x/20i $pc
=> 0xf65f1fc7 <_ZN20TR_EmbeddedHashTableIjLj2EE9rearrangeERN24TR_HashTableProfilerInfoIjE12HashFunctionE+55>:   
    movaps %xmm0,0x30(%esp)

esp            0xd2f05644 <- not 16-byte aligned

(gdb) frame 3
#3  0x5563bff5 in ?? ()
(gdb) x/20i $pc-14
   0x5563bfe7:  mov    %edx,%edi
   0x5563bfe9:  mov    $0xd544dc08,%edx
   0x5563bfee:  mov    %edi,%ecx
   0x5563bff0:  call   0xf65f48a0 <_jProfile32BitValue>
=> 0x5563bff5:  mov    %edi,%edx
   0x5563bff7:  jmp    0x5563a397

esp            0xd2f05734 <- not 16-byte aligned

The calling compiled method is:

+ (profiled very-hot) j9vm/test/hashTable/HashTableCorrectnessTest.addNumberOfStringsToTree(Ljava/util/TreeSet;I)V @ 55638824-5563C686 OrdinaryMethod 30.30% T Q_SZ=0 Q_SZI=0 QW=100 j9m=D316C2A0 bcsz=64 JPROF compThread=0 CpuLoad=200%(100%avg) JvmCpu=199%

I've verified that both of the C functions are maintaining correct stack alignment.

The issue appears to be that the incorrect linkage is being used to call the helpers, so the stack is not being aligned/restored to the native stack.

x86 32 bug jit

Source

gacholio

All 24 comments

@liqunl @r30shah

gacholio on 9 Sep 2019

@gacholio The linkage is good, the problem is that we don't align the stack for c helper linkage.

liqunl on 10 Sep 2019

thanks @liqunl - good catch.

andrewcraik on 10 Sep 2019

@gacholio There is a problem of stack alignment on 32-bit in general, it is in https://github.com/eclipse/openj9/blob/master/runtime/compiler/x/codegen/X86PrivateLinkage.cpp#L90. Like you siad, the test still fails after changing 4 to 16. I'm still investigating.

liqunl on 10 Sep 2019

I'm pretty sure Victor got this working for the VM-provided helpers (like fast_jitNewObject). Maybe start looking there.

gacholio on 10 Sep 2019

I think the problem is that the c++ function's stack is not aligned. The following instruction push 16 bytes on to the stack, then make the call. The stack was 16bytes aligned before calling TR_EmbeddedHashTable<unsigned int, 2u>::addKey(unsigned int), after the call at 0x01066beb the stack is not aligned. @gacholio Could it be related to the c compiler and the compiler flags we're using?

(kca) (0x01066beb-11)/10i
0x01066be0 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +0   55                   push       ebp
0x01066be1 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +1   89e5                 mov        ebp, esp
0x01066be3 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +3   57                   push       edi
0x01066be4 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +4   56                   push       esi
0x01066be5 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +5   53                   push       ebx
0x01066be6 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +6   e85193ebff           call       0x0f1ff3c ^{libj9jit29.so}{__x86.get_pc_thunk.di} +0
0x01066beb {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +11  81c715349b00         add        edi, 0x9b3415
0x01066bf1 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +17  83ec3c               sub        esp, 0x3c
0x01066bf4 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +20  8b7508               mov        esi, dword ptr [ebp+8]
0x01066bf7 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +23  65a114000000         mov        eax, dword ptr gs:[0x14]

liqunl on 10 Sep 2019

The C linkage requires that the stack be 16-byte aligned when running in the C stack frame (in the x86-32 case, this means being 16-aligned plus 4 on entry due to the pushed return address).

If you really don't want to switch stacks, there's really only two solutions:

1) Keep the java stack at known alignment on entry to compiled methods (wastes some java stack)
2) Forcibly align the SP (means recording the original SP across the call)

I believe the lowest overall impact to compiled code would be to switch the stack when calling the helpers. Again, what did Victor do for the existing ones?

gacholio on 10 Sep 2019

Sorry, my understanding of c stack alignment is wrong, I thought it is the same as what we're doing in prologue to the java stack, where the stack is N-aligned on entry of jitted code on x86 64. Currently N is 4 bytes for 32 bit and 16 bytes for 64 bits on x86. But this java stack alignment is also wrong, the correct alignment should be 16-byte aligned before the call on X.

I was under an impression that the java stack is double-slot aligned on the entry of jitted code called from the interpreter, and the stack alignment was implemented under this assumption to achieve local object alignment. Then Victor used it to align the stack for c helper calls on X86-64. @gacholio explained to me that the assumption is true except on X. The stack on X is aligned to 2 slots before the return address is pushed.

So the stack has to be 16-byte aligned for both 32 bit and 64 bit on X before the call instruction.

Since the alignment requirements for helper call and local object are similar, we can fix this issue by excluding the return address in calculation. I'll also fix the local object alignment.

liqunl on 10 Sep 2019

👍1

Note the stack is currently not 16-byte aligned on x86-32 (it's 8) when the interpreter calls compiled code. If you really want to call on the java stack, the interpreter/stackwalker/DDR will have to be fixed as well.

gacholio on 11 Sep 2019

Probably also worth verifying that I'm not lying about the misalignment - compile a few methods with breakOnEntry and see if the ESP is 8-aligned or not when it breaks.

gacholio on 11 Sep 2019

Actually, looking at the code, I think the stackwalker will accept any alignment, so it's just the transition code in the interpreter that would need to be changed (about 5 minutes work if I'm right).

gacholio on 11 Sep 2019

👍1

Also found some alignment code when allocating/growing the stack (if the stack gets copied, the alignment of each frame must be maintained) and DLT (a slightly different JIT transition).

These are all of the places I believe would need changing (and I'd rather do this bit myself):

https://github.com/eclipse/openj9/blob/f4b39c5bb479decd612954b9d7a6055d73e13660/runtime/vm/BytecodeInterpreter.hpp#L790
https://github.com/eclipse/openj9/blob/f4b39c5bb479decd612954b9d7a6055d73e13660/runtime/vm/growstack.cpp#L138
https://github.com/eclipse/openj9/blob/f4b39c5bb479decd612954b9d7a6055d73e13660/runtime/vm/vmthread.c#L1380
https://github.com/eclipse/openj9/blob/f4b39c5bb479decd612954b9d7a6055d73e13660/runtime/codert_vm/dlt.c#L194

gacholio on 11 Sep 2019

Before we commit to this solution, I'd like some comments from @andrewcraik. Running C code on the java stack is not a great idea - if the code is complex, it could blow the stack (and there will be no checking). If stack checking (gcc stack protect) is enabled in the compiler, I suspect running on the non-C stack will explode on any function call (any large one at least, where large is an arbitrary number defined by whatever compiler version we happen to land on).

As this is already a cold path (I sure hope you're not calling C in any hot path), I would strongly suggest switching stacks instead.

@DanHeidinga

gacholio on 11 Sep 2019

@r30shah/@liqunl we will need to prototype doing the stackswap. If the swap is too costly we should do a two version fast/slower helper we constraint he fast version like we do other helpers and then stack swap if we need to do a re-arrange or similar. Since there are other more urgent issues this one may have to wait a day or so, but we should check the perf impact of always stack swapping. If it is negligible I am fine with just doing the swap and saving further complexity.

andrewcraik on 12 Sep 2019

How does the direct C Helper calls like fast_jitInstanceOf works on x86 without doing a stack swap? On Z we have two different registers used for Java Stack Pointer and System Stack Pointer so we don't have to do stack swap but before calling out either through System Linkage or C Helper linkage we store Java Stack pointer on the slot in the vmThread.

r30shah on 12 Sep 2019

If you do decide we need to align the java stack, the VM code is here:

https://github.com/gacholio/openj9/tree/align

Currently aligning to 16 on all platforms.

gacholio on 12 Sep 2019

Talked to @r30shah, on x86, we don't swap the stack for fast direct c calls. The fast calls include all jprofiling functions, jitInstanceOf, jitCheckAssignable etc. These c functions won't gc so they are fast call only. Other platform always swap the stack before calling c helpers, so it's a X only issue. Will measure the overhead of swapping the stack for fast calls on X.

liqunl on 13 Sep 2019

So there are two issues here: 1) get JProfiling working 2) look at the stack swap for helper calls on x86. The design of running the C code on the Java stack has been that way since the C helpers were introduced. While helpers are not generally on the 'fast path' they do show up on profiles so changes to the call sequence need some care and performance measurement etc.

Based on discussions with @liqunl the primary issue JProfiling is seeing is due to the unaligned stack on x86-32. I suggest we get that alignment sorted and we can then asses how to proceed with the stack swap since that has broader performance implications and is not new (it has been this way for some time).

andrewcraik on 23 Sep 2019

@andrewcraik The C helpers run on the C stack normally - only the ones called directly from the code cache are breaking the rules.

gacholio on 23 Sep 2019

@gacholio I am not saying you are wrong. What I am saying is that the implementation for the JIT has always called them on the Java stack and has done so since inception (probably due to misunderstandings when the changes were made). I do understand your point and we do need to look at it, but has much bigger implications for performance that we need to assess since the current state of affairs has been sufficient for a number of releases (despite the incorrectness of the approach at a fundamental level). I'm suggesting we fix the alignment in one PR - that can enable JProfiling to make progress and we can work in parallel on the stack swap which does need to be looked at.

andrewcraik on 23 Sep 2019

Sounds like a ticking time bomb in case any modifications are made on the C side (or even changes to gcc compiler or flags) as any stack accesses are going to cause very weird bugs.

fjeremic on 23 Sep 2019

👍1

Completely agree - we do need to address this.

andrewcraik on 23 Sep 2019

👍1

I'm suggesting we fix the alignment in one PR - that can enable JProfiling to make progress and we can work in parallel on the stack swap which does need to be looked at.

To confirm, this is to allow the jprofiling implementation to progress but NOT to release the feature running on the java stack?

If it's released running on the java stack, it will be difficult to accept a performance delta after the fact. Better to resolve it upfront and get an accurate performance view then to have to try and walk it back later.

DanHeidinga on 23 Sep 2019

Correct - we need to get testing started to see if there are any other issues and that can't be done without the stack alignment. We can work on the stack swap in parallel to ensure that isn't seen in the field. If we can't address the general stack swap in time we can do it for the JProfiling to ensure we don't see a regression when we do sort it.

andrewcraik on 23 Sep 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings