The issue is a stack misalignment when storing an XMM in C code which has been called directly from the code cache:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf7d4ab40 (LWP 11187)]
0xf65f1fc7 in TR_EmbeddedHashTable<unsigned int, 2u>::rearrange(TR_HashTableProfilerInfo<unsigned int>::HashFunction&) ()
from /bluebird/builds/bld_426736/sdk/xi3280/jre/lib/i386/default/libj9jit29.so
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.i686 libgcc-4.8.5-39.el7.i686
(gdb) bt
#0 0xf65f1fc7 in TR_EmbeddedHashTable<unsigned int, 2u>::rearrange(TR_HashTableProfilerInfo<unsigned int>::HashFunction&) ()
from /bluebird/builds/bld_426736/sdk/xi3280/jre/lib/i386/default/libj9jit29.so
#1 0xf65f3163 in TR_EmbeddedHashTable<unsigned int, 2u>::addKey(unsigned int) ()
from /bluebird/builds/bld_426736/sdk/xi3280/jre/lib/i386/default/libj9jit29.so
#2 0xf65f48aa in _jProfile32BitValue () from /bluebird/builds/bld_426736/sdk/xi3280/jre/lib/i386/default/libj9jit29.so
#3 0x5563bff5 in ?? ()
#4 0xf7409fa0 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) frame 0
#0 0xf65f1fc7 in TR_EmbeddedHashTable<unsigned int, 2u>::rearrange(TR_HashTableProfilerInfo<unsigned int>::HashFunction&) ()
from /bluebird/builds/bld_426736/sdk/xi3280/jre/lib/i386/default/libj9jit29.so
(gdb) x/20i $pc
=> 0xf65f1fc7 <_ZN20TR_EmbeddedHashTableIjLj2EE9rearrangeERN24TR_HashTableProfilerInfoIjE12HashFunctionE+55>:
movaps %xmm0,0x30(%esp)
esp 0xd2f05644 <- not 16-byte aligned
(gdb) frame 3
#3 0x5563bff5 in ?? ()
(gdb) x/20i $pc-14
0x5563bfe7: mov %edx,%edi
0x5563bfe9: mov $0xd544dc08,%edx
0x5563bfee: mov %edi,%ecx
0x5563bff0: call 0xf65f48a0 <_jProfile32BitValue>
=> 0x5563bff5: mov %edi,%edx
0x5563bff7: jmp 0x5563a397
esp 0xd2f05734 <- not 16-byte aligned
The calling compiled method is:
+ (profiled very-hot) j9vm/test/hashTable/HashTableCorrectnessTest.addNumberOfStringsToTree(Ljava/util/TreeSet;I)V @ 55638824-5563C686 OrdinaryMethod 30.30% T Q_SZ=0 Q_SZI=0 QW=100 j9m=D316C2A0 bcsz=64 JPROF compThread=0 CpuLoad=200%(100%avg) JvmCpu=199%
I've verified that both of the C functions are maintaining correct stack alignment.
The issue appears to be that the incorrect linkage is being used to call the helpers, so the stack is not being aligned/restored to the native stack.
@liqunl @r30shah
@gacholio The linkage is good, the problem is that we don't align the stack for c helper linkage.
thanks @liqunl - good catch.
@gacholio There is a problem of stack alignment on 32-bit in general, it is in https://github.com/eclipse/openj9/blob/master/runtime/compiler/x/codegen/X86PrivateLinkage.cpp#L90. Like you siad, the test still fails after changing 4 to 16. I'm still investigating.
I'm pretty sure Victor got this working for the VM-provided helpers (like fast_jitNewObject). Maybe start looking there.
I think the problem is that the c++ function's stack is not aligned. The following instruction push 16 bytes on to the stack, then make the call. The stack was 16bytes aligned before calling TR_EmbeddedHashTable<unsigned int, 2u>::addKey(unsigned int), after the call at 0x01066beb the stack is not aligned. @gacholio Could it be related to the c compiler and the compiler flags we're using?
(kca) (0x01066beb-11)/10i
0x01066be0 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +0 55 push ebp
0x01066be1 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +1 89e5 mov ebp, esp
0x01066be3 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +3 57 push edi
0x01066be4 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +4 56 push esi
0x01066be5 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +5 53 push ebx
0x01066be6 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +6 e85193ebff call 0x0f1ff3c ^{libj9jit29.so}{__x86.get_pc_thunk.di} +0
0x01066beb {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +11 81c715349b00 add edi, 0x9b3415
0x01066bf1 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +17 83ec3c sub esp, 0x3c
0x01066bf4 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +20 8b7508 mov esi, dword ptr [ebp+8]
0x01066bf7 {libj9jit29.so}{_ZN20TR_EmbeddedHashTableIjLj2EE6addKeyEj} +23 65a114000000 mov eax, dword ptr gs:[0x14]
The C linkage requires that the stack be 16-byte aligned when running in the C stack frame (in the x86-32 case, this means being 16-aligned plus 4 on entry due to the pushed return address).
If you really don't want to switch stacks, there's really only two solutions:
1) Keep the java stack at known alignment on entry to compiled methods (wastes some java stack)
2) Forcibly align the SP (means recording the original SP across the call)
I believe the lowest overall impact to compiled code would be to switch the stack when calling the helpers. Again, what did Victor do for the existing ones?
Sorry, my understanding of c stack alignment is wrong, I thought it is the same as what we're doing in prologue to the java stack, where the stack is N-aligned on entry of jitted code on x86 64. Currently N is 4 bytes for 32 bit and 16 bytes for 64 bits on x86. But this java stack alignment is also wrong, the correct alignment should be 16-byte aligned before the call on X.
I was under an impression that the java stack is double-slot aligned on the entry of jitted code called from the interpreter, and the stack alignment was implemented under this assumption to achieve local object alignment. Then Victor used it to align the stack for c helper calls on X86-64. @gacholio explained to me that the assumption is true except on X. The stack on X is aligned to 2 slots before the return address is pushed.
So the stack has to be 16-byte aligned for both 32 bit and 64 bit on X before the call instruction.
Since the alignment requirements for helper call and local object are similar, we can fix this issue by excluding the return address in calculation. I'll also fix the local object alignment.
Note the stack is currently not 16-byte aligned on x86-32 (it's 8) when the interpreter calls compiled code. If you really want to call on the java stack, the interpreter/stackwalker/DDR will have to be fixed as well.
Probably also worth verifying that I'm not lying about the misalignment - compile a few methods with breakOnEntry and see if the ESP is 8-aligned or not when it breaks.
Actually, looking at the code, I think the stackwalker will accept any alignment, so it's just the transition code in the interpreter that would need to be changed (about 5 minutes work if I'm right).
Also found some alignment code when allocating/growing the stack (if the stack gets copied, the alignment of each frame must be maintained) and DLT (a slightly different JIT transition).
These are all of the places I believe would need changing (and I'd rather do this bit myself):
https://github.com/eclipse/openj9/blob/f4b39c5bb479decd612954b9d7a6055d73e13660/runtime/vm/BytecodeInterpreter.hpp#L790
https://github.com/eclipse/openj9/blob/f4b39c5bb479decd612954b9d7a6055d73e13660/runtime/vm/growstack.cpp#L138
https://github.com/eclipse/openj9/blob/f4b39c5bb479decd612954b9d7a6055d73e13660/runtime/vm/vmthread.c#L1380
https://github.com/eclipse/openj9/blob/f4b39c5bb479decd612954b9d7a6055d73e13660/runtime/codert_vm/dlt.c#L194
Before we commit to this solution, I'd like some comments from @andrewcraik. Running C code on the java stack is not a great idea - if the code is complex, it could blow the stack (and there will be no checking). If stack checking (gcc stack protect) is enabled in the compiler, I suspect running on the non-C stack will explode on any function call (any large one at least, where large is an arbitrary number defined by whatever compiler version we happen to land on).
As this is already a cold path (I sure hope you're not calling C in any hot path), I would strongly suggest switching stacks instead.
@DanHeidinga
@r30shah/@liqunl we will need to prototype doing the stackswap. If the swap is too costly we should do a two version fast/slower helper we constraint he fast version like we do other helpers and then stack swap if we need to do a re-arrange or similar. Since there are other more urgent issues this one may have to wait a day or so, but we should check the perf impact of always stack swapping. If it is negligible I am fine with just doing the swap and saving further complexity.
How does the direct C Helper calls like fast_jitInstanceOf works on x86 without doing a stack swap? On Z we have two different registers used for Java Stack Pointer and System Stack Pointer so we don't have to do stack swap but before calling out either through System Linkage or C Helper linkage we store Java Stack pointer on the slot in the vmThread.
If you do decide we need to align the java stack, the VM code is here:
https://github.com/gacholio/openj9/tree/align
Currently aligning to 16 on all platforms.
Talked to @r30shah, on x86, we don't swap the stack for fast direct c calls. The fast calls include all jprofiling functions, jitInstanceOf, jitCheckAssignable etc. These c functions won't gc so they are fast call only. Other platform always swap the stack before calling c helpers, so it's a X only issue. Will measure the overhead of swapping the stack for fast calls on X.
So there are two issues here: 1) get JProfiling working 2) look at the stack swap for helper calls on x86. The design of running the C code on the Java stack has been that way since the C helpers were introduced. While helpers are not generally on the 'fast path' they do show up on profiles so changes to the call sequence need some care and performance measurement etc.
Based on discussions with @liqunl the primary issue JProfiling is seeing is due to the unaligned stack on x86-32. I suggest we get that alignment sorted and we can then asses how to proceed with the stack swap since that has broader performance implications and is not new (it has been this way for some time).
@andrewcraik The C helpers run on the C stack normally - only the ones called directly from the code cache are breaking the rules.
@gacholio I am not saying you are wrong. What I am saying is that the implementation for the JIT has always called them on the Java stack and has done so since inception (probably due to misunderstandings when the changes were made). I do understand your point and we do need to look at it, but has much bigger implications for performance that we need to assess since the current state of affairs has been sufficient for a number of releases (despite the incorrectness of the approach at a fundamental level). I'm suggesting we fix the alignment in one PR - that can enable JProfiling to make progress and we can work in parallel on the stack swap which does need to be looked at.
Sounds like a ticking time bomb in case any modifications are made on the C side (or even changes to gcc compiler or flags) as any stack accesses are going to cause very weird bugs.
Completely agree - we do need to address this.
I'm suggesting we fix the alignment in one PR - that can enable JProfiling to make progress and we can work in parallel on the stack swap which does need to be looked at.
To confirm, this is to allow the jprofiling implementation to progress but NOT to release the feature running on the java stack?
If it's released running on the java stack, it will be difficult to accept a performance delta after the fact. Better to resolve it upfront and get an accurate performance view then to have to try and walk it back later.
Correct - we need to get testing started to see if there are any other issues and that can't be done without the stack alignment. We can work on the stack swap in parallel to ensure that isn't seen in the field. If we can't address the general stack swap in time we can do it for the JProfiling to ensure we don't see a regression when we do sort it.