Omr: Power and Z code generators incorrectly spills sub-32bit integer arguments on big-endian platforms

Created on 29 Jan 2019 · 13Comments · Source: eclipse/omr

When spilling arguments that come in registers onto the stack, the Z code generator incorrectly handles sub-32bit integer arguments.

For example, the following Trees:

(method return=Int32 args=[Int8]
  (block
    (ireturn
      (b2i
        (bload parm=0) ) ) ) )

result in the following generated code (GRA is disabled to unsure argument registers get spilled):

1:  stg %r15,120(%r15)
2:  lay %r15,-160(%r15)
3:  st  %r2,176(%r15)
4:  lb  %r2,176(%r15)
5:  lbr %r2,%r2
6:  lgfr    %r2,%r2
7:  lg  %r15,280(%r15)
8:  br  %r14

Note that the st instruction (line 3) does a word (32-bit) store of r2 to 176(%r15). On the next line, lb is used to load a byte from the same address back into r2 (the inefficiency here is due to GRA having been disabled). Because Z is big endian, instead of loading the 8-bit value passed as an argument, this instruction sequence loads the high 8-bits of the argument register.

176(%r15)
   |
   +-------------stored--------------+
   v                                 v
    31    24                 7      0
   +--------+---------------+--------+
   |xxxxxxxx|xxxxxxxxxxxxxxx|  data  | <- argument passed in r2
   +--------+---------------+--------+
   ^        ^
   +-loaded-+

power z backlog bug compiler help wanted

Source

Leonardo2718

Most helpful comment

Wouldn't it make more sense to just spill the whole register regardless of the data size currently being held?

I believe this is exactly what is happening. The Linux ABI in this case mandates that incoming parameters be at least 32-bits wide. Our internal stack slots are also 32-bits, so we spill the 32-bit register containing a 16-bit value onto the stack. However the problem is when we load from the stack we are not checking the width of the load and compensating for that in the big endian memory model so we load the wrong 16-bits in this case, i.e. the 16 most significant bits of the 32-bit spill slot instead of the 16 least significant bits. So the issue is with the sload in the comment preceding.

The solution to me seems what @andrewcraik described basically. We should:

Spill the full 32-bit incoming parameter (including any "dirty bits")
On stack loads load the entire 32-bit value (including any "dirty bits")
- We should not attempt to load sub-32-bit values from the stack since all our stack slots are at least 32-bits wide

fjeremic on 7 May 2019

👍2

All 13 comments

This issue was discovered in #3499 by changing the Tril backend from JitBuilder to the Test Compiler. The former runs GRA while the latter does not. Running GRA hides this issue by causing arguments to be loaded directly from the argument registers without spilling and reloading.

Leonardo2718 on 29 Jan 2019

@Leonardo2718 what does Power BE do? Looking at their bloadEvaluator [1] they seem to do the same thing. We've also had a very similar discussion in the past on this in [2]. See also the corresponding discussions in #1901. As things currently stand in the Z, and presumably P code generators it is not valid to load 8 bits from a 32 bit stack slot.

This comes back to the whole debate on whether the IL should be ABI centric or ABI agnostic. On little endian platforms things "just work" but on big endian we would need that above tree sequence to look different:

(method return=Int32 args=[Int32]
  (block
    (ireturn
        (iload parm=0) ) ) ) )

There is no such thing as an Int8 argument on Linux on Z ABI. All sub 32-bit types get sign/zero extended to 32-bits. See [3] for reference.

I'm not sure how we would tackle this problem. Subscribing @joransiu, @ymanton, @mstoodle for thoughts. Also @dchopra001 FYI as we spoke about this today.

[1] https://github.com/eclipse/omr/blob/97b275f64bc264273790fc3b292da34d2615337a/compiler/p/codegen/OMRTreeEvaluator.cpp#L774-L798
[2] https://github.com/eclipse/omr/issues/1943#issuecomment-342210193
[3] https://github.com/eclipse/omr/issues/1943#issuecomment-344375862

fjeremic on 30 Jan 2019

👍2

Presumably we should be able to identify that the symbol stored in the stack slot is an Int8, even though the stack slot itself would be 32-bits. We should then be able to generate a memory reference in the bloadEvaluator which references the correct offset into the slot. Just a thought. Not sure what kinds of things this may break or what other issues may arise as a result.

fjeremic on 30 Jan 2019

@fjeremic

I've only seen failures on Power LE but those are because of slightly different reasons (still need to open an issue about them). I believe AIX is the only OS that supports Power BE so I haven't seen any failures there because Tril doesn't run on AIX (yet).

Presumably we should be able to identify that the symbol stored in the stack slot is an Int8, even though the stack slot itself would be 32-bits. We should then be able to generate a memory reference in the bloadEvaluator which references the correct offset into the slot. Just a thought. Not sure what kinds of things this may break or what other issues may arise as a result.

I thought about something like this as well. I think it would but the approach feels error prone (that could just be me).

An alternative (assuming instructions exist for doing so) would be check whether the argument is 8 bits wide and just spill the register using a byte-store if it is. I believe that logic similar to this already exists for using different instructions for 32-bit and 64-bit arguments. No offset/address re-calculation would then be required. However, this does not solve the problem of deciding whether to sign- or zero- extend after the load.

Leonardo2718 on 30 Jan 2019

However, this does not solve the problem of deciding whether to sign- or zero- extend after the load.

Right, that would require knowing the signedness of the parameter. Do we even have UInt8 as a parameter type?

fjeremic on 30 Jan 2019

Right, that would require knowing the signedness of the parameter. Do we even have UInt8 as a parameter type?

Not as far as I know.

While looking into #3535, it occurred to me that we could unconditionally sign/zero-extending a value when it is used in an operation that cares. For example, the evaluator for bshr would unconditionally generate a sign-extension instruction before the instruction for doing the right-shift. A bushr would generate a zero-extension instead. In some cases, the additional instructions would be redundant but I think a peephole might be able to eliminate such cases.

Leonardo2718 on 31 Jan 2019

@yanluo7 Recently introduced some tril tests for *ternaryIL that also exposed this problem.

The symptoms are almost identical to what @Leonardo2718 describes in https://github.com/eclipse/omr/issues/3525#issue-404491302 (i.e. we spill the arguments as ints onto the stack, and then load back as shorts).

The tril trees in this case look like this:
https://github.com/eclipse/omr/blob/285d29efc21e7c6b3d3f03a2f887a5c707b7e9cd/fvtest/compilertriltest/TernaryTest.cpp#L390-L403

Similar to the original problem, this issue also disappears when we enable GRA since we don't end up spilling the arguments to the stack and load them directly from the argument registers.

@andrewcraik FYI

dchopra001 on 3 May 2019

Hmm interesting - I am wondering why the register spill is concerned about the size of thing the register contains? It seems to me people could be doing fancy things with a value and we shouldn't be guessing about what that value is - we just do what the program says in terms of widening, truncating, sign extending and zero extending. The issue, at least to me, seems to be that the register spilling code is trying to be smart and only spill the active bytes in a register rather than spilling 'the register' including the inactive bytes. Wouldn't it make more sense to just spill the whole register regardless of the data size currently being held? The high bytes may contain garbage or they may not but spilling and restoring the whole thing all the time would save this kind of concern and, I would have thought, limited performance implications. The advantage being that we don't then have to worry about extensions or anything associated with restores - you just bring the whole register back and let the program continue using it.

andrewcraik on 6 May 2019

Wouldn't it make more sense to just spill the whole register regardless of the data size currently being held?

The solution to me seems what @andrewcraik described basically. We should:

Spill the full 32-bit incoming parameter (including any "dirty bits")
On stack loads load the entire 32-bit value (including any "dirty bits")
- We should not attempt to load sub-32-bit values from the stack since all our stack slots are at least 32-bits wide

fjeremic on 7 May 2019

👍2

I think all the codegens should check this - essentially a save and restore for a spilled register should always spill the 'whole' register (including any "dirty bits" as @fjeremic put it). The codegen may make implicit assumptions about the "dirty bits" and for spilling to be transparent we must always restore the full register to preserve these implicit assumptions made based on the ISA.

andrewcraik on 7 May 2019

FYI @gita-omr and @0xdaryl

andrewcraik on 7 May 2019

@0xdaryl please add arch:p label to this.

gita-omr on 12 May 2019

I think I've addressed this issue over in #4733 via d6fe683. Rather than loading dirty bits as proposed in the previous comment (https://github.com/eclipse/omr/issues/3525#issuecomment-490101982) we instead spill the entire GPR on entry if needed, and adjust the parameter offsets to point to the correct data.

So for example if we have a C function calling a JitBuilder function char foo(char x) then when we compile foo we spill the entire GPR in the prologue of necessary, since our stack slots are 64-bits wide this is fine. Any use of the parameter inside the method will have to be a bload to be type-correct so all we do is inside of the linkage when we initialize the parameter offsets we add +7 in this case.

I've checked this fixes most of the tests we currently have disabled due to this issue. I'll have to dig deeper and validate that we do indeed fix all of them.

fjeremic on 24 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Investigate leveraging cpu_features project in port library

0xdaryl · 5Comments

Should byte-reversed load/store opcodes be moved to OMR?

aviansie-ben · 3Comments

Proposal to shorten Tril commoned node identifiers

0xdaryl · 3Comments

Investigate why scalarizing of arraycopies is disabled by default

fjeremic · 3Comments

Apply consistent use of OS macros in compiler

0xdaryl · 6Comments