Runtime: RyuJit: avoid conditional jumps using cmov and similar instructions

Created on 2 Oct 2016  ·  52Comments  ·  Source: dotnet/runtime

Conditional jumps, especially those that are hard to predict, are fairly expensive, so they should be avoided if possible. One way to avoid them is to use conditional moves and similar instructions (like sete). As far as I can tell, RuyJit never does this and I think it should.

For example, take these two methods:

``` c#
[MethodImpl(MethodImplOptions.NoInlining)]
static long sete_or_mov(bool cond) {
return cond ? 4 : 0;
}

[MethodImpl(MethodImplOptions.NoInlining)]
static long cmov(long longValue) {
long tmp1 = longValue & 0x00000000ffffffff;
return tmp1 == 0 ? longValue : tmp1;
}

For both of them, RyuJit generates a conditional jump:

``` asm
; Assembly listing for method Program:sete_or_mov(bool):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,   3  )    bool  ->  rcx
;  V01 tmp0         [V01,T01] (  3,   2  )     int  ->  rax
;# V02 OutArgs      [V02    ] (  1,   1  )  lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0

G_M60330_IG01:

G_M60330_IG02:
       84C9                 test     cl, cl
       7504                 jne      SHORT G_M60330_IG03
       33C0                 xor      eax, eax
       EB05                 jmp      SHORT G_M60330_IG04

G_M60330_IG03:
       B804000000           mov      eax, 4

G_M60330_IG04:
       4863C0               movsxd   rax, eax

G_M60330_IG05:
       C3                   ret

; Total bytes of code 17, prolog size 0 for method Program:sete_or_mov(bool):long
; ============================================================
; Assembly listing for method Program:cmov(long):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,   3.5)    long  ->  rcx
;  V01 loc0         [V01,T01] (  3,   2.5)    long  ->  rax
;# V02 OutArgs      [V02    ] (  1,   1  )  lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0

G_M53075_IG01:

G_M53075_IG02:
       B8FFFFFFFF           mov      eax, 0xFFFFFFFF
       4823C1               and      rax, rcx
       4885C0               test     rax, rax
       7401                 je       SHORT G_M53075_IG04

G_M53075_IG03:
       C3                   ret

G_M53075_IG04:
       488BC1               mov      rax, rcx

G_M53075_IG05:
       C3                   ret

; Total bytes of code 18, prolog size 0 for method Program:cmov(long):long
; ============================================================

For comparison, here are the same methods compiled using Clang and GCC with -O1 (by Compiler Explorer):

GCC 6.2:

sete_or_mov(bool):
        test    dil, dil
        setne   al
        movzx   eax, al
        sal     rax, 2
        ret
cmov(unsigned long):
        mov     eax, edi
        test    rax, rax
        cmove   rax, rdi
        ret

Clang 3.9.0:

sete_or_mov(bool):                       # @sete_or_mov(bool)
        movzx   eax, dil
        shl     rax, 2
        ret

cmov(unsigned long):                               # @cmov(unsigned long)
        mov     eax, edi
        mov     ecx, 4294967295
        and     rcx, rdi
        cmove   rax, rdi
        ret

category:cq
theme:basic-cq
skill-level:expert
cost:large

area-CodeGen-coreclr enhancement optimization tenet-performance

Most helpful comment

FWIW I've made a prototype that survives diffing and testing. It detects simple half/full hammocks where the true/false BBs contain a single assignment (or a return) that involves only local variables and constants. It only works with i4 and i8 but with some care we might be able to handle small integer types.

I'll probably try to extend this to indirs though it's problematic due to side-effects. A quick and dirty experiment that survives diffing could probably show if there are enough opportunities for this to be useful.

The conversion is currently done when lowering GT_JTRUE. It could be easily moved to a separate phase before lowering but I'm wondering if this shouldn't be done much earlier. After conversion we have less basic blocks and we may be able to eliminate some variables as well. And we may miss other optimizations if we do this conversion late - for example dotnet/runtime#6748 requires CSE.

FX diff shows a 2k reduction in code size but it could be better. I had to turn off a and-cmp to test optimization that caused a lot of trouble and that results in code size regressions. Also, using cmov can result in pretty bad code if variables are not already in registers. Something like:

mov eax, [ebp-8]
mov edx, 1
cmov eax, edx
mov [ebp-8], eax

Codegen for GT_SELECT should probably detect such cases and fall back to a jcc and mov [ebp-8], 1.

The 2 examples from the initial post generate:

G_M65174_IG02:
       84D2                 test     dl, dl
       0F95C0               setne    al
       0FB6C0               movzx    rax, al
       C1E002               shl      eax, 2
       4863C0               movsxd   rax, eax
G_M65174_IG03:
       C3                   ret

G_M65174_IG02:
       B8FFFFFFFF           mov      eax, 0xFFFFFFFF
       4823C2               and      rax, rdx
       4885C0               test     rax, rax
       480F44C2             cmove    rax, rdx
G_M65174_IG03:
       C3                   ret

The code is similar to what the native compilers generate except a couple of warts. The first example has a widening cast that can be easily removed. The second example has a useless test that could be removed as well.

It's worth noting that Clang assumes that bools are 0/1 and generates better code for the first example. I don't think we can do that, CLR bools are 0/non-zero.

The Math.Max example from dotnet/runtime#4326 generates:

G_M65175_IG02:
       8B4208               mov      eax, dword ptr [rdx+8]
       83F82A               cmp      eax, 42
       BA2A000000           mov      edx, 42
       0F4CC2               cmovl    eax, edx

so we get rid of a useless jump by accident (those kind of jumps can probably appear in other situations that aren't handled by if-conversion).

The bool conversion example from dotnet/runtime#4399 isn't currently detected but a variant of it works:

``` C#
for (int i = 0; i < N; i++)
sum = (bits[i] ? 1 : 0) + sum;


   470FB6440810         movzx    r8, byte  ptr [r8+r9+16]
   4585C0               test     r8d, r8d
   410F95C0             setne    r8b
   450FB6C0             movzx    r8, r8b
   4103C0               add      eax, r8d
The reason why the the original code (`sum += bits[i] ? 1 : 0`) isn't handled is that the IL stack is no empty at the entry of the true/false blocks and as a result an additional assignment is added to those blocks.

With some additional work this could probably turn into:

   movzx    r8, byte  ptr [r8+r9+16]
   neg r8d
   adc eax, 0

```

The nice thing about if-conversion is that once you generate a GT_SELECT you can look around and detect all kind of patterns relatively easily.

I'll try to do some perf measurements but the perf characteristics of converting branches are probably well known to anyone familiar with the subject. If the branch isn't predictable then you get good performance (I've seen numbers in the range 2-5x but I have to double check). If the branch is predictable then performance may regress slightly. Maybe such conversions should be driven by profile data but it seems that all native compilers generate cmov by default.

All 52 comments

There is code for it I think but @jamesqo noticed it was never getting executed in https://github.com/dotnet/coreclr/pull/6647#issuecomment-238382597

Function called CodeGen::genCodeForQmarkWithCMOV

@benaadams That code is in codegenlegacy.cpp, which, as far as I can tell, is the pre-RuyJit codegen and does not even support x64. So I'm not sure how relevant is that for RuyJit on x64.

Might be why its not called then 😆

@benaadams I believe I tested it with x86 builds, without the x86 RyuJIT (which is disabled by default) enabled. Feel free to check for yourself, though, in case I messed something up.

There does seem to be a lot of stuff for cmov in the code, have never seen it output though

We used to generate CMOV/SETCC from our internal QMark operator, however we ended up expanding QMark after importer to short-circuit flow to simplify handling in downstream phases. As I recall the plan was to enable a more general if-conversion phase to identify simple cmov/setcc patterns before lowering to reconstitute them. /cc @dotnet/jit-contrib @russellhadley

I believe JIT32 (used today on x86) does generate at least SETCC, if not CMOV.

Regardless of the current state, I think we all agree with the previous comments both are pretty powerful features of the IA architecture and worth generating. Often result in smaller and faster code without the possibility of expensive mis-prediction penalties.

I agree, it's pretty important to handle both of these cases. In terms of priority @svick where is this coming from? Is it in a particular workload or benchmark? What's the opportunity here?

@russellhadley I noticed this when I tried to modify this code in Kestrel to use cmov and failed.

But that was just "Hey, this code could be faster! No? Too bad.", I don't know if it would actually have measurable impact on the overall performance of Kestrel. So I don't think you should prioritize this based on me.

@svick It's still Kestrel and it's in one of the methods we want to get inlined into the MemoryAllocatorIterator::Seek routine and optimized - so it had some priority. Thanks for the heads up. :)

@svick, that code is probably faster with bsf/tzcnt(count trailing zeroes) and lshl(logical shift left) instead of cmov/condIf's. Whether JIT knows to generate them is another question.

@choikwa but ternary operators, for example, are probably a much more common pattern across the ecosystem as a first step to improve; and likely more simple to recognise - since its a language construct.

Have an improvement for that particular section of code in Kestrel in a PR https://github.com/aspnet/KestrelHttpServer/pull/1138

@cmckinsey @russellhadley Anyone working on this? FYI I'm trying to add support for CMOV to RyuJIT's codegen for a different issue. Maybe I can take a look at this one after that.

AFAIK nobody is looking into this at the moment. Feel free to dive in :)

@mikedn I think @benaadams has provided a branchless version of the code that motivated this in Kestrel, but getting to the point where we can generate CMOVs is still interesting. @sivarv has been working on some ideas around how to improve this kind of instruction selection. It would be good to make sure this effort dovetails.

FWIW I've made a prototype that survives diffing and testing. It detects simple half/full hammocks where the true/false BBs contain a single assignment (or a return) that involves only local variables and constants. It only works with i4 and i8 but with some care we might be able to handle small integer types.

I'll probably try to extend this to indirs though it's problematic due to side-effects. A quick and dirty experiment that survives diffing could probably show if there are enough opportunities for this to be useful.

The conversion is currently done when lowering GT_JTRUE. It could be easily moved to a separate phase before lowering but I'm wondering if this shouldn't be done much earlier. After conversion we have less basic blocks and we may be able to eliminate some variables as well. And we may miss other optimizations if we do this conversion late - for example dotnet/runtime#6748 requires CSE.

FX diff shows a 2k reduction in code size but it could be better. I had to turn off a and-cmp to test optimization that caused a lot of trouble and that results in code size regressions. Also, using cmov can result in pretty bad code if variables are not already in registers. Something like:

mov eax, [ebp-8]
mov edx, 1
cmov eax, edx
mov [ebp-8], eax

Codegen for GT_SELECT should probably detect such cases and fall back to a jcc and mov [ebp-8], 1.

The 2 examples from the initial post generate:

G_M65174_IG02:
       84D2                 test     dl, dl
       0F95C0               setne    al
       0FB6C0               movzx    rax, al
       C1E002               shl      eax, 2
       4863C0               movsxd   rax, eax
G_M65174_IG03:
       C3                   ret

G_M65174_IG02:
       B8FFFFFFFF           mov      eax, 0xFFFFFFFF
       4823C2               and      rax, rdx
       4885C0               test     rax, rax
       480F44C2             cmove    rax, rdx
G_M65174_IG03:
       C3                   ret

The code is similar to what the native compilers generate except a couple of warts. The first example has a widening cast that can be easily removed. The second example has a useless test that could be removed as well.

It's worth noting that Clang assumes that bools are 0/1 and generates better code for the first example. I don't think we can do that, CLR bools are 0/non-zero.

The Math.Max example from dotnet/runtime#4326 generates:

G_M65175_IG02:
       8B4208               mov      eax, dword ptr [rdx+8]
       83F82A               cmp      eax, 42
       BA2A000000           mov      edx, 42
       0F4CC2               cmovl    eax, edx

so we get rid of a useless jump by accident (those kind of jumps can probably appear in other situations that aren't handled by if-conversion).

The bool conversion example from dotnet/runtime#4399 isn't currently detected but a variant of it works:

``` C#
for (int i = 0; i < N; i++)
sum = (bits[i] ? 1 : 0) + sum;


   470FB6440810         movzx    r8, byte  ptr [r8+r9+16]
   4585C0               test     r8d, r8d
   410F95C0             setne    r8b
   450FB6C0             movzx    r8, r8b
   4103C0               add      eax, r8d
The reason why the the original code (`sum += bits[i] ? 1 : 0`) isn't handled is that the IL stack is no empty at the entry of the true/false blocks and as a result an additional assignment is added to those blocks.

With some additional work this could probably turn into:

   movzx    r8, byte  ptr [r8+r9+16]
   neg r8d
   adc eax, 0

```

The nice thing about if-conversion is that once you generate a GT_SELECT you can look around and detect all kind of patterns relatively easily.

I'll try to do some perf measurements but the perf characteristics of converting branches are probably well known to anyone familiar with the subject. If the branch isn't predictable then you get good performance (I've seen numbers in the range 2-5x but I have to double check). If the branch is predictable then performance may regress slightly. Maybe such conversions should be driven by profile data but it seems that all native compilers generate cmov by default.

As a simple perf test let's sum all the positive numbers from an array:

int sum = 0;
for (long i = 0; i < nums.Length; i++)
    sum += (nums[i] > 0 ? nums[i] : 0);
return sum;

As mentioned above, use of += blocks if-conversion so we don't get CMOV:

G_M65174_IG03:
       493BC8               cmp      rcx, r8
       731F                 jae      SHORT G_M65174_IG06
       448B4C8A10           mov      r9d, dword ptr [rdx+4*rcx+16]
       4585C9               test     r9d, r9d
       7F05                 jg       SHORT G_M65174_IG04
       4533C9               xor      r9d, r9d
       EB00                 jmp      SHORT G_M65174_IG04

G_M65174_IG04:
       4103C1               add      eax, r9d
       48FFC1               inc      rcx
       4C3BC1               cmp      r8, rcx
       7FE1                 jg       SHORT G_M65174_IG03

With 1,000,000 numbers this runs in 630us if all numbers are positive and 4300us if we have a random mix of positive and negative numbers. Poor branch prediction takes its toll.

If we change the sum expression so that if-conversion kicks in we get CMOV:

G_M65174_IG03:
       493BC8               cmp      rcx, r8
       731F                 jae      SHORT G_M65174_IG05
       448B4C8A10           mov      r9d, dword ptr [rdx+4*rcx+16]
       4533D2               xor      r10d, r10d
       4585C9               test     r9d, r9d
       450F4ECA             cmovle   r9d, r10d
       4103C1               add      eax, r9d
       48FFC1               inc      rcx
       4C3BC1               cmp      r8, rcx
       7FE1                 jg       SHORT G_M65174_IG03

Now we always get 720us no matter what the numbers are. We're slightly slower in the "all positive" case and 6 (yes, six!) times faster in the random case.

Should something be done about the slowdown we see in the "all positive" case? I don't know, as a developer I would use the following code if the numbers are known to be mostly positive:

for (long i = 0; i < nums.Length; i++)
    if (nums[i] > 0)
        sum += nums[i];

If-conversion doesn't kick in and this runs in 490us if all numbers are positive, that's faster than all of the above.

I know. The prototype I have does generate CMOV in the MinTernary method but it doesn't yet work in the Ternary method because the result is assigned to an array rather than a local variable.

@mikedn For your prototype did you just try and re-enable they qmark support? (not expand them early into flow)

@russellhadley Nope, I'm generating GT_SELCC nodes during GT_JTRUE lowering. GT_SELCC is a new operator I've added to represent CMOV. It's a binary operator that expects the condition flags to be set appropriately.

@mikedn Does it take 2 full trees to be evaluated or is it restricted to taking lclVars?

@russellhadley lclVars and constants. It recognizes:
```C#
if (whatever) { lclX = lclY (or const); }
// generates STORE_LCL(lclX, SELCC(lclY, lclX))

if (whatever) { lclX = lclY (or const); } else { lclX = lclZ (or const); }
// generates STORE_LCL(lclX, SELCC(lclY, lclZ))

if (whatever) { return lclY (or const); } else { return lclZ (or const); }
// generates RETURN(SELCC(lclY, lclZ))
```
It should be possible to handle more patterns but there are all kinds of issues:

  • In the first case lclX could also be an indir if we know that the address is not null. CMOV faults even if the condition is false.
  • In the second example lclX could be an indir but we have to ensure that both indirs have the same address. I suppose it should be possible to an extent but I haven't looked into this.
  • In all cases the RHS could probably be any side effect free tree but that's problematic from a performance point of view since it is evaluated even if the condition is false. Maybe we could allow one additional simple operation (addition, bitwise ops etc.)

I think (haven't looked at it too much) that GT_QMARK allows more or less arbitrary trees and that makes it problematic, it looks like some kind of control flow embedded in an expression tree. GT_SELCC looks like any other binary operator except that it has a hidden dependency on condition flags.

I mixed up my code with a few other experiments, I'll try to clean it up a bit tomorrow and push it to my GitHub fork for reference. Though at 500 lines of code (including empty lines) it may not be a pretty sight. Needs reorganization.

Thinking about it, given that CMOV is complicated to catch everything, why not tackle only the 'easy' cases (when there is certainty to have performance improvements) and provide also an intrisic version in case you need in any of the dubious cases.

@russellhadley My experimental branch is here: https://github.com/mikedn/coreclr/commits/if-convert2

@redknightlois I'm not sure how an intrinsic would help. Once it works for local variables it's up to you to ensure that the code has the right pattern to make use of CMOV. Or create a method like T Select<T>(bool condition, T trueValue, T falseValue) => condition ? trueValue : falseValue;. The moment you write Select(c, a[i], b[i]) you accept that both a[i] and b[i] will be evaluated.

@mikedn Looks like the experimental branch was deleted. Do you have a draft somewhere?

@sdmaclea See the "If conversion" commit in this branch https://github.com/mikedn/coreclr/commits/relop-lower-2

@mikedn, as for providing an intrinsic (a bit late, I know), I didn't think the runtime had a rule stating that a variable that is passed directly to a method (without being stored to a local in the interim) has to be evaluated (that's just the way it happens to work today, since we don't expose very many, if any, "intrinsics" like this).

That is, it should be entirely possible for the runtime to expose a method in Corlib that has a special contract where Select(c, a[i], b[i]) will only evaluate a[i] or b[i] based on the result of c. The method could be exposed in the new System.Runtime.Intrinsics namespace (or maybe System.Runtime.CompilerServices.Unsafe) and would be considered a "special" use-case for power-users (just like the rest of S.R.Intrinsics or S.R.CS.Unsafe).

That being said, just ensuring we recognize c ? a[i] : b[i] is probably good enough. The downside being, that there is no guarantee it will be transformed, depending on various other constraints the JIT might be under (this has been brought up in various different threads, including the one where I requested we document recognized patterns 😉).

@tannergooding I don't quite understand what you are saying. In particular, the "Select(c, a[i], b[i]) will only evaluate a[i] or b[i] based on the result of c" makes no sense. The whole point of adding the Select method I suggested is to allow people to clearly state that they are OK with both a[i] and b[i] being evaluated. The "special contract" you suggest seems to do exactly the opposite (and sounds like a C/C++ macro abomination).

That being said, just ensuring we recognize c ? a[i] : b[i] is probably good enough

It will never recognize that. It's simply impossible to express the semantics of such an expression using CMOV.

It will never recognize that. It's simply impossible to express the semantics of such an expression using CMOV.

C/C++ recognizes it:

int Select(bool condition, int trueValue, int falseValue)
{
    return condition ? trueValue : falseValue;
}

is transformed to:

test    cl,  cl
cmovne  r8d, edx
mov     eax, r8d
ret     0

Realistically, there is no reason the JIT should recognize that pattern as well.

The "special contract" you suggest seems to do exactly the opposite (and sounds like a C/C++ macro abomination).

Not really, even just using regular C/C++ methods, the inliner recognizes we are just indexing into an array (which for non-volatile locals/parameters should be non side-effecting) and does the right thing:

int Select(bool condition, int trueValue, int falseValue)
{
    return condition ? trueValue : falseValue;
}

int MyMethod(bool condition, int* a, int* b)
{
    return Select(condition, a[5], b[6]);
}

transforms to:

add     rdx, 20
lea     rax, qword ptr [r8 + 24]
test    cl,  cl
cmovne  rax, rdx
mov     eax, dword ptr [rax]
ret     0

So, as you can see, it is a cross method call that inlines so that the array access doesn't happen until after condition has been evaluated.

I see no reason why the JIT could not also recognize and optimize this appropriately, at the very least for something like acessing a local or an array index... Properties and methods are more questionable since they can be potentially side-effecting.

Having a method of the form Select(bool, T, T), where T can fit into a register, and having the JIT guarantee that it will inline and be transformed to a cmov (or equivalent) if the platform supports it, does not seem unreasonable to me.

Directly doing return condition ? a[5] : b[6] (or even return condition ? a[i] : b[i]) also works and is transformed to the equivalent of the MyMethod output.

I did test the various outputs on MSVC++ and Clang

Does it work when conditionis false and ais nullptr?

Does it work when condition is false and a is nullptr?

@briansull, The code it generates for MyMethod is consistent regardless of what a, b, or c are.

However, when MyMethod(false, null, b) is called, it is able to statically determine that c is always false and will only ever return b. So it will inline that appropriately.

So, for example:

int MyMethod2(int* b)
{
    return MyMethod(false, nullptr, b);
}

will transform to:

mov eax, dword ptr [rcx+24]
ret 0

and that itself can be inlined to just mov <register>, dword ptr [<b>+24]

C++ semantics allow suppressing faults in expressions whose results are otherwise unused. MSIL semantics do not.

C/C++ recognizes it:

So does my experimental if conversion implementation. But that's a different expression than then one you used in your previous post.

int MyMethod(bool condition, int* a, int* b)

That's not really the same as what you were suggesting. It's easier to observe if you look at the callsite - it would be MyMethod(c, &a[i], &b[i]), not MyMethod(c, a[i], b[i]). Besides, examples coming from the C++ world are not very relevant. In C/C++ there are no array range checks. In C/C++ dereferencing a null pointer results in undefined behavior, not in a NullReferenceException. In C/C++ the expression evaluation order is not well defined. And so on.

Having a method of the form Select(bool, T, T), where T can fit into a register, and having the JIT guarantee that it will inline and be transformed to a cmov (or equivalent) if the platform supports it, does not seem unreasonable to me.

Who said is unreasonable? You jumped in and suggested something that defies the standard language behavior and is not really needed. The use of a Select method to allow the developer to deal with side effects in an explicit manner was suggested earlier by me.

Directly doing return condition ? a[5] : b[6] (or even return condition ? a[i] : b[i]) also works and is transformed to the equivalent of the MyMethod output.

Again, we're talking about C++, a very different language. In C# that transform can only be done if a and b are known to be non-null and i is a valid index for both arrays. Probably doable but these conditions severely limit the scope.

I was actually considering doing something like this for field access. c ? o.x : o.y can always be if-converted, even if o is not known to be non-null.

You jumped in and suggested something that defies the standard language behavior and is not really needed.

I would say that it is needed, in order to produce more performant code, for the cases it is necessary (or desired).

The JIT has its own rules for safety (null checks, range checks, etc). There are certainly ways around this (such as using unsafe code), but the JIT already has trouble generating optimal code in a timely manner and some of the analysis will be impossible to do in the timeframe allotted. Doing things that are "out of the ordinary" probably won't help with generating optimal code.

The general response to this is to use AOT (which isn't always available) or write it in native code (which has its own overhead for p/invoke, or can have other issues if you are writing pure native code). Adding new IL instructions is also not very "viable" as it sets a new bar, breaks back-compat, requires compiler support to generate the code, etc.

So the remaining options are:

  • Just live with it
  • Provide special intrinsic functionality

I'd rather not "Just live with it", so to me, that just leaves providing special intrinsics. Doing this, is not a new concept, many modern compilers provide some form of intrinsics or compiler hints to help produce better codegen.

Today, we provide several intrinsics, but they are intrinsic via implementation detail rather than intrinsic via contract (Math.Sqrt is implementation detail, Sse2.Sqrt will be by contract). We also provide hints in the form of MethodImplOptions, but that is really only used for forcing or preventing inlining (and, as a hint, it isn't always respected).

As most intrinsics and hints, they would end up producing "unsafe" code that does not necessarily follow the language spec. Instead, they are designed for performance and place the burden on the end-user to ensure that what they are doing is "safe".

Something like a Select method or ElideBoundsChecks hint are just the type of things that would allow condition ? a[5] : b[6] to be emitted as

add     rdx, 20
lea     rax, qword ptr [r8 + 24]
test    cl,  cl
cmovne  rax, rdx
mov     eax, dword ptr [rax]
ret     0

rather than as:

00007FFA2B0F0570  sub         rsp,28h  
00007FFA2B0F0574  test        cl,cl  
00007FFA2B0F0576  jne         00007FFA2B0F0588  
00007FFA2B0F0578  cmp         dword ptr [r8+8],6  
00007FFA2B0F057D  jbe         00007FFA2B0F0596  
00007FFA2B0F057F  mov         eax,dword ptr [r8+28h]  
00007FFA2B0F0583  add         rsp,28h  
00007FFA2B0F0587  ret
00007FFA2B0F0588  cmp         dword ptr [rdx+8],5  
00007FFA2B0F058C  jbe         00007FFA2B0F0596  
00007FFA2B0F058E  mov         eax,dword ptr [rdx+24h]  
00007FFA2B0F0591  add         rsp,28h  
00007FFA2B0F0595  ret
00007FFA2B0F0596  call        00007FFA8AD264D0  
00007FFA2B0F059B  int         3

or worse, as:

00007FFA2B0D0500  sub         rsp,28h  
00007FFA2B0D0504  cmp         dword ptr [rdx+8],5  
00007FFA2B0D0508  jbe         00007FFA2B0D0527  
00007FFA2B0D050A  mov         eax,dword ptr [rdx+24h]  
00007FFA2B0D050D  cmp         dword ptr [r8+8],6  
00007FFA2B0D0512  jbe         00007FFA2B0D0527  
00007FFA2B0D0514  mov         edx,dword ptr [r8+28h]  
00007FFA2B0D0518  test        cl,cl  
00007FFA2B0D051A  jne         00007FFA2B0D051E  
00007FFA2B0D051C  jmp         00007FFA2B0D0520  
00007FFA2B0D051E  mov         edx,eax  
00007FFA2B0D0520  mov         eax,edx  
00007FFA2B0D0522  add         rsp,28h  
00007FFA2B0D0526  ret  
00007FFA2B0D0527  call        00007FFA8AD264D0  
00007FFA2B0D052C  int         3

NOTE: The last block here is actually generated when inlining a select call (which does return condition ? trueValue : falseValue) into a method that does return Select(condition, a[5], b[6]), when the first block should probably be generated instead. The difference being, whether both range checks are done first, or whether only one range check is done, and only when it is actually neeeded. The former works for null a (on false) or null b (on true). The latter fails for null a or null b, regardless of condition. It is also worth noting that the former is generated when the select call is manually inlined.

Now this is almost turning (if it hasn't already) into new proposal territory, so I'll take my grumbling elsewhere

Provide special intrinsic functionality

The specific Select behavior you are suggesting has nothing to do with intrinsics, it does not map features available in the supported instruction sets.

The Select method that I suggested may be treated as an intrinsic and that would avoid the need to do if-conversion in the JIT. That's certainly an option.

Something like a Select method or ElideBoundsChecks hint are just the type of things that would allow condition ? a[5] : b[6] to be emitted as

Select/CMOV/IfConversion has nothing to do with range check elimination. If you don't want those range checks then use unsafe code or whatever other means are availabe. Or suggest new ones. But again, this has nothing to do with this issue.

The difference being, whether both range checks are done first, or whether only one range check is done, and only when it is actually neeeded.

The difference is normal and expected. Please try to understand how things work before attempting to imply that one variant is somehow worse or better than the other. They're different. And again, this is not related to this issue.

Now this is almost turning (if it hasn't already) into new proposal territory, so I'll take my grumbling elsewhere

I guess it's good that you finally figured it out. Still, your behavior is kind of dubious for a Microsoft employee. They usually put more thought into their contributions.

@tannergooding

As most intrinsics and hints, they would end up producing "unsafe" code that does not necessarily follow the language spec.

Which existing or proposed intrinsics or hints actually do not follow the language spec?


Also, with the Select you proposed, how is the runtime going to figure out which side-effects can be ignored? IL doesn't really capture which expression was part of the Select call (and so it can be optimized) and which wasn't (and so it presumably can't be optimized).

For example, consider:

```c#
static int M1(bool c, int[] a, int[] b, int i)
{
return Select(c, a[i], b[i]);
}

static int M2(bool c, int[] a, int[] b, int i)
{
var c0 = c;
var ai = a[i];
return Select(c0, ai, b[i]);
}
```

Are you sure M1 and M2 will produce different IL, which the runtime can use to figure out if it can avoid evaluating a[i]? (Right now, the IL is different, but the only difference is a stloc ldloc pair. I think a future version of the compiler could easily change that.)

Or is it okay if a[i] is not evaluated even in M2?

Still, your behavior is kind of dubious for a Microsoft employee. They usually put more thought into their contributions.

@mikedn, We can have misunderstandings about areas that aren't our expertise as well. We can also have bad ideas. It just means we sometimes have a route to discuss bad ideas or misunderstandings before it goes on the internet :wink:

The difference is normal and expected. Please try to understand how things work before attempting to imply that one variant is somehow worse or better than the other. They're different. And again, this is not related to this issue.

Why is the difference expected?

Given two methods:
```C#
int Select(bool condition, int trueValue, int falseValue)
{
return condition ? trueValue : falseValue;
}

int MyMethod(bool condition, int[] a, int[] b)
{
return Select(condition, a[5], b[6]);
}
```

If no inlining is done, then both a[5] and b[6] must be evaluated before Select can be called.

However, if inlining is done, I don't see why the method cannot be transformed so the inlined code is better.

(I.12.6.4 Optimization) indicates

Conforming implementations of the CLI are free to execute programs using any technology that
guarantees, within a single thread of execution, that side-effects and exceptions generated by a
thread are visible in the order specified by the CIL. For this purpose only volatile operations
(including volatile reads) constitute visible side-effects. (Note that while only volatile operations
constitute visible side-effects, volatile operations also affect the visibility of non-volatile
references.)

[Rationale: An optimizing compiler is free to reorder side-effects and synchronous exceptions to
the extent that this reordering does not change any observable program behavior. end rationale]

[Note: An implementation of the CLI is permitted to use an optimizing compiler, for example, to
convert CIL to native machine code provided the compiler maintains (within each single thread
of execution) the same order of side-effects and synchronous exceptions.

This is a stronger condition than ISO C++ (which permits reordering between a pair of sequence
points) or ISO Scheme (which permits reordering of arguments to functions). end note]

Additionally, the spec indicates

In general, arguments and local variables are only visible to the executing thread, while instance and static fields and array elements can be visible to
multiple threads, and modification of such values is considered a side-effect.

Based on this, reading the element of a non-volatile array should be considered non side-effecting.

For many cases, the method may be too complex for the JIT to do the appropriate analysis and determine that x or an alias of x is not assigned.

However, for a simple case, such as the above, it should be able to determine that both a[5] and b[6] are non-volatile and that they are only read.

It should then be able to determine that the range checks don't need to be done at the top of the method and can be inlined to the paths where they are relevant.

how is the runtime going to figure out which side-effects can be ignored

@svick, in your example, the difference is whether or not the evaluation is stored into a local first.

For M1, the values are directly loaded onto the stack and then consumed from the Select call. Therefore, they are only available in that context.

For M2, the values are stored in locals before being stored on the stack. Additional analysis would have to be done to determine that the local exists, but is immediately consumed (and only consumed once) in a specific context and can therefore be inlined.

My understanding is that reading of an array should be considered non side-effecting, but modification of an array is considered side-effecting.

@tannergooding is this more accurate of what you are after?

int Select(bool condition, ref int trueValue, ref int falseValue)
{
    return condition ? ref trueValue : ref falseValue;
}

@benaadams, no. At this point I'm just trying to understand why

This:
```C#
int Select(bool condition, int trueValue, int falseValue)
{
return condition ? trueValue : falseValue;
}

int MyMethod(bool condition, int[] a, int[] b)
{
return Select(condition, a[5], b[6]);
}

cannot be transformed to be:

```C#
int MyMethod(bool condition, int[] a, int[] b)
{
    return condition ? a[5] : b[6];
}

and generate the appropriate code. My understanding of the optimizations the spec allows, indicates it should be.

Basically, if the range checks could be elided by the JIT, I would expect this to generate:

add     rdx, 20
lea     rax, qword ptr [r8 + 24]
test    cl,  cl
cmovne  rax, rdx
mov     eax, dword ptr [rax]
ret     0

If they could not be elided, I would expect it to generate:

00007FFA2B0F0570  sub         rsp,28h  
00007FFA2B0F0574  test        cl,cl  
00007FFA2B0F0576  jne         00007FFA2B0F0588  
00007FFA2B0F0578  cmp         dword ptr [r8+8],6  
00007FFA2B0F057D  jbe         00007FFA2B0F0596  
00007FFA2B0F057F  mov         eax,dword ptr [r8+28h]  
00007FFA2B0F0583  add         rsp,28h  
00007FFA2B0F0587  ret
00007FFA2B0F0588  cmp         dword ptr [rdx+8],5  
00007FFA2B0F058C  jbe         00007FFA2B0F0596  
00007FFA2B0F058E  mov         eax,dword ptr [rdx+24h]  
00007FFA2B0F0591  add         rsp,28h  
00007FFA2B0F0595  ret
00007FFA2B0F0596  call        00007FFA8AD264D0  
00007FFA2B0F059B  int         3

However, today, the codegen seems like it would be closer to (at least based on the current codegen):

00007FFA2B0F0570  sub         rsp,28h  
00007FFA2B0F0574  test        cl,cl  
00007FFA2B0F0576  jne         00007FFA2B0F0588  
00007FFA2B0F0578  cmp         dword ptr [r8+8],6  
00007FFA2B0F057D  jbe         00007FFA2B0F0596  

add     rdx, 20
lea     rax, qword ptr [r8 + 24]
test    cl,  cl
cmovne  rax, rdx
mov     eax, dword ptr [rax]

00007FFA2B0F0591  add         rsp,28h  
00007FFA2B0F0595  ret
00007FFA2B0F0596  call        00007FFA8AD264D0  
00007FFA2B0F059B  int         3

I believe the latter, although it uses cmov, is arguably worse since it forces a NullReference or IndexOutOfRange exception when condition is true and b is invalid or when condition is false and a is invalid. But the equivalent manually inlined code does not.

I brought up the Select intrinsic as a way to force the JIT to produce cmov, regardless of what it thinks should be generated. A ElideBoundsCheck hint would additionally allow the range checks to be dropped, but that begins getting into new proposal territory.

@tannergooding

For M2, the values are stored in locals before being stored on the stack. Additional analysis would have to be done to determine that the local exists, but is immediately consumed (and only consumed once) in a specific context and can therefore be inlined.

There is no local for c0 in the IL, so I think the C# compiler already does that analysis in some cases. And I think the runtime can't rely on the compiler not performing that analysis for ai (even if it doesn't do it today).

There's also a super-convoluted case where the compiler produces the same IL with or without a local in the source today:

```c#
static int M3(bool c, int[] a, int[] b, int i)
{
return Select(a: a[i], b: b[i], c: c);
}

static int M4(bool c, int[] a, int[] b, int i)
{
var ai = a[i];
return Select(a: ai, b: b[i], c: c);
}

The IL is exactly the same in for `M2` and `M3` and it stores `a[i]` (and `b[i]`) in a local:

```il
IL_0000:  ldarg.1     
IL_0001:  ldarg.3     
IL_0002:  ldelem.i4   
IL_0003:  stloc.0     
IL_0004:  ldarg.2     
IL_0005:  ldarg.3     
IL_0006:  ldelem.i4   
IL_0007:  stloc.1     
IL_0008:  ldarg.0     
IL_0009:  ldloc.0     
IL_000A:  ldloc.1     
IL_000B:  call        Select<Int32>
IL_0010:  ret 

@svick, yes.

I was indicating that, when locals are involved, additional analysis for those values has to be done (something the JIT probably doesn't have time to do), so the transformation would not occur and the evaluation would happen before the method call. (There would need to be an post-compilation, but pre JIT/AOT IL optimization tool for these transformations).

However, for the cases when no locals are involved, the JIT should be able to determine the value is passed directly to the method and can then attempt the appropriate transforms.

This is starting to go pretty far off the rails. Let's back up and answer @tannergooding's high-level question:

At this point I'm just trying to understand why this:

```c#
int Select(bool condition, int trueValue, int falseValue)
{
return condition ? trueValue : falseValue;
}

int MyMethod(bool condition, int[] a, int[] b)
{
return Select(condition, a[5], b[6]);
}

cannot be transformed to be:

```c#
int MyMethod(bool condition, int[] a, int[] b)
{
    return condition ? a[5] : b[6];
}

This transformation cannot be performed in the way that you suggest because it changes the behavior of the program. In the first case, the evaluation of condition, the array accesses, and any associated side-effects occur before the call to Select; in the second case, only one array access is performed, thus dropping the side-effects of the other access. As a rule, optimizations performed by the compiler must not affect the behavior of the compiled program. The only exceptions occur when the behavior of a program is undefined, in which case the compiler is allowed to do pretty much whatever it wants. This does not occur very often in IL, and in any case should generally be treated as a bug rather than a feature.

@tannergooding,

I'm just trying to understand why this ... cannot be transformed to be ... My understanding of the optimizations the spec allows, indicates it should be.

It's in the part of I.12.6.4 that you quoted. Adding emphasis (note "and exceptions"):

Conforming implementations of the CLI are free to execute programs using any technology that
guarantees, within a single thread of execution, that side-effects and exceptions generated by a
thread are visible in the order specified by the CIL. For this purpose only volatile operations
(including volatile reads) constitute visible side-effects. (Note that while only volatile operations
constitute visible side-effects, volatile operations also affect the visibility of non-volatile
references.)

The JIT must guarantee that the NullReferenceException or IndexOutOfRangeException thrown by the ldelem is visible.

@pgavlin, @JosephTremoulet. Thanks.

Am I correctly understanding the subsequent section on E-relaxed methods that, if a new CompilationRelaxations value was exposed, the JIT would be allowed to perform such a transformation and delay the check into the inlined code?

@tannergooding, by my reading, that would still run afoul of:

The check’s exception is thrown some time in the sequence, unless the sequence throws
another exception. When multiple relaxed checks fail, it is unspecified as to which
exception is thrown by the VES

@JosephTremoulet. Hmm, it seems that the further explanation of relaxations in Annex F specifically calls out the scenario I am interested in:

If this method is relaxed, and the caller is also relaxed, then the caller would be allowed, in the
absence of constraining data or control dependences, to interleave the call with other instructions
in the caller. If one of those other interleaved instructions faults, then any or all of M’s side
effects might be suppressed. Thus, method M should be marked as strict if it is important to
prevent a fault from destroying the invariant.

This “interference” from the caller is potentially annoying, but seems to be intrinsic to any
definition of relaxed exceptions that permits both:

  1. instruction reordering and
  2. inlined method calls are at least as fast as manual inlining.

In either case. Thanks for helping me understand this better.

I'm going to create a new issue so I can continue the discussion without polluting this thread anymore.

I think, even if the runtime does not "technically" allow it today (although I think it might), it should support these types of optimizations.

@tannergooding, translating "caller" and "M" to your example, that's saying that (if both are relaxed) a fault in MyMethod may result in some arbitrary subset of Select 's side-effects being suppressed. The fault in MyMethod still must be made visible.

HW_INTRINSIC-based implementation for CMOVnn: https://github.com/EgorBo/runtime-1/commit/1271fe536cc3274867a7306424d45c8db76be8ca

static int Test1(int x)
{
    return x == 42 ? 1000 : 2000;
}

static int Test2(int x, int a, int b)
{
    return x == 42 ? a : b;
}
; Method Tests:Test1(int):int
G_M56601_IG01:
G_M56601_IG02:
       83F92A               cmp      ecx, 42
       B8E8030000           mov      eax, 0x3E8
       BAD0070000           mov      edx, 0x7D0
       0F45C2               cmovne   eax, edx
G_M56601_IG03:
       C3                   ret      
; Total bytes of code: 17


; Method Tests:Test2(int,int,int):int
G_M50938_IG01:
G_M50938_IG02:
       83F92A               cmp      ecx, 42
       8BC2                 mov      eax, edx
       410F45C0             cmovne   eax, r8d
G_M50938_IG03:
       C3                   ret      
; Total bytes of code: 10

Works better with PGO (COMPlus_TieredPGO=1) 🙂:
image

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nalywa picture nalywa  ·  3Comments

GitAntoinee picture GitAntoinee  ·  3Comments

omajid picture omajid  ·  3Comments

matty-hall picture matty-hall  ·  3Comments

jamesqo picture jamesqo  ·  3Comments