Conditional jumps, especially those that are hard to predict, are fairly expensive, so they should be avoided if possible. One way to avoid them is to use conditional moves and similar instructions (like sete
). As far as I can tell, RuyJit never does this and I think it should.
For example, take these two methods:
``` c#
[MethodImpl(MethodImplOptions.NoInlining)]
static long sete_or_mov(bool cond) {
return cond ? 4 : 0;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static long cmov(long longValue) {
long tmp1 = longValue & 0x00000000ffffffff;
return tmp1 == 0 ? longValue : tmp1;
}
For both of them, RyuJit generates a conditional jump:
``` asm
; Assembly listing for method Program:sete_or_mov(bool):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) bool -> rcx
; V01 tmp0 [V01,T01] ( 3, 2 ) int -> rax
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0
G_M60330_IG01:
G_M60330_IG02:
84C9 test cl, cl
7504 jne SHORT G_M60330_IG03
33C0 xor eax, eax
EB05 jmp SHORT G_M60330_IG04
G_M60330_IG03:
B804000000 mov eax, 4
G_M60330_IG04:
4863C0 movsxd rax, eax
G_M60330_IG05:
C3 ret
; Total bytes of code 17, prolog size 0 for method Program:sete_or_mov(bool):long
; ============================================================
; Assembly listing for method Program:cmov(long):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 4, 3.5) long -> rcx
; V01 loc0 [V01,T01] ( 3, 2.5) long -> rax
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0
G_M53075_IG01:
G_M53075_IG02:
B8FFFFFFFF mov eax, 0xFFFFFFFF
4823C1 and rax, rcx
4885C0 test rax, rax
7401 je SHORT G_M53075_IG04
G_M53075_IG03:
C3 ret
G_M53075_IG04:
488BC1 mov rax, rcx
G_M53075_IG05:
C3 ret
; Total bytes of code 18, prolog size 0 for method Program:cmov(long):long
; ============================================================
For comparison, here are the same methods compiled using Clang and GCC with -O1
(by Compiler Explorer):
GCC 6.2:
sete_or_mov(bool):
test dil, dil
setne al
movzx eax, al
sal rax, 2
ret
cmov(unsigned long):
mov eax, edi
test rax, rax
cmove rax, rdi
ret
Clang 3.9.0:
sete_or_mov(bool): # @sete_or_mov(bool)
movzx eax, dil
shl rax, 2
ret
cmov(unsigned long): # @cmov(unsigned long)
mov eax, edi
mov ecx, 4294967295
and rcx, rdi
cmove rax, rdi
ret
There is code for it I think but @jamesqo noticed it was never getting executed in https://github.com/dotnet/coreclr/pull/6647#issuecomment-238382597
Function called CodeGen::genCodeForQmarkWithCMOV
@benaadams That code is in codegenlegacy.cpp, which, as far as I can tell, is the pre-RuyJit codegen and does not even support x64. So I'm not sure how relevant is that for RuyJit on x64.
Might be why its not called then 😆
@benaadams I believe I tested it with x86 builds, without the x86 RyuJIT (which is disabled by default) enabled. Feel free to check for yourself, though, in case I messed something up.
There does seem to be a lot of stuff for cmov in the code, have never seen it output though
We used to generate CMOV/SETCC from our internal QMark operator, however we ended up expanding QMark after importer to short-circuit flow to simplify handling in downstream phases. As I recall the plan was to enable a more general if-conversion phase to identify simple cmov/setcc patterns before lowering to reconstitute them. /cc @dotnet/jit-contrib @russellhadley
I believe JIT32 (used today on x86) does generate at least SETCC, if not CMOV.
Regardless of the current state, I think we all agree with the previous comments both are pretty powerful features of the IA architecture and worth generating. Often result in smaller and faster code without the possibility of expensive mis-prediction penalties.
I agree, it's pretty important to handle both of these cases. In terms of priority @svick where is this coming from? Is it in a particular workload or benchmark? What's the opportunity here?
@russellhadley I noticed this when I tried to modify this code in Kestrel to use cmov and failed.
But that was just "Hey, this code could be faster! No? Too bad.", I don't know if it would actually have measurable impact on the overall performance of Kestrel. So I don't think you should prioritize this based on me.
@svick It's still Kestrel and it's in one of the methods we want to get inlined into the MemoryAllocatorIterator::Seek routine and optimized - so it had some priority. Thanks for the heads up. :)
@svick, that code is probably faster with bsf/tzcnt(count trailing zeroes) and lshl(logical shift left) instead of cmov/condIf's. Whether JIT knows to generate them is another question.
@choikwa but ternary operators, for example, are probably a much more common pattern across the ecosystem as a first step to improve; and likely more simple to recognise - since its a language construct.
Have an improvement for that particular section of code in Kestrel in a PR https://github.com/aspnet/KestrelHttpServer/pull/1138
@cmckinsey @russellhadley Anyone working on this? FYI I'm trying to add support for CMOV to RyuJIT's codegen for a different issue. Maybe I can take a look at this one after that.
AFAIK nobody is looking into this at the moment. Feel free to dive in :)
@mikedn I think @benaadams has provided a branchless version of the code that motivated this in Kestrel, but getting to the point where we can generate CMOVs is still interesting. @sivarv has been working on some ideas around how to improve this kind of instruction selection. It would be good to make sure this effort dovetails.
FWIW I've made a prototype that survives diffing and testing. It detects simple half/full hammocks where the true/false BBs contain a single assignment (or a return) that involves only local variables and constants. It only works with i4 and i8 but with some care we might be able to handle small integer types.
I'll probably try to extend this to indirs though it's problematic due to side-effects. A quick and dirty experiment that survives diffing could probably show if there are enough opportunities for this to be useful.
The conversion is currently done when lowering GT_JTRUE
. It could be easily moved to a separate phase before lowering but I'm wondering if this shouldn't be done much earlier. After conversion we have less basic blocks and we may be able to eliminate some variables as well. And we may miss other optimizations if we do this conversion late - for example dotnet/runtime#6748 requires CSE.
FX diff shows a 2k reduction in code size but it could be better. I had to turn off a and-cmp to test optimization that caused a lot of trouble and that results in code size regressions. Also, using cmov
can result in pretty bad code if variables are not already in registers. Something like:
mov eax, [ebp-8]
mov edx, 1
cmov eax, edx
mov [ebp-8], eax
Codegen for GT_SELECT
should probably detect such cases and fall back to a jcc
and mov [ebp-8], 1
.
The 2 examples from the initial post generate:
G_M65174_IG02:
84D2 test dl, dl
0F95C0 setne al
0FB6C0 movzx rax, al
C1E002 shl eax, 2
4863C0 movsxd rax, eax
G_M65174_IG03:
C3 ret
G_M65174_IG02:
B8FFFFFFFF mov eax, 0xFFFFFFFF
4823C2 and rax, rdx
4885C0 test rax, rax
480F44C2 cmove rax, rdx
G_M65174_IG03:
C3 ret
The code is similar to what the native compilers generate except a couple of warts. The first example has a widening cast that can be easily removed. The second example has a useless test
that could be removed as well.
It's worth noting that Clang assumes that bools are 0/1 and generates better code for the first example. I don't think we can do that, CLR bools are 0/non-zero.
The Math.Max
example from dotnet/runtime#4326 generates:
G_M65175_IG02:
8B4208 mov eax, dword ptr [rdx+8]
83F82A cmp eax, 42
BA2A000000 mov edx, 42
0F4CC2 cmovl eax, edx
so we get rid of a useless jump by accident (those kind of jumps can probably appear in other situations that aren't handled by if-conversion).
The bool
conversion example from dotnet/runtime#4399 isn't currently detected but a variant of it works:
``` C#
for (int i = 0; i < N; i++)
sum = (bits[i] ? 1 : 0) + sum;
470FB6440810 movzx r8, byte ptr [r8+r9+16]
4585C0 test r8d, r8d
410F95C0 setne r8b
450FB6C0 movzx r8, r8b
4103C0 add eax, r8d
The reason why the the original code (`sum += bits[i] ? 1 : 0`) isn't handled is that the IL stack is no empty at the entry of the true/false blocks and as a result an additional assignment is added to those blocks.
With some additional work this could probably turn into:
movzx r8, byte ptr [r8+r9+16]
neg r8d
adc eax, 0
```
The nice thing about if-conversion is that once you generate a GT_SELECT you can look around and detect all kind of patterns relatively easily.
I'll try to do some perf measurements but the perf characteristics of converting branches are probably well known to anyone familiar with the subject. If the branch isn't predictable then you get good performance (I've seen numbers in the range 2-5x but I have to double check). If the branch is predictable then performance may regress slightly. Maybe such conversions should be driven by profile data but it seems that all native compilers generate cmov
by default.
As a simple perf test let's sum all the positive numbers from an array:
int sum = 0;
for (long i = 0; i < nums.Length; i++)
sum += (nums[i] > 0 ? nums[i] : 0);
return sum;
As mentioned above, use of +=
blocks if-conversion so we don't get CMOV:
G_M65174_IG03:
493BC8 cmp rcx, r8
731F jae SHORT G_M65174_IG06
448B4C8A10 mov r9d, dword ptr [rdx+4*rcx+16]
4585C9 test r9d, r9d
7F05 jg SHORT G_M65174_IG04
4533C9 xor r9d, r9d
EB00 jmp SHORT G_M65174_IG04
G_M65174_IG04:
4103C1 add eax, r9d
48FFC1 inc rcx
4C3BC1 cmp r8, rcx
7FE1 jg SHORT G_M65174_IG03
With 1,000,000 numbers this runs in 630us if all numbers are positive and 4300us if we have a random mix of positive and negative numbers. Poor branch prediction takes its toll.
If we change the sum expression so that if-conversion kicks in we get CMOV:
G_M65174_IG03:
493BC8 cmp rcx, r8
731F jae SHORT G_M65174_IG05
448B4C8A10 mov r9d, dword ptr [rdx+4*rcx+16]
4533D2 xor r10d, r10d
4585C9 test r9d, r9d
450F4ECA cmovle r9d, r10d
4103C1 add eax, r9d
48FFC1 inc rcx
4C3BC1 cmp r8, rcx
7FE1 jg SHORT G_M65174_IG03
Now we always get 720us no matter what the numbers are. We're slightly slower in the "all positive" case and 6 (yes, six!) times faster in the random case.
Should something be done about the slowdown we see in the "all positive" case? I don't know, as a developer I would use the following code if the numbers are known to be mostly positive:
for (long i = 0; i < nums.Length; i++)
if (nums[i] > 0)
sum += nums[i];
If-conversion doesn't kick in and this runs in 490us if all numbers are positive, that's faster than all of the above.
Also relevant: http://aakinshin.net/en/blog/dotnet/perfex-min/
I know. The prototype I have does generate CMOV in the MinTernary
method but it doesn't yet work in the Ternary
method because the result is assigned to an array rather than a local variable.
@mikedn For your prototype did you just try and re-enable they qmark support? (not expand them early into flow)
@russellhadley Nope, I'm generating GT_SELCC
nodes during GT_JTRUE
lowering. GT_SELCC
is a new operator I've added to represent CMOV. It's a binary operator that expects the condition flags to be set appropriately.
@mikedn Does it take 2 full trees to be evaluated or is it restricted to taking lclVars?
@russellhadley lclVars and constants. It recognizes:
```C#
if (whatever) { lclX = lclY (or const); }
// generates STORE_LCL(lclX, SELCC
if (whatever) { lclX = lclY (or const); } else { lclX = lclZ (or const); }
// generates STORE_LCL(lclX, SELCC
if (whatever) { return lclY (or const); } else { return lclZ (or const); }
// generates RETURN(SELCC
```
It should be possible to handle more patterns but there are all kinds of issues:
null
. CMOV faults even if the condition is false.I think (haven't looked at it too much) that GT_QMARK allows more or less arbitrary trees and that makes it problematic, it looks like some kind of control flow embedded in an expression tree. GT_SELCC looks like any other binary operator except that it has a hidden dependency on condition flags.
I mixed up my code with a few other experiments, I'll try to clean it up a bit tomorrow and push it to my GitHub fork for reference. Though at 500 lines of code (including empty lines) it may not be a pretty sight. Needs reorganization.
Thinking about it, given that CMOV is complicated to catch everything, why not tackle only the 'easy' cases (when there is certainty to have performance improvements) and provide also an intrisic version in case you need in any of the dubious
cases.
@russellhadley My experimental branch is here: https://github.com/mikedn/coreclr/commits/if-convert2
@redknightlois I'm not sure how an intrinsic would help. Once it works for local variables it's up to you to ensure that the code has the right pattern to make use of CMOV. Or create a method like T Select<T>(bool condition, T trueValue, T falseValue) => condition ? trueValue : falseValue;
. The moment you write Select(c, a[i], b[i])
you accept that both a[i]
and b[i]
will be evaluated.
@mikedn Looks like the experimental branch was deleted. Do you have a draft somewhere?
@sdmaclea See the "If conversion" commit in this branch https://github.com/mikedn/coreclr/commits/relop-lower-2
@mikedn, as for providing an intrinsic (a bit late, I know), I didn't think the runtime had a rule stating that a variable that is passed directly to a method (without being stored to a local in the interim) has to be evaluated (that's just the way it happens to work today, since we don't expose very many, if any, "intrinsics" like this).
That is, it should be entirely possible for the runtime to expose a method in Corlib that has a special contract where Select(c, a[i], b[i])
will only evaluate a[i]
or b[i]
based on the result of c
. The method could be exposed in the new System.Runtime.Intrinsics
namespace (or maybe System.Runtime.CompilerServices.Unsafe
) and would be considered a "special" use-case for power-users (just like the rest of S.R.Intrinsics
or S.R.CS.Unsafe
).
That being said, just ensuring we recognize c ? a[i] : b[i]
is probably good enough. The downside being, that there is no guarantee it will be transformed, depending on various other constraints the JIT might be under (this has been brought up in various different threads, including the one where I requested we document recognized patterns 😉).
@tannergooding I don't quite understand what you are saying. In particular, the "Select(c, a[i], b[i]) will only evaluate a[i] or b[i] based on the result of c" makes no sense. The whole point of adding the Select method I suggested is to allow people to clearly state that they are OK with both a[i]
and b[i]
being evaluated. The "special contract" you suggest seems to do exactly the opposite (and sounds like a C/C++ macro abomination).
That being said, just ensuring we recognize c ? a[i] : b[i] is probably good enough
It will never recognize that. It's simply impossible to express the semantics of such an expression using CMOV.
It will never recognize that. It's simply impossible to express the semantics of such an expression using CMOV.
C/C++ recognizes it:
int Select(bool condition, int trueValue, int falseValue)
{
return condition ? trueValue : falseValue;
}
is transformed to:
test cl, cl
cmovne r8d, edx
mov eax, r8d
ret 0
Realistically, there is no reason the JIT should recognize that pattern as well.
The "special contract" you suggest seems to do exactly the opposite (and sounds like a C/C++ macro abomination).
Not really, even just using regular C/C++ methods, the inliner recognizes we are just indexing into an array (which for non-volatile locals/parameters should be non side-effecting) and does the right thing:
int Select(bool condition, int trueValue, int falseValue)
{
return condition ? trueValue : falseValue;
}
int MyMethod(bool condition, int* a, int* b)
{
return Select(condition, a[5], b[6]);
}
transforms to:
add rdx, 20
lea rax, qword ptr [r8 + 24]
test cl, cl
cmovne rax, rdx
mov eax, dword ptr [rax]
ret 0
So, as you can see, it is a cross method call that inlines so that the array access doesn't happen until after condition
has been evaluated.
I see no reason why the JIT could not also recognize and optimize this appropriately, at the very least for something like acessing a local or an array index... Properties and methods are more questionable since they can be potentially side-effecting.
Having a method of the form Select(bool, T, T)
, where T can fit into a register, and having the JIT guarantee that it will inline and be transformed to a cmov
(or equivalent) if the platform supports it, does not seem unreasonable to me.
Directly doing return condition ? a[5] : b[6]
(or even return condition ? a[i] : b[i]
) also works and is transformed to the equivalent of the MyMethod
output.
I did test the various outputs on MSVC++ and Clang
Does it work when condition
is false and a
is nullptr?
Does it work when condition is false and a is nullptr?
@briansull, The code it generates for MyMethod
is consistent regardless of what a
, b
, or c
are.
However, when MyMethod(false, null, b)
is called, it is able to statically determine that c
is always false and will only ever return b
. So it will inline that appropriately.
So, for example:
int MyMethod2(int* b)
{
return MyMethod(false, nullptr, b);
}
will transform to:
mov eax, dword ptr [rcx+24]
ret 0
and that itself can be inlined to just mov <register>, dword ptr [<b>+24]
C++ semantics allow suppressing faults in expressions whose results are otherwise unused. MSIL semantics do not.
C/C++ recognizes it:
So does my experimental if conversion implementation. But that's a different expression than then one you used in your previous post.
int MyMethod(bool condition, int* a, int* b)
That's not really the same as what you were suggesting. It's easier to observe if you look at the callsite - it would be MyMethod(c, &a[i], &b[i])
, not MyMethod(c, a[i], b[i])
. Besides, examples coming from the C++ world are not very relevant. In C/C++ there are no array range checks. In C/C++ dereferencing a null pointer results in undefined behavior, not in a NullReferenceException. In C/C++ the expression evaluation order is not well defined. And so on.
Having a method of the form Select(bool, T, T), where T can fit into a register, and having the JIT guarantee that it will inline and be transformed to a cmov (or equivalent) if the platform supports it, does not seem unreasonable to me.
Who said is unreasonable? You jumped in and suggested something that defies the standard language behavior and is not really needed. The use of a Select method to allow the developer to deal with side effects in an explicit manner was suggested earlier by me.
Directly doing return condition ? a[5] : b[6] (or even return condition ? a[i] : b[i]) also works and is transformed to the equivalent of the MyMethod output.
Again, we're talking about C++, a very different language. In C# that transform can only be done if a
and b
are known to be non-null and i
is a valid index for both arrays. Probably doable but these conditions severely limit the scope.
I was actually considering doing something like this for field access. c ? o.x : o.y
can always be if-converted, even if o
is not known to be non-null.
You jumped in and suggested something that defies the standard language behavior and is not really needed.
I would say that it is needed, in order to produce more performant code, for the cases it is necessary (or desired).
The JIT has its own rules for safety (null checks, range checks, etc). There are certainly ways around this (such as using unsafe code), but the JIT already has trouble generating optimal code in a timely manner and some of the analysis will be impossible to do in the timeframe allotted. Doing things that are "out of the ordinary" probably won't help with generating optimal code.
The general response to this is to use AOT (which isn't always available) or write it in native code (which has its own overhead for p/invoke, or can have other issues if you are writing pure native code). Adding new IL instructions is also not very "viable" as it sets a new bar, breaks back-compat, requires compiler support to generate the code, etc.
So the remaining options are:
I'd rather not "Just live with it", so to me, that just leaves providing special intrinsics. Doing this, is not a new concept, many modern compilers provide some form of intrinsics or compiler hints to help produce better codegen.
Today, we provide several intrinsics, but they are intrinsic via implementation detail rather than intrinsic via contract (Math.Sqrt
is implementation detail, Sse2.Sqrt
will be by contract). We also provide hints
in the form of MethodImplOptions
, but that is really only used for forcing or preventing inlining (and, as a hint, it isn't always respected).
As most intrinsics and hints, they would end up producing "unsafe" code that does not necessarily follow the language spec. Instead, they are designed for performance and place the burden on the end-user to ensure that what they are doing is "safe".
Something like a Select
method or ElideBoundsChecks
hint are just the type of things that would allow condition ? a[5] : b[6]
to be emitted as
add rdx, 20
lea rax, qword ptr [r8 + 24]
test cl, cl
cmovne rax, rdx
mov eax, dword ptr [rax]
ret 0
rather than as:
00007FFA2B0F0570 sub rsp,28h
00007FFA2B0F0574 test cl,cl
00007FFA2B0F0576 jne 00007FFA2B0F0588
00007FFA2B0F0578 cmp dword ptr [r8+8],6
00007FFA2B0F057D jbe 00007FFA2B0F0596
00007FFA2B0F057F mov eax,dword ptr [r8+28h]
00007FFA2B0F0583 add rsp,28h
00007FFA2B0F0587 ret
00007FFA2B0F0588 cmp dword ptr [rdx+8],5
00007FFA2B0F058C jbe 00007FFA2B0F0596
00007FFA2B0F058E mov eax,dword ptr [rdx+24h]
00007FFA2B0F0591 add rsp,28h
00007FFA2B0F0595 ret
00007FFA2B0F0596 call 00007FFA8AD264D0
00007FFA2B0F059B int 3
or worse, as:
00007FFA2B0D0500 sub rsp,28h
00007FFA2B0D0504 cmp dword ptr [rdx+8],5
00007FFA2B0D0508 jbe 00007FFA2B0D0527
00007FFA2B0D050A mov eax,dword ptr [rdx+24h]
00007FFA2B0D050D cmp dword ptr [r8+8],6
00007FFA2B0D0512 jbe 00007FFA2B0D0527
00007FFA2B0D0514 mov edx,dword ptr [r8+28h]
00007FFA2B0D0518 test cl,cl
00007FFA2B0D051A jne 00007FFA2B0D051E
00007FFA2B0D051C jmp 00007FFA2B0D0520
00007FFA2B0D051E mov edx,eax
00007FFA2B0D0520 mov eax,edx
00007FFA2B0D0522 add rsp,28h
00007FFA2B0D0526 ret
00007FFA2B0D0527 call 00007FFA8AD264D0
00007FFA2B0D052C int 3
NOTE: The last block here is actually generated when inlining a select call (which does return condition ? trueValue : falseValue
) into a method that does return Select(condition, a[5], b[6])
, when the first block should probably be generated instead. The difference being, whether both range checks are done first, or whether only one range check is done, and only when it is actually neeeded. The former works for null a (on false) or null b (on true). The latter fails for null a or null b, regardless of condition. It is also worth noting that the former is generated when the select call is manually inlined.
Now this is almost turning (if it hasn't already) into new proposal territory, so I'll take my grumbling elsewhere
Provide special intrinsic functionality
The specific Select behavior you are suggesting has nothing to do with intrinsics, it does not map features available in the supported instruction sets.
The Select method that I suggested may be treated as an intrinsic and that would avoid the need to do if-conversion in the JIT. That's certainly an option.
Something like a Select method or ElideBoundsChecks hint are just the type of things that would allow condition ? a[5] : b[6] to be emitted as
Select/CMOV/IfConversion has nothing to do with range check elimination. If you don't want those range checks then use unsafe code or whatever other means are availabe. Or suggest new ones. But again, this has nothing to do with this issue.
The difference being, whether both range checks are done first, or whether only one range check is done, and only when it is actually neeeded.
The difference is normal and expected. Please try to understand how things work before attempting to imply that one variant is somehow worse or better than the other. They're different. And again, this is not related to this issue.
Now this is almost turning (if it hasn't already) into new proposal territory, so I'll take my grumbling elsewhere
I guess it's good that you finally figured it out. Still, your behavior is kind of dubious for a Microsoft employee. They usually put more thought into their contributions.
@tannergooding
As most intrinsics and hints, they would end up producing "unsafe" code that does not necessarily follow the language spec.
Which existing or proposed intrinsics or hints actually do not follow the language spec?
Also, with the Select
you proposed, how is the runtime going to figure out which side-effects can be ignored? IL doesn't really capture which expression was part of the Select
call (and so it can be optimized) and which wasn't (and so it presumably can't be optimized).
For example, consider:
```c#
static int M1(bool c, int[] a, int[] b, int i)
{
return Select(c, a[i], b[i]);
}
static int M2(bool c, int[] a, int[] b, int i)
{
var c0 = c;
var ai = a[i];
return Select(c0, ai, b[i]);
}
```
Are you sure M1
and M2
will produce different IL, which the runtime can use to figure out if it can avoid evaluating a[i]
? (Right now, the IL is different, but the only difference is a stloc ldloc
pair. I think a future version of the compiler could easily change that.)
Or is it okay if a[i]
is not evaluated even in M2
?
Still, your behavior is kind of dubious for a Microsoft employee. They usually put more thought into their contributions.
@mikedn, We can have misunderstandings about areas that aren't our expertise as well. We can also have bad ideas. It just means we sometimes have a route to discuss bad ideas or misunderstandings before it goes on the internet :wink:
The difference is normal and expected. Please try to understand how things work before attempting to imply that one variant is somehow worse or better than the other. They're different. And again, this is not related to this issue.
Why is the difference expected?
Given two methods:
```C#
int Select(bool condition, int trueValue, int falseValue)
{
return condition ? trueValue : falseValue;
}
int MyMethod(bool condition, int[] a, int[] b)
{
return Select(condition, a[5], b[6]);
}
```
If no inlining is done, then both a[5]
and b[6]
must be evaluated before Select
can be called.
However, if inlining is done, I don't see why the method cannot be transformed so the inlined code is better.
(I.12.6.4 Optimization) indicates
Conforming implementations of the CLI are free to execute programs using any technology that
guarantees, within a single thread of execution, that side-effects and exceptions generated by a
thread are visible in the order specified by the CIL. For this purpose only volatile operations
(including volatile reads) constitute visible side-effects. (Note that while only volatile operations
constitute visible side-effects, volatile operations also affect the visibility of non-volatile
references.)
[Rationale: An optimizing compiler is free to reorder side-effects and synchronous exceptions to
the extent that this reordering does not change any observable program behavior. end rationale]
[Note: An implementation of the CLI is permitted to use an optimizing compiler, for example, to
convert CIL to native machine code provided the compiler maintains (within each single thread
of execution) the same order of side-effects and synchronous exceptions.This is a stronger condition than ISO C++ (which permits reordering between a pair of sequence
points) or ISO Scheme (which permits reordering of arguments to functions). end note]
Additionally, the spec indicates
In general, arguments and local variables are only visible to the executing thread, while instance and static fields and array elements can be visible to
multiple threads, and modification of such values is considered a side-effect.
Based on this, reading the element of a non-volatile array should be considered non side-effecting.
For many cases, the method may be too complex for the JIT to do the appropriate analysis and determine that x
or an alias of x
is not assigned.
However, for a simple case, such as the above, it should be able to determine that both a[5]
and b[6]
are non-volatile and that they are only read.
It should then be able to determine that the range checks don't need to be done at the top of the method and can be inlined to the paths where they are relevant.
how is the runtime going to figure out which side-effects can be ignored
@svick, in your example, the difference is whether or not the evaluation is stored into a local first.
For M1
, the values are directly loaded onto the stack and then consumed from the Select
call. Therefore, they are only available in that context.
For M2
, the values are stored in locals before being stored on the stack. Additional analysis would have to be done to determine that the local exists, but is immediately consumed (and only consumed once) in a specific context and can therefore be inlined.
My understanding is that reading of an array should be considered non side-effecting, but modification of an array is considered side-effecting.
@tannergooding is this more accurate of what you are after?
int Select(bool condition, ref int trueValue, ref int falseValue)
{
return condition ? ref trueValue : ref falseValue;
}
@benaadams, no. At this point I'm just trying to understand why
This:
```C#
int Select(bool condition, int trueValue, int falseValue)
{
return condition ? trueValue : falseValue;
}
int MyMethod(bool condition, int[] a, int[] b)
{
return Select(condition, a[5], b[6]);
}
cannot be transformed to be:
```C#
int MyMethod(bool condition, int[] a, int[] b)
{
return condition ? a[5] : b[6];
}
and generate the appropriate code. My understanding of the optimizations the spec allows, indicates it should be.
Basically, if the range checks could be elided by the JIT, I would expect this to generate:
add rdx, 20
lea rax, qword ptr [r8 + 24]
test cl, cl
cmovne rax, rdx
mov eax, dword ptr [rax]
ret 0
If they could not be elided, I would expect it to generate:
00007FFA2B0F0570 sub rsp,28h
00007FFA2B0F0574 test cl,cl
00007FFA2B0F0576 jne 00007FFA2B0F0588
00007FFA2B0F0578 cmp dword ptr [r8+8],6
00007FFA2B0F057D jbe 00007FFA2B0F0596
00007FFA2B0F057F mov eax,dword ptr [r8+28h]
00007FFA2B0F0583 add rsp,28h
00007FFA2B0F0587 ret
00007FFA2B0F0588 cmp dword ptr [rdx+8],5
00007FFA2B0F058C jbe 00007FFA2B0F0596
00007FFA2B0F058E mov eax,dword ptr [rdx+24h]
00007FFA2B0F0591 add rsp,28h
00007FFA2B0F0595 ret
00007FFA2B0F0596 call 00007FFA8AD264D0
00007FFA2B0F059B int 3
However, today, the codegen seems like it would be closer to (at least based on the current codegen):
00007FFA2B0F0570 sub rsp,28h
00007FFA2B0F0574 test cl,cl
00007FFA2B0F0576 jne 00007FFA2B0F0588
00007FFA2B0F0578 cmp dword ptr [r8+8],6
00007FFA2B0F057D jbe 00007FFA2B0F0596
add rdx, 20
lea rax, qword ptr [r8 + 24]
test cl, cl
cmovne rax, rdx
mov eax, dword ptr [rax]
00007FFA2B0F0591 add rsp,28h
00007FFA2B0F0595 ret
00007FFA2B0F0596 call 00007FFA8AD264D0
00007FFA2B0F059B int 3
I believe the latter, although it uses cmov
, is arguably worse since it forces a NullReference
or IndexOutOfRange
exception when condition
is true and b
is invalid or when condition
is false and a
is invalid. But the equivalent manually inlined code does not.
I brought up the Select
intrinsic as a way to force the JIT to produce cmov
, regardless of what it thinks should be generated. A ElideBoundsCheck
hint would additionally allow the range checks to be dropped, but that begins getting into new proposal territory.
@tannergooding
For
M2
, the values are stored in locals before being stored on the stack. Additional analysis would have to be done to determine that the local exists, but is immediately consumed (and only consumed once) in a specific context and can therefore be inlined.
There is no local for c0
in the IL, so I think the C# compiler already does that analysis in some cases. And I think the runtime can't rely on the compiler not performing that analysis for ai
(even if it doesn't do it today).
There's also a super-convoluted case where the compiler produces the same IL with or without a local in the source today:
```c#
static int M3(bool c, int[] a, int[] b, int i)
{
return Select(a: a[i], b: b[i], c: c);
}
static int M4(bool c, int[] a, int[] b, int i)
{
var ai = a[i];
return Select(a: ai, b: b[i], c: c);
}
The IL is exactly the same in for `M2` and `M3` and it stores `a[i]` (and `b[i]`) in a local:
```il
IL_0000: ldarg.1
IL_0001: ldarg.3
IL_0002: ldelem.i4
IL_0003: stloc.0
IL_0004: ldarg.2
IL_0005: ldarg.3
IL_0006: ldelem.i4
IL_0007: stloc.1
IL_0008: ldarg.0
IL_0009: ldloc.0
IL_000A: ldloc.1
IL_000B: call Select<Int32>
IL_0010: ret
@svick, yes.
I was indicating that, when locals are involved, additional analysis for those values has to be done (something the JIT probably doesn't have time to do), so the transformation would not occur and the evaluation would happen before the method call. (There would need to be an post-compilation, but pre JIT/AOT IL optimization tool for these transformations).
However, for the cases when no locals are involved, the JIT should be able to determine the value is passed directly to the method and can then attempt the appropriate transforms.
This is starting to go pretty far off the rails. Let's back up and answer @tannergooding's high-level question:
At this point I'm just trying to understand why this:
```c#
int Select(bool condition, int trueValue, int falseValue)
{
return condition ? trueValue : falseValue;
}int MyMethod(bool condition, int[] a, int[] b)
{
return Select(condition, a[5], b[6]);
}cannot be transformed to be: ```c# int MyMethod(bool condition, int[] a, int[] b) { return condition ? a[5] : b[6]; }
This transformation cannot be performed in the way that you suggest because it changes the behavior of the program. In the first case, the evaluation of condition
, the array accesses, and any associated side-effects occur before the call to Select
; in the second case, only one array access is performed, thus dropping the side-effects of the other access. As a rule, optimizations performed by the compiler must not affect the behavior of the compiled program. The only exceptions occur when the behavior of a program is undefined, in which case the compiler is allowed to do pretty much whatever it wants. This does not occur very often in IL, and in any case should generally be treated as a bug rather than a feature.
@tannergooding,
I'm just trying to understand why this ... cannot be transformed to be ... My understanding of the optimizations the spec allows, indicates it should be.
It's in the part of I.12.6.4 that you quoted. Adding emphasis (note "and exceptions"):
Conforming implementations of the CLI are free to execute programs using any technology that
guarantees, within a single thread of execution, that side-effects and exceptions generated by a
thread are visible in the order specified by the CIL. For this purpose only volatile operations
(including volatile reads) constitute visible side-effects. (Note that while only volatile operations
constitute visible side-effects, volatile operations also affect the visibility of non-volatile
references.)
The JIT must guarantee that the NullReferenceException
or IndexOutOfRangeException
thrown by the ldelem
is visible.
@pgavlin, @JosephTremoulet. Thanks.
Am I correctly understanding the subsequent section on E-relaxed
methods that, if a new CompilationRelaxations
value was exposed, the JIT would be allowed to perform such a transformation and delay the check into the inlined code?
@tannergooding, by my reading, that would still run afoul of:
The check’s exception is thrown some time in the sequence, unless the sequence throws
another exception. When multiple relaxed checks fail, it is unspecified as to which
exception is thrown by the VES
@JosephTremoulet. Hmm, it seems that the further explanation of relaxations in Annex F specifically calls out the scenario I am interested in:
If this method is relaxed, and the caller is also relaxed, then the caller would be allowed, in the
absence of constraining data or control dependences, to interleave the call with other instructions
in the caller. If one of those other interleaved instructions faults, then any or all of M’s side
effects might be suppressed. Thus, method M should be marked as strict if it is important to
prevent a fault from destroying the invariant.This “interference” from the caller is potentially annoying, but seems to be intrinsic to any
definition of relaxed exceptions that permits both:
- instruction reordering and
- inlined method calls are at least as fast as manual inlining.
In either case. Thanks for helping me understand this better.
I'm going to create a new issue so I can continue the discussion without polluting this thread anymore.
I think, even if the runtime does not "technically" allow it today (although I think it might), it should support these types of optimizations.
@tannergooding, translating "caller" and "M" to your example, that's saying that (if both are relaxed) a fault in MyMethod
may result in some arbitrary subset of Select
's side-effects being suppressed. The fault in MyMethod
still must be made visible.
HW_INTRINSIC
-based implementation for CMOVnn
: https://github.com/EgorBo/runtime-1/commit/1271fe536cc3274867a7306424d45c8db76be8ca
static int Test1(int x)
{
return x == 42 ? 1000 : 2000;
}
static int Test2(int x, int a, int b)
{
return x == 42 ? a : b;
}
; Method Tests:Test1(int):int
G_M56601_IG01:
G_M56601_IG02:
83F92A cmp ecx, 42
B8E8030000 mov eax, 0x3E8
BAD0070000 mov edx, 0x7D0
0F45C2 cmovne eax, edx
G_M56601_IG03:
C3 ret
; Total bytes of code: 17
; Method Tests:Test2(int,int,int):int
G_M50938_IG01:
G_M50938_IG02:
83F92A cmp ecx, 42
8BC2 mov eax, edx
410F45C0 cmovne eax, r8d
G_M50938_IG03:
C3 ret
; Total bytes of code: 10
Works better with PGO (COMPlus_TieredPGO=1
) 🙂:
Most helpful comment
FWIW I've made a prototype that survives diffing and testing. It detects simple half/full hammocks where the true/false BBs contain a single assignment (or a return) that involves only local variables and constants. It only works with i4 and i8 but with some care we might be able to handle small integer types.
I'll probably try to extend this to indirs though it's problematic due to side-effects. A quick and dirty experiment that survives diffing could probably show if there are enough opportunities for this to be useful.
The conversion is currently done when lowering
GT_JTRUE
. It could be easily moved to a separate phase before lowering but I'm wondering if this shouldn't be done much earlier. After conversion we have less basic blocks and we may be able to eliminate some variables as well. And we may miss other optimizations if we do this conversion late - for example dotnet/runtime#6748 requires CSE.FX diff shows a 2k reduction in code size but it could be better. I had to turn off a and-cmp to test optimization that caused a lot of trouble and that results in code size regressions. Also, using
cmov
can result in pretty bad code if variables are not already in registers. Something like:Codegen for
GT_SELECT
should probably detect such cases and fall back to ajcc
andmov [ebp-8], 1
.The 2 examples from the initial post generate:
The code is similar to what the native compilers generate except a couple of warts. The first example has a widening cast that can be easily removed. The second example has a useless
test
that could be removed as well.It's worth noting that Clang assumes that bools are 0/1 and generates better code for the first example. I don't think we can do that, CLR bools are 0/non-zero.
The
Math.Max
example from dotnet/runtime#4326 generates:so we get rid of a useless jump by accident (those kind of jumps can probably appear in other situations that aren't handled by if-conversion).
The
bool
conversion example from dotnet/runtime#4399 isn't currently detected but a variant of it works:``` C#
for (int i = 0; i < N; i++)
sum = (bits[i] ? 1 : 0) + sum;
```
The nice thing about if-conversion is that once you generate a GT_SELECT you can look around and detect all kind of patterns relatively easily.
I'll try to do some perf measurements but the perf characteristics of converting branches are probably well known to anyone familiar with the subject. If the branch isn't predictable then you get good performance (I've seen numbers in the range 2-5x but I have to double check). If the branch is predictable then performance may regress slightly. Maybe such conversions should be driven by profile data but it seems that all native compilers generate
cmov
by default.