Math.Abs is slow when the value is negative
BenchmarkDotNet=v0.10.10, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.192)
Processor=Intel Core i7-7700K CPU 4.20GHz (Kaby Lake), ProcessorCount=8
.NET Core SDK=2.1.2
[Host] : .NET Core 2.0.3 (Framework 4.6.25815.02), 64bit RyuJIT [AttachedDebugger]
DefaultJob : .NET Core 2.0.3 (Framework 4.6.25815.02), 64bit RyuJIT
| Method | x | Mean | Error | StdDev |
|-------- |----- |----------:|----------:|----------:|
| If | -100 | 0.2354 ns | 0.0050 ns | 0.0044 ns |
| MyAbs | -100 | 0.2350 ns | 0.0037 ns | 0.0034 ns |
| MathAbs | -100 | 1.0940 ns | 0.0019 ns | 0.0016 ns |
| If | 100 | 0.2307 ns | 0.0013 ns | 0.0011 ns |
| MyAbs | 100 | 0.2297 ns | 0.0011 ns | 0.0010 ns |
| MathAbs | 100 | 0.1939 ns | 0.0009 ns | 0.0008 ns |
public class AbsBench
{
[Params(100,-100)]
public int x;
public int y;
[Benchmark]
public void If()
{
y = x < 0 ? (x == int.MinValue ? throw new OverflowException() : -x) : x;
}
[Benchmark]
public void MyAbs()
{
y = AbsHelper.Abs(x);
}
[Benchmark]
public void MathAbs()
{
y = Math.Abs(x);
}
}
public static class AbsHelper
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int Abs(int x)
=> x < 0 ? (x == int.MinValue ? throw new OverflowException() : -x) : x;
}
@tannergooding
This is probably because Math.Abs
is written in two parts:
```C#
public static int Abs(int value)
{
return (value >= 0) ? value : AbsHelper(value);
}
private static int AbsHelper(int value)
{
Debug.Assert(value < 0, "AbsHelper should only be called for negative values! (workaround for JIT inlining)");
if (value == int.MinValue)
{
throw new OverflowException(SR.Overflow_NegateTwosCompNum);
}
return -value;
}
```
Based on the assert, I would guess that having it all in one part would either prevent inlining or result in too much code bloat when it is inlined (possibly due to the thrown exception)
throw new OverflowException(SR.Overflow_NegateTwosCompNum);
is probably going to stop it from inlining; could push it out to a non returning method?
Yeah. Something like:
```C#
public static int Abs(int value)
{
if (value >= 0)
{
return value;
}
else if (value == int.MinValue)
{
AbsHelper();
}
return -value;
}
private static void AbsHelper()
{
throw new OverflowException(SR.Overflow_NegateTwosCompNum);
}
```
@AndyAyersMS is it still necessary to move "throw" out of a method to make it inlinable? Is it simply that it is slightly less IL to call a method rather than load the string, construct the exception and throw it?
is it still necessary to move "throw" out of a method to make it inlinable?
The main reason to move the throw
out is not to make the method inlinable but to avoid bloating the caller method with all the code that throw
generates.
We should probably change Math.Abs
to something like
C#
static int Abs(int value)
{
if (value < 0)
{
value = -value;
if (value < 0)
{
AbsHelper();
}
}
return value;
}
This produces the smallest amount of code with the current JIT and there's some room for improvement as well.
The current implementation is curious. Someone automagically decided that negative numbers are uncommon and thus only the positive case should be handled efficiently 馃槙
The general guidance is to separate out throws into helper methods that do the work of creating the exception object and any related data (eg formatted exception messages) and then unconditionally throw. The jit's inline performance heuristic will then block inlining of the helper. This has a number of performance benefits:
Native codegen for exception throws that use resource based strings is surprisingly large.
There is no "correctness" reason preventing methods with throws from being inlined, and methods that conditionally throw (like the original AbsHelper
above) may end up getting inlined, as they might contain a mixture of hot and cold code. Methods that unconditionally throw are much less likely to contain any hot code.
Another funny thing about the current implementation is that it doesn't quite save code size as one may hope. Abs
gets inlined and AbsHelper
does not. Too bad, because of the AbsHelper
call you may end up with larger function prolog/epilog in some cases:
```C#
static int Test(int[] a)
{
int s = 0;
foreach (int x in a)
s += Math.Abs(x);
return s;
}
```asm
G_M33786_IG01:
57 push rdi
56 push rsi
55 push rbp
53 push rbx
4883EC28 sub rsp, 40
G_M33786_IG02:
33F6 xor esi, esi
488BF9 mov rdi, rcx
33DB xor ebx, ebx
8B6F08 mov ebp, dword ptr [rdi+8]
85ED test ebp, ebp
7E1C jle SHORT G_M33786_IG06
G_M33786_IG03:
4863CB movsxd rcx, ebx
8B4C8F10 mov ecx, dword ptr [rdi+4*rcx+16]
; begin abs
85C9 test ecx, ecx
7D07 jge SHORT G_M33786_IG04
E802E26D5C call System.Math:AbsHelper(int):int
EB02 jmp SHORT G_M33786_IG05
G_M33786_IG04:
8BC1 mov eax, ecx
G_M33786_IG05:
; end abs
03F0 add esi, eax
FFC3 inc ebx
3BEB cmp ebp, ebx
7FE4 jg SHORT G_M33786_IG03
G_M33786_IG06:
8BC6 mov eax, esi
G_M33786_IG07:
4883C428 add rsp, 40
5B pop rbx
5D pop rbp
5E pop rsi
5F pop rdi
C3 ret
; Total bytes of code 61, prolog size 8 for method C:Test(ref):int
The Abs
version I posted above needs AggressiveInlining
to inline and then produces code that's only one byte larger:
G_M33787_IG01:
4883EC28 sub rsp, 40
G_M33787_IG02:
33C0 xor eax, eax
488BD1 mov rdx, rcx
33C9 xor ecx, ecx
448B4208 mov r8d, dword ptr [rdx+8]
4585C0 test r8d, r8d
7E1F jle SHORT G_M33787_IG05
G_M33787_IG03:
4C63C9 movsxd r9, ecx
468B4C8A10 mov r9d, dword ptr [rdx+4*r9+16]
; begin abs
4585C9 test r9d, r9d
7D08 jge SHORT G_M33787_IG04
41F7D9 neg r9d
4585C9 test r9d, r9d
7C0F jl SHORT G_M33787_IG06
; end abs
G_M33787_IG04:
4103C1 add eax, r9d
FFC1 inc ecx
443BC1 cmp r8d, ecx
7FE1 jg SHORT G_M33787_IG03
G_M33787_IG05:
4883C428 add rsp, 40
C3 ret
G_M33787_IG06:
; begin abs
E8B3FBFFFF call C:AbsHelper()
CC int3
; end abs
; Total bytes of code 62, prolog size 4 for method C:Test(ref):int
And I think it's possible to get rid of the test
that follows neg
and then this code would end up being 2 bytes shorter.
@mikedn Is there there an open issue already for that unnecessary neg+test
codegen (and inc+test
, etc)?
Is there there an open issue already for that unnecessary neg+test codegen (and inc+test, etc)?
Hmm, afaik there's nothing related to this specific pattern. It's problematic because the result of neg
is stored in a variable and that prevents test
from "seeing" what's there.
Such patterns often occur when the first computation (neg/inc/dec/sub...) is stored in a variable and then followed by a related test/cmp. Otherwise the IL would only contain a cmp/test generating instruction in the first place (assuming properly optimized code).
Such patterns often occur when the first computation (neg/inc/dec/sub...) is stored in a variable and then followed by a related test/cmp.
Apart from the issue with the intermediary variables there's another issue with these patterns - many are not valid when integer overflow has well defined behavior as it has in C# and IL. I already screwed up this once and reverted the change, see https://github.com/dotnet/coreclr/issues/14321 for more details. And I'm pretty sure neg
+ test
+ jl
is in the same boat, it can be done but we need to generate neg
+ js
instead of neg
+ jl
.
Feel free to open an issue in the coreclr repository. But it would be nice to include some actual code examples where these optimizations are valid and potentially useful.
Note that the JIT already handles cases involving ==
and !=
(e.g. -x == 0
, x + 2 != 0
etc.). Though these are too blocked if intermediary variables are present...
@mikedn @benaadams It was mentioned earlier that this code path is only optimized for the non-negative input case. Would branchless code like the below address this?
public static int Abs(int value) {
int sign = (value >> 31); // -1 if negative; 0 if non-negative
value ^= sign; // ~value if negative; value if non-negative
value -= sign; // ~value + 1 (= -value) if negative; value if non-negative
if (value < 0) { /* throw - should pretty much always evaluate to false in branch predictor */ }
return value;
}
In my own testing this has a slight perf hit over the merged implementation of Math.Abs
when the branch predictor can perfectly guess whether the input value is negative. It has a significant perf gain over the merged implementation for random input data.
In my own testing this has a slight perf hit over the merged implementation of Math.Abs when the branch predictor can perfectly guess whether the input value is negative. It has a significant perf gain over the merged implementation for random input data.
That's a complicated problem.
Yes, if you run into branch mispredictions then the performance goes down the drain. Yes, in many cases the perf hit of the branchless version is small (though it's there pretty much all the time). But it's also likely that you can construct examples where the branchless version perf hit is higher. I've seen that with CMOV - 1.5x slower and even 2x slower. This already makes thing problematic.
And then the branch version is also small. And it may be even smaller with an additional JIT optimization.
And it's more likely that the JIT can understand it and do something with it (whatever that is, including applying some kind of if-conversion). It's less likely that the JIT will be able to do anything useful with the branchless version. It certainly won't be able to convert it back to the branch version if it somehow figures out that the branch version won't suffer from mispredictions.
In the best case the JIT learns how to do proper PGO and decides to if-convert the branch code based on that (though even then you may have cases where the input data is random now and non-random next time). Failing that, the JIT might blindly decide to if-convert the branch version, at least that will produce slightly shorter code than the simulated version (I think, I haven't actually checked).
I've seen that with CMOV - 1.5x slower and even 2x slower.
Hmm, maybe it's not that bad. 1.5x - 2x tends to happen with max
patterns because there the branch version can have the fortunate effect of breaking long dependency chains. It seems that abs
doesn't have suffer from this particular problem, the branch version can't break a dependency chain and the branchless version should not add more than 2 cycles to an existing dependency chain, when done right.
In any case, this needs to be approached from the JIT side first since it could probably generate a shorter sequence such as neg
/cmovs
. Though this still needs a temp register, unlike the branch version...
Most helpful comment
The general guidance is to separate out throws into helper methods that do the work of creating the exception object and any related data (eg formatted exception messages) and then unconditionally throw. The jit's inline performance heuristic will then block inlining of the helper. This has a number of performance benefits:
Native codegen for exception throws that use resource based strings is surprisingly large.
There is no "correctness" reason preventing methods with throws from being inlined, and methods that conditionally throw (like the original
AbsHelper
above) may end up getting inlined, as they might contain a mixture of hot and cold code. Methods that unconditionally throw are much less likely to contain any hot code.