Runtime: Math.Abs is slow

Created on 10 Jan 2018  路  17Comments  路  Source: dotnet/runtime

Math.Abs is slow when the value is negative

BenchmarkDotNet=v0.10.10, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.192)
Processor=Intel Core i7-7700K CPU 4.20GHz (Kaby Lake), ProcessorCount=8
.NET Core SDK=2.1.2
  [Host]     : .NET Core 2.0.3 (Framework 4.6.25815.02), 64bit RyuJIT  [AttachedDebugger]
  DefaultJob : .NET Core 2.0.3 (Framework 4.6.25815.02), 64bit RyuJIT


| Method | x | Mean | Error | StdDev |
|-------- |----- |----------:|----------:|----------:|
| If | -100 | 0.2354 ns | 0.0050 ns | 0.0044 ns |
| MyAbs | -100 | 0.2350 ns | 0.0037 ns | 0.0034 ns |
| MathAbs | -100 | 1.0940 ns | 0.0019 ns | 0.0016 ns |
| If | 100 | 0.2307 ns | 0.0013 ns | 0.0011 ns |
| MyAbs | 100 | 0.2297 ns | 0.0011 ns | 0.0010 ns |
| MathAbs | 100 | 0.1939 ns | 0.0009 ns | 0.0008 ns |

public class AbsBench
{
    [Params(100,-100)]
    public int x;
    public int y;

    [Benchmark]
    public void If()
    {
        y = x < 0 ? (x == int.MinValue ? throw new OverflowException() : -x) : x;
    }

    [Benchmark]
    public void MyAbs()
    {
        y = AbsHelper.Abs(x);
    }


    [Benchmark]
    public void MathAbs()
    {
        y = Math.Abs(x);
    }
}

public static class AbsHelper
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static int Abs(int x)
        => x < 0 ? (x == int.MinValue ? throw new OverflowException() : -x) : x;

}
tenet-performance up-for-grabs

Most helpful comment

The general guidance is to separate out throws into helper methods that do the work of creating the exception object and any related data (eg formatted exception messages) and then unconditionally throw. The jit's inline performance heuristic will then block inlining of the helper. This has a number of performance benefits:

  • overall code size savings when are multiple callers or callers with multiple throw sites
  • call sites to helper are considered "rare" and so moved into the caller's cold code region
  • helper IL is only jitted if an exception about to be thrown, so caller jits faster
  • caller's prolog/epilog may be simplified with fewer register saves/restores

Native codegen for exception throws that use resource based strings is surprisingly large.

There is no "correctness" reason preventing methods with throws from being inlined, and methods that conditionally throw (like the original AbsHelper above) may end up getting inlined, as they might contain a mixture of hot and cold code. Methods that unconditionally throw are much less likely to contain any hot code.

All 17 comments

@tannergooding

This is probably because Math.Abs is written in two parts:

```C#
public static int Abs(int value)
{
return (value >= 0) ? value : AbsHelper(value);
}

private static int AbsHelper(int value)
{
Debug.Assert(value < 0, "AbsHelper should only be called for negative values! (workaround for JIT inlining)");

if (value == int.MinValue)
{
    throw new OverflowException(SR.Overflow_NegateTwosCompNum);
}

return -value;

}
```

Based on the assert, I would guess that having it all in one part would either prevent inlining or result in too much code bloat when it is inlined (possibly due to the thrown exception)

throw new OverflowException(SR.Overflow_NegateTwosCompNum); is probably going to stop it from inlining; could push it out to a non returning method?

Yeah. Something like:

```C#
public static int Abs(int value)
{
if (value >= 0)
{
return value;
}
else if (value == int.MinValue)
{
AbsHelper();
}

return -value;

}

private static void AbsHelper()
{
throw new OverflowException(SR.Overflow_NegateTwosCompNum);
}
```

@AndyAyersMS is it still necessary to move "throw" out of a method to make it inlinable? Is it simply that it is slightly less IL to call a method rather than load the string, construct the exception and throw it?

is it still necessary to move "throw" out of a method to make it inlinable?

The main reason to move the throw out is not to make the method inlinable but to avoid bloating the caller method with all the code that throw generates.

We should probably change Math.Abs to something like
C# static int Abs(int value) { if (value < 0) { value = -value; if (value < 0) { AbsHelper(); } } return value; }
This produces the smallest amount of code with the current JIT and there's some room for improvement as well.

The current implementation is curious. Someone automagically decided that negative numbers are uncommon and thus only the positive case should be handled efficiently 馃槙

The general guidance is to separate out throws into helper methods that do the work of creating the exception object and any related data (eg formatted exception messages) and then unconditionally throw. The jit's inline performance heuristic will then block inlining of the helper. This has a number of performance benefits:

  • overall code size savings when are multiple callers or callers with multiple throw sites
  • call sites to helper are considered "rare" and so moved into the caller's cold code region
  • helper IL is only jitted if an exception about to be thrown, so caller jits faster
  • caller's prolog/epilog may be simplified with fewer register saves/restores

Native codegen for exception throws that use resource based strings is surprisingly large.

There is no "correctness" reason preventing methods with throws from being inlined, and methods that conditionally throw (like the original AbsHelper above) may end up getting inlined, as they might contain a mixture of hot and cold code. Methods that unconditionally throw are much less likely to contain any hot code.

Another funny thing about the current implementation is that it doesn't quite save code size as one may hope. Abs gets inlined and AbsHelper does not. Too bad, because of the AbsHelper call you may end up with larger function prolog/epilog in some cases:
```C#
static int Test(int[] a)
{
int s = 0;
foreach (int x in a)
s += Math.Abs(x);
return s;
}

```asm
G_M33786_IG01:
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC28             sub      rsp, 40
G_M33786_IG02:
       33F6                 xor      esi, esi
       488BF9               mov      rdi, rcx
       33DB                 xor      ebx, ebx
       8B6F08               mov      ebp, dword ptr [rdi+8]
       85ED                 test     ebp, ebp
       7E1C                 jle      SHORT G_M33786_IG06
G_M33786_IG03:
       4863CB               movsxd   rcx, ebx
       8B4C8F10             mov      ecx, dword ptr [rdi+4*rcx+16]
; begin abs
       85C9                 test     ecx, ecx
       7D07                 jge      SHORT G_M33786_IG04
       E802E26D5C           call     System.Math:AbsHelper(int):int
       EB02                 jmp      SHORT G_M33786_IG05
G_M33786_IG04:
       8BC1                 mov      eax, ecx
G_M33786_IG05:
; end abs
       03F0                 add      esi, eax
       FFC3                 inc      ebx
       3BEB                 cmp      ebp, ebx
       7FE4                 jg       SHORT G_M33786_IG03
G_M33786_IG06:
       8BC6                 mov      eax, esi
G_M33786_IG07:
       4883C428             add      rsp, 40
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret
; Total bytes of code 61, prolog size 8 for method C:Test(ref):int

The Abs version I posted above needs AggressiveInlining to inline and then produces code that's only one byte larger:

G_M33787_IG01:
       4883EC28             sub      rsp, 40
G_M33787_IG02:
       33C0                 xor      eax, eax
       488BD1               mov      rdx, rcx
       33C9                 xor      ecx, ecx
       448B4208             mov      r8d, dword ptr [rdx+8]
       4585C0               test     r8d, r8d
       7E1F                 jle      SHORT G_M33787_IG05
G_M33787_IG03:
       4C63C9               movsxd   r9, ecx
       468B4C8A10           mov      r9d, dword ptr [rdx+4*r9+16]
; begin abs
       4585C9               test     r9d, r9d
       7D08                 jge      SHORT G_M33787_IG04
       41F7D9               neg      r9d
       4585C9               test     r9d, r9d
       7C0F                 jl       SHORT G_M33787_IG06
; end abs
G_M33787_IG04:
       4103C1               add      eax, r9d
       FFC1                 inc      ecx
       443BC1               cmp      r8d, ecx
       7FE1                 jg       SHORT G_M33787_IG03
G_M33787_IG05:
       4883C428             add      rsp, 40
       C3                   ret
G_M33787_IG06:
; begin abs
       E8B3FBFFFF           call     C:AbsHelper()
       CC                   int3
; end abs
; Total bytes of code 62, prolog size 4 for method C:Test(ref):int

And I think it's possible to get rid of the test that follows neg and then this code would end up being 2 bytes shorter.

@mikedn Is there there an open issue already for that unnecessary neg+test codegen (and inc+test, etc)?

Is there there an open issue already for that unnecessary neg+test codegen (and inc+test, etc)?

Hmm, afaik there's nothing related to this specific pattern. It's problematic because the result of neg is stored in a variable and that prevents test from "seeing" what's there.

Such patterns often occur when the first computation (neg/inc/dec/sub...) is stored in a variable and then followed by a related test/cmp. Otherwise the IL would only contain a cmp/test generating instruction in the first place (assuming properly optimized code).

Such patterns often occur when the first computation (neg/inc/dec/sub...) is stored in a variable and then followed by a related test/cmp.

Apart from the issue with the intermediary variables there's another issue with these patterns - many are not valid when integer overflow has well defined behavior as it has in C# and IL. I already screwed up this once and reverted the change, see https://github.com/dotnet/coreclr/issues/14321 for more details. And I'm pretty sure neg + test + jl is in the same boat, it can be done but we need to generate neg + js instead of neg + jl.

Feel free to open an issue in the coreclr repository. But it would be nice to include some actual code examples where these optimizations are valid and potentially useful.

Note that the JIT already handles cases involving == and != (e.g. -x == 0, x + 2 != 0 etc.). Though these are too blocked if intermediary variables are present...

@mikedn @benaadams It was mentioned earlier that this code path is only optimized for the non-negative input case. Would branchless code like the below address this?

public static int Abs(int value) {
    int sign = (value >> 31); // -1 if negative; 0 if non-negative
    value ^= sign; // ~value if negative; value if non-negative
    value -= sign; // ~value + 1 (= -value) if negative; value if non-negative
    if (value < 0) { /* throw - should pretty much always evaluate to false in branch predictor */ }
    return value;
}

In my own testing this has a slight perf hit over the merged implementation of Math.Abs when the branch predictor can perfectly guess whether the input value is negative. It has a significant perf gain over the merged implementation for random input data.

In my own testing this has a slight perf hit over the merged implementation of Math.Abs when the branch predictor can perfectly guess whether the input value is negative. It has a significant perf gain over the merged implementation for random input data.

That's a complicated problem.

Yes, if you run into branch mispredictions then the performance goes down the drain. Yes, in many cases the perf hit of the branchless version is small (though it's there pretty much all the time). But it's also likely that you can construct examples where the branchless version perf hit is higher. I've seen that with CMOV - 1.5x slower and even 2x slower. This already makes thing problematic.

And then the branch version is also small. And it may be even smaller with an additional JIT optimization.

And it's more likely that the JIT can understand it and do something with it (whatever that is, including applying some kind of if-conversion). It's less likely that the JIT will be able to do anything useful with the branchless version. It certainly won't be able to convert it back to the branch version if it somehow figures out that the branch version won't suffer from mispredictions.

In the best case the JIT learns how to do proper PGO and decides to if-convert the branch code based on that (though even then you may have cases where the input data is random now and non-random next time). Failing that, the JIT might blindly decide to if-convert the branch version, at least that will produce slightly shorter code than the simulated version (I think, I haven't actually checked).

I've seen that with CMOV - 1.5x slower and even 2x slower.

Hmm, maybe it's not that bad. 1.5x - 2x tends to happen with max patterns because there the branch version can have the fortunate effect of breaking long dependency chains. It seems that abs doesn't have suffer from this particular problem, the branch version can't break a dependency chain and the branchless version should not add more than 2 cycles to an existing dependency chain, when done right.

In any case, this needs to be approached from the JIT side first since it could probably generate a shorter sequence such as neg/cmovs. Though this still needs a temp register, unlike the branch version...

Was this page helpful?
0 / 5 - 0 ratings