Runtime: Performance issues for Math.Min, Math.Max for double/float

Created on 2 Mar 2020  Â·  24Comments  Â·  Source: dotnet/runtime

Issue Title

Using Math.Min in a loop is almost 4 times slower than before.

General

Porting our libraries to .net core 3.1, we realized performance issues on Math.Min. Min is including in a loop with millions of iterations. The loop execution was almost 4 times slower than before.

I looked at the source code and realized that the AggressiveInlining option (MethImplAttribute) was missing on Math.Min and Max for floating point arguments.
Of course the function has changed to be compliant with the new IEEE standard, but I guess it still can be inlined.

area-System.Numerics tenet-performance up-for-grabs

All 24 comments

cc: @tannergooding

Marked as up-for-grabs as it should be a trivial fix.

We should also add tests covering Min/Max to dotnet/performance. @adamsitnik might be able to help with this side of things.

Our existing math tests are here: https://github.com/dotnet/performance/tree/master/src/benchmarks/micro/runtime/Math/Functions

Math.Min(double, double) produces quite a lot of asm (see methods M1 and M2. Do we really want to force the inline?


As thought:
If we know from the use case that neither first nor seconds argument will be NaN there could be a "fast" version that emits minsd (on x86) for full speed (method M0 should emit minsd ideally).
The JIT can be tricked into this by using hw-intrinsics, but should do this by itself.

For compliance with IEEE 754:2019 we have Math.Min.


add tests covering Min/Max to dotnet/performance

I'll open a PR over there.

The asm isn't quite as bad when you look at the x64 version.

The main bloat comes from the IsNaN check which was cleaned up back in October and should be much smaller in .NET 5 (which sharplab doesn't support right now)

If we know from the use case that neither first nor seconds argument will be NaN there could be a "fast" version that emits minsd (on x86) for full speed (method M0 should emit minsd ideally).

This isn't a valid transformation because minsd doesn't correctly handle 0.0 vs -0.0, it always returns the second operand when IEEE 754 requires it return -0.0.

I had previously implemented a branchless version of this code which uses intrinsics, but the results weren't consistent on all platforms: https://github.com/dotnet/coreclr/pull/22965 (this was also a version which didn't propagate NaN, so it would need to be tweaked slightly).

We might also be able to gain some perf by ensuring the non Zero/non NaN case is the "predicted" path if the former isn't viable.

I tried the code with current _master_ (from a checked build).


C# code (just copied together)

```c#
using System.Runtime.CompilerServices;

namespace ConsoleApp4
{
class Program
{
static int Main(string[] args)
{
double a = 3d;
double b = 4d;
double min = Do(a, b);

        return min == a ? 0 : 1;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static double Do(double a, double b)
    {
        return Min(a, b);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static double Min(double val1, double val2)
    {
        // This matches the IEEE 754:2019 `minimum` function
        //
        // It propagates NaN inputs back to the caller and
        // otherwise returns the larger of the inputs. It
        // treats +0 as larger than -0 as per the specification.

        if ((val1 < val2) || IsNaN(val1))
        {
            return val1;
        }

        if (val1 == val2)
        {
            return IsNegative(val1) ? val1 : val2;
        }

        return val2;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static unsafe bool IsNaN(double d)
    {
        // A NaN will never equal itself so this is an
        // easy and efficient way to check for NaN.

pragma warning disable CS1718

        return d != d;

pragma warning restore CS1718

    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static unsafe bool IsNegative(double d)
    {
        return DoubleToInt64Bits(d) < 0;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static unsafe long DoubleToInt64Bits(double value)
    {
        return *((long*)&value);
    }
}

}

</details>

It is quite "jumpy".
```asm
; Assembly listing for method ConsoleApp4.Program:Do(double,double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  9,  5   )  double  ->  mm0
;  V01 arg1         [V01,T01] (  6,  4.25)  double  ->  mm1
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T02] (  5,  3   )  double  ->  mm0         "Inline return value spill temp"
;  V04 tmp2         [V04,T03] (  2,  1   )  double  ->  [rbp-0x08]   do-not-enreg[F] ld-addr-op "Inlining Arg"
;
; Lcl frame size = 16

G_M65265_IG01:
       55                   push     rbp
       4883EC10             sub      rsp, 16
       C5F877               vzeroupper
       488D6C2410           lea      rbp, [rsp+10H]
                        ;; bbWeight=1    PerfScore 2.75
G_M65265_IG02:
       C5F92EC8             vucomisd xmm1, xmm0
       770A                 ja       SHORT G_M65265_IG04
                        ;; bbWeight=1    PerfScore 2.00
G_M65265_IG03:
       C5F92EC0             vucomisd xmm0, xmm0
       7A03                 jp       SHORT G_M65265_IG04
       7403                 je       SHORT G_M65265_IG05
                        ;; bbWeight=0.25 PerfScore 0.75
G_M65265_IG04:
       EB24                 jmp      SHORT G_M65265_IG09
                        ;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG05:
       C5F92EC1             vucomisd xmm0, xmm1
       7A19                 jp       SHORT G_M65265_IG08
       7517                 jne      SHORT G_M65265_IG08
       C5FB1145F8           vmovsd   qword ptr [rbp-08H], xmm0
       48837DF800           cmp      qword ptr [rbp-08H], 0
       7C09                 jl       SHORT G_M65265_IG07
                        ;; bbWeight=0.25 PerfScore 1.38
G_M65265_IG06:
       C5F828C1             vmovaps  xmm0, xmm1
       EB08                 jmp      SHORT G_M65265_IG09
                        ;; bbWeight=0.50 PerfScore 1.12
G_M65265_IG07:
       EB05                 jmp      SHORT G_M65265_IG09
                        ;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG08:
       C5F828C1             vmovaps  xmm0, xmm1
                        ;; bbWeight=0.50 PerfScore 0.12
G_M65265_IG09:
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret
                        ;; bbWeight=1    PerfScore 2.00

; Total bytes of code 67, prolog size 13, PerfScore 19.43, (MethodHash=0702010e) for method ConsoleApp4.Program:Do(double,double):double
; ============================================================

The stack work for DoubleToInt64Bits isn't needed (could be movq), this is already tracked by https://github.com/dotnet/runtime/issues/12733#issuecomment-495709826.
A workaround is easy.

I'll try something to improve this...

An attempt for better codegen for Math.Min(double, double) is in https://github.com/dotnet/runtime/compare/master...gfoidl:math-max-min-double

During the work I noticed that same test-cases in https://github.com/dotnet/runtime/blob/4f9ae42d861fcb4be2fcd5d3d55d5f227d30e723/src/libraries/System.Runtime.Extensions/tests/System/Math.cs#L1150 are missing for double.NaN-case (or I did just miss other unit tests for Math.Min).

With the latest solution codegen looks pretty good.


codegen for current master

; Assembly listing for method ConsoleApp4.Program:Do(double,double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  9,  5   )  double  ->  mm0        
;  V01 arg1         [V01,T01] (  6,  4.25)  double  ->  mm1        
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T02] (  5,  3   )  double  ->  mm0         "Inline return value spill temp"
;  V04 tmp2         [V04,T03] (  2,  1   )  double  ->  [rbp-0x08]   do-not-enreg[F] ld-addr-op "Inlining Arg"
;
; Lcl frame size = 16

G_M65265_IG01:
       55                   push     rbp
       4883EC10             sub      rsp, 16
       C5F877               vzeroupper 
       488D6C2410           lea      rbp, [rsp+10H]
                        ;; bbWeight=1    PerfScore 2.75
G_M65265_IG02:
       C5F92EC8             vucomisd xmm1, xmm0
       770A                 ja       SHORT G_M65265_IG04
                        ;; bbWeight=1    PerfScore 2.00
G_M65265_IG03:
       C5F92EC0             vucomisd xmm0, xmm0
       7A03                 jp       SHORT G_M65265_IG04
       7403                 je       SHORT G_M65265_IG05
                        ;; bbWeight=0.25 PerfScore 0.75
G_M65265_IG04:
       EB24                 jmp      SHORT G_M65265_IG09
                        ;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG05:
       C5F92EC1             vucomisd xmm0, xmm1
       7A19                 jp       SHORT G_M65265_IG08
       7517                 jne      SHORT G_M65265_IG08
       C5FB1145F8           vmovsd   qword ptr [rbp-08H], xmm0
       48837DF800           cmp      qword ptr [rbp-08H], 0
       7C09                 jl       SHORT G_M65265_IG07
                        ;; bbWeight=0.25 PerfScore 1.38
G_M65265_IG06:
       C5F828C1             vmovaps  xmm0, xmm1
       EB08                 jmp      SHORT G_M65265_IG09
                        ;; bbWeight=0.50 PerfScore 1.12
G_M65265_IG07:
       EB05                 jmp      SHORT G_M65265_IG09
                        ;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG08:
       C5F828C1             vmovaps  xmm0, xmm1
                        ;; bbWeight=0.50 PerfScore 0.12
G_M65265_IG09:
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret      
                        ;; bbWeight=1    PerfScore 2.00

; Total bytes of code 67, prolog size 13, PerfScore 19.43, (MethodHash=0702010e) for method ConsoleApp4.Program:Do(double,double):double
; ============================================================



codegen for 1af43447e2da78426c31030f6a5eb9e6c930876c

; Assembly listing for method ConsoleApp4.Program:Do(double,double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T03] (  3,  3   )  double  ->  mm0        
;  V01 arg1         [V01,T02] (  5,  3.50)  double  ->  mm1        
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T01] (  8,  8.50)  double  ->  mm0         "Inlining Arg"
;  V04 tmp2         [V04,T00] (  2,  0.50)    long  ->  rax         "Inline return value spill temp"
;* V05 tmp3         [V05    ] (  0,  0   )  double  ->  zero-ref    ld-addr-op "Inlining Arg"
;
; Lcl frame size = 0

G_M65265_IG01:
       55                   push     rbp
       C5F877               vzeroupper 
       488BEC               mov      rbp, rsp
                        ;; bbWeight=1    PerfScore 2.25
G_M65265_IG02:
       C5F92EC8             vucomisd xmm1, xmm0
       7725                 ja       SHORT G_M65265_IG05
                        ;; bbWeight=1    PerfScore 2.00
G_M65265_IG03:
       C5F92EC0             vucomisd xmm0, xmm0
       7A1E                 jp       SHORT G_M65265_IG05
       C5F92EC1             vucomisd xmm0, xmm1
       7A13                 jp       SHORT G_M65265_IG04
       7511                 jne      SHORT G_M65265_IG04
       C5F828D0             vmovaps  xmm2, xmm0
       C4E1F97ED0           vmovd    rax, xmm2
       4885C0               test     rax, rax
       7C08                 jl       SHORT G_M65265_IG05
                        ;; bbWeight=0.25 PerfScore 1.88
G_M65265_IG04:
       C5F828C1             vmovaps  xmm0, xmm1
                        ;; bbWeight=0.25 PerfScore 0.06
G_M65265_IG05:
       5D                   pop      rbp
       C3                   ret      
                        ;; bbWeight=1    PerfScore 1.50

; Total bytes of code 47, prolog size 7, PerfScore 12.89, (MethodHash=0702010e) for method ConsoleApp4.Program:Do(double,double):double
; ============================================================

What I don't like with this solutions are the gotos, and this will be pretty fragile if the JIT improves with its codegen, as the JIT should lay out the method in that way, ideally.

With regard to branch-prediction I also tried to split Math.Min(double, double) in a fast-path and a cold-path -- see https://github.com/dotnet/runtime/commit/5bfaf8093224ec5ce5c436838b362ab97f70856c (please ignore the redundant check in the slow method).
This should bring an improvment iif val1 < val2, but not in the general case.

Any thoughts on this?
Maybe we should revisit https://github.com/dotnet/coreclr/pull/22965 and support for NaN-propagation?

are missing for double.NaN-case (or I did just miss other unit tests for Math.Min).

We have non-NaN, NaN returns NaN covered by+Inf, NaN. We are missing NaN, non-NaN returns NaN. However, having ones that also cover non infinity values is beneficial anyways. So thanks for catching this!

An attempt for better codegen for Math.Min(double, double) is in master...gfoidl:math-max-min-double

Unfortunately this version doesn't handle NaN correctly and will return 1.0 for NaN, 1.0 when it should return NaN.

What I don't like with this solutions are the gotos, and this will be pretty fragile if the JIT improves with its codegen, as the JIT should lay out the method in that way, ideally.

There are a few other places this is done in the framework for similar reasons.

Maybe we should revisit dotnet/coreclr#22965 and support for NaN-propagation?

In order to correctly implement the function, we have to pick the lesser of x or y, treating -0 as smaller than +0 and propagating NaN. That means we will either have branch heavy code -or- we use branchless logic -or- possibly a slightly hybrid approach.

It would likely be good to get implementations for all three cases done and then run benchmarks on a few machines.

With regards to the hybrid approach, I mean that minsd is safe to use if we know that neither input is NaN or both aren't 0;' or that we might be able to make the branchless logic smaller if we assume NaN or both being 0 is an edge case.

I can continue next week... I am out of office the next days.

@tannergooding side question: should the workaround for BitConverter double <-> long be handled separate? (Cf. https://github.com/dotnet/runtime/issues/12733#issuecomment-495709826).
I would do so, to keep the changes focused...

In https://github.com/gfoidl/runtime/commit/b2fe76da672858479d57c5722c9c6aae82d43bff I added more tests for the edge cases.

Tested variants are here:

  • default (current master)
  • variant from https://github.com/dotnet/runtime/issues/33057#issuecomment-594614187
  • a version based on https://github.com/dotnet/coreclr/pull/22965

All three passes the tests (on intel-x64).
Benchmark results (win-x64) and generated asm (linux-x64) are in this folder.

The raytracer -- as mentioned in https://github.com/dotnet/runtime/issues/12159 -- is hacked together and gives on my machine (Intel x64) the following numbers:

Default:            48021 ms
InlinedOptimized:   38675 ms
Vectorized:         47637 ms

(Quite interesting that the branchless vectorized version doesn't perform better (on my machine) -- I didn't investigate why...maybe later)

Unfortunately I don't have access to other cpu-platforms (amd, arm) to run tests on.
Are there any other real-world benchmarks we can run for min/max?

So based on these numbers, I would go with InlinedOptimized variant.

Thank you for this change. I’m sure it will help.
I see this class has a IsNaN implementation. I‘m not sure if its implementation, which differs from double.IsNan. The only benefits I see is inlining. But to me Math.IsNan(double.NaN) is unsure.

Le 9 mars 2020 à 19:15, Günther Foidl notifications@github.com a écrit :


In gfoidl@b2fe76d I added more tests for the edge cases.

Tested variants are here:

default (current master)
variant from #33057 (comment)
a version based on dotnet/coreclr#22965
All three passes the tests (on intel-x64).
Benchmark results (win-x64) and generated asm (linux-x64) are in this folder.

The raytracer -- as mentioned in #12159 -- is hacked together and gives on my machine (Intel x64) the following numbers:

Default: 48021 ms
InlinedOptimized: 38675 ms
Vectorized: 47637 ms
(Quite interesting that the branchless vectorized version doesn't perform better (on my machine) -- I didn't investigate why...maybe later)

Unfortunately I don't have access to other cpu-platforms (amd, arm) to run tests on.
Are there any other real-world benchmarks we can run for min/max?

So based on these numbers, I would go with InlinedOptimized variant.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

this class has a IsNaN implementation

It's just copied over from double.IsNaN along other methods for easier tweaking of the code.

should the workaround for BitConverter double <-> long be handled separate? (Cf. #12733 (comment)).

Yes, it should be a separate PR with its own jit-diff and potentially perf data.

So based on these numbers, I would go with InlinedOptimized variant.

Thanks @gfoidl, I'm going to run some numbers on my AMD and Intel boxes in a bit and will share back as well.

Variants tested and disassembly: https://gist.github.com/tannergooding/d81a6cd7530ec1cdc27e08530922f02a

Notably, there are a few places where codegen could be improved; but inlining and reordering things to assume inputs aren't the same or NaN provides the easiest win.

Ryzen 3900X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT

| Method | Mean | Error | StdDev |
|--------------------------------------- |---------:|----------:|----------:|
| MathMin | 9.383 us | 0.0170 us | 0.0159 us |
| MathMinInlined | 4.694 us | 0.0133 us | 0.0118 us |
| MinReorder | 4.471 us | 0.0066 us | 0.0061 us |
| MinReorderWithMinScalar | 3.527 us | 0.0080 us | 0.0071 us |
| MinVectorized | 4.715 us | 0.0046 us | 0.0041 us |

Ryzen 1800X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 7 1800X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT

| Method | Mean | Error | StdDev |
|--------------------------------------- |----------:|----------:|----------:|
| MathMin | 16.147 us | 0.1826 us | 0.1708 us |
| MathMinInlined | 5.484 us | 0.0148 us | 0.0131 us |
| MinReorder | 4.902 us | 0.0114 us | 0.0107 us |
| MinReorderWithMinScalar | 4.120 us | 0.0228 us | 0.0214 us |
| MinVectorized | 5.166 us | 0.0168 us | 0.0149 us |

i7-7700

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT

| Method | Mean | Error | StdDev | Median |
|--------------------------------------- |----------:|----------:|----------:|----------:|
| MathMin | 12.560 us | 0.2606 us | 0.5664 us | 12.244 us |
| MathMinInlined | 7.369 us | 0.0230 us | 0.0192 us | 7.371 us |
| MinReorder | 6.063 us | 0.0257 us | 0.0240 us | 6.057 us |
| MinReorderWithMinScalar | 6.043 us | 0.0235 us | 0.0220 us | 6.037 us |
| MinVectorized | 7.833 us | 0.0788 us | 0.0699 us | 7.810 us |

That's why we need a Fast Math mode :-) to be able to just use a single instruction (e.g. minsd) and don't care about nans, etc 🙂

Some dotnet/perf benchmarks are up to 3x faster with Fast-Math mode on Mono-LLVM (e.g. the Burgers ones)

@tannergooding I updated my benchmarks with your variants -> results.

All variants use optimized BitConverter.

From these numbers (for double) InlinedOptimized is still the fasted variant (for my benchmarks).

For the raytracer I get the following numbers (float):

InlinedOptimized:                   35525 ms
DefaultReorderedVectorized:         40959 ms
DefaultReorderedVectorizedHotCold:  41835 ms

For AMD it seems that MinScalar has much greater positive effect than on Intel.

Can you run your benchmarks with InlinedOptimized-variant as well, please?

I see the following. Part of the differnce in numbers is due to the inputs given, as such I've run the benchmark with 4 different input variations and listed them below.

From the numbers, you can see that MinVectorized is fairly stable regardless of what inputs are given and other algorithms really start shining when the branch predictor can start working but may perform worse when it can't.

All inputs: val1 < val2

Ryzen 3900X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT

| Method | Mean | Error | StdDev |
|------------------------ |----------:|----------:|----------:|
| MathMin | 10.856 us | 0.0392 us | 0.0367 us |
| MathMinInlined | 4.716 us | 0.0020 us | 0.0016 us |
| MinReorder | 4.745 us | 0.0139 us | 0.0124 us |
| MinReorderWithMinScalar | 3.564 us | 0.0103 us | 0.0096 us |
| MinVectorized | 4.742 us | 0.0069 us | 0.0062 us |
| MinGfoidl | 4.627 us | 0.0145 us | 0.0135 us |

i7-7700

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT

| Method | Mean | Error | StdDev |
|------------------------ |----------:|----------:|----------:|
| MathMin | 12.110 us | 0.0280 us | 0.0260 us |
| MathMinInlined | 5.051 us | 0.0834 us | 0.0780 us |
| MinReorder | 4.847 us | 0.0063 us | 0.0056 us |
| MinReorderWithMinScalar | 6.034 us | 0.0096 us | 0.0090 us |
| MinVectorized | 7.793 us | 0.0545 us | 0.0455 us |
| MinGfoidl | 6.063 us | 0.0232 us | 0.0217 us |

1/2 Inputs: val1 < val2; 1/2 Inputs: val2 < val1

Ryzen 3900X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT

| Method | Mean | Error | StdDev |
|------------------------ |---------:|----------:|----------:|
| MathMin | 8.869 us | 0.0481 us | 0.0375 us |
| MathMinInlined | 4.163 us | 0.0349 us | 0.0326 us |
| MinReorder | 4.749 us | 0.0229 us | 0.0214 us |
| MinReorderWithMinScalar | 3.791 us | 0.0541 us | 0.0506 us |
| MinVectorized | 4.651 us | 0.0203 us | 0.0189 us |
| MinGfoidl | 4.235 us | 0.0059 us | 0.0052 us |

i7-7700

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT

| Method | Mean | Error | StdDev |
|------------------------ |----------:|----------:|----------:|
| MathMin | 11.010 us | 0.0890 us | 0.0830 us |
| MathMinInlined | 5.897 us | 0.0199 us | 0.0186 us |
| MinReorder | 5.743 us | 0.0066 us | 0.0062 us |
| MinReorderWithMinScalar | 5.991 us | 0.0190 us | 0.0168 us |
| MinVectorized | 7.716 us | 0.0223 us | 0.0198 us |
| MinGfoidl | 5.397 us | 0.0162 us | 0.0151 us |

1/3 Inputs: val1 < val2; 1/3 Inputs: val2 < val1; 1/3 Inputs: NaN

Ryzen 3900X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT

| Method | Mean | Error | StdDev |
|------------------------ |----------:|----------:|----------:|
| MathMin | 10.553 us | 0.2106 us | 0.4488 us |
| MathMinInlined | 3.846 us | 0.0094 us | 0.0088 us |
| MinReorder | 4.099 us | 0.0113 us | 0.0101 us |
| MinReorderWithMinScalar | 3.750 us | 0.0164 us | 0.0145 us |
| MinVectorized | 4.416 us | 0.0198 us | 0.0185 us |
| MinGfoidl | 3.812 us | 0.0138 us | 0.0129 us |

i7-7700

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT

| Method | Mean | Error | StdDev |
|------------------------ |----------:|----------:|----------:|
| MathMin | 11.850 us | 0.0470 us | 0.0420 us |
| MathMinInlined | 5.050 us | 0.0108 us | 0.0101 us |
| MinReorder | 5.136 us | 0.0100 us | 0.0088 us |
| MinReorderWithMinScalar | 5.565 us | 0.0094 us | 0.0088 us |
| MinVectorized | 6.881 us | 0.0080 us | 0.0075 us |
| MinGfoidl | 5.221 us | 0.0082 us | 0.0073 us |

1/4 Inputs: val1 < val2; 1/4 Inputs: val2 < val1; 1/4 Inputs: NaN; 1/4 Inputs: val1 == val2

Ryzen 3900X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT

| Method | Mean | Error | StdDev |
|------------------------ |----------:|----------:|----------:|
| MathMin | 11.400 us | 0.1980 us | 0.1850 us |
| MathMinInlined | 3.839 us | 0.0144 us | 0.0135 us |
| MinReorder | 3.789 us | 0.0153 us | 0.0143 us |
| MinReorderWithMinScalar | 3.764 us | 0.0245 us | 0.0229 us |
| MinVectorized | 4.327 us | 0.0082 us | 0.0072 us |
| MinGfoidl | 3.766 us | 0.0151 us | 0.0142 us |

i7-7700

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT

| Method | Mean | Error | StdDev |
|------------------------ |----------:|----------:|----------:|
| MathMin | 11.740 us | 0.0400 us | 0.0350 us |
| MathMinInlined | 5.046 us | 0.0203 us | 0.0180 us |
| MinReorder | 5.309 us | 0.0120 us | 0.0107 us |
| MinReorderWithMinScalar | 5.416 us | 0.0099 us | 0.0093 us |
| MinVectorized | 6.663 us | 0.0067 us | 0.0056 us |
| MinGfoidl | 4.967 us | 0.0141 us | 0.0132 us |

I poured the numbers in a chart (without case MathMin, as it's the worst):
math-min

For Ryzen MinReorderWithMinScalar performs best and is quite stable on the inputs.
For Intel MinReorder performs best.

So which one should we pick?
Do you know why minsd seems to be quite slower on Intel or how this can be circumvented?

other algorithms really start shining when the branch predictor can start working but may perform worse when it can't

This is what I would expect, but the figure looks partially somewhat different. The more "random" the inputs, the less time it takes. This feels a bit strange to me.

Do you know why minsd seems to be quite slower on Intel or how this can be circumvented?

Looking at https://www.agner.org/optimize/instruction_tables.pdf, Skylake has a 4 cycle latency while Ryzen only has a 1 cycle latency (for minss/minsd and maxss/maxsd).

This is what I would expect, but the figure looks partially somewhat different. The more "random" the inputs, the less time it takes. This feels a bit strange to me.

It might have to do with internal mechanics of how the branch predictor works. It could also be a side effect of the test or some other factor.

Math.Min and Math.Max are really small functions, so we can't accurately test them using just a benchmark that calls them. This is because they take less time than the resolution of the high frequency timer on the system (which is sub 1 microsecond, but generally no more accurate than 100ns. Instead, we have to scale the number of operations done during the test, which can skew results.

We need something that isn't MinReorderWithMinScalar or MinVectorized as the software fallback in either case, and so we should just pick one that is the best overall, on average. We can then optimize other cases as needed.
I imagine all of these algorithms (minus MinVectorized) may behave differently depending on the surrounding code due to the number of branches/return statements they have.

Was this page helpful?
0 / 5 - 0 ratings