Core: Slow performance with System.Numerics.Vector

Created on 2 Feb 2019 · 22Comments · Source: dotnet/core

I have benchmarked System.Numerics.Vector with a regular array as in below test:
```c#
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Numerics;
using System.Runtime.CompilerServices;

namespace SIMD
{
[SimpleJob(launchCount:1, warmupCount:2, targetCount:10)]
public class TestSuite
{
[Benchmark]
public static void AddSIMD()
{
var arr1 = new int[64 * 1024 * 1024];
var arr2 = new int[64 * 1024 * 1024];

        SIMDArrayAddition(arr1, arr2);
    }
    [Benchmark]
    public static void Add()
    {
        var arr1 = new int[64 * 1024 * 1024];
        var arr2 = new int[64 * 1024 * 1024];

        ArrayAdd(arr1, arr2);
    }
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    public static int[] ArrayAdd(int [] lhs, int[] rhs)
    {
        var result = new int[lhs.Length];
        for (var i = 0; i < lhs.Length; i++)
        {
            result[i] = lhs[i] + rhs[i];
        }
        return result;
    }
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    public static int[] SIMDArrayAddition(int[] lhs, int[] rhs)
    {
        var simdLength = Vector<int>.Count;
        var l = lhs.Length - simdLength;
        var result = new int[lhs.Length];
        for (var i = 0; i <= l ; i += simdLength)
        {
            var va = new Vector<int>(lhs, i);
            var vb = new Vector<int>(rhs, i);
            (va + vb).CopyTo(result, i);
        }
        return result;
    }
}
class Program
{
    static void Main(string[] args)
    {
        var summary = BenchmarkRunner.Run<TestSuite>();
    }
}

}

According the below results AVX version runs the same time with non AVX:

Method | Mean | Error | StdDev |
-------- |---------:|---------:|---------:|
AddSIMD | 480.9 ms | 20.88 ms | 13.81 ms |
Add | 487.9 ms | 24.14 ms | 14.37 ms |
```

Source

OnurGumus

Most helpful comment

@tannergooding yes my bad again. It was tiering. BDN would take care of it if you run it more than 30 times. I don't wait that much. I added aggressive optimization, now it is fast!.

OnurGumus on 3 Feb 2019

🎉3

All 22 comments

@tannergooding any comments?
cc @adamsitnik

karelz on 2 Feb 2019

Using Vector<T> with arrays is kind of problematic because the range checks are not eliminated and loop cloning is not performed. You basically go from:

       4863CA               movsxd   rcx, edx
       448B448F10           mov      r8d, dword ptr [rdi+4*rcx+16]
       4403448E10           add      r8d, dword ptr [rsi+4*rcx+16]
       4489448810           mov      dword ptr [rax+4*rcx+16], r8d
       FFC2                 inc      edx
       3BDA                 cmp      ebx, edx
       7FE8                 jg       SHORT G_M55886_IG04

G_M55886_IG03:
       3BD3                 cmp      edx, ebx
       7350                 jae      SHORT G_M55886_IG05
       8D4A07               lea      ecx, [rdx+7]
       3BCB                 cmp      ecx, ebx
       7349                 jae      SHORT G_M55886_IG05
       C5FD10449710         vmovupd  ymm0, ymmword ptr[rdi+4*rdx+16]
       448B4608             mov      r8d, dword ptr [rsi+8]
       413BD0               cmp      edx, r8d
       733A                 jae      SHORT G_M55886_IG05
       413BC8               cmp      ecx, r8d
       7335                 jae      SHORT G_M55886_IG05
       C5FD104C9610         vmovupd  ymm1, ymmword ptr[rsi+4*rdx+16]
       C5FDFEC1             vpaddd   ymm0, ymm1
       448B4008             mov      r8d, dword ptr [rax+8]
       413BD0               cmp      edx, r8d
       7327                 jae      SHORT G_M55886_IG06
       413BC8               cmp      ecx, r8d
       7327                 jae      SHORT G_M55886_IG07
       C5FD11449010         vmovupd  ymmword ptr[rax+4*rdx+16], ymm0
       83C208               add      edx, 8
       3BD5                 cmp      edx, ebp
       7EBC                 jle      SHORT G_M55886_IG03

mikedn on 3 Feb 2019

@mikedn , so how should we use it then? Via unsafe?

OnurGumus on 3 Feb 2019

Well, unsafe won't work very well with Vector<T> because there's no way to load Vector<T> from a pointer. You could use Span<Vector<int>> and MemoryMarshal.Cast:
```C#
var vlhs = MemoryMarshal.Cast>(lhs);
var vrhs = MemoryMarshal.Cast>(rhs);
var vres = MemoryMarshal.Cast>(result);
for (var i = 0; i < vlhs.Length; i++)
vres[i] = vlhs[i] + vrhs[i];

`unsafe` would work well with the SSE/AVX intrinsics:
```C#
fixed (int* plhs = &lhs[0])
fixed (int* prhs = &rhs[0])
fixed (int* pres = &result[0])
{
    for (var i = 0; i < lhs.Length; i += 8)
        Avx2.Store(pres + i, Avx2.Add(Avx.LoadVector256(plhs + i), Avx2.LoadVector256(prhs + i)));
}

But a quick test on my machine shows that all variants have basically the same performance. Shouldn't be too surprising, those arrays have 256 megabytes each. The cost of accessing all that memory is so high that everything else is pretty much irrelevant.

mikedn on 3 Feb 2019

@mikedn , can you mention one case where using vectors will boost the speed compared to arrays? If there is none that means I would always use arrays instead and vectors are useless.
Also for similar loops, c++ does auto vectorization.

OnurGumus on 3 Feb 2019

can you mention one case where using vectors will boost the speed compared to arrays?

Shorter arrays; accessing 256MB is going to dominate over the CPU instructions for that size of array as it can't fit into CPU cache

https://gist.github.com/jboner/2841832

Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us

benaadams on 3 Feb 2019

Well, unsafe won't work very well with Vector because there's no way to load Vector from a pointer.

This should be possible in the latest C# 8 preview, with the unmanaged constructed types feature. The new feature lets you take the address of and have a pointer to a generic struct, like Vector<int>, if the compiler can statically determine it is unmanaged.

It would probably have about the same perf as using Unsafe.As or Unsafe.Read/Write, however.

tannergooding on 3 Feb 2019

Shorter arrays; accessing 256MB is going to dominate over the CPU instructions for that size of array as it can't fit into CPU cache

Probably a good case for HWIntrinsics and the NonTemporal store functions?

tannergooding on 3 Feb 2019

Actually in the question arrays were 64 MB. I just changed the 64 KB still, no difference! This really doesn't sound right. I remember I have tested System.Numberics.Vector few years ago for a similar case. It was fast. Obviously above assembly code is correct. It has to copy each item to ymm and do the addition. I think I am misinformed of something.

OnurGumus on 3 Feb 2019

At most ugly and using Unsafe

public static int[] SIMDArrayAddition(int[] lhs, int[] rhs)
{
    if (lhs.Length > rhs.Length)
    {
        throw new ArgumentOutOfRangeException();
    }
    else if ((lhs.Length & (Vector<int>.Count - 1)) != 0)
    {
        throw new ArgumentOutOfRangeException("Not a multiple of vector length");
    }
    else if (lhs.Length == 0)
    {
        return Array.Empty<int>();
    }

    var result = new int[lhs.Length];

    ref var resultStart = ref result[0];
    ref var lhsStart = ref lhs[0];
    ref var rhsStart = ref rhs[0];

    for (var i = 0; i <= (result.Length - Vector<int>.Count); i += Vector<int>.Count)
    {
        var va = Unsafe.ReadUnaligned<Vector<int>>(ref Unsafe.As<int, byte>(ref Unsafe.Add(ref lhsStart, i)));
        var vb = Unsafe.ReadUnaligned<Vector<int>>(ref Unsafe.As<int, byte>(ref Unsafe.Add(ref rhsStart, i)));
        ref var r = ref Unsafe.As<int, Vector<int>>(ref Unsafe.Add(ref resultStart, i));
        r = (va + vb);
    }
    return result;
}

Give you an inner loop of

G_M40479_IG07:
       movsxd   r9, r8d
       shl      r9, 2
       vmovupd  ymm0, ymmword ptr[rdi+r9]
       vmovupd  ymm1, ymmword ptr[rsi+r9]
       add      r9, rdx
       vpaddd   ymm0, ymm1
       vmovupd  ymmword ptr[r9], ymm0
       add      r8d, 8
       cmp      ecx, r8d
       jge      SHORT G_M40479_IG07

benaadams on 3 Feb 2019

Shorter arrays; accessing 256MB is going to dominate over the CPU instructions for that size of array as it can't fit into CPU cache

Indeed. And that only if the array memory is already in cache (probably reasonable to assume, the data must have been written there previously and if you're not using tons of memory it's probably still in the cache).

And of course, another case is when you have more expensive computations, though you probably need quite a bit of computation to raise above memory access cost.

Also for similar loops, c++ does auto vectorization.

Yeah, they tend to be overly optimistic and perform optimizations that aren't necessarily needed/useful.

Probably a good case for HWIntrinsics and the NonTemporal store functions?

Seems to help a bit but not a lot, around 10%. And some of that (3-4%) seems to come from alignment.

Actually in the question arrays were 64 MB

They're int so it's 64MB * 4 = 256MB. Doesn't really matter, they still don't fit in any CPU cache.

I just changed the 64 KB still, no difference! This really doesn't sound right. I remember I have tested System.Numberics.Vector few years ago for a similar case. It was fast. Obviously above assembly code is correct. It has to copy each item to ymm and do the addition. I think I am misinformed of something.

Well, it might depend on what exactly you are measuring. Your benchmarks include the cost of array allocation, whatever that cost might be. If you allocate those arrays only once and reuse them then the results might be different.

With pre-allocated 65536 element arrays the AVX version variant I posted above is ~2x faster than the scalar version. Sure, 2x is not 8x but getting an ideal speed-up from SIMD code isn't exactly the norm.

mikedn on 3 Feb 2019

@mikedn

Actually in the question arrays were 64 MB
They're int so it's 64MB * 4 = 256MB. Doesn't really matter, they still don't fit in any CPU cache.

Oh yes how dumb I am :) And yes moving the array allocation made it approx 2 faster. so it was my bad way of benchmarking regardless the array size!
@benaadams
That code is very impressive however from BDN I see the following output:

mov     rcx,qword ptr [rbp-30h]
--
  | mov     edx,dword ptr [rbp-3Ch]
  | call    System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
  | mov     rcx,rax
  | call    System.Runtime.CompilerServices.Unsafe.As[[System.Int32, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Int32 ByRef)
  | mov     qword ptr [rbp-0A0h],rax
  | mov     rdx,qword ptr [rbp-0A0h]
  | lea     rcx,[rbp-70h]
  | call    System.Runtime.CompilerServices.Unsafe.ReadUnaligned[[System.Numerics.Vector`1[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]](Byte ByRef)
  | mov     rcx,qword ptr [rbp-38h]
  | mov     edx,dword ptr [rbp-3Ch]
  | call    System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
  | mov     rcx,rax
  | call    System.Runtime.CompilerServices.Unsafe.As[[System.Int32, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Int32 ByRef)
  | mov     qword ptr [rbp-0A8h],rax
  | mov     rdx,qword ptr [rbp-0A8h]
  | lea     rcx,[rbp-90h]
  | call    System.Runtime.CompilerServices.Unsafe.ReadUnaligned[[System.Numerics.Vector`1[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]](Byte ByRef)
  | mov     rcx,qword ptr [rbp-28h]
  | mov     edx,dword ptr [rbp-3Ch]
  | call    System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
  | mov     rcx,rax
  | call    System.Runtime.CompilerServices.Unsafe.As[[System.Int32, System.Private.CoreLib],[System.Numerics.Vector`1[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]](Int32 ByRef)
  | mov     qword ptr [rbp-98h],rax
  | vmovupd ymm0,ymmword ptr [rbp-70h]
  | vmovupd ymm1,ymmword ptr [rbp-90h]
  | vpaddd  ymm0,ymm0,ymm1
  | mov     rax,qword ptr [rbp-98h]
  | vmovupd ymmword ptr [rax],ymm0
  | mov     eax,dword ptr [rbp-3Ch]
  | add     eax,8
  | mov     dword ptr [rbp-3Ch],eax

I will close the ticket but I am willing to discuss further if you have time.

OnurGumus on 3 Feb 2019

That code is very impressive however from BDN I see the following output:

@OnurGumus, are you testing a Debug configuration? Almost all of those calls should have been inlined.

tannergooding on 3 Feb 2019

@tannergooding nope it is release. Benchmark Dot Net doesn't allow debug benchmarking anyway.
Having that said I have added the following :

 <DebugType>pdbonly</DebugType>
 <DebugSymbols>true</DebugSymbols>

To the project file which are required to print disassembly. But it is being built with release mode.

OnurGumus on 3 Feb 2019

Tried switching off TieredCompilation

set COMPlus_TieredCompilation=0

benaadams on 3 Feb 2019

Well, there is definitely something up there. Maybe tiering, but I had thought BDN took care of that as well...

tannergooding on 3 Feb 2019

@tannergooding yes my bad again. It was tiering. BDN would take care of it if you run it more than 30 times. I don't wait that much. I added aggressive optimization, now it is fast!.

OnurGumus on 3 Feb 2019

🎉3

Might be worth adding that the AVX version I posted doesn't generate ideal code:

G_M54511_IG03:
       4D63DA               movsxd   r11, r10d
       49C1E302             shl      r11, 2
       4A8D3419             lea      rsi, [rcx+r11]
       C5FE6F06             vmovdqu  ymm0, ymmword ptr[rsi]
       4A8D341A             lea      rsi, [rdx+r11]
       C5FDFE06             vpaddd   ymm0, ymm0, ymmword ptr[rsi]
       4D03D9               add      r11, r9
       C4C17E7F03           vmovdqu  ymmword ptr[r11], ymm0
       4183C208             add      r10d, 8
       413BC2               cmp      eax, r10d
       7FD8                 jg       SHORT G_M54511_IG03

Those LEAs should not be there, that's a limitation of the current JIT. But it's unlikely that it has any measurable impact. Other versions, such as:
```C#
int* plhs1 = plhs;
int* plhsEnd = plhs1 + lhs.Length;
int* prhs1 = prhs;
int* pres1 = pres;

for (; plhs1 < plhsEnd; plhs1 += 8, prhs1 += 8, pres1 += 8)
Avx2.Store(pres1, Avx2.Add(Avx.LoadVector256(plhs1), Avx2.LoadVector256(prhs1)));
```
have similar performance.

mikedn on 3 Feb 2019

Is it possible to do loop unrolling ? I tried hand unrolling the loop 4 times when used LoadVector256 all of them still uses YMM0, when I use Vector they use different YMM, though I have not observed any perf difference.

OnurGumus on 3 Feb 2019

Is it possible to do loop unrolling ?

The JIT doesn't do that today.

I tried hand unrolling the loop 4 times when used LoadVector256 all of them still uses YMM0, when I use Vector they use different YMM, though I have not observed any perf difference.

And as you already found, it doesn't help. I actually tried that too earlier and came to the same result. Though what I did
C# for (; plhs1 < plhsEnd; plhs1 += 32, prhs1 += 32, pres1 += 32) { Avx2.Store(pres1, Avx2.Add(Avx.LoadVector256(plhs1), Avx2.LoadVector256(prhs1))); Avx2.Store(pres1 + 8, Avx2.Add(Avx.LoadVector256(plhs1 + 8), Avx2.LoadVector256(prhs1 + 8))); Avx2.Store(pres1 + 16, Avx2.Add(Avx.LoadVector256(plhs1 + 16), Avx2.LoadVector256(prhs1 + 16))); Avx2.Store(pres1 + 24, Avx2.Add(Avx.LoadVector256(plhs1 + 24), Avx2.LoadVector256(prhs1 + 24))); }
also suffers from that extra LEAs issue. Perhaps if that's fixed it would work better. But I don't expect wonders.

mikedn on 3 Feb 2019

👍1

I guess CPU renames registers anyway. Lol, Old CPU's were more predictable.

OnurGumus on 3 Feb 2019

I guess CPU renames registers anyway. Lol, Old CPU's were more predictable.

Ah, if you're referring to the use of the same register multiple times in the loop then yes, modern CPUs tend to handle that well via register renaming.