I have benchmarked System.Numerics.Vector with a regular array as in below test:
```c#
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Numerics;
using System.Runtime.CompilerServices;
namespace SIMD
{
[SimpleJob(launchCount:1, warmupCount:2, targetCount:10)]
public class TestSuite
{
[Benchmark]
public static void AddSIMD()
{
var arr1 = new int[64 * 1024 * 1024];
var arr2 = new int[64 * 1024 * 1024];
SIMDArrayAddition(arr1, arr2);
}
[Benchmark]
public static void Add()
{
var arr1 = new int[64 * 1024 * 1024];
var arr2 = new int[64 * 1024 * 1024];
ArrayAdd(arr1, arr2);
}
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public static int[] ArrayAdd(int [] lhs, int[] rhs)
{
var result = new int[lhs.Length];
for (var i = 0; i < lhs.Length; i++)
{
result[i] = lhs[i] + rhs[i];
}
return result;
}
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public static int[] SIMDArrayAddition(int[] lhs, int[] rhs)
{
var simdLength = Vector<int>.Count;
var l = lhs.Length - simdLength;
var result = new int[lhs.Length];
for (var i = 0; i <= l ; i += simdLength)
{
var va = new Vector<int>(lhs, i);
var vb = new Vector<int>(rhs, i);
(va + vb).CopyTo(result, i);
}
return result;
}
}
class Program
{
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<TestSuite>();
}
}
}
According the below results AVX version runs the same time with non AVX:
Method | Mean | Error | StdDev |
-------- |---------:|---------:|---------:|
AddSIMD | 480.9 ms | 20.88 ms | 13.81 ms |
Add | 487.9 ms | 24.14 ms | 14.37 ms |
```
@tannergooding any comments?
cc @adamsitnik
Using Vector<T> with arrays is kind of problematic because the range checks are not eliminated and loop cloning is not performed. You basically go from:
4863CA movsxd rcx, edx
448B448F10 mov r8d, dword ptr [rdi+4*rcx+16]
4403448E10 add r8d, dword ptr [rsi+4*rcx+16]
4489448810 mov dword ptr [rax+4*rcx+16], r8d
FFC2 inc edx
3BDA cmp ebx, edx
7FE8 jg SHORT G_M55886_IG04
to
G_M55886_IG03:
3BD3 cmp edx, ebx
7350 jae SHORT G_M55886_IG05
8D4A07 lea ecx, [rdx+7]
3BCB cmp ecx, ebx
7349 jae SHORT G_M55886_IG05
C5FD10449710 vmovupd ymm0, ymmword ptr[rdi+4*rdx+16]
448B4608 mov r8d, dword ptr [rsi+8]
413BD0 cmp edx, r8d
733A jae SHORT G_M55886_IG05
413BC8 cmp ecx, r8d
7335 jae SHORT G_M55886_IG05
C5FD104C9610 vmovupd ymm1, ymmword ptr[rsi+4*rdx+16]
C5FDFEC1 vpaddd ymm0, ymm1
448B4008 mov r8d, dword ptr [rax+8]
413BD0 cmp edx, r8d
7327 jae SHORT G_M55886_IG06
413BC8 cmp ecx, r8d
7327 jae SHORT G_M55886_IG07
C5FD11449010 vmovupd ymmword ptr[rax+4*rdx+16], ymm0
83C208 add edx, 8
3BD5 cmp edx, ebp
7EBC jle SHORT G_M55886_IG03
@mikedn , so how should we use it then? Via unsafe?
Well, unsafe won't work very well with Vector<T> because there's no way to load Vector<T> from a pointer. You could use Span<Vector<int>> and MemoryMarshal.Cast:
```C#
var vlhs = MemoryMarshal.Cast
var vrhs = MemoryMarshal.Cast
var vres = MemoryMarshal.Cast
for (var i = 0; i < vlhs.Length; i++)
vres[i] = vlhs[i] + vrhs[i];
`unsafe` would work well with the SSE/AVX intrinsics:
```C#
fixed (int* plhs = &lhs[0])
fixed (int* prhs = &rhs[0])
fixed (int* pres = &result[0])
{
for (var i = 0; i < lhs.Length; i += 8)
Avx2.Store(pres + i, Avx2.Add(Avx.LoadVector256(plhs + i), Avx2.LoadVector256(prhs + i)));
}
But a quick test on my machine shows that all variants have basically the same performance. Shouldn't be too surprising, those arrays have 256 megabytes each. The cost of accessing all that memory is so high that everything else is pretty much irrelevant.
@mikedn , can you mention one case where using vectors will boost the speed compared to arrays? If there is none that means I would always use arrays instead and vectors are useless.
Also for similar loops, c++ does auto vectorization.
can you mention one case where using vectors will boost the speed compared to arrays?
Shorter arrays; accessing 256MB is going to dominate over the CPU instructions for that size of array as it can't fit into CPU cache
https://gist.github.com/jboner/2841832
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Well, unsafe won't work very well with Vector
because there's no way to load Vector from a pointer.
This should be possible in the latest C# 8 preview, with the unmanaged constructed types feature. The new feature lets you take the address of and have a pointer to a generic struct, like Vector<int>, if the compiler can statically determine it is unmanaged.
It would probably have about the same perf as using Unsafe.As or Unsafe.Read/Write, however.
Shorter arrays; accessing 256MB is going to dominate over the CPU instructions for that size of array as it can't fit into CPU cache
Probably a good case for HWIntrinsics and the NonTemporal store functions?
Actually in the question arrays were 64 MB. I just changed the 64 KB still, no difference! This really doesn't sound right. I remember I have tested System.Numberics.Vector few years ago for a similar case. It was fast. Obviously above assembly code is correct. It has to copy each item to ymm and do the addition. I think I am misinformed of something.
At most ugly and using Unsafe
public static int[] SIMDArrayAddition(int[] lhs, int[] rhs)
{
if (lhs.Length > rhs.Length)
{
throw new ArgumentOutOfRangeException();
}
else if ((lhs.Length & (Vector<int>.Count - 1)) != 0)
{
throw new ArgumentOutOfRangeException("Not a multiple of vector length");
}
else if (lhs.Length == 0)
{
return Array.Empty<int>();
}
var result = new int[lhs.Length];
ref var resultStart = ref result[0];
ref var lhsStart = ref lhs[0];
ref var rhsStart = ref rhs[0];
for (var i = 0; i <= (result.Length - Vector<int>.Count); i += Vector<int>.Count)
{
var va = Unsafe.ReadUnaligned<Vector<int>>(ref Unsafe.As<int, byte>(ref Unsafe.Add(ref lhsStart, i)));
var vb = Unsafe.ReadUnaligned<Vector<int>>(ref Unsafe.As<int, byte>(ref Unsafe.Add(ref rhsStart, i)));
ref var r = ref Unsafe.As<int, Vector<int>>(ref Unsafe.Add(ref resultStart, i));
r = (va + vb);
}
return result;
}
Give you an inner loop of
G_M40479_IG07:
movsxd r9, r8d
shl r9, 2
vmovupd ymm0, ymmword ptr[rdi+r9]
vmovupd ymm1, ymmword ptr[rsi+r9]
add r9, rdx
vpaddd ymm0, ymm1
vmovupd ymmword ptr[r9], ymm0
add r8d, 8
cmp ecx, r8d
jge SHORT G_M40479_IG07
Shorter arrays; accessing 256MB is going to dominate over the CPU instructions for that size of array as it can't fit into CPU cache
Indeed. And that only if the array memory is already in cache (probably reasonable to assume, the data must have been written there previously and if you're not using tons of memory it's probably still in the cache).
And of course, another case is when you have more expensive computations, though you probably need quite a bit of computation to raise above memory access cost.
Also for similar loops, c++ does auto vectorization.
Yeah, they tend to be overly optimistic and perform optimizations that aren't necessarily needed/useful.
Probably a good case for HWIntrinsics and the NonTemporal store functions?
Seems to help a bit but not a lot, around 10%. And some of that (3-4%) seems to come from alignment.
Actually in the question arrays were 64 MB
They're int so it's 64MB * 4 = 256MB. Doesn't really matter, they still don't fit in any CPU cache.
I just changed the 64 KB still, no difference! This really doesn't sound right. I remember I have tested System.Numberics.Vector few years ago for a similar case. It was fast. Obviously above assembly code is correct. It has to copy each item to ymm and do the addition. I think I am misinformed of something.
Well, it might depend on what exactly you are measuring. Your benchmarks include the cost of array allocation, whatever that cost might be. If you allocate those arrays only once and reuse them then the results might be different.
With pre-allocated 65536 element arrays the AVX version variant I posted above is ~2x faster than the scalar version. Sure, 2x is not 8x but getting an ideal speed-up from SIMD code isn't exactly the norm.
@mikedn
Actually in the question arrays were 64 MB
They're int so it's 64MB * 4 = 256MB. Doesn't really matter, they still don't fit in any CPU cache.
Oh yes how dumb I am :) And yes moving the array allocation made it approx 2 faster. so it was my bad way of benchmarking regardless the array size!
@benaadams
That code is very impressive however from BDN I see the following output:
mov rcx,qword ptr [rbp-30h]
--
聽 | mov edx,dword ptr [rbp-3Ch]
聽 | call System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
聽 | mov rcx,rax
聽 | call System.Runtime.CompilerServices.Unsafe.As[[System.Int32, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Int32 ByRef)
聽 | mov qword ptr [rbp-0A0h],rax
聽 | mov rdx,qword ptr [rbp-0A0h]
聽 | lea rcx,[rbp-70h]
聽 | call System.Runtime.CompilerServices.Unsafe.ReadUnaligned[[System.Numerics.Vector`1[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]](Byte ByRef)
聽 | mov rcx,qword ptr [rbp-38h]
聽 | mov edx,dword ptr [rbp-3Ch]
聽 | call System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
聽 | mov rcx,rax
聽 | call System.Runtime.CompilerServices.Unsafe.As[[System.Int32, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Int32 ByRef)
聽 | mov qword ptr [rbp-0A8h],rax
聽 | mov rdx,qword ptr [rbp-0A8h]
聽 | lea rcx,[rbp-90h]
聽 | call System.Runtime.CompilerServices.Unsafe.ReadUnaligned[[System.Numerics.Vector`1[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]](Byte ByRef)
聽 | mov rcx,qword ptr [rbp-28h]
聽 | mov edx,dword ptr [rbp-3Ch]
聽 | call System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
聽 | mov rcx,rax
聽 | call System.Runtime.CompilerServices.Unsafe.As[[System.Int32, System.Private.CoreLib],[System.Numerics.Vector`1[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]](Int32 ByRef)
聽 | mov qword ptr [rbp-98h],rax
聽 | vmovupd ymm0,ymmword ptr [rbp-70h]
聽 | vmovupd ymm1,ymmword ptr [rbp-90h]
聽 | vpaddd ymm0,ymm0,ymm1
聽 | mov rax,qword ptr [rbp-98h]
聽 | vmovupd ymmword ptr [rax],ymm0
聽 | mov eax,dword ptr [rbp-3Ch]
聽 | add eax,8
聽 | mov dword ptr [rbp-3Ch],eax
I will close the ticket but I am willing to discuss further if you have time.
That code is very impressive however from BDN I see the following output:
@OnurGumus, are you testing a Debug configuration? Almost all of those calls should have been inlined.
@tannergooding nope it is release. Benchmark Dot Net doesn't allow debug benchmarking anyway.
Having that said I have added the following :
<DebugType>pdbonly</DebugType>
<DebugSymbols>true</DebugSymbols>
To the project file which are required to print disassembly. But it is being built with release mode.
Tried switching off TieredCompilation
set COMPlus_TieredCompilation=0
Well, there is definitely something up there. Maybe tiering, but I had thought BDN took care of that as well...
@tannergooding yes my bad again. It was tiering. BDN would take care of it if you run it more than 30 times. I don't wait that much. I added aggressive optimization, now it is fast!.
Might be worth adding that the AVX version I posted doesn't generate ideal code:
G_M54511_IG03:
4D63DA movsxd r11, r10d
49C1E302 shl r11, 2
4A8D3419 lea rsi, [rcx+r11]
C5FE6F06 vmovdqu ymm0, ymmword ptr[rsi]
4A8D341A lea rsi, [rdx+r11]
C5FDFE06 vpaddd ymm0, ymm0, ymmword ptr[rsi]
4D03D9 add r11, r9
C4C17E7F03 vmovdqu ymmword ptr[r11], ymm0
4183C208 add r10d, 8
413BC2 cmp eax, r10d
7FD8 jg SHORT G_M54511_IG03
Those LEAs should not be there, that's a limitation of the current JIT. But it's unlikely that it has any measurable impact. Other versions, such as:
```C#
int* plhs1 = plhs;
int* plhsEnd = plhs1 + lhs.Length;
int* prhs1 = prhs;
int* pres1 = pres;
for (; plhs1 < plhsEnd; plhs1 += 8, prhs1 += 8, pres1 += 8)
Avx2.Store(pres1, Avx2.Add(Avx.LoadVector256(plhs1), Avx2.LoadVector256(prhs1)));
```
have similar performance.
Is it possible to do loop unrolling ? I tried hand unrolling the loop 4 times when used LoadVector256 all of them still uses YMM0, when I use Vector they use different YMM, though I have not observed any perf difference.
Is it possible to do loop unrolling ?
The JIT doesn't do that today.
I tried hand unrolling the loop 4 times when used LoadVector256 all of them still uses YMM0, when I use Vector they use different YMM, though I have not observed any perf difference.
And as you already found, it doesn't help. I actually tried that too earlier and came to the same result. Though what I did
C#
for (; plhs1 < plhsEnd; plhs1 += 32, prhs1 += 32, pres1 += 32)
{
Avx2.Store(pres1, Avx2.Add(Avx.LoadVector256(plhs1), Avx2.LoadVector256(prhs1)));
Avx2.Store(pres1 + 8, Avx2.Add(Avx.LoadVector256(plhs1 + 8), Avx2.LoadVector256(prhs1 + 8)));
Avx2.Store(pres1 + 16, Avx2.Add(Avx.LoadVector256(plhs1 + 16), Avx2.LoadVector256(prhs1 + 16)));
Avx2.Store(pres1 + 24, Avx2.Add(Avx.LoadVector256(plhs1 + 24), Avx2.LoadVector256(prhs1 + 24)));
}
also suffers from that extra LEAs issue. Perhaps if that's fixed it would work better. But I don't expect wonders.
I guess CPU renames registers anyway. Lol, Old CPU's were more predictable.
I guess CPU renames registers anyway. Lol, Old CPU's were more predictable.
Ah, if you're referring to the use of the same register multiple times in the loop then yes, modern CPUs tend to handle that well via register renaming.
Lol, Old CPU's were more predictable.
In a way, yes. New CPUs are too pretty predictable, but unfortunately that's @#$% complicated and not well documented :)
Most helpful comment
@tannergooding yes my bad again. It was tiering. BDN would take care of it if you run it more than 30 times. I don't wait that much. I added aggressive optimization, now it is fast!.