I have a loop I manually unrolled. Let's even move the loop body to a separate method for simplicity:
unsafe void Set(int* a, int length)
{
for (int i = 0; i < length; i += 4)
Iteration(a, i);
}
unsafe void Iteration(int* a, int i)
{
a[i] = 0; // any constant
a[i + 1] = 0; // any constant
a[i + 2] = 0; // any constant
a[i + 3] = 0; // any constant
}
Current codegen for Iteration:
G_M64981_IG01:
G_M64981_IG02:
movsxd rax, r8d
xor ecx, ecx
mov dword ptr [rdx+4*rax], ecx
lea eax, [r8+1]
movsxd rax, eax
mov dword ptr [rdx+4*rax], ecx
lea eax, [r8+2]
movsxd rax, eax
mov dword ptr [rdx+4*rax], ecx
add r8d, 3
movsxd rax, r8d
mov dword ptr [rdx+4*rax], ecx
G_M64981_IG03:
ret
; Total bytes of code: 39
LLVM codegen (e.g. Mono-LLVM):
movsxd rax, esi
vxorps xmm0, xmm0, xmm0
vmovups xmmword ptr [rdi + 4*rax], xmm0
ret
So we basically can replace it with just:
unsafe void Iteration(int* a, int i)
{
Sse2.Store(a, Vector128<int>.Zero);
}
/*
vzeroupper
vxorps xmm0, xmm0, xmm0
vmovdqu xmmword ptr [rdx], xmm0
ret
; Total bytes of code: 12
*/
I mean we have a single basic-block with 4 (depending on type and SSE/AVX availability) statements to recognize:
***** BB01
***** BB01
STMT00000 (IL 0x000...0x007)
N009 ( 8, 8) [000009] -A-XG------- * ASG int $VN.Void
N007 ( 6, 6) [000008] *--X---N---- +--* IND int $44
N006 ( 4, 5) [000006] -------N---- | \--* ADD long $142
N001 ( 1, 1) [000000] ------------ | +--* LCL_VAR long V01 arg1 u:1 $80
N005 ( 3, 4) [000005] -------N---- | \--* LSH long $141
N003 ( 2, 3) [000002] ------------ | +--* CAST long <- int $140
N002 ( 1, 1) [000001] ------------ | | \--* LCL_VAR int V02 arg2 u:1 $c0
N004 ( 1, 1) [000004] ------------ | \--* CNS_INT long 2 $180
N008 ( 1, 1) [000007] ------------ \--* CNS_INT int 0 $44
***** BB01
STMT00001 (IL 0x008...0x011)
N011 ( 10, 10) [000021] -A-XG------- * ASG int $VN.Void
N009 ( 8, 8) [000020] *--X---N---- +--* IND int $44
N008 ( 6, 7) [000018] -------N---- | \--* ADD long $145
N001 ( 1, 1) [000010] ------------ | +--* LCL_VAR long V01 arg1 u:1 $80
N007 ( 5, 6) [000017] -------N---- | \--* LSH long $144
N005 ( 4, 5) [000014] ------------ | +--* CAST long <- int $143
N004 ( 3, 3) [000013] ------------ | | \--* ADD int $200
N002 ( 1, 1) [000011] ------------ | | +--* LCL_VAR int V02 arg2 u:1 $c0
N003 ( 1, 1) [000012] ------------ | | \--* CNS_INT int 1 $40
N006 ( 1, 1) [000016] ------------ | \--* CNS_INT long 2 $180
N010 ( 1, 1) [000019] ------------ \--* CNS_INT int 0 $44
***** BB01
STMT00002 (IL 0x012...0x01B)
N011 ( 10, 10) [000033] -A-XG------- * ASG int $VN.Void
N009 ( 8, 8) [000032] *--X---N---- +--* IND int $44
N008 ( 6, 7) [000030] -------N---- | \--* ADD long $148
N001 ( 1, 1) [000022] ------------ | +--* LCL_VAR long V01 arg1 u:1 $80
N007 ( 5, 6) [000029] -------N---- | \--* LSH long $147
N005 ( 4, 5) [000026] ------------ | +--* CAST long <- int $146
N004 ( 3, 3) [000025] ------------ | | \--* ADD int $201
N002 ( 1, 1) [000023] ------------ | | +--* LCL_VAR int V02 arg2 u:1 $c0
N003 ( 1, 1) [000024] ------------ | | \--* CNS_INT int 2 $41
N006 ( 1, 1) [000028] ------------ | \--* CNS_INT long 2 $180
N010 ( 1, 1) [000031] ------------ \--* CNS_INT int 0 $44
***** BB01
STMT00003 (IL 0x01C...0x025)
N011 ( 10, 10) [000045] -A-XG------- * ASG int $VN.Void
N009 ( 8, 8) [000044] *--X---N---- +--* IND int $44
N008 ( 6, 7) [000042] -------N---- | \--* ADD long $14b
N001 ( 1, 1) [000034] ------------ | +--* LCL_VAR long V01 arg1 u:1 (last use) $80
N007 ( 5, 6) [000041] -------N---- | \--* LSH long $14a
N005 ( 4, 5) [000038] ------------ | +--* CAST long <- int $149
N004 ( 3, 3) [000037] ------------ | | \--* ADD int $202
N002 ( 1, 1) [000035] ------------ | | +--* LCL_VAR int V02 arg2 u:1 (last use) $c0
N003 ( 1, 1) [000036] ------------ | | \--* CNS_INT int 3 $45
N006 ( 1, 1) [000040] ------------ | \--* CNS_INT long 2 $180
N010 ( 1, 1) [000043] ------------ \--* CNS_INT int 0 $44
and replace it with just:
***** BB01
STMT00000 (IL 0x000...0x00B)
N003 ( 3, 3) [000002] -A-XG------- * HWIntrinsic void int Store $101
N001 ( 1, 1) [000000] ------------ +--* LCL_VAR long V01 arg1 u:1 (last use) $80
N002 ( 1, 1) [000001] ------------ \--* HWIntrinsic simd16 int get_Zero $100
Hand-made prototype: https://github.com/EgorBo/coreclr/commit/aa216e690240297b48a006807fc2221773408f57

I'm all for using HWIntrinsics to help support autovectorization (and hopefully others are as well).
I would say (assuming this effort is approved and worked on) we should try to write the infrastructure for it to be easily shared between x86 and ARM and for ARM support to come online as possible since we are working on exposing those intrinsics for .NET 5.
@tannergooding I agree, the only thing - it's not easy to emit Vector.Create since it's not an intrinsic 馃檪 .
Btw, I noticed that in the following code:
static void MyTest(float* array, int length)
{
for (int i = 0; i < length; i += 8)
Avx.Store(array, Vector256<float>.Zero);
}
Vector256<float>.Zero is not moved out of the loop (loop hoisting), is there a feature request/issue for it?
This is a dupe of previous proposals to implement auto-vectorization. The conclusion is that we need general design discussion for finding the best approach as the issue is far from the simple one as described here. Can't find though previous thread.
There are multiple issues, including (but not limited to): https://github.com/dotnet/coreclr/issues/20486
@4creators @tannergooding ok closing as a dup.
Yeah I was not planning to do it myself, more like just to share my prototype (who knows maybe it will help or demonstrates that since we now have HW nodes it's now a bit easier to implement 馃檪)
@EgorBo
The more of us push for that feature design and implementation the better are chances we will get it sooner. Anyway, the handcrafted vectorization of simple loops should be a thing of the past for JIT and crossgen, hopefully soon.
Most helpful comment
Hand-made prototype: https://github.com/EgorBo/coreclr/commit/aa216e690240297b48a006807fc2221773408f57