Runtime: JIT optimization: vectorization of a manually unrolled loop

Created on 24 Oct 2019  路  7Comments  路  Source: dotnet/runtime

I have a loop I manually unrolled. Let's even move the loop body to a separate method for simplicity:

unsafe void Set(int* a, int length)
{
    for (int i = 0; i < length; i += 4)
        Iteration(a, i);
}

unsafe void Iteration(int* a, int i)
{
    a[i] = 0;     // any constant
    a[i + 1] = 0; // any constant
    a[i + 2] = 0; // any constant
    a[i + 3] = 0; // any constant
}

Current codegen for Iteration:

G_M64981_IG01:
G_M64981_IG02:
       movsxd   rax, r8d
       xor      ecx, ecx
       mov      dword ptr [rdx+4*rax], ecx
       lea      eax, [r8+1]
       movsxd   rax, eax
       mov      dword ptr [rdx+4*rax], ecx
       lea      eax, [r8+2]
       movsxd   rax, eax
       mov      dword ptr [rdx+4*rax], ecx
       add      r8d, 3
       movsxd   rax, r8d
       mov      dword ptr [rdx+4*rax], ecx
G_M64981_IG03:
       ret      
; Total bytes of code: 39

LLVM codegen (e.g. Mono-LLVM):

       movsxd  rax, esi
       vxorps  xmm0, xmm0, xmm0
       vmovups xmmword ptr [rdi + 4*rax], xmm0
       ret

So we basically can replace it with just:

unsafe void Iteration(int* a, int i)
{
    Sse2.Store(a, Vector128<int>.Zero);
}
/*
vzeroupper 
vxorps   xmm0, xmm0, xmm0
vmovdqu  xmmword ptr [rdx], xmm0
ret      
; Total bytes of code: 12
*/

I mean we have a single basic-block with 4 (depending on type and SSE/AVX availability) statements to recognize:

***** BB01
***** BB01
STMT00000 (IL 0x000...0x007)
N009 (  8,  8) [000009] -A-XG-------              *  ASG       int    $VN.Void
N007 (  6,  6) [000008] *--X---N----              +--*  IND       int    $44
N006 (  4,  5) [000006] -------N----              |  \--*  ADD       long   $142
N001 (  1,  1) [000000] ------------              |     +--*  LCL_VAR   long   V01 arg1         u:1 $80
N005 (  3,  4) [000005] -------N----              |     \--*  LSH       long   $141
N003 (  2,  3) [000002] ------------              |        +--*  CAST      long <- int $140
N002 (  1,  1) [000001] ------------              |        |  \--*  LCL_VAR   int    V02 arg2         u:1 $c0
N004 (  1,  1) [000004] ------------              |        \--*  CNS_INT   long   2 $180
N008 (  1,  1) [000007] ------------              \--*  CNS_INT   int    0 $44

***** BB01
STMT00001 (IL 0x008...0x011)
N011 ( 10, 10) [000021] -A-XG-------              *  ASG       int    $VN.Void
N009 (  8,  8) [000020] *--X---N----              +--*  IND       int    $44
N008 (  6,  7) [000018] -------N----              |  \--*  ADD       long   $145
N001 (  1,  1) [000010] ------------              |     +--*  LCL_VAR   long   V01 arg1         u:1 $80
N007 (  5,  6) [000017] -------N----              |     \--*  LSH       long   $144
N005 (  4,  5) [000014] ------------              |        +--*  CAST      long <- int $143
N004 (  3,  3) [000013] ------------              |        |  \--*  ADD       int    $200
N002 (  1,  1) [000011] ------------              |        |     +--*  LCL_VAR   int    V02 arg2         u:1 $c0
N003 (  1,  1) [000012] ------------              |        |     \--*  CNS_INT   int    1 $40
N006 (  1,  1) [000016] ------------              |        \--*  CNS_INT   long   2 $180
N010 (  1,  1) [000019] ------------              \--*  CNS_INT   int    0 $44

***** BB01
STMT00002 (IL 0x012...0x01B)
N011 ( 10, 10) [000033] -A-XG-------              *  ASG       int    $VN.Void
N009 (  8,  8) [000032] *--X---N----              +--*  IND       int    $44
N008 (  6,  7) [000030] -------N----              |  \--*  ADD       long   $148
N001 (  1,  1) [000022] ------------              |     +--*  LCL_VAR   long   V01 arg1         u:1 $80
N007 (  5,  6) [000029] -------N----              |     \--*  LSH       long   $147
N005 (  4,  5) [000026] ------------              |        +--*  CAST      long <- int $146
N004 (  3,  3) [000025] ------------              |        |  \--*  ADD       int    $201
N002 (  1,  1) [000023] ------------              |        |     +--*  LCL_VAR   int    V02 arg2         u:1 $c0
N003 (  1,  1) [000024] ------------              |        |     \--*  CNS_INT   int    2 $41
N006 (  1,  1) [000028] ------------              |        \--*  CNS_INT   long   2 $180
N010 (  1,  1) [000031] ------------              \--*  CNS_INT   int    0 $44

***** BB01
STMT00003 (IL 0x01C...0x025)
N011 ( 10, 10) [000045] -A-XG-------              *  ASG       int    $VN.Void
N009 (  8,  8) [000044] *--X---N----              +--*  IND       int    $44
N008 (  6,  7) [000042] -------N----              |  \--*  ADD       long   $14b
N001 (  1,  1) [000034] ------------              |     +--*  LCL_VAR   long   V01 arg1         u:1 (last use) $80
N007 (  5,  6) [000041] -------N----              |     \--*  LSH       long   $14a
N005 (  4,  5) [000038] ------------              |        +--*  CAST      long <- int $149
N004 (  3,  3) [000037] ------------              |        |  \--*  ADD       int    $202
N002 (  1,  1) [000035] ------------              |        |     +--*  LCL_VAR   int    V02 arg2         u:1 (last use) $c0
N003 (  1,  1) [000036] ------------              |        |     \--*  CNS_INT   int    3 $45
N006 (  1,  1) [000040] ------------              |        \--*  CNS_INT   long   2 $180
N010 (  1,  1) [000043] ------------              \--*  CNS_INT   int    0 $44

and replace it with just:

***** BB01
STMT00000 (IL 0x000...0x00B)
N003 (  3,  3) [000002] -A-XG-------              *  HWIntrinsic void   int Store $101
N001 (  1,  1) [000000] ------------              +--*  LCL_VAR   long   V01 arg1         u:1 (last use) $80
N002 (  1,  1) [000001] ------------              \--*  HWIntrinsic simd16 int get_Zero $100
area-CodeGen-coreclr

Most helpful comment

All 7 comments

I'm all for using HWIntrinsics to help support autovectorization (and hopefully others are as well).

I would say (assuming this effort is approved and worked on) we should try to write the infrastructure for it to be easily shared between x86 and ARM and for ARM support to come online as possible since we are working on exposing those intrinsics for .NET 5.

@tannergooding I agree, the only thing - it's not easy to emit Vector.Create since it's not an intrinsic 馃檪 .
Btw, I noticed that in the following code:

static void MyTest(float* array, int length)
{
    for (int i = 0; i < length; i += 8)
        Avx.Store(array, Vector256<float>.Zero);
}

Vector256<float>.Zero is not moved out of the loop (loop hoisting), is there a feature request/issue for it?

This is a dupe of previous proposals to implement auto-vectorization. The conclusion is that we need general design discussion for finding the best approach as the issue is far from the simple one as described here. Can't find though previous thread.

There are multiple issues, including (but not limited to): https://github.com/dotnet/coreclr/issues/20486

@4creators @tannergooding ok closing as a dup.
Yeah I was not planning to do it myself, more like just to share my prototype (who knows maybe it will help or demonstrates that since we now have HW nodes it's now a bit easier to implement 馃檪)

@EgorBo

The more of us push for that feature design and implementation the better are chances we will get it sooner. Anyway, the handcrafted vectorization of simple loops should be a thing of the past for JIT and crossgen, hopefully soon.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jamesqo picture jamesqo  路  3Comments

nalywa picture nalywa  路  3Comments

GitAntoinee picture GitAntoinee  路  3Comments

matty-hall picture matty-hall  路  3Comments

jzabroski picture jzabroski  路  3Comments