Runtime: Optimize System.Runtime.Intrinsics using arm64 intrinsics

Created on 11 Mar 2020  路  8Comments  路  Source: dotnet/runtime

This item tracks the conversion of the System.Runtime.Intrinsics class to use arm64 intrinsics.

Related: #33308

arch-arm64 area-System.Numerics

Most helpful comment

@BruceForstall - All the APIs under System.Runtime.Intrinsics are optimized. Thank you @tannergooding , @echesakovMSFT and @TamarChristinaArm for your valuable feedback throughout.

All 8 comments

@BruceForstall, how does this compare to https://github.com/dotnet/runtime/issues/33495?

Ah, based on #33308, this should be System.Runtime.Intrinsics and the original comment should be updated.

@TamarChristinaArm, is there an existing reference for how to efficiently create a Vector64/Vector128 from non-constant inputs?

We have methods both like:

public static unsafe Vector128<byte> Create(byte value);
public static unsafe Vector128<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7, byte e8, byte e9, byte e10, byte e11, byte e12, byte e13, byte e14, byte e15)

In the case of the former, the given value is duplicated to all elements in the Vector.
In the case of the latter, we are given the value for each element separately.

I don't see any existing reference that can be used on the C++ side. But, on the x86/x64 side the former is basically a single broadcast instruction and the latter is a series of inserts.

Would the same be the correct thing to do for ARM64?

@TamarChristinaArm, is there an existing reference for how to efficiently create a Vector64/Vector128 from non-constant inputs?

Hmm not that I'm aware of, though we have a very limited number of instructions for this.

In the case of the former, the given value is duplicated to all elements in the Vector.
In the case of the latter, we are given the value for each element separately.

I don't see any existing reference that can be used on the C++ side. But, on the x86/x64 side the former is basically a single broadcast instruction and the latter is a series of inserts.

Would the same be the correct thing to do for ARM64?

Yeah, the former case is just a dup and the latter is indeed a series of ins. where the first one is an fmov to create the vector.

Though the context they're used in is important, since e.g. doing a load followed by Create(byte value); should ideally result in a ld1r.

Though the context they're used in is important, since e.g. doing a load followed by Create(byte value); should ideally result in a ld1r.

Right, and if all or most of the inputs are constant, we could just construct a 64-bit or 128-bit constant (or simplify codegen in other ways) and load that instead.

But those should hopefully be optimizations handled in the JIT馃槃

yup :) and if creating one of the YxZ_t types like int32x4x4 then the Create call itself should be a no-op and the register allocate should just arrange the values to be put in the correct register when they're created if it can so it's zero cost :)

@BruceForstall - All the APIs under System.Runtime.Intrinsics are optimized. Thank you @tannergooding , @echesakovMSFT and @TamarChristinaArm for your valuable feedback throughout.

@kunalspathak That's great! Thanks for all the work!

Was this page helpful?
0 / 5 - 0 ratings