Runtime: API Proposal: Arm64 Simd Insert and Extract elements

Created on 5 Jan 2018 · 33Comments · Source: dotnet/runtime

@eerhardt @CarolEidt @RussKeldorph

Target Framework netcoreapp2.1

```C#
namespace System.Runtime.Intrinsics.Arm.Arm64
{
///

/// This class provides access to the Arm64 AdvSIMD intrinsics
///
/// Arm64 CPUs indicate support for this feature by setting
/// ID_AA64PFR0_EL1.AdvSIMD == 0 or better.
///

public static class Simd
{
public static bool IsSupported { get { throw null; } }

    /// <summary>
    /// Vector extract item
    ///
    /// result = vector[index]
    ///
    /// Note: In order to be inlined, index must be a JIT time const expression which can be used to 
    /// populate the literal immediate field.  Use of a non constant will result in generation of a switch table
    ///
    /// Corresponds to vector forms of ARM64 MOV
    /// </summary>
    public static T Extract<T>(Vector64<T>  vector, byte index) where T : struct { throw null; }
    public static T Extract<T>(Vector128<T> vector, byte index) where T : struct { throw null; }

    /// <summary>
    /// Vector insert item
    ///
    /// result = vector;
    /// result[index] = data;
    ///
    /// Note: In order to be inlined, index must be a JIT time const expression which can be used to 
    /// populate the literal immediate field.  Use of a non constant will result in generation of a switch table
    ///
    /// Corresponds to vector forms of ARM64 INS
    /// </summary>
    public static Vector64<T>  Insert<T>(Vector64<T>  vector, byte index, T data) where T : struct { throw null; }
    public static Vector128<T> Insert<T>(Vector128<T> vector, byte index, T data) where T : struct { throw null; }
}

}
```

Argument order and parameter names should be consistent between X86 and ARM64. They are currently not

api-approved arch-arm64 area-System.Runtime.Intrinsics

Source

sdmaclea

Most helpful comment

What scope do you intend for that statement? Methods with similar usage? All methods?

My intended scope was that for things that are the same, or very similar, if there is no value in doing something different, then they should be consistent.

Another inconsistency is x86 has the order vector, data, index, but this has the order vector, index, data - the index and the data parameters are switched around. Maybe the value in being different there is that the machine instructions are different order between the two chips, and we want the C# API to match the low-level intrinsic. (I'm not saying that we want that - just giving an example where inconsistency may be OK - because there is a reason to be inconsistent.) But in general, when there is no real reason to be different, I think we should be consistent.

eerhardt on 10 Jan 2018

👍2

All 33 comments

Same methods are called Insert* and Extract* in Intel intrinsics.

jkotas on 5 Jan 2018

👍2

Updated for generics and to match XARCH naming conventions.

sdmaclea on 5 Jan 2018

public static Vector64<T> Insert<T>(Vector64<T> left, byte index, T right) where T : struct { throw null; }

(nit) Are left and right good names for the parameters here? Maybe vector and value?

public static Vector64<T> Insert<T>(Vector64<T> left, byte index1, Vector64<T> right, byte index2) where T : struct { throw null; }

Should index1 and index2 be renamed to leftIndex and rightIndex?

eerhardt on 9 Jan 2018

👍1

@eerhardt Done

sdmaclea on 9 Jan 2018

For reference, x86 is using value and data, rather than vector and value. I don't have a preference either way, but it might be nice for them to be named consistently.

tannergooding on 9 Jan 2018

Good call - I don't have a preference either way as well. Consistency is my preference. I just wanted something better than left and right.

eerhardt on 9 Jan 2018

👍1

I don't have a preference either way as well. Consistency is my preference. I just wanted something better than left and right.

Agreed.

tannergooding on 9 Jan 2018

For reference, x86 is using value and data, rather than vector and value

@tannergooding Like this?
C# Extract(value, valueIndex); Insert(value, valueIndex, data); Insert(value, valueIndex, data, dataIndex);

sdmaclea on 9 Jan 2018

I don't think x86 has equivalents for Insert(value, valueIndex, data, dataIndex) (shuffle and unpack are generally used for these). But yes, there is basically the following signatures:

Vector256<T> Insert<T>(Vector256<T> value, T data, byte index)
Vector256<T> Insert<T>(Vector256<T> value, Vector128<T> data, byte index)
Vector256<T> Insert<T>(Vector256<T> value, T* address, byte index)

Like @eerhardt, I don't have a preferences if we use value/data or vector/value.

CC @fiigii

tannergooding on 9 Jan 2018

Like @eerhardt, I don't have a preferences if we use value/data or vector/value.

IMHO vector\value would better describe parameters

4creators on 10 Jan 2018

Consistency is my preference

What scope do you intend for that statement? Methods with similar usage? All methods?

For simple binary operands left and right is simple and can be consistent.
result = left operator right makes a lot of sense and X86 is already using this naming

IMHO vectorvalue would better describe parameters

Arguments starting with same letter will slow comprehension slightly. People look at first and last letter, word length to recognize differences quickly.

vector/data seems like a better combination.

sdmaclea on 10 Jan 2018

What scope do you intend for that statement? Methods with similar usage? All methods?

My intended scope was that for things that are the same, or very similar, if there is no value in doing something different, then they should be consistent.

eerhardt on 10 Jan 2018

👍2

But in general, when there is no real reason to be different, I think we should be consistent.

This has quite an impact on learning curve, having good and intuitive naming makes it less steep.

Arguments starting with same letter will slow comprehension slightly.

This is good point - I have myself played games on word recognition using this kind cognitive rules.

4creators on 10 Jan 2018

This has quite an impact on learning curve, having good and intuitive naming makes it less steep.

Agreed. Let's have good, intuitive, and consistent naming. 😄
We can still change the x86 names if there are better names than what currently exist.

eerhardt on 10 Jan 2018

Another inconsistency is x86 has the order vector, data, index, but this has the order vector, index, data

The Arm64 order is consistent with left to right reading order

result = vector[index];
vector[index] = data;
vector[index1] = data[index2]

It is also consistent with Arm64 assembly syntax.

mov X2, V0[3]      // X2 = V0[3]
mov V0[3], X2      // V0[3] = X2
mov V0[3], V2[5]   // V0[3] = V2[5]

It is inconsistent with codegen order (which always confuses me)

emit_R_R_I(... target, vector, index)
emit_R_R_I(... vector, data, index)
emit_R_R_I_I(... vector, data, index1, index2)

My personal preference would be to

keep the Arm64 order as proposed
change X86 order to match

API is not my dominion so I will accept instruction.

sdmaclea on 10 Jan 2018

I don't think x86 has equivalents for Insert(vector, valueIndex, data, dataIndex)

I wonder if this API is needed.

It could also be written
C# Insert(vector, valueIndex, Extract(data, dataIndex))
During lowering the pattern could be recognized and the Extract contained in the Insert.

During import the tree was going to be created like this anyway marking the Extract contained.

I guess it is in part a philosophical question,

should lowering be done on HW intrinsics
OR should it be treated like inline assembly and kept intact

I guess I am prefering dropping this API and handling in containment analysis

sdmaclea on 10 Jan 2018

when there is no real reason to be different, I think we should be consistent.

I certainly agree.

The issue is that the X86 API surface is large enough already, that I miss overlap. The unfamiliar X86 terminology and documentation in terms of assembly instructions certainly does not make it easier.

In this case, when the overlap was identified, I made the mistake of just renaming the function instead of copying the API from X86. In fact until this discussion I didn't realize I made that mistake.

sdmaclea on 10 Jan 2018

should it be treated like inline assembly and kept intact

I believe it should be treated as inline assembly (effectively). That is, outside of the helper functions which don't strictly map to a single instruction (e.g SetAll, StaticCast, etc), we should be emitting exactly what the user requested (meaning, we wouldn't do something like convert Sse.Shuffle(left, right, 68) into Sse2.UnpackLow(left, right) or Sse.MoveLowHigh(left, right) even though they are all functionally equivalent).

If there are cases where a user can get better codegen/perf by changing their code, an analyzer may be better suited to do that analysis and suggest the "fix".

This not only keeps the intrinsics simple, but it means the user gets exactly what they ask for which is predictable.

tannergooding on 10 Jan 2018

I believe it should be treated as inline assembly (effectively)

In general I agree

outside of the helper functions which don't strictly map to a single instruction

I believe this vector[index1] = data[index2] case is actually one of the rare exceptions.

If the base type is integer the vector element accesses represent a vector <--> general purpose register file which can be slower. So fusing will eliminate the slow operations.

However both indicies must be immediate constants. In the case either index is not a constant, we need a switch to implement. In this case I would not want to fuse the instructions.

On ARM/ARM64 this case is probably best handle with structured loads and stores. The need for this flexibility seems like a rare corner case.

Finally allowing Insert to contain Extract is equivalent to silently eliminating a mov and dropping a temporary register. So containment is not risky.

My preference is still to remove the composite API, and contain.
If containment was prohibited I would still defer adding the composite API.

I updated OP to remove the composite API & use vector/data.

I left the Insert arguments in left to right order, pending further discussion.

sdmaclea on 10 Jan 2018

Finally allowing Insert to contain Extract is equivalent to silently eliminating a mov and dropping a temporary register. So containment is not risky.

Fair enough 👍

tannergooding on 10 Jan 2018

Sse2.UnpackLow(left, right) or Sse.MoveLowHigh(left, right) even though they are all functionally equivalent

If they are functionally equivalent, why do they use different C# method names?
Doesn't this violate "good, intuitive, and consistent naming" ?

sdmaclea on 11 Jan 2018

Sse2.UnpackLow operates on double and does:
DEST[63:0] = SRC1[63:0]
DEST[127:64] = SRC2[63:0]

Sse.MoveLowToHigh operates on float and does:
DEST[31:0] = SRC1[31:0]
DEST[63:32] = SRC1[63:32]
DEST[95:64] = SRC2[31:0]
DEST[127:96] = SRC2[63:32]

From a bit agnostic view, they are functionally the same (and will do the same thing regardless of the data you operate on)

tannergooding on 11 Jan 2018

The functionality does differ when you operate on Vector256.

Avx.UnpackLow operates on double and does:
DEST[63:0] = SRC1[63:0]
DEST[127:64] = SRC2[63:0]
DEST[191:128] = SRC1[191:128]
DEST[255:192] = SRC2[191:128]

and MoveLowToHigh doesn't have a Vector256 overload

tannergooding on 11 Jan 2018

@tannergooding 👍 X86 assembly is too ... for me to try to figure it out...

sdmaclea on 11 Jan 2018

😄1

@caroleidt @eerhardt The argument order for the Insert<T>() case is the only open issue here.

I think the proposed order above is best, but it is inconsistent with X86. I would prefer X86 order got fixed.

However, if consensus favors current X86 argument order, I ready to change.

sdmaclea on 15 Jan 2018

I think we can mark this as ready for review. We can bring up the ordering during the review - whether they should be consistent and if so, which one should be used.

eerhardt on 16 Jan 2018

@eerhardt Any reason why this is marked future? For Extract the CoreCLR work is complete and ready for 2.1. The remaining Insert part is in review and should be ready for 2.1

sdmaclea on 5 Feb 2018

Looks good as proposed. Feedback:

No need to match X86 ordering of parmeter, just match the ARM instruction ordering.

terrajobst on 5 Sep 2018

@tannergooding Are you currently working on this? If not - can I take this issue?

echesakovMSFT on 6 Feb 2020

No, I am not working on it right now and am trying to finish going through the architecture manual to find which APIs still need to be proposed

tannergooding on 10 Feb 2020

@TamarChristinaArm Can you please help me to understand whether
```c#

public static Vector64 Insert(Vector64 vector, byte index, float data);
public static Vector128 Insert(Vector128 vector, byte index, float data);

public static Vector64 Insert(Vector64 vector, byte index, double data);
public static Vector128 Insert(Vector128 vector, byte index, double data);

should be lowered to 
```asm
INS Vd.S[lane], Vn.S[0]
INS Vd.S[lane], Vn.S[0]

INS Vd.D[lane], Vn.D[0]
INS Vd.D[lane], Vn.D[0]

or something else?

https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vset_lane says that

float32x2_t vset_lane_f32 (float32_t a, float32x2_t v, const int lane);
float32x4_t vsetq_lane_f32 (float32_t a, float32x4_t v, const int lane);

float64x1_t vset_lane_f64 (float64_t a, float64x1_t v, const int lane);
float64x2_t vsetq_lane_f64 (float64_t a, float64x2_t v, const int lane);

is lowered to

INS Vd.S[lane],Rn
INS Vd.S[lane],Rn

INS Vd.D[lane],Rn
INS Vd.D[lane],Rn

and Visual Studio 2019 seems to be doing what prescribed on the website, i.e. it generates INS Vd.S[lane],Wn preceded by FMOV Wn, Sm

For example,

    float r0 = 5.1f;
    float r1 = 4.2f;

    float32x4_t v0 = { 0 };

    v0 = vsetq_lane_f32(r0, v0, 3);
    v0 = vsetq_lane_f32(r1, v0, 1);

    vst1q_f32(p, v0);

compiles to

  00000001400010E8: 910003E8  mov         x8,sp
  00000001400010EC: 1C0001F0  ldr         s16,0000000140001128
  00000001400010F0: A9007D1F  stp         xzr,xzr,[x8]
  00000001400010F4: 3DC003F2  ldr         q18,[sp]
  00000001400010F8: 1C0001B1  ldr         s17,000000014000112C
  00000001400010FC: 910043E9  add         x9,sp,#0x10
  0000000140001100: 1E260208  fmov        w8,s16
  0000000140001104: 52800000  mov         w0,#0
  0000000140001108: 4E1C1D12  mov         v18.s[3],w8
  000000014000110C: 1E260228  fmov        w8,s17
  0000000140001110: 4E0C1D12  mov         v18.s[1],w8
  0000000140001114: 4C007932  st1         {v18.4s},[x9]

Presumably, this could be compiled to?

mov         x8,sp
ldr         s16,0000000140001128
ldr         s17,000000014000112C
stp         xzr,xzr,[x8]
ldr         q18,[sp]
mov         v18.s[3],v16.s[0]
mov         v18.s[1],v17.s[0]
add         x9,sp,#0x10
st1         {v18.4s},[x9]

Am I missing something?

echesakovMSFT on 24 Mar 2020

Presumably, this could be compiled to?

mov         x8,sp
ldr         s16,0000000140001128
ldr         s17,000000014000112C
stp         xzr,xzr,[x8]
ldr         q18,[sp]
mov         v18.s[3],v16.s[0]
mov         v18.s[1],v17.s[0]
add         x9,sp,#0x10
st1         {v18.4s},[x9]

Am I missing something?

No there's no particular reason it can't. In fact in GCC we do this. We only generate the instruction in the ACLE docs in the case that we have constructed the float value on the integer side to begin with.

e.g. modern GCC for

float r0 = 5.1f;

won't generate a literal pool but will instead create the bitpattern on the genreg side

        mov     w1, 13107
        movk    w1, 0x40a3, lsl 16
        fmov    s2, w1

This is because the first mov/movk is handled in the same cycle. So it ends up being cheaper to construct like this then addressing the literal pool and doing a load and you just replace the fmov with the mov.