@eerhardt @CarolEidt @RussKeldorph
Target Framework netcoreapp2.1
```C#
namespace System.Runtime.Intrinsics.Arm.Arm64
{
///
/// This class provides access to the Arm64 AdvSIMD intrinsics
///
/// Arm64 CPUs indicate support for this feature by setting
/// ID_AA64PFR0_EL1.AdvSIMD == 0 or better.
///
public static class Simd
{
public static bool IsSupported { get { throw null; } }
/// <summary>
/// Vector extract item
///
/// result = vector[index]
///
/// Note: In order to be inlined, index must be a JIT time const expression which can be used to
/// populate the literal immediate field. Use of a non constant will result in generation of a switch table
///
/// Corresponds to vector forms of ARM64 MOV
/// </summary>
public static T Extract<T>(Vector64<T> vector, byte index) where T : struct { throw null; }
public static T Extract<T>(Vector128<T> vector, byte index) where T : struct { throw null; }
/// <summary>
/// Vector insert item
///
/// result = vector;
/// result[index] = data;
///
/// Note: In order to be inlined, index must be a JIT time const expression which can be used to
/// populate the literal immediate field. Use of a non constant will result in generation of a switch table
///
/// Corresponds to vector forms of ARM64 INS
/// </summary>
public static Vector64<T> Insert<T>(Vector64<T> vector, byte index, T data) where T : struct { throw null; }
public static Vector128<T> Insert<T>(Vector128<T> vector, byte index, T data) where T : struct { throw null; }
}
}
```
Argument order and parameter names should be consistent between X86 and ARM64. They are currently not
Same methods are called Insert* and Extract* in Intel intrinsics.
Updated for generics and to match XARCH naming conventions.
public static Vector64<T> Insert<T>(Vector64<T> left, byte index, T right) where T : struct { throw null; }
(nit) Are left and right good names for the parameters here? Maybe vector and value?
public static Vector64<T> Insert<T>(Vector64<T> left, byte index1, Vector64<T> right, byte index2) where T : struct { throw null; }
Should index1 and index2 be renamed to leftIndex and rightIndex?
@eerhardt Done
For reference, x86 is using value and data, rather than vector and value. I don't have a preference either way, but it might be nice for them to be named consistently.
Good call - I don't have a preference either way as well. Consistency is my preference. I just wanted something better than left and right.
I don't have a preference either way as well. Consistency is my preference. I just wanted something better than left and right.
Agreed.
For reference, x86 is using value and data, rather than vector and value
@tannergooding Like this?
C#
Extract(value, valueIndex);
Insert(value, valueIndex, data);
Insert(value, valueIndex, data, dataIndex);
I don't think x86 has equivalents for Insert(value, valueIndex, data, dataIndex) (shuffle and unpack are generally used for these). But yes, there is basically the following signatures:
Vector256<T> Insert<T>(Vector256<T> value, T data, byte index)
Vector256<T> Insert<T>(Vector256<T> value, Vector128<T> data, byte index)
Vector256<T> Insert<T>(Vector256<T> value, T* address, byte index)
Like @eerhardt, I don't have a preferences if we use value/data or vector/value.
CC @fiigii
Like @eerhardt, I don't have a preferences if we use value/data or vector/value.
IMHO vector\value would better describe parameters
Consistency is my preference
What scope do you intend for that statement? Methods with similar usage? All methods?
For simple binary operands left and right is simple and can be consistent.
result = left operator right makes a lot of sense and X86 is already using this naming
IMHO vectorvalue would better describe parameters
Arguments starting with same letter will slow comprehension slightly. People look at first and last letter, word length to recognize differences quickly.
vector/data seems like a better combination.
What scope do you intend for that statement? Methods with similar usage? All methods?
My intended scope was that for things that are the same, or very similar, if there is no value in doing something different, then they should be consistent.
Another inconsistency is x86 has the order vector, data, index, but this has the order vector, index, data - the index and the data parameters are switched around. Maybe the value in being different there is that the machine instructions are different order between the two chips, and we want the C# API to match the low-level intrinsic. (I'm not saying that we want that - just giving an example where inconsistency may be OK - because there is a reason to be inconsistent.) But in general, when there is no real reason to be different, I think we should be consistent.
But in general, when there is no real reason to be different, I think we should be consistent.
This has quite an impact on learning curve, having good and intuitive naming makes it less steep.
Arguments starting with same letter will slow comprehension slightly.
This is good point - I have myself played games on word recognition using this kind cognitive rules.
This has quite an impact on learning curve, having good and intuitive naming makes it less steep.
Agreed. Let's have good, intuitive, and consistent naming. 馃槃
We can still change the x86 names if there are better names than what currently exist.
Another inconsistency is x86 has the order vector, data, index, but this has the order vector, index, data
The Arm64 order is consistent with left to right reading order
result = vector[index];
vector[index] = data;
vector[index1] = data[index2]
It is also consistent with Arm64 assembly syntax.
mov X2, V0[3] // X2 = V0[3]
mov V0[3], X2 // V0[3] = X2
mov V0[3], V2[5] // V0[3] = V2[5]
It is inconsistent with codegen order (which always confuses me)
emit_R_R_I(... target, vector, index)
emit_R_R_I(... vector, data, index)
emit_R_R_I_I(... vector, data, index1, index2)
My personal preference would be to
API is not my dominion so I will accept instruction.
I don't think x86 has equivalents for
Insert(vector, valueIndex, data, dataIndex)
I wonder if this API is needed.
It could also be written
C#
Insert(vector, valueIndex, Extract(data, dataIndex))
During lowering the pattern could be recognized and the Extract contained in the Insert.
During import the tree was going to be created like this anyway marking the Extract contained.
I guess it is in part a philosophical question,
I guess I am prefering dropping this API and handling in containment analysis
when there is no real reason to be different, I think we should be consistent.
I certainly agree.
The issue is that the X86 API surface is large enough already, that I miss overlap. The unfamiliar X86 terminology and documentation in terms of assembly instructions certainly does not make it easier.
In this case, when the overlap was identified, I made the mistake of just renaming the function instead of copying the API from X86. In fact until this discussion I didn't realize I made that mistake.
should it be treated like inline assembly and kept intact
I believe it should be treated as inline assembly (effectively). That is, outside of the helper functions which don't strictly map to a single instruction (e.g SetAll, StaticCast, etc), we should be emitting exactly what the user requested (meaning, we wouldn't do something like convert Sse.Shuffle(left, right, 68) into Sse2.UnpackLow(left, right) or Sse.MoveLowHigh(left, right) even though they are all functionally equivalent).
If there are cases where a user can get better codegen/perf by changing their code, an analyzer may be better suited to do that analysis and suggest the "fix".
This not only keeps the intrinsics simple, but it means the user gets exactly what they ask for which is predictable.
I believe it should be treated as inline assembly (effectively)
In general I agree
outside of the helper functions which don't strictly map to a single instruction
I believe this vector[index1] = data[index2] case is actually one of the rare exceptions.
If the base type is integer the vector element accesses represent a vector <--> general purpose register file which can be slower. So fusing will eliminate the slow operations.
However both indicies must be immediate constants. In the case either index is not a constant, we need a switch to implement. In this case I would not want to fuse the instructions.
On ARM/ARM64 this case is probably best handle with structured loads and stores. The need for this flexibility seems like a rare corner case.
Finally allowing Insert to contain Extract is equivalent to silently eliminating a mov and dropping a temporary register. So containment is not risky.
My preference is still to remove the composite API, and contain.
If containment was prohibited I would still defer adding the composite API.
I updated OP to remove the composite API & use vector/data.
I left the Insert arguments in left to right order, pending further discussion.
Finally allowing Insert to contain Extract is equivalent to silently eliminating a mov and dropping a temporary register. So containment is not risky.
Fair enough 馃憤
Sse2.UnpackLow(left, right) or Sse.MoveLowHigh(left, right) even though they are all functionally equivalent
If they are functionally equivalent, why do they use different C# method names?
Doesn't this violate "good, intuitive, and consistent naming" ?
Sse2.UnpackLow operates on double and does:
DEST[63:0] = SRC1[63:0]
DEST[127:64] = SRC2[63:0]
Sse.MoveLowToHigh operates on float and does:
DEST[31:0] = SRC1[31:0]
DEST[63:32] = SRC1[63:32]
DEST[95:64] = SRC2[31:0]
DEST[127:96] = SRC2[63:32]
From a bit agnostic view, they are functionally the same (and will do the same thing regardless of the data you operate on)
The functionality does differ when you operate on Vector256
Avx.UnpackLow operates on double and does:
DEST[63:0] = SRC1[63:0]
DEST[127:64] = SRC2[63:0]
DEST[191:128] = SRC1[191:128]
DEST[255:192] = SRC2[191:128]
and MoveLowToHigh doesn't have a Vector256 overload
@tannergooding 馃憤 X86 assembly is too ... for me to try to figure it out...
@caroleidt @eerhardt The argument order for the Insert<T>() case is the only open issue here.
I think the proposed order above is best, but it is inconsistent with X86. I would prefer X86 order got fixed.
However, if consensus favors current X86 argument order, I ready to change.
I think we can mark this as ready for review. We can bring up the ordering during the review - whether they should be consistent and if so, which one should be used.
@eerhardt Any reason why this is marked future? For Extract the CoreCLR work is complete and ready for 2.1. The remaining Insert part is in review and should be ready for 2.1
Looks good as proposed. Feedback:
@tannergooding Are you currently working on this? If not - can I take this issue?
No, I am not working on it right now and am trying to finish going through the architecture manual to find which APIs still need to be proposed
@TamarChristinaArm Can you please help me to understand whether
```c#
public static Vector64
public static Vector128
public static Vector64
public static Vector128
should be lowered to
```asm
INS Vd.S[lane], Vn.S[0]
INS Vd.S[lane], Vn.S[0]
INS Vd.D[lane], Vn.D[0]
INS Vd.D[lane], Vn.D[0]
or something else?
https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vset_lane says that
float32x2_t vset_lane_f32 (float32_t a, float32x2_t v, const int lane);
float32x4_t vsetq_lane_f32 (float32_t a, float32x4_t v, const int lane);
float64x1_t vset_lane_f64 (float64_t a, float64x1_t v, const int lane);
float64x2_t vsetq_lane_f64 (float64_t a, float64x2_t v, const int lane);
is lowered to
INS Vd.S[lane],Rn
INS Vd.S[lane],Rn
INS Vd.D[lane],Rn
INS Vd.D[lane],Rn
and Visual Studio 2019 seems to be doing what prescribed on the website, i.e. it generates INS Vd.S[lane],Wn preceded by FMOV Wn, Sm
For example,
float r0 = 5.1f;
float r1 = 4.2f;
float32x4_t v0 = { 0 };
v0 = vsetq_lane_f32(r0, v0, 3);
v0 = vsetq_lane_f32(r1, v0, 1);
vst1q_f32(p, v0);
compiles to
00000001400010E8: 910003E8 mov x8,sp
00000001400010EC: 1C0001F0 ldr s16,0000000140001128
00000001400010F0: A9007D1F stp xzr,xzr,[x8]
00000001400010F4: 3DC003F2 ldr q18,[sp]
00000001400010F8: 1C0001B1 ldr s17,000000014000112C
00000001400010FC: 910043E9 add x9,sp,#0x10
0000000140001100: 1E260208 fmov w8,s16
0000000140001104: 52800000 mov w0,#0
0000000140001108: 4E1C1D12 mov v18.s[3],w8
000000014000110C: 1E260228 fmov w8,s17
0000000140001110: 4E0C1D12 mov v18.s[1],w8
0000000140001114: 4C007932 st1 {v18.4s},[x9]
Presumably, this could be compiled to?
mov x8,sp
ldr s16,0000000140001128
ldr s17,000000014000112C
stp xzr,xzr,[x8]
ldr q18,[sp]
mov v18.s[3],v16.s[0]
mov v18.s[1],v17.s[0]
add x9,sp,#0x10
st1 {v18.4s},[x9]
Am I missing something?
Presumably, this could be compiled to?
mov x8,sp ldr s16,0000000140001128 ldr s17,000000014000112C stp xzr,xzr,[x8] ldr q18,[sp] mov v18.s[3],v16.s[0] mov v18.s[1],v17.s[0] add x9,sp,#0x10 st1 {v18.4s},[x9]Am I missing something?
No there's no particular reason it can't. In fact in GCC we do this. We only generate the instruction in the ACLE docs in the case that we have constructed the float value on the integer side to begin with.
e.g. modern GCC for
float r0 = 5.1f;
won't generate a literal pool but will instead create the bitpattern on the genreg side
mov w1, 13107
movk w1, 0x40a3, lsl 16
fmov s2, w1
This is because the first mov/movk is handled in the same cycle. So it ends up being cheaper to construct like this then addressing the literal pool and doing a load and you just replace the fmov with the mov.
@TamarChristinaArm Thank you for clarifying this for me!
Most helpful comment
My intended scope was that for things that are the same, or very similar, if there is no value in doing something different, then they should be consistent.
Another inconsistency is x86 has the order
vector, data, index, but this has the ordervector, index, data- the index and the data parameters are switched around. Maybe the value in being different there is that the machine instructions are different order between the two chips, and we want the C# API to match the low-level intrinsic. (I'm not saying that we want that - just giving an example where inconsistency may be OK - because there is a reason to be inconsistent.) But in general, when there is no real reason to be different, I think we should be consistent.