UnpackLow() (which is different operation), or wrap existing signed upconversion with StaticCast<signed, unsigned>() which may provide overhead (see dotnet/runtime#10357) .[V]PMOVZXBW xmm, xmm (same insn as for existing ConvertToVector128Int16())Sse41.ConvertToVector128UInt32(Vector128(byte/ushort)) for [V]PMOVZXBD/WD xmm, xmm (same insn as for existing ConvertToVector128Int32())
[V]PMOVZXBQ/WQ/DQ xmm, xmm (same insn as for existing ConvertToVector128Int64())Similarly, no direct means are exposed in API for unsigned int -> signed int upconversion when src is in 128-bit vector, and the dest is in 256-bit vector.
The methods being suggested are:
VPMOVZXBW ymm, xmm (same insn as for existing ConvertToVector256UInt16())Avx2.ConvertToVector256Int32(Vector128(byte/ushort)) for VPMOVZXBD/WD ymm, xmm (same insn as for existing ConvertToVector256UInt32())
VPMOVZXBQ/WQ/DQ ymm, xmm (same insn as for existing ConvertToVector256UInt64())I was not able to find not-too-verbose method to convert 32/64-bit scalar value to 256-bit vector in YMM reg. It is possible to set 128-bit vector with Sse2.ConvertScalarToVector128UInt32/64() which produces (MOV r32/r64, imm + MOVD/MOVQ XMM, r32/r64), but then MOVDQA XMMd, XMMs is automatically issued when one attempts to use helper method Avx.ExtendToVector256() to get 256-bit vector. To my understanding the helper method was intended to be used as type conversion and produce no-op in such cases, since MOVD/MOVQ X/YMM, r32/r64 zeroes upper portion of dest reg. Below is an example of the issue I'm trying to explain:
var v = Avx.ExtendToVector256(Sse2.ConvertScalarToVector128UInt64(0x12345678UL));
00007FF989272625 mov ecx,12345678h
00007FF98927262A vmovq xmm0,rcx
00007FF98927262F vmovdqa xmm6,xmm0 <======= this is not required
OTOH, the following conversion in reverse direction produces code that looks fine/optimal:
var v1 = Sse2.ConvertToUInt64(Avx.GetLowerHalf(Avx.SetZeroVector256<ulong>()));
00007FF989282618 vpxor ymm0,ymm0,ymm0
00007FF98928261D vmovq rsi,xmm0
Ssse3.AlignRight() that works on sbyte. I believe it makes sense to add overloads for other integer types, the same way as it was implemented for Sse2.ShiftRightLogical128BitLane() which is quite similar in operation. Otherwise developers will have to use type casting.Ssse3.Shuffle() that works on sbyte. I believe it makes sense to add overload that will also work on byte.I believe these versions deserve their own overloads since conceptually they can be used on SSExx-only hardware which does not provide anything closer to implement "gather" and "scatter" operations (actually "scatter" only appears in AVX512, and anyway granularity is 32 or 64 bits IIRC):
PMOVZX/SX... xmm, [m] - these load from [m] and extend at once, a nice fusion. Esp. note the 2x 8-bit version.PEXTRB/D/W + EXTRACTPS [m], xmm, i - spill single element from xmm to [m]PINSRB/D/W + INSERTPS xmm, [m], i - merge single element from [m] into xmm. There is special issue open on API for INSERTPS ( dotnet/runtime#10383 ).category:testing
theme:intrinsics
skill-level:intermediate
cost:medium
@CarolEidt @fiigii @tannergooding @4creators @eerhardt Should this be discussed in corefx first?
@RussKeldorph, some of this (such as the inefficient codegen) should be resolved by a PR I am currently working on.
As for the new APIs, it is probably worth discussion during the next HWIntrinsic design review (I sent an e-mail on this a couple days ago).
Some of the inefficient codegen was cleaned up with https://github.com/dotnet/coreclr/pull/18262
More of the remaining inefficient codegen will be cleaned up with https://github.com/dotnet/coreclr/pull/18297
After dotnet/coreclr#18297, there are is a bit more work that involves special handling for various intrinsics, but we are getting closer.
@tannergooding, thanks for the good news!
I wonder whether there will be any way to test these changes independently (by me) before first 2.2.0 preview will become available ?
@voinokin you can use the SDK daily builds with the daily Intrinsics packages from myget.
https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/dogfooding.md
Or you can build the runtime from source if you want to be able to do JitDumps and the like
https://github.com/dotnet/coreclr/blob/master/Documentation/building/windows-instructions.md
https://github.com/dotnet/coreclr/blob/master/Documentation/workflow/UsingYourBuild.md
@saucecontrol Thanks! I'll give it a try in some time.
Adding these:
Ssse3.AlignRight() that works on sbyte. I believe it makes sense to add overloads for other integer types, the same way as it was implemented for Sse2.ShiftRightLogical128BitLane() which is quite similar in operation. Otherwise developers will have to use type casting.Ssse3.Shuffle() that works on sbyte. I believe it makes sense to add overload that will also work on byte.There exists just one overload for Ssse3.AlignRight() that works on sbyte. I believe it makes sense to add overloads for other integer types, the same way as it was implemented for Sse2.ShiftRightLogical128BitLane() which is quite similar in operation. Otherwise developers will have to use type casting.
We will want to be careful with this one, since the instruction explicitly operates on byte values.
There exists just one overload for Ssse3.Shuffl() that works on sbyte. I believe it makes sense to add overload that will also work on byte.
It definitely makes sense to ensure that both the signed and unsigned versions are exposed here.
@tannergooding
We will want to be careful with this one, since the instruction explicitly operates on byte values.
It is the same for ShiftRightLogical128BitLane() - that's my point. BTW, the method appeared quite useful without typecasting :-).
I suggest to rename argument of AlignRight() which is now called mask (?) to smth like numBytes, again the same way as it's done for ShiftRightLogical128BitLane().
We will want to be careful with this one, since the instruction explicitly operates on byte values.
Really, all the masked byte-shuffle instructions work on the minimum element size that can be represented by the mask, but they're not necessarily most-often used on that size.
I have an example here that uses AlignRight on ulong values. Something went funny in the codegen with all the casts I had to do, though. Even though I have a managed helper for that here, I couldn't use it in that context without speed taking a dive. It would be nice if we could have managed helper overloads for cases like that, but we'd have to be assured that they boil down to the same instruction during JIT. I don't know whether that managed helper should be part of the Intrinsics API or whether we should have to roll our own, but it'll be a common use-case no doubt.
Something went funny in the codegen with all the casts I had to do, though. Even though I have a managed helper for that here, I couldn't use it in that context without speed taking a dive.
Try using ref modifier on your method's parameters - sometimes that helps when current JIT version inlines. My advice is - check the disasm after that, I've seen a lot of it for now... And logged some strange points here too.
Ah yes, thanks. I do that in a lot of cases but didn't try it on those tiny cast helpers.
A good bit of the "bad codegen" was because we didn't support the ins reg, [mem] encodings for a good number of the intrinsics.
The latest builds out of master should have much better codegen and the last of the non load/store intrinsics should support containment with: https://github.com/dotnet/coreclr/pull/18349.
There is, of course, still some more work to be done, but hopefully you will see much better results.
(StaticCast itself does still need a change, but I will be working on that next).
Not sure it is proper place here to discuss API that is already defined (?)... Anyway, I stumbled across this with the names of API methods that load and store smth.
Some background:
LoadScalarVectorNNN(), LoadVectorNNN() / LoadAlignedVectorNNN() / LoadDquVectorNNN() / LoadAlignedVectorNNNNonTemporal().Store() / StoreAligned() / StoreAlignedNonTemporal().Here are my points:
...Dqu... overloads.StoreNonTemporal() which do not operate on vectors taking GP reg as input (MOVNTI [m], reg) - may it happen that these specific ones need some more clear names ?LoadScalarVectorNNN(int/uint/long/ulong/float/double) overloads to load just 1 element according to data type and then zero remaining part of vector. OTOH, the methods intended for opposite operations have names split: StoreScalar(float/double) (MOVSS/MOVSD [m], xmm), StoreLow(long/ulong) (MOVQ [m], xmm which in fact stores one 64-bit element call it scalar), and I found no method exposed to store 32-bit ints as scalars (MOVD [m], xmm).Adding something mentioned in https://github.com/dotnet/coreclr/issues/18300#issuecomment-394772776
I believe these versions deserve their own overloads since conceptually they can be used on SSExx-only hardware which does not provide anything closer to implement "gather" and "scatter" operations (actually "scatter" only appears in AVX512, and anyway granularity is 32 or 64 bits IIRC):
PMOVZX/SX... xmm, [m] - these load from [m] and extend at once, a nice fusion. Esp. note the 2x 8-bit version.PEXTRB/D/W + EXTRACTPS [m], xmm, i - spill single element from xmm to [m]PINSRB/D/W + INSERTPS xmm, [m], i - merge single element from [m] into xmm. There is special issue open on API for INSERTPS ( dotnet/runtime#10383 ).My feeling is that there could be better wording for ...Dqu... overloads.
Ah, actually I tried to find a better name for these guys but I thought there is no single word can explain the semantics very well, so just followed C++ names... Do you have suggestions?
There is no word "Vector" in methods that do store entire vectors; it's similar to partial store operations.
Store* intrinsics take source as a parameter, so the different vector length can be resolved by the overload system, which makes the API simpler.
OTOH, the methods intended for opposite operations have names split: StoreScalar(float/double) (MOVSS/MOVSD [m], xmm), StoreLow(long/ulong) (MOVQ [m], xmm which in fact stores one 64-bit element call it scalar),
The Scalar suffix only makes sence with floating point types (float and double) because x86/x64 architectures execute floating point computation via SIMD (SSE2) units.
and I found no method exposed to store 32-bit ints as scalars (MOVD [m], xmm).
I believe these versions deserve their own overloads since conceptually they can be used on SSExx-only
In the current design, we are avoiding exposing "memory-access encoding" as much as possible, and we plan to generate these encodes via containment optimization (i.e. merging ins(load(address)) or store(address, ins(...)) in a single instruction). I think it also works for scalar type containment (i.e., folding a[i] = Sse2.ConvertToInt32(v) to MOVD [m], xmm).
My feeling is that there could be better wording for ...Dqu... overloads.
Ah, actually I tried to find a better name for these guys but I thought there is no single word can explain the semantics very well, so just followed C++ names... Do you have suggestions?
For [V]LDDQU I suggest to use LoadUnalignedVectorNNN - explicitly stating that the operation is intended as special case of unaligned loads. (I believe it behaves just like MOVDQU on current CPUs though.)
In the current design, we are avoiding exposing "memory-access encoding" as much as possible, and we plan to generate these encodes via containment optimization (i.e. merging ins(load(address)) or store(address, ins(...)) in a single instruction). I think it also works for scalar type containment (i.e., folding a[i] = Sse2.ConvertToInt32(v) to MOVD [m], xmm).
Given this will work, still some unclarities remain with API:
[V]PMOVZX/SXBQ xmm, [m16] - there is no single operation to assign-extend 2x8-bit values to vector to my knowledge other that this. All other versions of [V]PMOVZX/SX... take at least 32 bits which will be achievable thru containment support with some typecasting + LoadVectorNNN() or ConvertScalarToVectorXXX() , but this specific version only takes 16 bits on input.VPBROADCASTW xmm/ymm, [m16]) does also exists, but this is about different operation and AVX2.MOVD/Q xmm, [m] - LoadScalarVector128(int/uint/long/ulong) do already exist, why not removing them in favor of Sse2.ConvertScalarToVector128[U]Int32/64(indir) ? It looks asymmetrical for 32-bit ints for now - there is direct load operation exposed, but not the corresponding direct store operation.INSERTPS xmm, [m32], i - related dotnet/runtime#10383, supposedly fixed by dotnet/coreclr#17637. I can't see the final version, but if the existing overload taking just one scalar value will be replaced with the overload taking the vector, then the subject instruction encoding will become unavailable (I had no understanding of containment being introduced when I was logging dotnet/runtime#10383).For [V]LDDQU I suggest to use LoadUnalignedVectorNNN
LoadUnaligned* is not enough to express lddqu semantics. In x86 SIMD programming, unaligned is usually related to instructions like movups, movdqu, etc. So, that may be confusing...
For [V]LDDQU I suggest to use LoadUnalignedVectorNNN
LoadUnaligned* is not enough to express lddqu semantics. In x86 SIMD programming, unaligned is usually related to instructions like movups, movdqu, etc. So, that may be confusing...
Then, the last remaining idea from me would be to extend existing LoadVector128/256(type* ptr) overloads with optional parameter so that it would become LoadVector128/256(type* ptr, bool forceUnaligned=false). Looks a bit ugly though....
Perhaps LoadUnalignedSplit*. Though I tend to think that coming up with such fancy names for already established instructions/intrinsics does more harm than good. And LDDQU is kind of useless these days...
I think we've addressed some of this already. Could the original post be updated with anything still relevant or the issue be otherwise closed?
I think we've addressed some of this already
Right, I think we can close this issue and open a new issue for "folding store".
I think we've addressed some of this already. Could the original post be updated with anything still relevant or the issue be otherwise closed?
Tell me which issues remain and I will update the original post. Thanks.
@tannergooding can you help get this sorted out? Hoping there is no work left here.
cc @CarolEidt
No.1 and No.2 haven't been resolved and need a separate proposal addressing them logged against CoreFX and in the recommended format (https://github.com/dotnet/corefx/issues/35768 tracks some of the issues raised).
The primary issue here is that PMOV* has both sign-extending and zero-extending versions. We need to ensure these take types and are exposed in a mechanism that is familiar to existing .NET users.
No.3 is meant to be covered by the 128-bit conversion and then a widening conversion to 256-bit via the ToVector256 or ToVector256Unsafe method.
No.4 and No.5 have bneen resolved.
For No.6, No.7, and No.8, we aren't currently looking at providing helper methods like these.
- PMOVZX/SX... xmm, [m] - these load from [m] and extend at once, a nice fusion. Esp. note the 2x 8-bit version.
Regarding No. 6 - my point is it's not helper method, but rather a separate operation which loads values and extends them to 16/32/64 bits. This can currently be replaced with several ops using typecasting:
My use cases for 8-bit version are decoding stream of compressed bytes.
my point is it's not helper method, but rather a separate operation which loads values and extends them to 16/32/64 bits.
Might be misunderstanding, but this isn't a singular hardware instruction; so it would be classified as a helper (it is implemented in terms of the actual intrinsics) rather than being an actual hardware intrinsic itself.
Given that it isn't a singular hardware instruction, and it isn't considered one of the "core" operations (which is basically just creating a vector and accessing individual elements), it likely wouldn't be considered at this point (users should be able to provide their own implementation in the interim).
a separate operation which loads values and extends them to 16/32/64 bits
That should be just the xmm/mem encoded versions of PMOV[ZS]X[BWD][WDQ]. These are already handled by containment, but the correct mem overloads are addressed in https://github.com/dotnet/corefx/issues/35768
Might be misunderstanding, but this isn't a singular hardware instruction; so it would be classified as a helper (it is implemented in terms of the actual intrinsics) rather than being an actual hardware intrinsic itself. Given that it isn't a singular hardware instruction, and it isn't considered one of the "core" operations (which is basically just creating a vector and accessing individual elements), it likely wouldn't be considered at this point (users should be able to provide their own implementation in the interim).
Here you have it (sorry, found no better way for now):
https://gcc.godbolt.org/z/TglHbD
Also, check 3rd form from the top https://www.felixcloutier.com/x86/pmovzx
66 0f 38 32 /r --- PMOVZXBQ xmm1, xmm2/m16 --- Zero extend 2 packed 8-bit integers in the low 2 bytes of xmm2/m16 to 2 packed 64-bit integers in xmm1.
I mean, IT IS singular hardware instruction.
PMOVZXBQ is covered by ConvertToVector128Int64, the encoding of the memory operand is being tracked by https://github.com/dotnet/corefx/issues/35768.
I confirm - dotnet/corefx#35768 covers my understanding expressed in item No. 6
It's nice we will have these APIs implemented :-)
@tannergooding Can this be closed in favor of other issues? If there is remaining work here, could you open separate issues to make it very clear what work remains for 3.0?
Yes, I think this could be closed as I believe all issues are either resolved or tracked by other existing issues.
@voinokin, feel free to clarify if you don't believe that is the case.
I know this issue is officially close, and I'm late to the show, but I am a bit confused by the current state of preview5...:
All issues seem to be resolved, all PRs merged, yet PMOVZXB{D,Q} and other do not seem to be generated and the current master branch show this unwelcoming comment:
Which seems to imply it isn't really supported at this stage...
Is it this part that's confusing: "The native signature does not exist."?
If so, that just means that there's no corresponding native (C++) intrinsic. You notice that for other intrinsics the equivalent C++ intrinsic is shown in addition to the target instruction, for example, a little further down we have:
/// <summary>
/// __m128i _mm256_extracti128_si256 (__m256i a, const int imm8)
/// VEXTRACTI128 xmm, ymm, imm8
/// </summary>
public new static Vector128<sbyte> ExtractVector128(Vector256<sbyte> value, byte index) => ExtractVector128(value, index);
The second line is the native (C++) intrinsic.
Most helpful comment
(
StaticCastitself does still need a change, but I will be working on that next).