Runtime: Some integer scalar and vectored methods/overloads are not exposed in HW intrinsics API

Created on 27 May 2018 · 37Comments · Source: dotnet/runtime

Currently, the intrinsics are exposed for signed int -> signed int and for unsigned int -> signed int upconversion when both src and dest are 128-bit vector. But no such API method is directly exposed for unsigned int -> unsigned int, so one has to either use UnpackLow() (which is different operation), or wrap existing signed upconversion with StaticCast<signed, unsigned>() which may provide overhead (see dotnet/runtime#10357) .
The suggestion is to add following convenience methods:
Sse41.ConvertToVector128UInt16(Vector128(byte)) for [V]PMOVZXBW xmm, xmm (same insn as for existing ConvertToVector128Int16())
Sse41.ConvertToVector128UInt32(Vector128(byte/ushort)) for [V]PMOVZXBD/WD xmm, xmm (same insn as for existing ConvertToVector128Int32())

- Sse41.ConvertToVector128UInt64(Vector128(byte/ushort/uint)) for [V]PMOVZXBQ/WQ/DQ xmm, xmm (same insn as for existing ConvertToVector128Int64())
Similarly, no direct means are exposed in API for unsigned int -> signed int upconversion when src is in 128-bit vector, and the dest is in 256-bit vector.
The methods being suggested are:
Avx2.ConvertToVector256Int16(Vector128(byte)) for VPMOVZXBW ymm, xmm (same insn as for existing ConvertToVector256UInt16())
Avx2.ConvertToVector256Int32(Vector128(byte/ushort)) for VPMOVZXBD/WD ymm, xmm (same insn as for existing ConvertToVector256UInt32())

- Avx2.ConvertToVector256Int64(Vector128(byte/ushort/uint)) for VPMOVZXBQ/WQ/DQ ymm, xmm (same insn as for existing ConvertToVector256UInt64())
I was not able to find not-too-verbose method to convert 32/64-bit scalar value to 256-bit vector in YMM reg. It is possible to set 128-bit vector with Sse2.ConvertScalarToVector128UInt32/64() which produces (MOV r32/r64, imm + MOVD/MOVQ XMM, r32/r64), but then MOVDQA XMMd, XMMs is automatically issued when one attempts to use helper method Avx.ExtendToVector256() to get 256-bit vector. To my understanding the helper method was intended to be used as type conversion and produce no-op in such cases, since MOVD/MOVQ X/YMM, r32/r64 zeroes upper portion of dest reg. Below is an example of the issue I'm trying to explain:

var v = Avx.ExtendToVector256(Sse2.ConvertScalarToVector128UInt64(0x12345678UL));
00007FF989272625  mov         ecx,12345678h  
00007FF98927262A  vmovq       xmm0,rcx  
00007FF98927262F  vmovdqa     xmm6,xmm0  <======= this is not required

OTOH, the following conversion in reverse direction produces code that looks fine/optimal:

var v1 = Sse2.ConvertToUInt64(Avx.GetLowerHalf(Avx.SetZeroVector256<ulong>()));
00007FF989282618  vpxor       ymm0,ymm0,ymm0  
00007FF98928261D  vmovq       rsi,xmm0

4. There exists just one overload for `Ssse3.AlignRight()` that works on `sbyte`. I believe it makes sense to add overloads for other integer types, the same way as it was implemented for `Sse2.ShiftRightLogical128BitLane()` which is quite similar in operation. Otherwise developers will have to use type casting.

There exists just one overload for Ssse3.Shuffle() that works on sbyte. I believe it makes sense to add overload that will also work on byte.

Adding something mentioned in https://github.com/dotnet/coreclr/issues/18300#issuecomment-394772776

I believe these versions deserve their own overloads since conceptually they can be used on SSExx-only hardware which does not provide anything closer to implement "gather" and "scatter" operations (actually "scatter" only appears in AVX512, and anyway granularity is 32 or 64 bits IIRC):

PMOVZX/SX... xmm, [m] - these load from [m] and extend at once, a nice fusion. Esp. note the 2x 8-bit version.
PEXTRB/D/W + EXTRACTPS [m], xmm, i - spill single element from xmm to [m]
PINSRB/D/W + INSERTPS xmm, [m], i - merge single element from [m] into xmm. There is special issue open on API for INSERTPS ( dotnet/runtime#10383 ).

category:testing
theme:intrinsics
skill-level:intermediate
cost:medium

Design Discussion area-CodeGen-coreclr

Source

voinokin

Most helpful comment

(StaticCast itself does still need a change, but I will be working on that next).

tannergooding on 7 Jun 2018

👍2

All 37 comments

@CarolEidt @fiigii @tannergooding @4creators @eerhardt Should this be discussed in corefx first?

RussKeldorph on 31 May 2018

@RussKeldorph, some of this (such as the inefficient codegen) should be resolved by a PR I am currently working on.

As for the new APIs, it is probably worth discussion during the next HWIntrinsic design review (I sent an e-mail on this a couple days ago).

tannergooding on 31 May 2018

👍1

Some of the inefficient codegen was cleaned up with https://github.com/dotnet/coreclr/pull/18262

More of the remaining inefficient codegen will be cleaned up with https://github.com/dotnet/coreclr/pull/18297

After dotnet/coreclr#18297, there are is a bit more work that involves special handling for various intrinsics, but we are getting closer.

tannergooding on 5 Jun 2018

@tannergooding, thanks for the good news!
I wonder whether there will be any way to test these changes independently (by me) before first 2.2.0 preview will become available ?

voinokin on 5 Jun 2018

@voinokin you can use the SDK daily builds with the daily Intrinsics packages from myget.
https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/dogfooding.md

Or you can build the runtime from source if you want to be able to do JitDumps and the like
https://github.com/dotnet/coreclr/blob/master/Documentation/building/windows-instructions.md
https://github.com/dotnet/coreclr/blob/master/Documentation/workflow/UsingYourBuild.md

saucecontrol on 5 Jun 2018

👍1

@saucecontrol Thanks! I'll give it a try in some time.

voinokin on 5 Jun 2018

Adding these:

4. There exists just one overload for `Ssse3.AlignRight()` that works on `sbyte`. I believe it makes sense to add overloads for other integer types, the same way as it was implemented for `Sse2.ShiftRightLogical128BitLane()` which is quite similar in operation. Otherwise developers will have to use type casting.

There exists just one overload for Ssse3.Shuffle() that works on sbyte. I believe it makes sense to add overload that will also work on byte.

voinokin on 7 Jun 2018

There exists just one overload for Ssse3.AlignRight() that works on sbyte. I believe it makes sense to add overloads for other integer types, the same way as it was implemented for Sse2.ShiftRightLogical128BitLane() which is quite similar in operation. Otherwise developers will have to use type casting.

We will want to be careful with this one, since the instruction explicitly operates on byte values.

There exists just one overload for Ssse3.Shuffl() that works on sbyte. I believe it makes sense to add overload that will also work on byte.

It definitely makes sense to ensure that both the signed and unsigned versions are exposed here.

tannergooding on 7 Jun 2018

@tannergooding

We will want to be careful with this one, since the instruction explicitly operates on byte values.

It is the same for ShiftRightLogical128BitLane() - that's my point. BTW, the method appeared quite useful without typecasting :-).
I suggest to rename argument of AlignRight() which is now called mask (?) to smth like numBytes, again the same way as it's done for ShiftRightLogical128BitLane().

voinokin on 7 Jun 2018

We will want to be careful with this one, since the instruction explicitly operates on byte values.

Really, all the masked byte-shuffle instructions work on the minimum element size that can be represented by the mask, but they're not necessarily most-often used on that size.

I have an example here that uses AlignRight on ulong values. Something went funny in the codegen with all the casts I had to do, though. Even though I have a managed helper for that here, I couldn't use it in that context without speed taking a dive. It would be nice if we could have managed helper overloads for cases like that, but we'd have to be assured that they boil down to the same instruction during JIT. I don't know whether that managed helper should be part of the Intrinsics API or whether we should have to roll our own, but it'll be a common use-case no doubt.

saucecontrol on 7 Jun 2018

👍2

Something went funny in the codegen with all the casts I had to do, though. Even though I have a managed helper for that here, I couldn't use it in that context without speed taking a dive.

Try using ref modifier on your method's parameters - sometimes that helps when current JIT version inlines. My advice is - check the disasm after that, I've seen a lot of it for now... And logged some strange points here too.

voinokin on 7 Jun 2018

Ah yes, thanks. I do that in a lot of cases but didn't try it on those tiny cast helpers.

saucecontrol on 7 Jun 2018

A good bit of the "bad codegen" was because we didn't support the ins reg, [mem] encodings for a good number of the intrinsics.

The latest builds out of master should have much better codegen and the last of the non load/store intrinsics should support containment with: https://github.com/dotnet/coreclr/pull/18349.

There is, of course, still some more work to be done, but hopefully you will see much better results.

tannergooding on 7 Jun 2018

👍1

(StaticCast itself does still need a change, but I will be working on that next).

tannergooding on 7 Jun 2018

👍2

Not sure it is proper place here to discuss API that is already defined (?)... Anyway, I stumbled across this with the names of API methods that load and store smth.

Some background:

the overloads to move (parts of) vectors FROM memory are mostly called LoadScalarVectorNNN(), LoadVectorNNN() / LoadAlignedVectorNNN() / LoadDquVectorNNN() / LoadAlignedVectorNNNNonTemporal().
the overloads to move vectors TO memory are mostly called: Store() / StoreAligned() / StoreAlignedNonTemporal().

Here are my points:

My feeling is that there could be better wording for ...Dqu... overloads.
There is no word "Vector" in methods that do store entire vectors; it's similar to partial store operations. It looks ok though, until you find out that there exist overloads of StoreNonTemporal() which do not operate on vectors taking GP reg as input (MOVNTI [m], reg) - may it happen that these specific ones need some more clear names ?
There exist LoadScalarVectorNNN(int/uint/long/ulong/float/double) overloads to load just 1 element according to data type and then zero remaining part of vector. OTOH, the methods intended for opposite operations have names split: StoreScalar(float/double) (MOVSS/MOVSD [m], xmm), StoreLow(long/ulong) (MOVQ [m], xmm which in fact stores one 64-bit element call it scalar), and I found no method exposed to store 32-bit ints as scalars (MOVD [m], xmm).

voinokin on 8 Jun 2018

Adding something mentioned in https://github.com/dotnet/coreclr/issues/18300#issuecomment-394772776

PMOVZX/SX... xmm, [m] - these load from [m] and extend at once, a nice fusion. Esp. note the 2x 8-bit version.
PEXTRB/D/W + EXTRACTPS [m], xmm, i - spill single element from xmm to [m]
PINSRB/D/W + INSERTPS xmm, [m], i - merge single element from [m] into xmm. There is special issue open on API for INSERTPS ( dotnet/runtime#10383 ).

voinokin on 12 Jun 2018

My feeling is that there could be better wording for ...Dqu... overloads.

Ah, actually I tried to find a better name for these guys but I thought there is no single word can explain the semantics very well, so just followed C++ names... Do you have suggestions?

There is no word "Vector" in methods that do store entire vectors; it's similar to partial store operations.

Store* intrinsics take source as a parameter, so the different vector length can be resolved by the overload system, which makes the API simpler.

OTOH, the methods intended for opposite operations have names split: StoreScalar(float/double) (MOVSS/MOVSD [m], xmm), StoreLow(long/ulong) (MOVQ [m], xmm which in fact stores one 64-bit element call it scalar),

The Scalar suffix only makes sence with floating point types (float and double) because x86/x64 architectures execute floating point computation via SIMD (SSE2) units.

fiigii on 12 Jun 2018

and I found no method exposed to store 32-bit ints as scalars (MOVD [m], xmm).

I believe these versions deserve their own overloads since conceptually they can be used on SSExx-only

In the current design, we are avoiding exposing "memory-access encoding" as much as possible, and we plan to generate these encodes via containment optimization (i.e. merging ins(load(address)) or store(address, ins(...)) in a single instruction). I think it also works for scalar type containment (i.e., folding a[i] = Sse2.ConvertToInt32(v) to MOVD [m], xmm).

fiigii on 12 Jun 2018

👍1

My feeling is that there could be better wording for ...Dqu... overloads.

Ah, actually I tried to find a better name for these guys but I thought there is no single word can explain the semantics very well, so just followed C++ names... Do you have suggestions?

For [V]LDDQU I suggest to use LoadUnalignedVectorNNN - explicitly stating that the operation is intended as special case of unaligned loads. (I believe it behaves just like MOVDQU on current CPUs though.)

In the current design, we are avoiding exposing "memory-access encoding" as much as possible, and we plan to generate these encodes via containment optimization (i.e. merging ins(load(address)) or store(address, ins(...)) in a single instruction). I think it also works for scalar type containment (i.e., folding a[i] = Sse2.ConvertToInt32(v) to MOVD [m], xmm).

Given this will work, still some unclarities remain with API:

128-bit version of [V]PMOVZX/SXBQ xmm, [m16] - there is no single operation to assign-extend 2x8-bit values to vector to my knowledge other that this. All other versions of [V]PMOVZX/SX... take at least 32 bits which will be achievable thru containment support with some typecasting + LoadVectorNNN() or ConvertScalarToVectorXXX() , but this specific version only takes 16 bits on input.
[UPD] Avx.Broadcast(short/ushort) (VPBROADCASTW xmm/ymm, [m16]) does also exists, but this is about different operation and AVX2.
For MOVD/Q xmm, [m] - LoadScalarVector128(int/uint/long/ulong) do already exist, why not removing them in favor of Sse2.ConvertScalarToVector128[U]Int32/64(indir) ? It looks asymmetrical for 32-bit ints for now - there is direct load operation exposed, but not the corresponding direct store operation.
For INSERTPS xmm, [m32], i - related dotnet/runtime#10383, supposedly fixed by dotnet/coreclr#17637. I can't see the final version, but if the existing overload taking just one scalar value will be replaced with the overload taking the vector, then the subject instruction encoding will become unavailable (I had no understanding of containment being introduced when I was logging dotnet/runtime#10383).

voinokin on 12 Jun 2018

For [V]LDDQU I suggest to use LoadUnalignedVectorNNN

LoadUnaligned* is not enough to express lddqu semantics. In x86 SIMD programming, unaligned is usually related to instructions like movups, movdqu, etc. So, that may be confusing...

fiigii on 12 Jun 2018

For [V]LDDQU I suggest to use LoadUnalignedVectorNNN

LoadUnaligned* is not enough to express lddqu semantics. In x86 SIMD programming, unaligned is usually related to instructions like movups, movdqu, etc. So, that may be confusing...

Then, the last remaining idea from me would be to extend existing LoadVector128/256(type* ptr) overloads with optional parameter so that it would become LoadVector128/256(type* ptr, bool forceUnaligned=false). Looks a bit ugly though....

voinokin on 12 Jun 2018

Perhaps LoadUnalignedSplit*. Though I tend to think that coming up with such fancy names for already established instructions/intrinsics does more harm than good. And LDDQU is kind of useless these days...

mikedn on 12 Jun 2018

I think we've addressed some of this already. Could the original post be updated with anything still relevant or the issue be otherwise closed?

tannergooding on 10 Dec 2018

🎉1

I think we've addressed some of this already

Right, I think we can close this issue and open a new issue for "folding store".

fiigii on 10 Dec 2018

I think we've addressed some of this already. Could the original post be updated with anything still relevant or the issue be otherwise closed?

Tell me which issues remain and I will update the original post. Thanks.

voinokin on 11 Dec 2018

@tannergooding can you help get this sorted out? Hoping there is no work left here.
cc @CarolEidt

AndyAyersMS on 12 Mar 2019

No.1 and No.2 haven't been resolved and need a separate proposal addressing them logged against CoreFX and in the recommended format (https://github.com/dotnet/corefx/issues/35768 tracks some of the issues raised).
The primary issue here is that PMOV* has both sign-extending and zero-extending versions. We need to ensure these take types and are exposed in a mechanism that is familiar to existing .NET users.

No.3 is meant to be covered by the 128-bit conversion and then a widening conversion to 256-bit via the ToVector256 or ToVector256Unsafe method.

No.4 and No.5 have bneen resolved.

For No.6, No.7, and No.8, we aren't currently looking at providing helper methods like these.

tannergooding on 12 Mar 2019

PMOVZX/SX... xmm, [m] - these load from [m] and extend at once, a nice fusion. Esp. note the 2x 8-bit version.

Regarding No. 6 - my point is it's not helper method, but rather a separate operation which loads values and extends them to 16/32/64 bits. This can currently be replaced with several ops using typecasting:

16-bit operands: Load [m16x2] from pointer cast to u32/i32 as scalar into vector + reverse typecasting + Z/S-extend
32-bit operands: Load [m32x2] from pointer cast to u64/i64 as scalar into vector + reverse typecasting + Z/S-extend
But in case of 8-bit operands I can't see anything shorter than this: Load [m8x2] from pointer cast to u16/i16 to GPR16 + extend to GPR32 or GPR64 + load value to vector + reverse typecasting + Z/S-extend. To my memory, 5-6 months ago when I last checked, JIT was not emitting MOVZX/SX GPR, [m8/16/32] which would compress two first instructions to just one. Even if this feature is implemented now, this would leave 3 instructions (MOVZX + MOVQ + PMOVZX/SX) instead of just one. If it's not - we have 4 instructions.

My use cases for 8-bit version are decoding stream of compressed bytes.

voinokin on 12 Mar 2019

my point is it's not helper method, but rather a separate operation which loads values and extends them to 16/32/64 bits.

Might be misunderstanding, but this isn't a singular hardware instruction; so it would be classified as a helper (it is implemented in terms of the actual intrinsics) rather than being an actual hardware intrinsic itself.

Given that it isn't a singular hardware instruction, and it isn't considered one of the "core" operations (which is basically just creating a vector and accessing individual elements), it likely wouldn't be considered at this point (users should be able to provide their own implementation in the interim).

tannergooding on 12 Mar 2019

a separate operation which loads values and extends them to 16/32/64 bits

That should be just the xmm/mem encoded versions of PMOV[ZS]X[BWD][WDQ]. These are already handled by containment, but the correct mem overloads are addressed in https://github.com/dotnet/corefx/issues/35768

saucecontrol on 12 Mar 2019

👍1

Might be misunderstanding, but this isn't a singular hardware instruction; so it would be classified as a helper (it is implemented in terms of the actual intrinsics) rather than being an actual hardware intrinsic itself. Given that it isn't a singular hardware instruction, and it isn't considered one of the "core" operations (which is basically just creating a vector and accessing individual elements), it likely wouldn't be considered at this point (users should be able to provide their own implementation in the interim).

Here you have it (sorry, found no better way for now):
https://gcc.godbolt.org/z/TglHbD

Also, check 3rd form from the top https://www.felixcloutier.com/x86/pmovzx
66 0f 38 32 /r --- PMOVZXBQ xmm1, xmm2/m16 --- Zero extend 2 packed 8-bit integers in the low 2 bytes of xmm2/m16 to 2 packed 64-bit integers in xmm1.

I mean, IT IS singular hardware instruction.

voinokin on 12 Mar 2019

PMOVZXBQ is covered by ConvertToVector128Int64, the encoding of the memory operand is being tracked by https://github.com/dotnet/corefx/issues/35768.

tannergooding on 12 Mar 2019

I confirm - dotnet/corefx#35768 covers my understanding expressed in item No. 6
It's nice we will have these APIs implemented :-)

voinokin on 12 Mar 2019

@tannergooding Can this be closed in favor of other issues? If there is remaining work here, could you open separate issues to make it very clear what work remains for 3.0?

RussKeldorph on 21 Mar 2019

Yes, I think this could be closed as I believe all issues are either resolved or tracked by other existing issues.

@voinokin, feel free to clarify if you don't believe that is the case.

tannergooding on 21 Mar 2019

I know this issue is officially close, and I'm late to the show, but I am a bit confused by the current state of preview5...:

All issues seem to be resolved, all PRs merged, yet PMOVZXB{D,Q} and other do not seem to be generated and the current master branch show this unwelcoming comment:

https://github.com/dotnet/coreclr/blob/1a495118c005b9a5409c81fea1813bd2b3044cbd/src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Avx2.cs#L750-L755

Which seems to imply it isn't really supported at this stage...

damageboy on 6 Jun 2019

Is it this part that's confusing: "The native signature does not exist."?

If so, that just means that there's no corresponding native (C++) intrinsic. You notice that for other intrinsics the equivalent C++ intrinsic is shown in addition to the target instruction, for example, a little further down we have:

        /// <summary>
        /// __m128i _mm256_extracti128_si256 (__m256i a, const int imm8)
        ///   VEXTRACTI128 xmm, ymm, imm8
        /// </summary>
        public new static Vector128<sbyte> ExtractVector128(Vector256<sbyte> value, byte index) => ExtractVector128(value, index);

The second line is the native (C++) intrinsic.

CarolEidt on 6 Jun 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add Bson integration

GitAntoinee · 3Comments

PEVerify Source?

matty-hall · 3Comments

RC2 netstandard1.5 'Object' is defined in an assembly that is not referenced.

jchannon · 3Comments

Concurrent GC fails to run concurrently

nalywa · 3Comments

'Type' does not contain a definition for 'IsEnum'

aggieben · 3Comments

Runtime: Some integer scalar and vectored methods/overloads are not exposed in HW intrinsics API

- Sse41.ConvertToVector128UInt64(Vector128(byte/ushort/uint)) for `[V]PMOVZXBQ/WQ/DQ xmm, xmm` (same insn as for existing ConvertToVector128Int64())

- Avx2.ConvertToVector256Int64(Vector128(byte/ushort/uint)) for `VPMOVZXBQ/WQ/DQ ymm, xmm` (same insn as for existing ConvertToVector256UInt64())

Adding something mentioned in https://github.com/dotnet/coreclr/issues/18300#issuecomment-394772776

Most helpful comment

All 37 comments

Related issues

Runtime: Some integer scalar and vectored methods/overloads are not exposed in HW intrinsics API

- Sse41.ConvertToVector128UInt64(Vector128(byte/ushort/uint)) for [V]PMOVZXBQ/WQ/DQ xmm, xmm (same insn as for existing ConvertToVector128Int64())

- Avx2.ConvertToVector256Int64(Vector128(byte/ushort/uint)) for VPMOVZXBQ/WQ/DQ ymm, xmm (same insn as for existing ConvertToVector256UInt64())

Adding something mentioned in https://github.com/dotnet/coreclr/issues/18300#issuecomment-394772776

Most helpful comment

All 37 comments

Related issues

- Sse41.ConvertToVector128UInt64(Vector128(byte/ushort/uint)) for `[V]PMOVZXBQ/WQ/DQ xmm, xmm` (same insn as for existing ConvertToVector128Int64())

- Avx2.ConvertToVector256Int64(Vector128(byte/ushort/uint)) for `VPMOVZXBQ/WQ/DQ ymm, xmm` (same insn as for existing ConvertToVector256UInt64())