This is a proposal of the ideas discussed in issue dotnet/runtime#20147.
There are a few SIMD operations on System.Numerics.Vector<T>
missing that are required to implement vectorized versions of log
, exp
, sin
, and cos
.
There is a C implementation of these methods using AVX intrinsics here, which I have translated into a C# implemention using Vector<T>
here. In the implementation, I have added fake versions of the missing intrinsics.
@mellinoe suggested creating two separate issues for the missing methods (the other issue is dotnet/runtime#20509), so the purpose of this issue is to propose the addition of the following methods:
public static Vector<short> ShiftLeft(Vector<short> x, int n)
public static Vector<int> ShiftLeft(Vector<int> x, int n)
public static Vector<long> ShiftLeft(Vector<long> x, int n)
public static Vector<short> ShiftRight(Vector<short> x, int n)
public static Vector<int> ShiftRight(Vector<int> x, int n)
public static Vector<long> ShiftRight(Vector<long> x, int n)
which map directly to the following AVX2 intrinsics:
_mm256_slli_epi32
_mm256_slli_epi64
_mm256_srli_epi16
_mm256_srli_epi32
_mm256_srli_epi64
respectively.
@CarolEidt @sivarv
The main conceptual problem with these methods is that, at least if our past investigations are valid, it is not feasible for the JIT to produce good SIMD instructions for these in all cases, because the second parameter is not necessarily a constant. It can do so if the methods are called with a constant parameter, but this cannot be guaranteed in C#. There is a "pit of failure" so to speak, where the performance of the method drops off a cliff if used improperly. For example, consider the following:
```C#
result = ShiftLeft(v, 5); // Constant parameter
result = ShiftLeft(v, this.Settings.ShiftAmount); // Parameter from an indirect value
The second call cannot be properly optimized because `this.Settings.ShiftAmount` is not a compile-time constant. The first call CAN be, because `5` is a compile-time constant. There is no way to enforce or communicate that the second usage is degenerate / invalid and will be many times slower than the first usage.
The most obvious workaround for this: bake specific constants into the method signature, e.g. force them to be a part of the compile-time signature:
```C#
public static Vector<T> ShiftLeftOne(Vector<T> v);
public static Vector<T> ShiftLeftTwo(Vector<T> v);
public static Vector<T> ShiftLeftThree(Vector<T> v);
public static Vector<T> ShiftLeftFour(Vector<T> v);
...
CC @russellhadley
What about use of an enum?
```c#
enum ShiftAmount : byte/int { One = 1, Two, Three, ... };
public static Vector
// called like:
var result = ShiftLeft(v,ShiftAmount.One);
var result = ShiftLeft(v,ShiftAmount.Two);
var result = ShiftLeft(v,ShiftAmount.Three);
```
The problem is that the value cannot be guaranteed to be a compile-time constant. There is no way to guarantee that in C#, except to make the constant a part of the method signature (i.e. the name of the method). Enums don't help in that regard.
Of course, sorry. So the C# compiler seems to have the facility to identify if something is a compile time constant or not, as evidenced by error messages on optional parameters on functions.
Are there any other motivations for semantics to constrain an input parameter to be a compile time constant? Or would that definitely be a no-go?
Could including a Roslyn analyzer in the NuGet package be a solution?
That way, if you write ShiftLeft(v, notAConstant)
in C# or VB, you get a warning in VS.
Caveats:
dotnet build
, but those could probably be improved to support analyzers, if they don't already.Is NEON the only architecture where this needs to be a constant?
Out of interest: Why doesn't the JIT support immediate operands?
@svick Including a Roslyn analyzer is a potential option -- it's the only one we've thought of that could help here, TBH.
@jackmott I'm not very familiar with NEON. The discussion above certainly applies to SSE/AVX instruction families.
@mjmckp It's not that the JIT "doesn't support immediate operands". The problem is that, when writing C# code, there is no way to enforce that you call a function with an immediate operand; that just isn't a concept available in the language or type system. When the JIT encounters such a call, every option at that point is sub-optimal. The JIT could either:
Like I said above, I think the Roslyn analyzer is the best option that we've thought of. We have never shipped such an analyzer with a BCL package, though, so I'm not sure exactly what it would look like.
@mellinoe I guess what I meant was: Why can't the JIT look at each invocation of the method, and either 1) generate a fixed instruction which contains the immediate operand when the argument is a literal value, or, 2) generate a dynamic instruction which contains the runtime value of the argument? I'm sure there's a perfectly good reason why it can't, I'm just curious!
@mjmckp It should be able to, but the "dynamic instruction" version will be many times slower, in theory. If we hit that path, we've essentially "failed" to give you an intrinsic operation. We'd like to design something where it is obvious when you make that mistake, and how to easily fix it.
I think there is a solution to that problem - code has to talk to compiler (Jit):
```C#
[AttributeUsage(AttributeTargets.Parameter)]
internal class JitIntrinsicParamAttribute : Attribute
{
public JitIntrinsicParamAttribute(ParamType type) { Type = type; }
public IntrinsicParamType Type { get; private set; }
}
internal enum IntrinsicParamType
{
Unknown,
Immediate
}
public static Vector
```
but on the other end compiler has to listen and should understand above code - without Jit changes implementation would be very difficult.
Encountering above attribute compiler should implement rewriting transform which would ensure that shift operation receives argument as an immediate value disregard of what we pass to method invocation.
The above pattern can be further generalized to include other difficult to implement corner cases.
@4creators You can still pass a non-constant value to that method. The problem isn't that the JIT can't detect this case (it can), but that there is no way to gracefully recover at that point. Ideally, your code would not compile if you passed a non-constant parameter to the function, but C#/IL does not allow that in any way.
I think using an attribute is a good idea. I would mean that:
Though, what would happen if the compiler and the analyzer produced the same error? Is there some way to avoid producing two errors in that case?
@mellinoe My proposal essentially allows for non-constant values to be passed to intrinsic method which uses processor instruction requiring immediate operand.
Intel definition of one of the shift intrinsics is as follows:
_mm256_slli_epi16/32/64
Logical shift of word/doubleword/quadword elements to left according to specified number.
The corresponding Intel庐 AVX2 instruction is VPSLLW, VPSLLD, or VPSLLQ*.
VPSLLW 128 bit Intel instruction definition is as follows:
VPSLLW xmm1, xmm2, xmm3/m128 -> Shift words in xmm2 left by amount specified in xmm3/m128 while shifting in 0s.
VPSLLW xmm1, xmm2, imm8 -> Shift words in xmm2 left by imm8 while shifting in 0s.
VPSLLW 256 bit instruction definition is as follows:
VPSLLW ymm1, ymm2, xmm3/m128 -> Shift words in ymm2 left by amount specified in xmm3/m128 while shifting in 0s.
VPSLLW ymm1, ymm2, imm8 -> Shift words in ymm2 left by imm8 while shifting in 0s.
Using above information one can notice that shift instructions accept not only imm8 operand
which has to be a compile time constant as it can be a value of the xmm register
which is passed as operand to one of the forms of VPSLLW or m128 memory
location.
Going further the only significant gurantee which Jit compiler has to provide is that register stores value of parameter and not it's address. In my opinion (however I do not know RyuJit internals good enough to be any authority on that) it does allow to pass a non-constant value to that instruction and the compiler (RyuJit) must ensure during all compilation passes and in particular during register allocation, instruction selection and instruction scheduling that passed value is stored in appropriate register.
What is even more important based on available VPSLL, VPSRL we can add additional Vector<T>
shift functions overloads and operators as follows:
C#
public static Vector<T> ShiftLeft(Vector<T> x, Vector<U> n);
public static Vector<T> ShiftRight(Vector<T> x, Vector<U> n);
public static operator Vector<T> <<(Vector<T> x, int n);
public static operator Vector<T> <<(Vector<T> x, Vector<U> n);
public static operator Vector<T> >>(Vector<T> x, int n);
public static operator Vector<T> >>(Vector<T> x, Vector<U> n);
public static operator Vector<T> <<=(Vector<T> x, int n);
public static operator Vector<T> <<=(Vector<T> x, Vector<U> n);
public static operator Vector<T> >>=(Vector<T> x, int n);
public static operator Vector<T> >>=(Vector<T> x, Vector<U> n);
@svick I agree that attributes could be used during Roslyn compilation and by Roslyn analyzers but due to flexibility which we have on processor instruction level at least with x86 ISA this would not be required.
After some experiments I have arrived at the following solution for enforcing passing compile time constant to basic function form, however, this would require a change to Attribute syntax. Snippet which compiles is below, unfortunately current Attribute syntax is a bit problematic here bcs one can still pass a non-constant value to ConstByteAttribute constructor (it's impossible if we use attribute inside square brackets):
```C#
namespace System.Numerics.Experimental
{
[AttributeUsage(AttributeTargets.All, Inherited = false, AllowMultiple = false)]
public class ConstByteAttribute : Attribute
{
public ConstByteAttribute(byte value)
{
Value = value;
}
public byte Value
{
get;
private set;
}
}
public class Vector<T>
{
// API
public static Vector<T> ShiftLeft(Vector<T> x, ConstByteAttribute constShift)
{
throw new NotImplementedException();
}
// API usage
public static Vector<T> ShiftLeftOne(Vector<T> x)
{
return ShiftLeft(x, new ConstByteAttribute(1));
}
}
}
One obvious change to C# which would be helpful for the above code would be an ability to use type parametrized attributes -> generic attributes providing that we could put on them constraints in form of value types. It would be a modification of the proposal https://github.com/dotnet/csharplang/issues/569
```C#
[AttributeUsage(AttributeTargets.All, Inherited = false, AllowMultiple = false)]
public class ConstAttribute<T> : Attribute where T : byte, sbyte
{
public ConstAttribute(T value)
{
Value = value;
}
public T Value
{
get;
private set;
}
}
Other changes could include implicit or explicit conversion operators form T to ConstAttribute
and from
ConstAttribute to T
. Using this pattern we could get the following syntax of calling methods with compile time constants:
```C#
public class Vector
{
// API
public static Vector
{
throw new NotImplementedException();
}
// API usage
public static Vector<T> ShiftLeftOne(Vector<T> x)
{
return ShiftLeft(x, (Const)1);
}
}
```
Finally we have a new very useful language feature -> compile time constants (or perhaps even another class of constants -> scoped constants) which are enforced using type system (very similar to C++).
@4creators I don't really understand what you're proposing, nor how it would solve the issue at hand. Like you have said, there is no way to ensure that someone passes a constant value to the attribute constructor, so it is no better than just passing a value in directly. I may be missing the reason you are using an Attribute
type, though. You're not using it as an attribute at all, just a regular method parameter.
I also believe you have misunderstood the behavior of the shift instructions above, but I am not an expert in that particular instruction. The documentation just seems to indicate that the shift amount is specified by the 1-byte immediate operand.
Realistically, we will probably do the following:
@mellinoe It is misunderstending - in my first post I have used attribute to mark passed value to enable Jit support which as You have said is available:
The problem isn't that the JIT can't detect this case (it can)`
therefore attribute use is not necessary anymore for all cases where we do not have to use constants - I thought at that time of writing that it would mean always for shift intrinsics. I have not stated that clearly in the second post discussing processor instructions. This part was edited and set in bold.
What I wanted to stress in second post discussing processor instructions is that we do not have to use immediate value as an argument to shift instruction. In the x86 ISA there are almost always defined two other VPSLL, VPSRL instructions which use xmm register or directly memory location:
VPSLLW ymm1, ymm2, xmm3/m128 -> encoded as VEX.NDS.256.66.0F.WIG F1 /r
For this instruction variant operand in which one passes shift value is a xmm3
register or m128
memory location. The instruction is encoded at a binary level differently from instruction which uses immediate (compile time constant) value.
VPSLLW ymm1, ymm2, imm8 -> encoded as VEX.NDD.256.66.0F.WIG 71 /6 ib
Altogether AVX2 defines 18 256bit shift instructions on words, double words and quadwords out of which only 6 require immediate operand (constant), 6 can use xmm register and 6 can use memory location. AVX defines 18 128bit shift instructions with exactly same groups as AVX2. Exception to this schema is at the level of SSE2 instructions where there are only 10 128bit two operand instructions and only 4 may use xmm or m128 memory operands.
My third post, however, is an extension of argument from the first post on attribute usage but this time in a completely new context i.e. we can implement 32 Vector shift instrinsics without constants using available AVX2, AVX and SSE instructions but still there are 12 corner cases where we would need compile time constants. Actually I have noticed that we have couple of uncovered shift cases while counting x86 ISA instructions over last hour.
Uncovered cases - no hardware support - comprise processors which do not support AVX/AVX2 ISA and these are pre Sandy Bridge (no AVX) Intel processors plus lower classes of even currently available processors i.e. currently produced Intel Pentium G and lower (but not Intel Pentium D). When we go lower on product ladder than even currently produced low power processor would not support SSE.
You can still pass a non-constant value to that method. The problem isn't that the JIT can't detect this case (it can), but that there is no way to gracefully recover at that point. Ideally, your code would not compile if you passed a non-constant parameter to the function, but C#/IL does not allow that in any way.
Wouldn't the recovery option just be to use the IL implementation rather than the instrinsic?
Wouldn't the recovery option just be to use the IL implementation rather than the instrinsic?
Yes -- that is likely an order of magnitude slower.
Right, but it would still work. I think at some point, we have to acknowledge that Vector
I honestly don't see how this is any worse than the current state following the addition of the Narrow/Widen/Convert methods on Vector<T>
. We used to know whether an operation was intrinsic based on the value of Vector.IsHardwareAccelerated
It started out as:
If the hardware supports it and you're on RyuJit, it's intrinsic.
But now it's:
If the hardware supports it AND the RyuJit version you're on has the implementation, it's intrinsic.
Could be:
If the hardware supports it AND the RyuJit version you're on has the implementation AND you use it correctly, it's intrinsic.
And I think the 'using it correctly' restriction is kind of implicit in Vector operations anyway, because if you use them when they're not supported (for whatever reason), it's always much slower than a scalar implementation would be.
I'm in agreement with you -- the Vector API's are for advanced users, and we can only do so much hand-holding; it is ultimately up to the user to make sure they understand what they are doing.
Like I said above, and after discussing this a lot, we are leaning towards throwing an exception if a non-const value is given. I don't think that using the IL fallback path is a good idea. The JIT is able to detect when an invalid value is passed in and generate an exception instead. We will also look into a Roslyn analyzer and/or a Roslyn compiler feature that lets us specify these parameter constraints so that errors can be caught at compile time rather than runtime.
Oh yeah, that makes perfect sense. Since a non-immediate argument is never correct, throwing an exception is probably the right thing to do. I thought the hold-up was that throwing a runtime exception wasn't acceptable and that catching it at compile time wasn't possible.
What's wrong with non-immediate shift counts? From @4creators comment and Intel documentation it looks like all of these instructions have both int imm8
and __m128i count
variants (excluding the non-packed shifts and byte shifts that are not relevant here).
The discussion above is just about how to handle hardware instructions which require immediate operands. If there's an instruction which doesn't take an immediate operand (and there is in this case), then there is no problem. It's just something to keep in mind when looking at these sorts of things.
Yeah, I got mixed up there. I was thinking about shuffles because that's what I'm more interested in, and they were part of the conversation.
Having thought more about it, though, I'm not sure a runtime exception would be great for those cases where an immediate is needed for the intrinsic and a variable is passed to the method. It would be possible (however unlikely) for someone to write bad code, test it with the legacy JITs and have it work with the IL implementation, only to have it fail later in a RyuJit environment. Seems like the best option is still to fall back to the slow code for consistency.
If we include a Roslyn analyzer with the library (or land a built-in compiler feature), then the user would also have to actively ignore the build-time warning/error about misusing the API. I think I'm okay with throwing a runtime error in that case.
I have created a C# const keyword
extended usage proposal as a one of the possible solutions to enforcing compile time constant usage in some methods overloads. I think it's clean, intuitively easy to understand since it is well known keyword used in the very same way but in new context. Should be easy to implement as well.
I think I'm okay with throwing a runtime error in that case.
I do not think that it is a good idea to throw exceptions as performance warnings when the JIT is not able to perform given optimization. There are number of cases where the slow non-interpreted implementation is perfectly acceptable. For example, profilers and other similar diagnostic tools that instrument IL can modify the IL such that the JIT may not be able to detect the constant.
It is not unusual to see order of magnitude slow down in a performance sensitive methods when the expected optimization is not performed for them, nothing specific to SIMD. For example, MethodImplOptions.AggressiveInlining
may not be honored for various reasons. We do not throw exceptions and crash the program when that happens. Instead, we have ETW events that you can use to find out when the expected optimization did not happen.
@jkotas @mellinoe probably I am not understanding the root of the issue of immediate access instructions. But given that the latency for the immediate is 1 and the other case (register) is 3 in the worst case (on most relevant architectures at least), there is not such a big difference between one or the other. If the JIT can see the constant, fine, you issue the immediate instruction; if not, well you just issue the register one. The difference is for most practical purposes so small (given the alternative of not having it) that it feels like a non-issue for every AVX supported platform. With aggressive inlining set on the function itself, the JIT will constant propagate and catch most of the constant uses anyways. If you really really have an issue there, its because you know what you are doing anyways; and you can definitely fix it if possible.
Even in the cases where it is just not supported, the straightforward (naive) version is instruction translate to an extra table-jump + lots of instructions + a far jump back per call; definitely not the same as having support but its doable on non AVX[1|2] without issuing a call instruction (which we should avoid like the plague). And I bet there are probably better algorithms for it anyways.
I can understand the "pit of success" drive, but in the end the problem is an abstraction one. IMHO the general issue here is that probably we are not going to use this anyways in non AVX platforms (I know I wouldn't use it if I cannot know in advance). The Vector<T>
abstract definition feels just not good enough for a general purpose API given the broad support necessary and the weak ability to figure out if platforms abide to the performance needed constraints. Then it begs the question, why put resources into rubber stamp a weak implementation on an abstraction that is not playing nice with it, instead of tackle the most general issue once and for all? I know that asking the same question again and again gets old, but I still cannot see a path that leads to success that doesn't have that pre-requisite built into it. Needless to say I am here hoping that I find myself being wrong.
馃槅 Seems that our discussion is misplaced because of errors in Intel Architecture
Instruction Set Extensions Programming Reference edition August 2015 319433-023 which is the most recent reference published by Intel comprising SSE/AVX/AVX2 instruction set extensions. When compared to Intel Architecture Instruction Set Extensions Programming Reference edition February 2012 319433-012A all SSE shift instructions have equivalents accepting register/memory operands in 2012 instead of immediate operand. Link to 2012 edition is broken as one has to add .pdf extension to download file so many of us would use 2015 edition as the newest and most authoritative one and skip download of the old one from broken link. After verifying it with Intel 64 and IA-32 Architectures Software Developer鈥檚 Manual Volume 2 (2A, 2B, 2C & 2D): Instruction Set Reference, A-Z edition July 2017 325462-063US there are indeed 56 logical shift instructions with all having equivalents accepting besides imm8
operand xmm
or m128
operands (register and memory respectively).
@mellinoe @jkotas Perhaps it would be prudent to ask Intel for official position on that error putting behind the question MSFT weight to settle problem once for all as we have wasted quite a bit of time for non-issue.
Shift Logical Left Instructions
PSLLW xmm1, xmm2
PSLLW xmm1, m128
PSLLW xmm1, imm8
PSLLD xmm1, xmm2
PSLLD xmm1, m128
PSLLD xmm1, imm8
PSLLQ xmm1, xmm2
PSLLQ xmm1, m128
PSLLQ xmm1, imm8
VPSLLW xmm1, xmm2, xmm3
VPSLLW xmm1, xmm2, m128
VPSLLW xmm1, xmm2, imm8
VPSLLD xmm1, xmm2, xmm3
VPSLLD xmm1, xmm2, m128
VPSLLD xmm1, xmm2, imm8
VPSLLQ xmm1, xmm2, xmm3
VPSLLQ xmm1, xmm2, m128
VPSLLQ xmm1, xmm2, imm8
VPSLLW ymm1, ymm2, xmm3
VPSLLW ymm1, ymm2, m128
VPSLLW ymm1, ymm2, imm8
VPSLLD ymm1, ymm2, xmm3
VPSLLD ymm1, ymm2, m128
VPSLLD ymm1, ymm2, imm8
VPSLLQ ymm1, ymm2, xmm3
VPSLLQ ymm1, ymm2, m128
VPSLLQ ymm1, ymm2, imm8
Shift Logical Right Instructions
PSRLW xmm1, xmm2
PSRLW xmm1, m128
PSRLW xmm1, imm8
PSRLD xmm1, xmm2
PSRLD xmm1, m128
PSRLD xmm1, imm8
PSRLQ xmm1, xmm2
PSRLQ xmm1, m128
PSRLQ xmm1, imm8
VPSRLW xmm1, xmm2, xmm3
VPSRLW xmm1, xmm2, m128
VPSRLW xmm1, xmm2, imm8
VPSRLD xmm1, xmm2, xmm3
VPSRLD xmm1, xmm2, m128
VPSRLD xmm1, xmm2, imm8
VPSRLQ xmm1, xmm2, xmm3
VPSRLQ xmm1, xmm2, m128
VPSRLQ xmm1, xmm2, imm8
VPSRLW ymm1, ymm2, xmm3
VPSRLW ymm1, ymm2, m128
VPSRLW ymm1, ymm2, imm8
VPSRLD ymm1, ymm2, xmm3
VPSRLD ymm1, ymm2, m128
VPSRLD ymm1, ymm2, imm8
VPSRLQ ymm1, ymm2, xmm3
VPSRLQ ymm1, ymm2, m128
VPSRLQ ymm1, ymm2, imm8
I believe, for a compiler feature, adding a modreq isliteral
flag to the generated IL is the best route.
The runtime would still likely need to validate the parameter is a literal and throw if it is not (for manually implemented IL, or a compiler that ignores the modreq flag anyways).
There are number of cases where the slow non-interpreted implementation is perfectly acceptable. For example, profilers and other similar diagnostic tools that instrument IL can modify the IL such that the JIT may not be able to detect the constant.
@jkotas Thanks for bringing that up; we hadn't considered such a scenario in our discussions. In general, I don't really have a strong preference for one option or the other, because I am very hopeful that we will be able to rely on a compiler feature to make the difference more-or-less irrelevant.
Intel hardware intrinsic API proposal has been opened at dotnet/corefx#22940
It has been nearly 2 years. Any progress on ShiftRight
and ShiftLeft
? I would really love to this as part of Vector<T>
.
You dont need it anymore. You can implement them directly using Hardware Intrinsics. 3.0 will support the whole SSE, AVX, AVX2 standard. 2.1 already support a bunch.
This isn't much help to those constrained to use .NET framework, though.
Most helpful comment
If we include a Roslyn analyzer with the library (or land a built-in compiler feature), then the user would also have to actively ignore the build-time warning/error about misusing the API. I think I'm okay with throwing a runtime error in that case.