This is a concrete proposal of the ideas discussed in issue dotnet/runtime#14521. Converting data from one form to another is a necessary part of vector algorithms in many areas such as image processing, signal processing, text manipulation, and more. Current SIMD instruction sets expose a lot of functionality for quickly and efficiently converting large chunks of data, but the current interface of Vector<T> does not have any support for it. Allowing access to these intrinsics could greatly speed up common algorithms involving data conversion.
All of the additions are on the static Vector class, and operate on parameters of type Vector<T>. The additions are systematic and straightforward. Widen and Narrow are provided for type pairs that are half- or double-sized, and Convert is provided for same-sized integral-floating point type pairs.
```C#
public static partial class Vector
{
public static void Widen(Vector
public static void Widen(Vector
public static void Widen(Vector
public static void Widen(Vector
public static void Widen(Vector
public static void Widen(Vector
public static void Widen(Vector
public static Vector<byte> Narrow(Vector<ushort> source1, Vector<ushort> source2);
public static Vector<ushort> Narrow(Vector<uint> source1, Vector<uint> source2);
public static Vector<uint> Narrow(Vector<ulong> source1, Vector<ulong> source2);
public static Vector<sbyte> Narrow(Vector<short> source1, Vector<short> source2);
public static Vector<short> Narrow(Vector<int> source1, Vector<int> source2);
public static Vector<int> Narrow(Vector<long> source1, Vector<long> source2);
public static Vector<float> Narrow(Vector<double> source1, Vector<double> source2);
public static Vector<float> ConvertToSingle(Vector<int> value);
public static Vector<float> ConvertToSingle(Vector<uint> value);
public static Vector<double> ConvertToDouble(Vector<long> value);
public static Vector<double> ConvertToDouble(Vector<ulong> value);
public static Vector<int> ConvertToInt32(Vector<float> value);
public static Vector<uint> ConvertToUInt32(Vector<float> value);
public static Vector<long> ConvertToInt64(Vector<double> value);
public static Vector<ulong> ConvertToUInt64(Vector<double> value);
}
# Semantics
The semantics are simple and uniform. I will use a representative set of methods to explain.
```C#
Widen(Vector<byte> source, out Vector<ushort> dest1, out Vector<ushort> dest2)
An input Vector<byte> called source is given. When the method completes, dest1 contains the lower half of elements in source, and dest2 contains the upper half of elements in source. The elements are converted as if they were individually cast from byte to ushort.
```C#
Vector
Two input `Vector<ushort>`'s are given. The method returns a single `Vector<byte>`. The return value's lower elements are the elements from `source1`, and the upper elements are from `source2`. The elements are converted as if they were individually cast from `ushort` to `byte`.
```C#
Vector<float> ConvertToSingle(Vector<int> value)
A single Vector<int> is given. The method returns a single Vector<float> containing the input's elements. The elements are converted as if they were individually cast from int to float.
Currently, there is no way to efficiently convert elements residing in Vector<T> to a different data type. Directly converting the vector's source data (in a serial manner) is the most efficient way, but that requires your algorithm to break out of "vector mode" just to do simple conversion operations. It is also many times slower than we could achieve with proper vector conversion support. Many common algorithms require conversions from one data type to another, so this is currently a large hole in the API.
@sivarv @CarolEidt
Yes please 馃槃
Proposal looks like ready for review, we should tackle that tomorrow.
Note from design discussion:
low and high, as appropriate, in order to better document how the parameters are used.We should rename source1 and dest1 to low and source2 and dest2 to high.
We should leave value as-is -- it's consistent to other methods on Vector.
Question... my SIMD cheat sheet says that unpack is available on SSE2; could there be a way to use these functions in coreclr (even if its hardcoded to x64 and the original AMD64 instructions)
@benaadams Could you explain more what you mean?
As I understand it Vector can be used upstream of coreclr (e.g. corefx, et al); but the types aren't available in coreclr itself. I assume there may be issues with IsHardwareAcclerated branch elimination and Vector sizing with NGen?
However a place Widen would be very useful would be in ASCII and UTF8 Encodings for conversion to string which are defined in coreclr rather than corefx. Narrow would be useful for conversion for same as byte[] https://github.com/dotnet/coreclr/pull/9187
I see; you're asking whether Vector<T> can be used within System.Private.CoreLib for optimizations to lower-level core types. Right now, it cannot. You would need to implement such optimizations in a separate layer.
Thanks, @mellinoe. I had a follow-up question to the one I asked in the PR comments but figured it belongs here instead. I have to apologize for my ignorance as I'm not sure I fully understand how RyuJIT updates and the .NET Core and .NET Framework versioning work with the add-on package dependencies. Will the updated version of System.Numerics.Vectors have a runtime version requirement that keeps it from being used with an older version of the JIT? I'm just concerned about cases where Vector.IsHardwareAccelerated would be true but the installed JIT doesn't support the intrinsics for these new methods. Do we need an additional flag indicating whether these new methods are actually accelerated?
The Widen methods use out parameters to return multiple values. But in C# 7, it will become possible to use tuples (i.e. ValueTuple) for the same purpose.
From the API design perspective, would it be worthwhile to consider changing the API to that? For example, changing from:
```c#
public static void Widen(Vector
To:
```c#
public static (Vector<ushort> low, Vector<ushort> high) Widen(Vector<byte> source);
If the answer is yes, how hard would it be to make CoreCLR compile a call to such method to the same machine code as the currently proposed out version? Or are other issues with this approach (maybe the added dependency from System.Numerics.Vectors to System.ValueTuple?)?
@saucecontrol You're correct in that we don't have a way to expose _which_ intrinsics are accelerated. It's a problem we've talked about in the past, but haven't come up with a good way to represent it in the API. In this case, in the next version of .NET Framework, RyuJIT will be updated with support for these methods.
@svick As far as I know, we are not intending to use tuples in any of our public API's. In this case, I don't think it's a good idea. I can't really accurately say how hard it would be to ensure that the JIT generates the best code in this case.
@mellinoe I'm still not clear on the versioning and dependencies. I get that the new JIT will use the intrinsics with these new methods, but will the updated System.Numerics.Vectors have its dependencies configured such that it is _only_ available on newer runtimes that would be guaranteed to have the updated JIT?
Since hardware acceleration is the point of Vector<T> in the first place, I think it's extremely important that we be able to tell whether that acceleration is actually available. If that's done with the runtime versioning requirements so that it's all or nothing, then everything's cool. But if there can be scenarios where some operations are accelerated and some aren't, I can see that causing some serious problems.
Let's say I'm widening a byte[] to int[]. If the intrinsics are available, it may make sense to use a vectored widen from byte to ushort and then ushort to int, but if I end up running the fallback, that's significantly worse than if I wrote a direct conversion myself.
The problem will only get worse from here, so it's probably best to solve the problem now, before adding to the Vector<T> footprint.
Again, my apologies if I've simply misunderstood the way the versioning is done.
I agree with what you're saying. We need some sort of capability-querying API for Vector<T>; we just haven't been able to pinpoint a good design for it yet. Even without these new API's the feature would be good to have. We have a variety of compilers (x86/x64 RyuJIT, ARM/64 RyuJIT in the future, .NET Native, etc.) with different levels of support for different intrinsics, and it is impossible to understand what will be accelerated unless you know your runtime environment ahead of time. We should try to track it separately from this issue, I think.
These have been implemented in corefx. JIT work is still in progress.
This is a great start, however there are still a lot of SSE2 intrinsics missing that are required to port over the vectorised versions of sin, cos, exp and log written in C found here. The same implementations have also been written in terms of AVX intrinsics here.
As @jackmott mentioned in dotnet/runtime#14521, ideally we would have every SIMD intrinsic that Intel and ARM support added to the API.
There is definitely more we need to add and improve, but it's more complicated and nuanced than just simply adding every intrinsic out there. We have quite a few issues filed regarding additional features, but we need help designing them, and we especially need more input from the folks interested in using them. Here's a few:
https://github.com/dotnet/corefx/issues/1168
https://github.com/dotnet/corefx/issues/992
https://github.com/dotnet/corefx/issues/1608
https://github.com/dotnet/corefx/issues/1010
We haven't heard anything regarding trigonometric or other transcendental functions, but we're aware it's a missing piece. It's likely that we could expose such intrinsics fairly directly, e.g. Vector.Sin(Vector<float>), etc. but we need a complete and coherent design before we'd move forward with that.
Just adding the 4 methods I mentioned from here should be relatively straightforward, the implementation is only about 700 lines.
Would you prefer a proposal that just adds the 4 corresponding Vector methods, or also includes exposing just the intrinsics required to implement them?
Do you happen to have an example of what a similar proposal looks like, so that I can see if it is within my capabilities?
The proposal at the top of the page is a good starting point. The main hurdle to get over is designing the public interface, which is distinct from actually implementing the JIT recognition and intrinsic codegen. At a bare minimum the proposal would need to cover that, as well as the other topics touched on in this document.
That said, proof-of-concept prototype code for the JIT is welcomed, but not a pre-requisite for having the idea itself approved. In the case of the proposal up above, we had already implemented a good chunk of the functionality in the JIT as a prototype several months back, so that part was skimmed over.
I've put an implementation of log, exp, sin, and cos for float using System.Numerics.Vector<T> up here, which is based on avx_mathfun.h from here.
To do this, I've added methods which mimic the functionality of the missing AVX intrinsics, which include functions to do:
float and int (these missing methods have been added in the proposal above).The corresponding implementations for double would require a bit more work, as it would be necessary to go back to the cephes source, as the algorithms for single and double precision are similar, but different.
In the meantime, may I suggest adding the above intrinsics (plus Ceil for symmetry) to Vector<T>? That is:
public static Vector<float> Floor(Vector<float> x)
public static Vector<double> Floor(Vector<double> x)
public static Vector<float> Ceil(Vector<float> x)
public static Vector<double> Ceil(Vector<double> x)
public static Vector<T> ShiftLeft(Vector<T> x, int n)
public static Vector<T> ShiftRight(Vector<T> x, int n)
public static Vector<T> BitwiseXOR(Vector<T> x, Vector<T> y)
public static Vector<T> BitwiseAndNot(Vector<T> x, Vector<T> y)
BitwiseXOR already exists; though its just called Xor (assume because there is no confusion with a logical version)
Right, thanks Ben, I have updated the implementation to reflect this.
It turns out AndNot is also there, although the order of the arguments are flipped around compared to the corresponding AVX intrinsic! The documentation for this method really needs to be updated.
The updated list of proposed methods is:
public static Vector<float> Floor(Vector<float> x)
public static Vector<double> Floor(Vector<double> x)
public static Vector<float> Ceil(Vector<float> x)
public static Vector<double> Ceil(Vector<double> x)
public static Vector<T> ShiftLeft(Vector<T> x, int n)
public static Vector<T> ShiftRight(Vector<T> x, int n)
With the initial proposal here and mjmckp's additions, popular noise functions like Simplex and Perlin will be practical to implement in C# and F# with SIMD, that would be great!
@mjmckp Thanks for the detailed response and implementation. I think that Floor and Ceil are pretty simple and uncontroversial; I would suggest filing a new issue about those two functions so we can discuss them specifically.
ShiftLeft and ShiftRight have a more fundamental problem in that the CPU instruction for them has an immediate operand. Our discussions in the past have led us to believe that it is not feasible to expose a C# function with such a parameter and expect that the JIT would be able to understand and optimize it correctly. You should file a new issue about ShiftLeft and ShiftRight so we can discuss the options.
Question for possibly a future API proposal.
The description below is a hypothetical situation. The actual question is at the bottom.
Hypothetical situation 1
Suppose someone would like to use:
Array.ConvertWithTruncate(int[] source, int sourceStart, byte[] dest, int destStart, int count)Array.ConvertWithSaturation(int[] source, int sourceStart, byte[] dest, int destStart, int count)Array.ConvertAll(byte[] source, int sourceStart, int[] dest, int destStart, int count)The method signatures are modeled after Array.Copy and Array.ConvertAll. This is the reason why I put these hypothetical methods in the Array class. It might be more suitable for Vector class given that Vector is a partial static class which makes it extensible.
Furthermore it would be nice if the underlying implementation is hardware accelerated, without effort from the programmer. In other words, whenever the count is of sufficient size, at least some part of the array processing would be done in SIMD.
Hypothetical situation 2
An alternative proposal is to piggyback on Array.ConvertAll.
ConvertAll method magically benefits from hardware acceleration. Lambdas won't work here.)The actual question:
In each hypothetical situation, would it be beneficial to build such methods in C# on top of the Widen and Narrow methods, or would these be better proposed as new API additions?
I posted some simple demonstration code for using this capability, with commentary, on StackOverflow. For experts who don't need to read that belabored treatise, here is the working and tested code, which implements "raw/dumb" (linguistically-blithe) widening of a byte[] array to char[]:
/// <summary>
/// 'Widen' each byte in 'bytes' to 16-bits with no consideration for encoding/character mapping
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe char[] WidenByteArray(byte[] bytes, int i, int c)
{
var rgch = new char[c];
if (c > 0)
fixed (char* dst = rgch)
fixed (byte* src = bytes)
widen_bytes_simd(dst, src + i, c);
return rgch;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static void unsafe widen_bytes_simd(char* dst, byte* src, int c)
{
for (; c > 0 && ((long)dst & 0xF) != 0; c--)
*dst++ = (char)*src++;
for (; (c -= 0x10) >= 0; src += 0x10, dst += 0x10)
Vector.Widen(Unsafe.AsRef<Vector<byte>>(src),
out Unsafe.AsRef<Vector<ushort>>(dst + 0),
out Unsafe.AsRef<Vector<ushort>>(dst + 8));
for (c += 0x10; c > 0; c--)
*dst++ = (char)*src++;
}
Kudos to the x64 JIT team for the following impressive result which, in the realm of .NET, destroys all comers at this particular byte-interleaving task.
L_4223 mov rax,rbx
L_4226 movups xmm0,xmmword ptr [rax] ; fetch 16 bytes
L_4229 mov rax,rdi
L_422C lea rdx,[rdi+10h]
L_4230 movaps xmm2,xmm0
L_4233 pxor xmm1,xmm1
L_4237 punpcklbw xmm2,xmm1 ; interleave 8-to-16 bits (lo)
L_423B movups xmmword ptr [rax],xmm2 ; store 8 bytes (lo) to 8 wide chars (16 bytes)
L_423E pxor xmm1,xmm1
L_4242 punpckhbw xmm0,xmm1 ; interleave 8-to-16 bits (hi)
L_4246 movups xmmword ptr [rdx],xmm0 ; store 8 bytes (hi) to 8 wide chars (16 bytes)
L_4249 add rbx,10h
L_424D add rdi,20h
L_4251 add esi,0FFFFFFF0h
L_4254 test esi,esi
L_4256 jge L_4223
L_4258 ...
Normalized performance results for this code on .NET Framework 4.7.2 are consistently outstanding, vs. several of the single-byte .NET encoders for all of the data sizes I examined (up to 2MB).
SIMD code (shown above): 100% C# naive unsafe bytewise loop: 153.45% Encoding.UTF8: 161.62% Encoding.ASCII: 221.38% Encoding.Default (1252): 358.84%
Narrow isn't quite as exciting https://github.com/dotnet/coreclr/issues/16474 if you can come up with better asm than is presented in the issue, please do!
@grabyourpitchforks
https://github.com/dotnet/corefx/issues/15957#issuecomment-407696940
_Narrow_ isn't quite as exciting...
@benaadams Didn't you mean to say _Widen_ (which I showed above) is less "exciting" (challenging/problematic) versus Narrow (dotnet/coreclr#16474)?
Widen is exciting for performance; and is used in Kestrel; Narrow isn't so great and isn't currently used in Kestrel (though it has benchmarks for it)
Most helpful comment
These have been implemented in corefx. JIT work is still in progress.