Runtime: API Proposal: Widen, Narrow, and Convert for Vector<T>

Created on 8 Feb 2017 · 29Comments · Source: dotnet/runtime

This is a concrete proposal of the ideas discussed in issue dotnet/runtime#14521. Converting data from one form to another is a necessary part of vector algorithms in many areas such as image processing, signal processing, text manipulation, and more. Current SIMD instruction sets expose a lot of functionality for quickly and efficiently converting large chunks of data, but the current interface of Vector<T> does not have any support for it. Allowing access to these intrinsics could greatly speed up common algorithms involving data conversion.

Proposed additions

All of the additions are on the static Vector class, and operate on parameters of type Vector<T>. The additions are systematic and straightforward. Widen and Narrow are provided for type pairs that are half- or double-sized, and Convert is provided for same-sized integral-floating point type pairs.

```C#
public static partial class Vector
{
public static void Widen(Vector source, out Vector dest1, out Vector dest2);
public static void Widen(Vector source, out Vector dest1, out Vector dest2);
public static void Widen(Vector source, out Vector dest1, out Vector dest2);
public static void Widen(Vector source, out Vector dest1, out Vector dest2);
public static void Widen(Vector source, out Vector dest1, out Vector dest2);
public static void Widen(Vector source, out Vector dest1, out Vector dest2);
public static void Widen(Vector source, out Vector dest1, out Vector dest2);

public static Vector<byte> Narrow(Vector<ushort> source1, Vector<ushort> source2);
public static Vector<ushort> Narrow(Vector<uint> source1, Vector<uint> source2);
public static Vector<uint> Narrow(Vector<ulong> source1, Vector<ulong> source2);
public static Vector<sbyte> Narrow(Vector<short> source1, Vector<short> source2);
public static Vector<short> Narrow(Vector<int> source1, Vector<int> source2);
public static Vector<int> Narrow(Vector<long> source1, Vector<long> source2);
public static Vector<float> Narrow(Vector<double> source1, Vector<double> source2);

public static Vector<float> ConvertToSingle(Vector<int> value);
public static Vector<float> ConvertToSingle(Vector<uint> value);
public static Vector<double> ConvertToDouble(Vector<long> value);
public static Vector<double> ConvertToDouble(Vector<ulong> value);
public static Vector<int> ConvertToInt32(Vector<float> value);
public static Vector<uint> ConvertToUInt32(Vector<float> value);
public static Vector<long> ConvertToInt64(Vector<double> value);
public static Vector<ulong> ConvertToUInt64(Vector<double> value);

}

# Semantics
The semantics are simple and uniform. I will use a representative set of methods to explain.

```C#
Widen(Vector<byte> source, out Vector<ushort> dest1, out Vector<ushort> dest2)

An input Vector<byte> called source is given. When the method completes, dest1 contains the lower half of elements in source, and dest2 contains the upper half of elements in source. The elements are converted as if they were individually cast from byte to ushort.

```C#
Vector Narrow(Vector source1, Vector source2)

Two input `Vector<ushort>`'s are given. The method returns a single `Vector<byte>`. The return value's lower elements are the elements from `source1`, and the upper elements are from `source2`. The elements are converted as if they were individually cast from `ushort` to `byte`.

```C#
Vector<float> ConvertToSingle(Vector<int> value)

A single Vector<int> is given. The method returns a single Vector<float> containing the input's elements. The elements are converted as if they were individually cast from int to float.

Rationale

Currently, there is no way to efficiently convert elements residing in Vector<T> to a different data type. Directly converting the vector's source data (in a serial manner) is the most efficient way, but that requires your algorithm to break out of "vector mode" just to do simple conversion operations. It is also many times slower than we could achieve with proper vector conversion support. Many common algorithms require conversions from one data type to another, so this is currently a large hole in the API.

@sivarv @CarolEidt

api-approved area-System.Numerics

Source

mellinoe

👍13 🎉2

Most helpful comment

These have been implemented in corefx. JIT work is still in progress.

mellinoe on 1 Mar 2017

🎉4

All 29 comments

Yes please 😄

benaadams on 8 Feb 2017

Proposal looks like ready for review, we should tackle that tomorrow.

terrajobst on 13 Feb 2017

Note from design discussion:

Consider renaming the parameters to low and high, as appropriate, in order to better document how the parameters are used.

mellinoe on 14 Feb 2017

👍1

We should rename source1 and dest1 to low and source2 and dest2 to high.

We should leave value as-is -- it's consistent to other methods on Vector.

terrajobst on 14 Feb 2017

Question... my SIMD cheat sheet says that unpack is available on SSE2; could there be a way to use these functions in coreclr (even if its hardcoded to x64 and the original AMD64 instructions)

benaadams on 15 Feb 2017

@benaadams Could you explain more what you mean?

mellinoe on 15 Feb 2017

As I understand it Vector can be used upstream of coreclr (e.g. corefx, et al); but the types aren't available in coreclr itself. I assume there may be issues with IsHardwareAcclerated branch elimination and Vector sizing with NGen?

However a place Widen would be very useful would be in ASCII and UTF8 Encodings for conversion to string which are defined in coreclr rather than corefx. Narrow would be useful for conversion for same as byte[] https://github.com/dotnet/coreclr/pull/9187

benaadams on 15 Feb 2017

I see; you're asking whether Vector<T> can be used within System.Private.CoreLib for optimizations to lower-level core types. Right now, it cannot. You would need to implement such optimizations in a separate layer.

mellinoe on 15 Feb 2017

Thanks, @mellinoe. I had a follow-up question to the one I asked in the PR comments but figured it belongs here instead. I have to apologize for my ignorance as I'm not sure I fully understand how RyuJIT updates and the .NET Core and .NET Framework versioning work with the add-on package dependencies. Will the updated version of System.Numerics.Vectors have a runtime version requirement that keeps it from being used with an older version of the JIT? I'm just concerned about cases where Vector.IsHardwareAccelerated would be true but the installed JIT doesn't support the intrinsics for these new methods. Do we need an additional flag indicating whether these new methods are actually accelerated?

saucecontrol on 21 Feb 2017

The Widen methods use out parameters to return multiple values. But in C# 7, it will become possible to use tuples (i.e. ValueTuple) for the same purpose.

From the API design perspective, would it be worthwhile to consider changing the API to that? For example, changing from:

```c#
public static void Widen(Vector source, out Vector low, out Vector high);

To:

```c#
public static (Vector<ushort> low, Vector<ushort> high) Widen(Vector<byte> source);

If the answer is yes, how hard would it be to make CoreCLR compile a call to such method to the same machine code as the currently proposed out version? Or are other issues with this approach (maybe the added dependency from System.Numerics.Vectors to System.ValueTuple?)?

svick on 21 Feb 2017

@saucecontrol You're correct in that we don't have a way to expose _which_ intrinsics are accelerated. It's a problem we've talked about in the past, but haven't come up with a good way to represent it in the API. In this case, in the next version of .NET Framework, RyuJIT will be updated with support for these methods.

@svick As far as I know, we are not intending to use tuples in any of our public API's. In this case, I don't think it's a good idea. I can't really accurately say how hard it would be to ensure that the JIT generates the best code in this case.

mellinoe on 21 Feb 2017

@mellinoe I'm still not clear on the versioning and dependencies. I get that the new JIT will use the intrinsics with these new methods, but will the updated System.Numerics.Vectors have its dependencies configured such that it is _only_ available on newer runtimes that would be guaranteed to have the updated JIT?

Since hardware acceleration is the point of Vector<T> in the first place, I think it's extremely important that we be able to tell whether that acceleration is actually available. If that's done with the runtime versioning requirements so that it's all or nothing, then everything's cool. But if there can be scenarios where some operations are accelerated and some aren't, I can see that causing some serious problems.

Let's say I'm widening a byte[] to int[]. If the intrinsics are available, it may make sense to use a vectored widen from byte to ushort and then ushort to int, but if I end up running the fallback, that's significantly worse than if I wrote a direct conversion myself.

The problem will only get worse from here, so it's probably best to solve the problem now, before adding to the Vector<T> footprint.

Again, my apologies if I've simply misunderstood the way the versioning is done.

saucecontrol on 21 Feb 2017

I agree with what you're saying. We need some sort of capability-querying API for Vector<T>; we just haven't been able to pinpoint a good design for it yet. Even without these new API's the feature would be good to have. We have a variety of compilers (x86/x64 RyuJIT, ARM/64 RyuJIT in the future, .NET Native, etc.) with different levels of support for different intrinsics, and it is impossible to understand what will be accelerated unless you know your runtime environment ahead of time. We should try to track it separately from this issue, I think.

mellinoe on 22 Feb 2017

These have been implemented in corefx. JIT work is still in progress.

mellinoe on 1 Mar 2017

🎉4

This is a great start, however there are still a lot of SSE2 intrinsics missing that are required to port over the vectorised versions of sin, cos, exp and log written in C found here. The same implementations have also been written in terms of AVX intrinsics here.

As @jackmott mentioned in dotnet/runtime#14521, ideally we would have every SIMD intrinsic that Intel and ARM support added to the API.

mjmckp on 1 Mar 2017

There is definitely more we need to add and improve, but it's more complicated and nuanced than just simply adding every intrinsic out there. We have quite a few issues filed regarding additional features, but we need help designing them, and we especially need more input from the folks interested in using them. Here's a few:

https://github.com/dotnet/corefx/issues/1168
https://github.com/dotnet/corefx/issues/992
https://github.com/dotnet/corefx/issues/1608
https://github.com/dotnet/corefx/issues/1010

We haven't heard anything regarding trigonometric or other transcendental functions, but we're aware it's a missing piece. It's likely that we could expose such intrinsics fairly directly, e.g. Vector.Sin(Vector<float>), etc. but we need a complete and coherent design before we'd move forward with that.

mellinoe on 1 Mar 2017

Just adding the 4 methods I mentioned from here should be relatively straightforward, the implementation is only about 700 lines.

Would you prefer a proposal that just adds the 4 corresponding Vector methods, or also includes exposing just the intrinsics required to implement them?

Do you happen to have an example of what a similar proposal looks like, so that I can see if it is within my capabilities?

mjmckp on 2 Mar 2017

The proposal at the top of the page is a good starting point. The main hurdle to get over is designing the public interface, which is distinct from actually implementing the JIT recognition and intrinsic codegen. At a bare minimum the proposal would need to cover that, as well as the other topics touched on in this document.

That said, proof-of-concept prototype code for the JIT is welcomed, but not a pre-requisite for having the idea itself approved. In the case of the proposal up above, we had already implemented a good chunk of the functionality in the JIT as a prototype several months back, so that part was skimmed over.

mellinoe on 2 Mar 2017

👍1

I've put an implementation of log, exp, sin, and cos for float using System.Numerics.Vector<T> up here, which is based on avx_mathfun.h from here.

To do this, I've added methods which mimic the functionality of the missing AVX intrinsics, which include functions to do:

Floor.
Bitwise shift left and shift right.
Bitwise XOR.
Bitwise ANDNOT (a strange but pervasive intrinsic defined here which is necessary as there isn't an intrinsic to compute the one's complement).
Conversion to/from float and int (these missing methods have been added in the proposal above).

The corresponding implementations for double would require a bit more work, as it would be necessary to go back to the cephes source, as the algorithms for single and double precision are similar, but different.

In the meantime, may I suggest adding the above intrinsics (plus Ceil for symmetry) to Vector<T>? That is:

public static Vector<float> Floor(Vector<float> x)
public static Vector<double> Floor(Vector<double> x)
public static Vector<float> Ceil(Vector<float> x)
public static Vector<double> Ceil(Vector<double> x)
public static Vector<T> ShiftLeft(Vector<T> x, int n)
public static Vector<T> ShiftRight(Vector<T> x, int n)
public static Vector<T> BitwiseXOR(Vector<T> x, Vector<T> y)
public static Vector<T> BitwiseAndNot(Vector<T> x, Vector<T> y)

mjmckp on 7 Mar 2017

BitwiseXOR already exists; though its just called Xor (assume because there is no confusion with a logical version)

benaadams on 7 Mar 2017

Right, thanks Ben, I have updated the implementation to reflect this.

It turns out AndNot is also there, although the order of the arguments are flipped around compared to the corresponding AVX intrinsic! The documentation for this method really needs to be updated.

The updated list of proposed methods is:

public static Vector<float> Floor(Vector<float> x)
public static Vector<double> Floor(Vector<double> x)
public static Vector<float> Ceil(Vector<float> x)
public static Vector<double> Ceil(Vector<double> x)
public static Vector<T> ShiftLeft(Vector<T> x, int n)
public static Vector<T> ShiftRight(Vector<T> x, int n)

mjmckp on 7 Mar 2017

👍1

With the initial proposal here and mjmckp's additions, popular noise functions like Simplex and Perlin will be practical to implement in C# and F# with SIMD, that would be great!

jackmott on 7 Mar 2017

👍2

@mjmckp Thanks for the detailed response and implementation. I think that Floor and Ceil are pretty simple and uncontroversial; I would suggest filing a new issue about those two functions so we can discuss them specifically.

ShiftLeft and ShiftRight have a more fundamental problem in that the CPU instruction for them has an immediate operand. Our discussions in the past have led us to believe that it is not feasible to expose a C# function with such a parameter and expect that the JIT would be able to understand and optimize it correctly. You should file a new issue about ShiftLeft and ShiftRight so we can discuss the options.

mellinoe on 7 Mar 2017

Question for possibly a future API proposal.

The description below is a hypothetical situation. The actual question is at the bottom.

Hypothetical situation 1

Suppose someone would like to use:

Array.ConvertWithTruncate(int[] source, int sourceStart, byte[] dest, int destStart, int count)
- (where the lowest 8 bits will be kept)
Array.ConvertWithSaturation(int[] source, int sourceStart, byte[] dest, int destStart, int count)
- (where negative values are clamped to zero, and values greater than 255 are clamped to 255)
Array.ConvertAll(byte[] source, int sourceStart, int[] dest, int destStart, int count)
- (where it is not necessary to specify how that conversion is done because a byte fits within the range of an 32-bit signed integer)

The method signatures are modeled after Array.Copy and Array.ConvertAll. This is the reason why I put these hypothetical methods in the Array class. It might be more suitable for Vector class given that Vector is a partial static class which makes it extensible.

Furthermore it would be nice if the underlying implementation is hardware accelerated, without effort from the programmer. In other words, whenever the count is of sufficient size, at least some part of the array processing would be done in SIMD.

Hypothetical situation 2

An alternative proposal is to piggyback on Array.ConvertAll.

var truncFunc = NarrowingDelegates.TruncateIntToByte; // of type delegate byte (int v);
var satFunc = NarrowingDelegates.SaturateIntToByte; // of type delegate byte (int v);
Array.ConvertAll(source, 0, dest, 0, truncFunc)
- (When the conversion delegate matches one of the specially defined delegate (such as "truncFunc" or "satFunc"), the ConvertAll method magically benefits from hardware acceleration. Lambdas won't work here.)

The actual question:

In each hypothetical situation, would it be beneficial to build such methods in C# on top of the Widen and Narrow methods, or would these be better proposed as new API additions?

kinchungwong on 3 May 2017

I posted some simple demonstration code for using this capability, with commentary, on StackOverflow. For experts who don't need to read that belabored treatise, here is the working and tested code, which implements "raw/dumb" (linguistically-blithe) widening of a byte[] array to char[]:

/// <summary>
/// 'Widen' each byte in 'bytes' to 16-bits with no consideration for encoding/character mapping
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe char[] WidenByteArray(byte[] bytes, int i, int c)
{
    var rgch = new char[c];
    if (c > 0)
        fixed (char* dst = rgch)
        fixed (byte* src = bytes)
            widen_bytes_simd(dst, src + i, c);
    return rgch;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
static void unsafe widen_bytes_simd(char* dst, byte* src, int c)
{
    for (; c > 0 && ((long)dst & 0xF) != 0; c--)
        *dst++ = (char)*src++;

    for (; (c -= 0x10) >= 0; src += 0x10, dst += 0x10)
        Vector.Widen(Unsafe.AsRef<Vector<byte>>(src),
                        out Unsafe.AsRef<Vector<ushort>>(dst + 0),
                        out Unsafe.AsRef<Vector<ushort>>(dst + 8));

    for (c += 0x10; c > 0; c--)
        *dst++ = (char)*src++;
}

Kudos to the x64 JIT team for the following impressive result which, in the realm of .NET, destroys all comers at this particular byte-interleaving task.

L_4223  mov         rax,rbx  
L_4226  movups      xmm0,xmmword ptr [rax]  ; fetch 16 bytes
L_4229  mov         rax,rdi  
L_422C  lea         rdx,[rdi+10h]  
L_4230  movaps      xmm2,xmm0  
L_4233  pxor        xmm1,xmm1  
L_4237  punpcklbw   xmm2,xmm1               ; interleave 8-to-16 bits (lo)
L_423B  movups      xmmword ptr [rax],xmm2  ; store 8 bytes (lo) to 8 wide chars (16 bytes)
L_423E  pxor        xmm1,xmm1  
L_4242  punpckhbw   xmm0,xmm1               ; interleave 8-to-16 bits (hi)
L_4246  movups      xmmword ptr [rdx],xmm0  ; store 8 bytes (hi) to 8 wide chars (16 bytes)
L_4249  add         rbx,10h  
L_424D  add         rdi,20h  
L_4251  add         esi,0FFFFFFF0h  
L_4254  test        esi,esi  
L_4256  jge         L_4223  
L_4258  ...

Normalized performance results for this code on .NET Framework 4.7.2 are consistently outstanding, vs. several of the single-byte .NET encoders for all of the data sizes I examined (up to 2MB).

SIMD code (shown above):           100%
C# naive unsafe bytewise loop:  153.45%
Encoding.UTF8:                  161.62%
Encoding.ASCII:                 221.38%
Encoding.Default (1252):        358.84%

glenn-slayden on 25 Jul 2018

Narrow isn't quite as exciting https://github.com/dotnet/coreclr/issues/16474 if you can come up with better asm than is presented in the issue, please do!

benaadams on 25 Jul 2018

@grabyourpitchforks

danmosemsft on 25 Jul 2018

https://github.com/dotnet/corefx/issues/15957#issuecomment-407696940
_Narrow_ isn't quite as exciting...

@benaadams Didn't you mean to say _Widen_ (which I showed above) is less "exciting" (challenging/problematic) versus Narrow (dotnet/coreclr#16474)?

glenn-slayden on 25 Jul 2018

Widen is exciting for performance; and is used in Kestrel; Narrow isn't so great and isn't currently used in Kestrel (though it has benchmarks for it)

benaadams on 25 Jul 2018

Was this page helpful?

0 / 5 - 0 ratings