Runtime: Add Scalar Intel hardware intrinsic functions

Created on 24 Aug 2017 · 34Comments · Source: dotnet/runtime

This proposal extends dotnet/runtime#23057 to additionally cover intrinsics for the scalar forms of the x86 hardware instructions.

Rationale

Currently, CoreFX implicitly emits various SSE/SSE2 instructions when performing operations on scalar floating-point types. However, there is currently no way to explicitly emit specific optimized sequences of code to best take advantage of the underlying hardware.

For example, the System.Math and System.MathF APIs are currently implemented as FCALLs to the underlying C Runtime implementation of the corresponding method. The underlying C Runtime implementations are most frequently implemented as hand-optimized assembly or C/C++ intrinsics to ensure that they provide the best throughput possible and that they take advantage of newer instruction sets when available (cos/cosf, for example, frequently have one code path for SSE and one for FMA3).

Due to these methods depending on the underlying C Runtime implementation:

We are at a significantly delayed cadence for getting bug fixes
- We often have to add hacks to workaround these bugs as we find them, decreasing perf
The implementations between platforms and architectures often differ
- This leads to perf differences: https://github.com/dotnet/coreclr/issues/9373
- This leads to input/output differences between Operating Systems (Windows, Linux, Mac, etc)
- This leads to input/output differences between Platforms (x86, x64, ARM, etc)
Updating these methods requires modifying the runtime

By providing scalar intrinsics for the Intel hardware functions it becomes much easier to implement these functions in managed code, which:

Means fewer workarounds as bugs can be fixed directly
Keeps perf is more consistent
Keeps input/output differences minimal or non-existing between operating systems
Helps keep input/output are minimal between platforms
Allows most bug fixes to be made independent of the runtime

Furthermore, with the addition of dotnet/runtime#23057, it may become more pertinent to also have these scalar intrinsics to ensure that the codegen when intermixing scalar and vectorized operations remains "optimal" for the end-users customized/hand-optimized algorithms.

Proposed API

The current design in dotnet/runtime#23057 creates a class per instruction set and exposes methods such as Vector128<double> Sse2.Sqrt(Vector128<double>) which corresponds to the __m128d _mm_sqrt_pd(__m128d) C/C++ intrinsic`.

This would additionally extend the surface area to expose the scalar forms of all the instructions with the same name as the vector intrinsic, but with a Scalar postfix. This is required to differentiate between the vector and scalar APIs due to them taking the same types as their inputs.

For example, we would expose:

public static class Sse2
{
    // __m128d _mm_add_sd(__m128d a, __m128d b);
    public static Vector128<double> AddScalar<double>(Vector128<double> left, Vector128<double> right);

    // ...

    // No corresponding C/C++ intrinsic, used when upper should be taken from `value`
    public static Vector128<double> SqrtScalar(Vector128<double> value);    

    // __m128d _mm_sqrt_sd(__m128d a, __m128d b)
    public static Vector128<double> SqrtScalar(Vector128<double> upper, Vector128<double> value);

    // ...
}

Other Thoughts

Most of the remaining sections (Intended Audience, Semantics and Usage, Implementation Roadmap, etc) are the same as in dotnet/runtime#23057

api-approved area-System.Runtime.Intrinsics

Source

tannergooding

👍8

Most helpful comment

Generally string lookup against a fixed set of target strings should be logarithmic in the number of target strings. And since we have (effectively) a radix 50 or so encoding, the logarithm falls of quite sharply. Going from 1 target to 1000 targets should only increase lookup time by some small integer factor. Hashing can cut this down to something linear, should it matter.

The jit always has to look at the numbers and types of arguments, so overload resolution won't cost much extra on top of this.

And from the jit's perspective, only a tiny fraction of all method calls it sees will be resolved as intrinics

I think it's more important to organize the expression of all this in code so it is obviously correct and easy to maintain. Having something that looks like a table would be a nice start. For instance, macroize the current enum thing and put names and properities there too. Then a build-time tool can generate a recognizer off this format.

AndyAyersMS on 23 Sep 2017

👍3

All 34 comments

cc: @russellhadley @mellinoe @CarolEidt @terrajobst, @fiigii

tannergooding on 24 Aug 2017

What does Vector128<double> SqrtScalar(Vector128<double> value) corresponds to?

jkotas on 24 Aug 2017

SqrtScalar => _mm_sqrt_sd (updated above): https://msdn.microsoft.com/en-us/library/1994h1ay(v=vs.100).aspx

AddScalar => _mm_add_sd (updated above): https://msdn.microsoft.com/en-us/library/we4dxyk3(v=vs.100).aspx

tannergooding on 24 Aug 2017

@tannergooding Thank you for opening this issue. Exposing these SSE* scalar floating-point intrinsics has different concerns from dotnet/runtime#23057, so discussing this in a new proposal is a good idea.
Do you want all the SSE* scalar floating-point instructions (most of them are already covered current RyuJIT codegen) or just a subset to be exposed as intrinsics?

fiigii on 24 Aug 2017

@fiigii, I think they should all be exposed since that will likely lead to the best codegen and reduce pressure on the JIT for the scenarios where these matter. It will also likely lead to less confusion for the end-user when dealing with scalar types as they won't have to remember to convert to Vector128<float> to perform operations 'A', 'B', and 'C', and then convert back to float to perform operation 'X', 'Y', and 'Z'.

I believe the primary use case will be to ensure better codegen occurs when mixing scalar and vectorized operations (I believe this is the purpose behind certain functions such as _mm_mask_sqrt_sd).

But it will also be important for implement optimized forms of various mathematical algorithms in managed code (such as the System.Math, System.MathF, and System.Numerics.Complex APIs).

tannergooding on 24 Aug 2017

👍1

What is the purpose of:
cs // __m128d _mm_sqrt_sd(__m128d a, __m128d b) public static Vector128<double> SqrtScalar(Vector128<double> upper, Vector128<double> value);

I mean why it takes 2 parameters? SQRTSD takes only one...

fanoI on 25 Aug 2017

@fanol. SQRTSD takes two parameters. The first is the register which receives the result in its lower bits and has its upper bits left untouched. The second is a register or memory address on which the operation is actually performed.

The API needs to expose the ability to dictate what the upper bits of the computed result should be, in the case that it is important. Taking a second parameter allows and places the burden of optimization on the compiler (the other option would be to return nothing and have a ref parameter, which forces the user to allocate a temp themselves if they want to preserve a value).

tannergooding on 25 Aug 2017

Going to ping @russellhadley @mellinoe @CarolEidt @terrajobst, on this.

The API surface area isn't fully described (since it covers several hundred APIs), but it follows the same convention as the previous proposal so it should be "reviewable".

tannergooding on 22 Sep 2017

I'm fine with this in theory, but we need to ensure that we measure throughput as we expand the number of intrinsics that we support.

CarolEidt on 22 Sep 2017

👍1

we need to ensure that we measure throughput as we expand the number of intrinsics that we support.

I agree.

With the sheer number of APIs we are doing, when an intrinsic is hit in the System.Runtime.Intrinsics namespace, there are going to end up being a number (possibly in the hundreds) of string comparisons to find the matching method (at least with my limited understanding of the NamedIntrinsics feature).

I think we'll end up needing to find an alternative way to match these intrinsics quickly, without doing a bunch of string comparisons. SSE2 has 144 unique instructions (at least if my reference is correct -- this is both scalar and vector forms), but there are additional overloads on top of that as well.

tannergooding on 23 Sep 2017

AFAIR the proposal by @fiigii was to use binary search to get it to be acceptable fast

4creators on 23 Sep 2017

AFAIR the proposal by @fiigii was to use binary search to get it to be acceptable fast

That may be sufficient, but I still think it is going to be critical to measure each step along the way, so that we know where we stand.

CarolEidt on 23 Sep 2017

@CarolEidt, at one point (before the Intel proposal went up), I had proposed that this be achieved via [Intrinsic(IntrinsicId id)], where IntrinsicId was an enum that had a strict 1-to-1 mapping with an enum in the runtime (what is currently the NamedIntrinsic enum).

Outside of any cost required to get the value out of the attribute for a given method, it would be a constant time lookup, regardless of how many intrinsics we decide to add now or in the future.

tannergooding on 23 Sep 2017

@tannergooding - that might be a reasonable solution; it has all the earmarks of fragility (the two sides getting out of sync), but given that the name can be verified, certainly at least in a Checked/Debug build, it could be practical. I've always been told that attribute recognition is expensive in the runtime, except for those attributes that have specific metadata representations, so it would be interesting to hear from someone on the VM side (@jkotas?)

CarolEidt on 23 Sep 2017

I've always been told that attribute recognition is expensive in the runtime, except for those attributes that have specific metadata representations

I think that [Intrinsic] is already in that category (specific metadata representation), since it is what is being used by the NamedIntrinsics today, just that it doesn't take any parameters.

it has all the earmarks of fragility (the two sides getting out of sync), but given that the name can be verified, certainly at least in a Checked/Debug build, it could be practical

I think the general case is, if managed code uses an IntrinsicId that the runtime doesn't recognize, we will get a PNSE, but that was already the case today (if the runtime doesn't know to match the namespace/method/class name).

The "worst" case is that someone ends up changing the enum values in one or the other, such that they no longer line up, but that should be detectable in checked/debug mode (as you indicated). We can also help prevent it from happening by explicitly assigning values to the enum entries and adding warnings at the top/bottom of the enum declarations, to help ensure users see the requirement of keeping them in sync.

tannergooding on 23 Sep 2017

For reference, the existing check for [Intrinsic] is here: https://github.com/dotnet/coreclr/blob/master/src/vm/methodtablebuilder.cpp#L5086

I just don't know how expensive it is to pull out an enum value from it (although, I would hope it isn't terribly expensive, since something similar is done for MethodImpl).

tannergooding on 23 Sep 2017

I believe that matching intrinsics using attribute that marks candidates (this is cheap to cache - single bit) and then do the lookup based on name is the right balance.

The current implementation of this scheme in the JIT has space for improvement. I expect that it will need to be turned into hashtable once the number of these intrinsics grows. Hashtable will turn it into one-time name comparison. There is a ton of name comparisons going on during JITing because of metadata references are name-based, so a few more should be just a drop in a bucket.

I can be convinced by data if somebody shows me a profile that this does not work for some reason and a different scheme would be better.

BTW: We have standardized on this scheme in CoreRT/.NET Native as well so re-plumbing it to something else would be a bunch of busy work or forked CoreLib sources.

jkotas on 23 Sep 2017

👍2

I think that [Intrinsic] is already in that category (specific metadata representation)

It is not.

the NamedIntrinsics today

I do not know what this is.

jkotas on 23 Sep 2017

I'd honestly be surprised if the current scheme can't be made to perform acceptably well. Given that we'll only resort to name comparisons when we actually have one of these new intrinsics to process, it is already in the pay-for-play category, and very little jitted code will play. It "radix decomposes" pretty nicely given the spread of the names across namespaces, classes, and methods

A similar argument could be made for how we do SIMD lookups. These use sequential name matches today and call the "slow" jit interface methods to get the names to match on, but we only pay for this cost if we see SIMD, and the recognition cost is just the tip of the iceberg.

AndyAyersMS on 23 Sep 2017

By "those attributes that have specific metadata representations" I meant the "pseudo-custom attributes" as defined in II.21.2.1 of the spec. They are "attributes" that look like regular custom attributes, but which have dedicated bits in the metadata.

CarolEidt on 23 Sep 2017

It is not.

@jkotas, The IntrinsicAttribute is checked for here: https://github.com/dotnet/coreclr/blob/master/src/vm/methodtablebuilder.cpp#L5086

I don't know if that means it is in the category of specific metadata representation, just that the VM is doing the check for it regardless.

I do not know what this is.

The above IntrinsicAttribute check ends up setting some flags which eventually routs down to CORINFO_FLG_JIT_INTRINSIC in the JIT layer, which controls whether or not lookupNamedIntrinsic is called (https://github.com/dotnet/coreclr/blob/master/src/jit/importer.cpp#L3890). That method itself does the matching against namespace/class/method name to return a NamedIntrinsic enum value (those are defined here: https://github.com/dotnet/coreclr/blob/master/src/jit/namedintrinsiclist.h)

We have standardized on this scheme in CoreRT/.NET Native as well so re-plumbing it to something else would be a bunch of busy work or forked CoreLib sources.

I definitely agree that if the standard elsewhere is to just match on names, that it doesn't make much sense to just go rewriting everything 😄

That being said, if in the future it is determined that the string comparisons do add too much overhead, I don't think the mechanism I proposed means other frameworks need to be replumbed immediately (or possibly ever).

The proposal I had would extend the existing [Intrinsic] attribute to take an optional IntrinsicId parameter. So existing frameworks, which did not know to lookup the parameter, could continue doing the name comparisons (the expense of doing so isn't a concern for CoreRT/.NET Native since they are AOT compiled). While CoreCLR would have a constant time lookup (without even the memory overhead of a hash table), with the most expensive part being the read of the IntrinsicId out of the attribute (still not sure how expensive this is, although I would hope not too much).

tannergooding on 23 Sep 2017

A similar argument could be made for how we do SIMD lookups. These use sequential name matches today and call the "slow" jit interface methods to get the names to match on ...

Right, and I've been concerned that the existing mechanism is unreasonably slow, i.e. the "pay" for the "play" isn't a good balance. But perhaps just switching to something reasonable instead of a linear search would be sufficient. Of course, measurements and data would be nice :-)

CarolEidt on 23 Sep 2017

I've been concerned that the existing mechanism is unreasonably slow, i.e. the "pay" for the "play" isn't a good balance.

This is my concern as well.

The hardware intrinsics are expected to be used in very high-performance/core scenarios (reimplementing Math, String, Buffer functions are some primary examples in CoreFX itself).

Additionally, given how they are organized, we are going to have several hundred intrinsics in a single class (and Arm NEON, for example, doesn't have nearly the "ability" to split intrinsics into logical ISA groups that x86 does with SSE, SSE2, SSE3, etc).

Even with non-linear search, that could potentially be a large number of string comparisons we are doing (and that may additionally require further argument matching to determine which overload we are dealing with, etc).

So, I think every comparison we can cut out of this feature is just more optimizations we can do elsewhere.

As always, measurements and data would be nice.

tannergooding on 23 Sep 2017

The jit always has to look at the numbers and types of arguments, so overload resolution won't cost much extra on top of this.

And from the jit's perspective, only a tiny fraction of all method calls it sees will be resolved as intrinics

AndyAyersMS on 23 Sep 2017

👍3

I think it's more important to organize the expression of all this in code so it is obviously correct and easy to maintain. Having something that looks like a table would be a nice start.

Let's start with the current implementation of https://github.com/dotnet/coreclr/pull/14020 if it looks good to you all.
After enabling all the Intel hardware intrinsics (or a whole class), it may be the good time to measure the performance difference among these identifying algorithms (e.g., sequential search, binary search, and hashtable, etc.).

fiigii on 23 Sep 2017

I've marked this as api-ready-for-review as I believe all relevant discussions so far have been made.

This might warrant a special review session, as was done for the original intrinsics proposal, however.

tannergooding on 8 Nov 2017

Does this proposal also cover scalar move instructions (e.g., MOVD xmm, r/m32)?

pentp on 17 Nov 2017

It is meant to cover all of the scalar instructions that are behind the SSE through AVX CPUID flags (that is, the same CPUID flags covered by https://github.com/dotnet/corefx/issues/22940).

movd xmm r/m32 and movd xmm r/m64 would be covered since they are behind the SSE2 CPUID flag.

tannergooding on 17 Nov 2017

👍1

cc @eerhardt @ViktorHofer as owners of numerics area.

danmosemsft on 27 Nov 2017

FYI: The API review discussion was recorded - see https://youtu.be/9F_NsviGT0Y?t=2950 (7 min duration)

karelz on 28 Nov 2017

I have the initial PR up (https://github.com/dotnet/coreclr/pull/15341) and left some notes on some potential issues with the existing APIs that have already been added (from https://github.com/dotnet/corefx/issues/22940)

tannergooding on 2 Dec 2017

Adding to the 2.1 milestone since @tannergooding is working on adding this now.