Runtime: Runtime should be updated to support the `__vectorcall` calling convention

Created on 7 Jun 2017 · 18Comments · Source: dotnet/runtime

Rationale

Today, the runtime supports the __fastcall calling convention, which not only allows interop with any native code that uses that calling convention but also allows it to take advantage of the additional registers that are available on the underlying architecture.

However, it means that operating with certain data types is still "sub-optimal".

Microsoft Windows provides the __vectorcall calling convention just for this purpose (https://msdn.microsoft.com/en-us/library/dn375768.aspx). It extends the existing __fastcall calling convention to additionally allow SIMD vector types and Homogeneous Vector Aggregate values to be passed via register rather than on the stack.

The System V AMD64 ABI already defines vector sized types (__m128, __m256) and supports passing them in register.

Proposal

The runtime should add support for the __vectorcall calling convention, not only to improve performance, but to also provide better interop with native code that uses it.

The __vectorcall calling convention should be exposed on System.Runtime.InteropServices.CallingConvention as VectorCall.

area-Interop-coreclr

Source

tannergooding

👍10

Most helpful comment

This would significantly improve performance for the System.Numerics.Vector package, where all of the exposed types could be passed in register rather than passed on stack.

tannergooding on 7 Jun 2017

👍7

All 18 comments

FYI. @mellinoe, who may be interested.

tannergooding on 7 Jun 2017

This would significantly improve performance for the System.Numerics.Vector package, where all of the exposed types could be passed in register rather than passed on stack.

tannergooding on 7 Jun 2017

👍7

https://github.com/dotnet/designs/issues/13

tannergooding on 31 Jul 2017

https://docs.microsoft.com/en-us/cpp/cpp/stdcall

On ARM and x64 processors, __stdcall is accepted and ignored by the compiler; on ARM and x64 architectures, by convention, arguments are passed in registers when possible, and subsequent arguments are passed on the stack.

https://docs.microsoft.com/en-us/cpp/cpp/vectorcall

On ARM machines, __vectorcall is accepted and ignored by the compiler

For ARM64 the standard calling convention (ARM64 AAPCS64) passes vectors in registers as Short Vectors or HVA.

A brief glance at the Vector ABI for ARM64 (VPCS) doc looks like it is similar to the ARM64 AAPCS64 except it changes Callee/Caller register save responsibilities to eliminate some of the issues with preserving/restoring high bits of vector registers. __atribute__((aarch64_vector_pcs))

sdmaclea on 8 Feb 2018

Looks like attribute aarch64_vector_pcs is not recognized by gcc 6.3.0. Tested using latest Arm64 gcc on Compliler Explorer https://godbolt.org/

sdmaclea on 8 Feb 2018

With the support for hardware intrinsics in addition to the existing support for things like System.Numerics.Vector, this may be more important.

This currently represents a scenario where the Windows ABI actually loses out on performance as compared to the System V ABI.

This performance difference is readily measurable in native code, and will become more measurable in managed code as the the CoreCLR System V ABI implementation continues getting improvements.

tannergooding on 26 Mar 2018

CC. @CarolEidt

tannergooding on 26 Mar 2018

On a similar note we should explore the custom xmm call convention on x86, at least for invoking our own math helpers, to avoid transitioning in and out of x87 like we do now.

__vectorcall on x86 looks pretty hacky.

AndyAyersMS on 26 Mar 2018

👍1

On a similar note we should explore the custom xmm call convention on x86, at least for invoking our own math helpers, to avoid transitioning in and out of x87 like we do now

@AndyAyersMS, I thought we removed all the x87 FPU code with RyuJIT? At the very least, I remember doing some work to ensure the System.Math helpers were able to call the CRT implementations (which use SSE/SSE2 when that compiler switch is specified), rather than using the hand-coded assembly.

tannergooding on 26 Mar 2018

__vectorcall on x86 looks pretty hacky.

How so? It should just be (roughly speaking) the x86 __fastcall convention plus enabling HVA arguments

tannergooding on 26 Mar 2018

👍1

I thought we removed all the x87 FPU code with RyuJIT?

The standard x86 calling convention returns FP values in ST(0).

mikedn on 26 Mar 2018

👍1

Hmm, maybe I misread the "spec" -- it seems like if we made vectorcall the default for all methods it looks like it would give us XMM pass/return for floats on x86. The description here is not all that easy to parse as it also says the convention for floats is not impacted.

AndyAyersMS on 26 Mar 2018

it seems like if we made vectorcall the default for all methods it looks like it would give us XMM pass/return for floats on x86

Yes, it does that. For example: https://godbolt.org/g/ZsJv5y

mikedn on 26 Mar 2018

Also it interesting to see that __fastcall on x86 has some limited aspects of __vectorcall. I am pretty sure the jit doesn't do this for manged methods with HFA/HVAs.

Maybe interop knows about it?

AndyAyersMS on 26 Mar 2018

The description here is not all that easy to parse as it also says the convention for floats is not impacted.

The documentation page (https://msdn.microsoft.com/en-us/library/dn375768.aspx) does say that vector types include FP types:

A vector type is either a floating-point type—for example, a float or double—or an SIMD vector type—for example, __m128 or __m256.

And then for x86 it says:

Vector type results are returned by value in XMM0 or YMM0, depending on size. HVA results have each data element returned by value in registers XMM0:XMM3 or YMM0:YMM3, depending on element size. Other result types are returned by reference to memory allocated by the caller.

mikedn on 26 Mar 2018

Maybe interop knows about it?

Interop does not support FastCall calling convention. From https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.callingconvention?view=netcore-2.0 : FastCall This calling convention is not supported.

jkotas on 26 Mar 2018

Friendly ping as two years passed and I believe it's an "easy" yet probably very significant performance optimization!

LifeIsStrange on 2 Aug 2020

@LifeIsStrange - this was something that we had hoped to be able to make progress on for the 5.0 release (starting with supporting the correct standard calling conventions for both Linux and Windows, where the former passes vectors in registers, and both conventions call for returning vectors in registers). However, there was enough complexity between the runtime stubs and the JIT handling, that it didn't get completed.