Today, the runtime supports the __fastcall calling convention, which not only allows interop with any native code that uses that calling convention but also allows it to take advantage of the additional registers that are available on the underlying architecture.
However, it means that operating with certain data types is still "sub-optimal".
Microsoft Windows provides the __vectorcall calling convention just for this purpose (https://msdn.microsoft.com/en-us/library/dn375768.aspx). It extends the existing __fastcall calling convention to additionally allow SIMD vector types and Homogeneous Vector Aggregate values to be passed via register rather than on the stack.
The System V AMD64 ABI already defines vector sized types (__m128, __m256) and supports passing them in register.
The runtime should add support for the __vectorcall calling convention, not only to improve performance, but to also provide better interop with native code that uses it.
The __vectorcall calling convention should be exposed on System.Runtime.InteropServices.CallingConvention as VectorCall.
FYI. @mellinoe, who may be interested.
This would significantly improve performance for the System.Numerics.Vector package, where all of the exposed types could be passed in register rather than passed on stack.
https://docs.microsoft.com/en-us/cpp/cpp/stdcall
On ARM and x64 processors, __stdcall is accepted and ignored by the compiler; on ARM and x64 architectures, by convention, arguments are passed in registers when possible, and subsequent arguments are passed on the stack.
https://docs.microsoft.com/en-us/cpp/cpp/vectorcall
On ARM machines, __vectorcall is accepted and ignored by the compiler
For ARM64 the standard calling convention (ARM64 AAPCS64) passes vectors in registers as Short Vectors or HVA.
A brief glance at the Vector ABI for ARM64 (VPCS) doc looks like it is similar to the ARM64 AAPCS64 except it changes Callee/Caller register save responsibilities to eliminate some of the issues with preserving/restoring high bits of vector registers. __atribute__((aarch64_vector_pcs))
Looks like attribute aarch64_vector_pcs is not recognized by gcc 6.3.0. Tested using latest Arm64 gcc on Compliler Explorer https://godbolt.org/
With the support for hardware intrinsics in addition to the existing support for things like System.Numerics.Vector, this may be more important.
This currently represents a scenario where the Windows ABI actually loses out on performance as compared to the System V ABI.
This performance difference is readily measurable in native code, and will become more measurable in managed code as the the CoreCLR System V ABI implementation continues getting improvements.
CC. @CarolEidt
On a similar note we should explore the custom xmm call convention on x86, at least for invoking our own math helpers, to avoid transitioning in and out of x87 like we do now.
__vectorcall on x86 looks pretty hacky.
On a similar note we should explore the custom xmm call convention on x86, at least for invoking our own math helpers, to avoid transitioning in and out of x87 like we do now
@AndyAyersMS, I thought we removed all the x87 FPU code with RyuJIT? At the very least, I remember doing some work to ensure the System.Math helpers were able to call the CRT implementations (which use SSE/SSE2 when that compiler switch is specified), rather than using the hand-coded assembly.
__vectorcall on x86 looks pretty hacky.
How so? It should just be (roughly speaking) the x86 __fastcall convention plus enabling HVA arguments
I thought we removed all the x87 FPU code with RyuJIT?
The standard x86 calling convention returns FP values in ST(0).
Hmm, maybe I misread the "spec" -- it seems like if we made vectorcall the default for all methods it looks like it would give us XMM pass/return for floats on x86. The description here is not all that easy to parse as it also says the convention for floats is not impacted.
it seems like if we made vectorcall the default for all methods it looks like it would give us XMM pass/return for floats on x86
Yes, it does that. For example: https://godbolt.org/g/ZsJv5y
Also it interesting to see that __fastcall on x86 has some limited aspects of __vectorcall. I am pretty sure the jit doesn't do this for manged methods with HFA/HVAs.
Maybe interop knows about it?
The description here is not all that easy to parse as it also says the convention for floats is not impacted.
The documentation page (https://msdn.microsoft.com/en-us/library/dn375768.aspx) does say that vector types include FP types:
A vector type is either a floating-point type鈥攆or example, a float or double鈥攐r an SIMD vector type鈥攆or example, __m128 or __m256.
And then for x86 it says:
Vector type results are returned by value in XMM0 or YMM0, depending on size. HVA results have each data element returned by value in registers XMM0:XMM3 or YMM0:YMM3, depending on element size. Other result types are returned by reference to memory allocated by the caller.
Maybe interop knows about it?
Interop does not support FastCall calling convention. From https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.callingconvention?view=netcore-2.0 : FastCall This calling convention is not supported.
Friendly ping as two years passed and I believe it's an "easy" yet probably very significant performance optimization!
@LifeIsStrange - this was something that we had hoped to be able to make progress on for the 5.0 release (starting with supporting the correct standard calling conventions for both Linux and Windows, where the former passes vectors in registers, and both conventions call for returning vectors in registers). However, there was enough complexity between the runtime stubs and the JIT handling, that it didn't get completed.
Most helpful comment
This would significantly improve performance for the
System.Numerics.Vectorpackage, where all of the exposed types could be passed in register rather than passed on stack.