Runtime: System.Numerics.Matrix4x4.Invert has no SIMD acceleration

Created on 1 Apr 2020 · 19Comments · Source: dotnet/runtime

area-System.Numerics tenet-performance up-for-grabs

Source

CallumDev

Most helpful comment

I'd like to grab and investigate this if there are no objections.

eanova on 8 Apr 2020

👍3

All 19 comments

Area Owners

@tannergooding, @pgovind

msftbot[bot] on 1 Apr 2020

I'm happy for this, or any of the other methods on Matrix3x2, Matrix4x4, Plane, Quaternion, Vector2, Vector3, or Vector4 to be accelerated where they aren't already.

If you provide a pull-request, please also provide some performance numbers (from benchmark.net) and an assembly diff.
Feel free to ping me on here if you need help with any of that or would like to discuss an appropriate implementation.

tannergooding on 1 Apr 2020

A reference implementation for SSE in C can be found here.
(I didn't check if there are faster ways to do it.)

gfoidl on 1 Apr 2020

There is also an implementation as part of the DirectX Math library which is known to fit with the general use-case that the System.Numerics types were designed for (and also has an ARM implementation): https://github.com/microsoft/DirectXMath/blob/master/Inc/DirectXMathMatrix.inl#L772

tannergooding on 1 Apr 2020

A reference implementation for SSE in C can be found here.
(I didn't check if there are faster ways to do it.)

A lot of shuffles and we still don't have an API for _MM_SHUFFLE_ :-)

EgorBo on 1 Apr 2020

A lot of shuffles and we still don't have an API for MM_SHUFFLE :-)

Feel free to open a proposal 😉

tannergooding on 1 Apr 2020

I'd like to grab and investigate this if there are no objections.

eanova on 8 Apr 2020

👍3

Thanks @eanova, no concerns on my end so I've assigned it out.

Feel free to let me know if you have any questions, comments, concerns!

tannergooding on 9 Apr 2020

Cool. Just shortly I'm going to use the DirectXMath library as the reference. The only thing that immediately jumps out at me is if the Matrix4x4 class has the right structure since it's not using vector128. I don't think this is going to be a big issue, but I guess it should also be benchmarked?

Also, I've read your blog post, looking at Intel's and DotNet docs on Instrinsic, and the BenchMark site. Any other sources I should be aware of?

eanova on 9 Apr 2020

if the Matrix4x4 class has the right structure since it's not using vector128

It can not be changed unfortunately since it's a breaking change. In theory Vector128-like fields could help to avoid unnecessary spills between operations but it can only be done in a some 3rd party Matrix4x4 🙁

EgorBo on 9 Apr 2020

You can see an example of Matrix4x4 using HWIntrinsics here: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.cs#L1762

It isn't necessarily "ideal", especially across multiple calls right now but it isn't something we can adjust due to back-compat and the existing fields being public.
We might be able to experiment with having a union type with the underlying members being Vector4 and then using the zero-cost .AsVector128 and .AsVector4 APIs to convert between them, but it would likely require more extensive testing and thought.

tannergooding on 9 Apr 2020

👍1

@tannergooding I wonder if it makes sense to optimize redundant loads after inlining (in case of Matrix methods are not inlineable mostly but I am about a general case). E.g. after inlining:

Sse.Store(vec1, &matrix.m11);
var vec2 = Sse.LoadVector(&matrix.m11); // <--

in jit.

EgorBo on 9 Apr 2020

@EgorBo, there are likely some possible optimizations. I don't know what scenarios the above would be possible/legal under.

tannergooding on 9 Apr 2020

Hi just wanted to provide an update on progress. I've written up the code in a Benchmark project and was able to get a 20% improvement using SIMD. I still have to test for correctness.

eanova on 13 Apr 2020

🎉2

Hi guys, sorry about the delay. I was able to debug and test the new code. All 7 Invert tests from Matrix4x4Tests.cs are now passing. Should I begin the ARM/NEON code before the PR? Or should I submit just the SSE code for now?

eanova on 23 Apr 2020

We can add the case to https://github.com/dotnet/runtime/issues/33565 for ARM64. There may still be a few additional intrinsics required before it can be completed succesfully.
However, if you want to give it a try and report back on any intrinsics that are needed but not yet implemented, that is also fine 😄

tannergooding on 23 Apr 2020

Got it... I can do that. Let me clean up the current code and I'll post it as a gist for now.

Just one quick question, what do you recommend for ARM64 development. Is there a proper emulator or should I be looking at a Raspberry Pi..

eanova on 23 Apr 2020

I use a combination of a rPI 3/4 or a Windows 10 on ARM laptop (such as the Surface Book X or GalaxyBook 2). I'm not aware of any emulators in use.
I would recommend cross-building on your desktop (Ubuntu directly or WSL on Windows) and then copying bits to your pi rather than attempting to build on your pi directly. It significantly shortens the turnaround time.

tannergooding on 23 Apr 2020

👍1

// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.778 (1909/November2018Update/19H2)
Intel Core i9-9900K CPU 3.60GHz (Coffee Lake), 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=3.1.201
[Host] : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT
DefaultJob : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT

| Method | Mean | Error | StdDev | Median |
|------------------------------------ |----------:|---------:|---------:|----------:|
| DotNetInverseCallParameters | 49.02 ns | 0.744 ns | 0.696 ns | 49.11 ns |
| DotNetInverse | 48.49 ns | 0.243 ns | 0.215 ns | 48.53 ns |
| SSEInverseWithShuffleMacro | 178.79 ns | 0.378 ns | 0.316 ns | 178.68 ns |
| SSEInverseDirectStoreCallParameters | 47.20 ns | 0.875 ns | 0.818 ns | 47.13 ns |
| SSEInverseDirectStore | 35.14 ns | 0.193 ns | 0.171 ns | 35.09 ns |
| SSEInverseSSEStoreCallParameters | 47.04 ns | 0.165 ns | 0.154 ns | 47.04 ns |
| SSEInverseSSEStore | 39.78 ns | 0.192 ns | 0.161 ns | 39.72 ns |
| SSEInverseShuffleClassCallParamters | 50.50 ns | 1.028 ns | 1.184 ns | 51.35 ns |
| SSEInverseShuffleClass | 35.10 ns | 0.104 ns | 0.097 ns | 35.12 ns |
| SSEInverseShuffleEnumCallParamters | 47.75 ns | 0.745 ns | 0.660 ns | 47.56 ns |
| SSEInverseShuffleEnum | 35.15 ns | 0.513 ns | 0.480 ns | 34.97 ns |

These are all the benchmarks I've tried so far, both in terms of the code and MM_SHUFFLE.
CallParamaters refers to this.
ie.

        public void SSEInverseDirectStoreCallParameters()
        {
            Matrix4x4 mtx = new Matrix4x4(
                         1, 0, 0, 0,
                         0, 1, 0, 0,
                         0, 0, 1, 0,
                         0, 0, 0, 1);
            Matrix4x4 result;
            Matrix4x4Inverse.BenchInverseDirectStore(mtx, out result);
        }        
        public void SSEInverseDirectStore() => Matrix4x4Inverse.BenchInverseDirectStore();

Q: I'm not 100% sure how to interpret these results, am I benching the wrong use case? It would seem there is a clear winner here, as well as a choice in how MM_SHUFFLE is implemented/used?

Thanks! @tannergooding

eanova on 7 May 2020

Was this page helpful?

0 / 5 - 0 ratings