Runtime: System.Numerics.Matrix4x4.Invert has no SIMD acceleration

Created on 1 Apr 2020  路  19Comments  路  Source: dotnet/runtime

area-System.Numerics tenet-performance up-for-grabs

Most helpful comment

I'd like to grab and investigate this if there are no objections.

All 19 comments

Area Owners

@tannergooding, @pgovind

I'm happy for this, or any of the other methods on Matrix3x2, Matrix4x4, Plane, Quaternion, Vector2, Vector3, or Vector4 to be accelerated where they aren't already.

If you provide a pull-request, please also provide some performance numbers (from benchmark.net) and an assembly diff.
Feel free to ping me on here if you need help with any of that or would like to discuss an appropriate implementation.

A reference implementation for SSE in C can be found here.
(I didn't check if there are faster ways to do it.)

There is also an implementation as part of the DirectX Math library which is known to fit with the general use-case that the System.Numerics types were designed for (and also has an ARM implementation): https://github.com/microsoft/DirectXMath/blob/master/Inc/DirectXMathMatrix.inl#L772

A reference implementation for SSE in C can be found here.
(I didn't check if there are faster ways to do it.)

A lot of shuffles and we still don't have an API for _MM_SHUFFLE_ :-)

A lot of shuffles and we still don't have an API for MM_SHUFFLE :-)

Feel free to open a proposal 馃槈

I'd like to grab and investigate this if there are no objections.

Thanks @eanova, no concerns on my end so I've assigned it out.

Feel free to let me know if you have any questions, comments, concerns!

Cool. Just shortly I'm going to use the DirectXMath library as the reference. The only thing that immediately jumps out at me is if the Matrix4x4 class has the right structure since it's not using vector128. I don't think this is going to be a big issue, but I guess it should also be benchmarked?

Also, I've read your blog post, looking at Intel's and DotNet docs on Instrinsic, and the BenchMark site. Any other sources I should be aware of?

if the Matrix4x4 class has the right structure since it's not using vector128

It can not be changed unfortunately since it's a breaking change. In theory Vector128-like fields could help to avoid unnecessary spills between operations but it can only be done in a some 3rd party Matrix4x4 馃檨

You can see an example of Matrix4x4 using HWIntrinsics here: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.cs#L1762

It isn't necessarily "ideal", especially across multiple calls right now but it isn't something we can adjust due to back-compat and the existing fields being public.
We might be able to experiment with having a union type with the underlying members being Vector4 and then using the zero-cost .AsVector128 and .AsVector4 APIs to convert between them, but it would likely require more extensive testing and thought.

@tannergooding I wonder if it makes sense to optimize redundant loads after inlining (in case of Matrix methods are not inlineable mostly but I am about a general case). E.g. after inlining:

Sse.Store(vec1, &matrix.m11);
var vec2 = Sse.LoadVector(&matrix.m11); // <--

in jit.

@EgorBo, there are likely some possible optimizations. I don't know what scenarios the above would be possible/legal under.

Hi just wanted to provide an update on progress. I've written up the code in a Benchmark project and was able to get a 20% improvement using SIMD. I still have to test for correctness.

Hi guys, sorry about the delay. I was able to debug and test the new code. All 7 Invert tests from Matrix4x4Tests.cs are now passing. Should I begin the ARM/NEON code before the PR? Or should I submit just the SSE code for now?

We can add the case to https://github.com/dotnet/runtime/issues/33565 for ARM64. There may still be a few additional intrinsics required before it can be completed succesfully.
However, if you want to give it a try and report back on any intrinsics that are needed but not yet implemented, that is also fine 馃槃

Got it... I can do that. Let me clean up the current code and I'll post it as a gist for now.

Just one quick question, what do you recommend for ARM64 development. Is there a proper emulator or should I be looking at a Raspberry Pi..

I use a combination of a rPI 3/4 or a Windows 10 on ARM laptop (such as the Surface Book X or GalaxyBook 2). I'm not aware of any emulators in use.
I would recommend cross-building on your desktop (Ubuntu directly or WSL on Windows) and then copying bits to your pi rather than attempting to build on your pi directly. It significantly shortens the turnaround time.

// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.778 (1909/November2018Update/19H2)
Intel Core i9-9900K CPU 3.60GHz (Coffee Lake), 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=3.1.201
[Host] : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT
DefaultJob : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT

| Method | Mean | Error | StdDev | Median |
|------------------------------------ |----------:|---------:|---------:|----------:|
| DotNetInverseCallParameters | 49.02 ns | 0.744 ns | 0.696 ns | 49.11 ns |
| DotNetInverse | 48.49 ns | 0.243 ns | 0.215 ns | 48.53 ns |
| SSEInverseWithShuffleMacro | 178.79 ns | 0.378 ns | 0.316 ns | 178.68 ns |
| SSEInverseDirectStoreCallParameters | 47.20 ns | 0.875 ns | 0.818 ns | 47.13 ns |
| SSEInverseDirectStore | 35.14 ns | 0.193 ns | 0.171 ns | 35.09 ns |
| SSEInverseSSEStoreCallParameters | 47.04 ns | 0.165 ns | 0.154 ns | 47.04 ns |
| SSEInverseSSEStore | 39.78 ns | 0.192 ns | 0.161 ns | 39.72 ns |
| SSEInverseShuffleClassCallParamters | 50.50 ns | 1.028 ns | 1.184 ns | 51.35 ns |
| SSEInverseShuffleClass | 35.10 ns | 0.104 ns | 0.097 ns | 35.12 ns |
| SSEInverseShuffleEnumCallParamters | 47.75 ns | 0.745 ns | 0.660 ns | 47.56 ns |
| SSEInverseShuffleEnum | 35.15 ns | 0.513 ns | 0.480 ns | 34.97 ns |

These are all the benchmarks I've tried so far, both in terms of the code and MM_SHUFFLE.
CallParamaters refers to this.
ie.

        public void SSEInverseDirectStoreCallParameters()
        {
            Matrix4x4 mtx = new Matrix4x4(
                         1, 0, 0, 0,
                         0, 1, 0, 0,
                         0, 0, 1, 0,
                         0, 0, 0, 1);
            Matrix4x4 result;
            Matrix4x4Inverse.BenchInverseDirectStore(mtx, out result);
        }        
        public void SSEInverseDirectStore() => Matrix4x4Inverse.BenchInverseDirectStore();

Q: I'm not 100% sure how to interpret these results, am I benching the wrong use case? It would seem there is a clear winner here, as well as a choice in how MM_SHUFFLE is implemented/used?

Thanks! @tannergooding

Was this page helpful?
0 / 5 - 0 ratings