Runtime: Alternative versions of methods for Vectors and Matrices

Created on 25 Nov 2014  Â·  59Comments  Â·  Source: dotnet/runtime

Would be helpful to have alternative versions of methods with "out" parameter as result. Example:
this

```C#
public static Matrix CreateFromQuaternion(Quaternion quaternion) { ... }

and that

```C#
public static void CreateFromQuaternion(ref Quaternion quaternion, out Matrix result) { ... }

What do you think guys? Do you want to add that?

api-approved area-System.Numerics

Most helpful comment

After a bit more discussion, we've decided that it's a good idea to add these overloads. While it would be nice to rely on great code-gen everywhere, the reality is that we are going to be supporting a wide variety of runtimes for the forseeable future, some of which will be lagging behind others, and won't be able to apply the advanced copy-elimination optimizations that would obviate the need for these overloads. In RyuJIT, we're hoping that we will be able to recognize both "versions" of these methods without a ton of extra logic, as the basic shape of the functions is still the same.

More concretely, I think we should provide one extra overload for each existing overload, with the following differences:

  • Return types are always converted to an out parameter.
  • All parameters whose size is greater than a pointer are converted to a ref parameter. All others are kept as-is. In our case, float parameters would continue to be passed by-value, but other would be passed by-ref.
  • Some methods already have an out parameter. I propose we still do the above, even though some other libraries seem to ignore these cases. For example:

public static bool Invert(Matrix4x4 m, out Matrix4x4 inverted)
becomes
public static void Invert(ref Matrix4x4, out bool succeeded, out Matrix4x4 inverted)

Another note: Since this library is implemented by System.Numerics.dll on the .NET Framework, support for the new members will only be available in the next major release of .NET Framework. No way around that, unfortunately. That would place some version support restrictions on libraries and frameworks depending on the library. For example, if Monogame wanted to use the new overloads, you guys might realistically want to consider waiting for a major version bump (or whatever the equivalent might be), since it will be an API breaking change and will require a higher .NET Framework version.

I've got a prototype version implemented in this branch. The proposed additions can be seen in this commit, see the first file changed.

Tagging a few folks interested in the discussion: @EmptyKeys , @tomspilman , @xoofx , @benaadams , @CarolEidt , @terrajobst

All 59 comments

Can you give an example where you think this is helpful? does it simplify a specific scenario?

ref and out in this case are better for memory. Important if you are actually making some render engine etc. :)

Also, in ref-out scenario, you can reuse already created and allocated matrix. In case you need a lot of quaternion-matrix conversions, It can help you save some miliseconds.
Hovever, with such code as in Matrix.CreateFromQuaternion(...), I don't think memory allocation must be the point of concern in application heavily using this method.

ref/out avoid copying the value, for large value types such as Matrix that can be quite an improvement. Not only that you don't pay the cost of the copying but the generated code is much smaller because of that.

That said, using ref/out is kind of ugly and in some cases it simply doesn't work (you can't use ref/out with properties for example). Having to extend the API with such overloads just because the JIT compiler can't handle this case properly is a rather unfortunate situation.

The example of CreateFromQuaternion shouldn't allocate any memory as its entirely stack based?

You would pay a 64 byte copy cost if you were storing the return value rather than just using it.

Reusing a local declared vary by ref shouldn't gain much and I think things get weird when you start passing addresses of local stack variables around if not careful. (If you are storing the return its probably not hotpath as you are already caching)

Is the proposal here to do this for all methods on these types or a subset? It would be helpful to understand what exactly is in scope and have some sample code fragments that exhibit bad codegen today so we can compare the present codegen to the codegen we would get if we made changes.

pretty much what @mikedn said. If you want better performance ref/out methods are good candidate and should be implemented, where it makes sense ie big struct types like Matrix/Quaternion etc.

For cases like public static Matrix CreateFromQuaternion(Quaternion quaternion), I can imagine how passing the argument by ref could provide different codegen that is better in some cases (since you don't have the copy the data into the callee).

However, is the rewrite of the return value to an out parameter really that helpful? I would have sort of assumed there would be some return value optimization going on.

For me, it would be helpful to see some representative samples where the codegen is unacceptable so we can understand the best way to address the issue. Is it via additional APIs? Is there something we can do to make the methods more likely to be inlined and hence data need not be copied, are there other optimizations the JIT could be doing?

I worry about creating a bunch of new APIs to address problems we don't have good samples for. How can we know if we are doing better?

this thread and posts are like Déjà vu, it's almost surreal.

https://social.msdn.microsoft.com/Forums/en-US/e715c16b-6b02-454a-b836-d402093bd36f/end-users-shouldnt-have-to-rewrite-xna-math-functions-in-order-to-improve-performance?forum=xnaframework

Almost 9 years old thread. XNA managed to get that API correct on the end. Something changed since then in compiler for this issue?

If you can change something somewhere, which will not copy bunch of floats there and back, then we don't need this change in API. What ever solution you find is good. And if you want some good samples just try your examples for this library with ref/out (or any solution you pick) implementation and compare difference.

I don't know what the jitter does but a byval Quaternion can be a single cpu register and if so, a byref would give you worse performance as it could be referenced in multiple places on multiple threads so always needs to memory backed rather than register backed.

However, is the rewrite of the return value to an out parameter really that helpful? I would have sort of assumed there would be some return value optimization going on.

Good question, struct returns are already converted to out returns by the JIT compiler since there's no other (sane) way of returning a struct that doesn't fit in a register. And it turns out that the JIT compiler does an optimization and eliminates one of 2 copies that are normally required. For code like field = CreateFromQuaternion(...) the JIT compiler generates code that looks identical to CreateFromQuaternion(..., out field) instead of passing a pointer to a temporary and then copying the temporary to the actual destination.

The remaining issue is a copy that's performed inside CreateFromQuaternion. The matrix is normally computed in a local variable and at the end of the method the local variable is copied to the out parameter. It's difficult to avoid this copy, attempting to compute the matrix straight into the out parameter would result in a corrupted return value if an exception gets thrown during computation.

I can imagine how passing the argument by ref could provide different codegen that is better in some cases (since you don't have the copy the data into the callee).

This is something where the JIT compiler might be able to do better. I think it could always pass structs by ref and have the callee perform the copy when necessary to preserve the value semantics. In some cases the copy would simply transform into loading the necessary struct fields in registers.

I don't know what the jitter does but a byval Quaternion can be a single cpu register

Yes, it could be but it appears that currently it isn't. It's also debatable if passing a quaternion in a register actually helps. If the callee needs to call Quaternion methods that cannot be inlined then the compiler will have to spill the register to memory anyway so it can take the address for this.

XNA managed to get that API correct on the end.

That's debatable. Using ref and out for performance reasons is a workaround rather than "correct". Not to say that this shouldn't be done but don't be surprised if people will discuss this over and over again.

The way I see it: I do care about performance but I also want the code to be readable. What can be done so that one can use operator overloads (think matrix multiplication for example) and also get good performance?

well compiler could check if struct is big (ie Matrix) and do that optimization (ref) for us, but not sure if there is any issue with this kind of behavior.

Yeah, you are right. Ref/out solution was always workaround (even for XNA), but I thought that changing API (in one small library) would be easier than changing how compiler works (for everyone).

Could the (relatively new) AggressiveInlining flag be used to force the jitter to inline those methods? If SIMD code is emitted, it shouldn't bloat the generated code too much, should it?

If SIMD code is emitted, it shouldn't bloat the generated code too much, should it?

I don't see SIMD code being emitted for Matrix4x4 and Quaternion.

Could the (relatively new) AggressiveInlining flag be used to force the jitter to inline those methods?

Inlining doesn't remove unnecessary copies by itself, the JIT compiler must do that as a separate optimization. As far as I can tell it doesn't do that currently:

``` C#
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static Matrix4x4 Create() {
return new Matrix4x4(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16);
}
static int Main() {
Matrix4x4 m = Create();
return (int)m.GetDeterminant();
}

generates the following code:

...
movss xmm2, dword ptr [@RWD48]
movss dword ptr [rsp+80H], xmm2
lea rcx, bword ptr [rsp+90H]
...
movss xmm1, dword ptr [@RWD60]
call System.Numerics.Matrix4x4:.ctor(float,float,float,float,float,float,float,float,float,float,float,float,float,float,float,float):this
movdqu xmm0, qword ptr [rsp+90H] ; redundant copying of a Matrix4x4
movdqu qword ptr [rsp+D0H], xmm0
movdqu xmm0, qword ptr [rsp+A0H]
movdqu qword ptr [rsp+E0H], xmm0
movdqu xmm0, qword ptr [rsp+B0H]
movdqu qword ptr [rsp+F0H], xmm0
movdqu xmm0, qword ptr [rsp+C0H]
movdqu qword ptr [rsp+100H], xmm0
lea rcx, bword ptr [rsp+D0H]
call System.Numerics.Matrix4x4:GetDeterminant():float:this
```

@mikedn: I haven't checked, but if matrix multiplication, for instance, isn't using SIMD, then what is?

Lots of comments in here so sorry if I don't address a couple.

@caroleidt and I had discussed this briefly a while back. A couple of things are problematic and prevent us from just "adding the methods":

  1. It would greatly complicate the JIT's recognition of these intrinsics, as there could be two forms for any given operation.
  2. This can be a strange pattern for new users, and might confuse some people's expectations of what ref and out are doing here. That said, this is a very common pattern for most math libraries, so I'm not too concerned that many people would be confused for long.
  3. Like someone said above, though, this would be a workaround of sorts, and we would prefer to keep the exposed surface area of the library conceptually pure and tackle the root problem itself.

Expanding on (3): We would love to tackle this problem generically, be it through better JIT support, better inlining, better heuristics, etc. Alternatively, given that the JIT already recognizes some of these operations as intrinsics (or can in the future), it may be a more pragmatic approach to simply improve the handling of these specific operations in the JIT. If this is a general problem that needs addressing though, it would be great if we could provide a general solution rather than special-case the methods in this library.

In summary, we don't plan to add explicit ref/out variations of methods at this time, but we do want to improve the performance of these methods, either through a generic solution, or through better JIT recognition on these specific types.

Hi there,
this issue is preventing us (for a project like SharpDX or Xenko) for switching to System.Numerics. For example, when using Matrix multiplications you really can't afford to have 64 bytes* 3 copied back and forth on the stack, so we are always doing all of our calculations by ref/out. The API should reflect that big structs passed by value is not efficient (but just convenient). Ideally we should not count on the JIT to do things like this (and basically today we can't as we have to deal with many different JIT/codegen implems out there from mobile Android, iOS, Windows...etc.)

Also, afair for SharpDX, one reason by ref methods are not ideal is because of VB limitation, as VB don't accept methods with same names that only differs by ref/out types, which is very annoying (and one reason why I didn't bother so far to support much VB)

/cc @CarolEidt

While this is especially true for Matrix, out methods for operators on vectors would also be nice to have.

operator+(..)
public static void Add( ref Vector v1, ref Vector v2, out result ) ... (or vice versa out result at start _shrug_)
{ result = v1.x + v2.x; .... }
public void Add( ref Vector v, out Vector result ) ... (or vice versa out result at start _shrug_)
{ result = x + v.x; .... }

Add, Subtract, Multiply, Divide ( divide implemented as * 1/val).
Matrix should have Multiply (or Apply method) with similar out vectors.

Scaling can also take advantage of a full vector multiply...
Vector.Multiply( ref Vector a, ref Vector b, out Matrix result );
and
Matrix.Multiply( ref Matrix a, ref Matrix b, out Matrix result );

.... it's just a habit to put the result at the end...


Other useful functions

Vector
{
/* double or whatever appropriate base type is */
public void AddScale( Vector v, double s, out Vector result )
{
result.x = x + v.x * s;
result.y = y + v.y * s;
result.z = z + v.z * s;
result.w = w + v.w * s;
}
}

.. I Should have read ALL of the comments instead of stopping 2 or 3 from the end...

of course - 'we don't plan on adding any such intelligent methods; [because people are dumb and wouldn't take advantage of them if we did.]' Great love the official stance.
Just give us some attributes that JIT can check.
GIve us another namespace that will also compile to intrinsics. that can be filled with user code.
Mono JIT checks for 'Mono.SIMD' namespace in 'Mono.SIMD' assembly name (still fails); but I'd imagine that the MS JIT does a similar thing, checking for a specific namespace... and perhaps in a specific assembly. If the it was just checking namespace, then we could extend it with high performance methods.

of course - 'we don't plan on adding any such intelligent methods; [because people are dumb and wouldn't take advantage of them if we did.]' Great love the official stance.

If you are responding to my comment, that is not what I intended to convey at all. My point was more of a statement regarding our design guidelines and the meaning conveyed by language features. Passing a parameter by ref signifies to the caller that the argument will likely be modified, which can be confusing in cases like these. On the other hand, most everyone using these API's is familiar with the pattern as such, and won't actually think that an "Add" method modifies its operands (that would be a bit silly). I was merely stating this because we will need to justify the potential confusion and reduced API clarity against the obvious performance merits the change has.

The real, concrete barrier to adding these methods is the fact that the JIT needs to recognize them as intrinsics, as it does the other methods in the library. @CarolEidt could say how much work it would be to modify the code gen to recognize these different forms. Presumably most of the logic would remain the same, but it may need to be refactored significantly to accommodate the multiple calling conventions.

but I'd imagine that the MS JIT does a similar thing, checking for a specific namespace... and perhaps in a specific assembly

It is checking the assembly (simp.cpp). To aleviate the problem of ref in System.Numerics, I would not mind having a different kind of SIMD compatible type detection, not based on assembly (e.g attributes defined at the assembly level would be quite fast to check)

Passing a parameter by ref signifies to the caller that the argument will likely be modified, which can be confusing in cases like these.

A reason why I would love to see ref readonly! :wink:

But as I mentioned earlier, the problem with adding ref methods with same method names is that VB won't accept them, and that's the annoying blocker for it (but I would not mind that VB doesn't get ref methods and their compiler just produce a warning instead of an error)

I've not disassembled post jit; to confirm what actually happens; but how would ref interact with the variables being passed as XMM register values rather than on the stack?

On the other hand, most everyone using these API's is familiar with the pattern as such, and won't actually think that an "Add" method modifies its operands (that would be a bit silly). I was merely stating this because we will need to justify the potential confusion and reduced API clarity against the obvious performance merits the change has.

isn't 'out' pretty obvious that the operand is modified?
'ref readonly' would be nice, but the issue would be more 'why isn't this getting modified when I think it should'... and the danger as in 'cross product' which in typical code is unsafe to do like
v1.cross( ref v2, out v1 ); and there's no protection one can provide against that. Or when multiplying a matrix

I've not disassembled post jit; to confirm what actually happens; but how would ref interact with the variables being passed as XMM register values rather than on the stack?

They're passed on the stack, it's not until it hits the internal math classes that it gets loaded into XMM registers. (unless they are aggressiveInline'd, in which case it would just load from memory into the XMM registers anyway.

As @benaadams points out, we can improve this for Vector by passing them directly in the SIMD registers (#2959). However, for types that may be used as interop types (Vector2, Vector3, Vector4) this would be somewhat more complex, because we would have to pass them differently for managed and native calls.

Ideally we should not count on the JIT to do things like this (and basically today we can't as we have to deal with many different JIT/codegen implems out there from mobile Android, iOS, Windows...etc.)

:+1: This is soo true.

Besides being optimized by SIMD, System.Numerics.Vectors is also THE ideal communication contract for graphics, games and image processing libraries. These libraries wants to be both fast and portable to platforms other than coreclr (Android, iOS as a great example).

Even if coreclr has implemented these optimizations in the foreseeable future, those other runtimes are not likely to keep up in pace. Adding these overloads will also block folks from picking System.Numerics.Vectors from portability perspective.

AFAIK, openTK and sharpDX is pretty much the open source foundation in the .NET 3D graphics and games ecosystem (beside these wrappers around native libraries, :sob: farewell XNA). If they are blocked, libraries on top of them are less likely to adopt System.Numerics.Vectors.

FYI. In MonoGame we've been discussing adopting System.Numerics.Vectors (https://github.com/mono/MonoGame/issues/3781), but the huge performance loss in matrix math from unnecessary structure copies has us waiting until those issues are resolved.

We support many different versions of .NET and Mono which won't have any of these new JIT optimizations. We will stick to our versions which mimic XNA's math types for now.

After a bit more discussion, we've decided that it's a good idea to add these overloads. While it would be nice to rely on great code-gen everywhere, the reality is that we are going to be supporting a wide variety of runtimes for the forseeable future, some of which will be lagging behind others, and won't be able to apply the advanced copy-elimination optimizations that would obviate the need for these overloads. In RyuJIT, we're hoping that we will be able to recognize both "versions" of these methods without a ton of extra logic, as the basic shape of the functions is still the same.

More concretely, I think we should provide one extra overload for each existing overload, with the following differences:

  • Return types are always converted to an out parameter.
  • All parameters whose size is greater than a pointer are converted to a ref parameter. All others are kept as-is. In our case, float parameters would continue to be passed by-value, but other would be passed by-ref.
  • Some methods already have an out parameter. I propose we still do the above, even though some other libraries seem to ignore these cases. For example:

public static bool Invert(Matrix4x4 m, out Matrix4x4 inverted)
becomes
public static void Invert(ref Matrix4x4, out bool succeeded, out Matrix4x4 inverted)

Another note: Since this library is implemented by System.Numerics.dll on the .NET Framework, support for the new members will only be available in the next major release of .NET Framework. No way around that, unfortunately. That would place some version support restrictions on libraries and frameworks depending on the library. For example, if Monogame wanted to use the new overloads, you guys might realistically want to consider waiting for a major version bump (or whatever the equivalent might be), since it will be an API breaking change and will require a higher .NET Framework version.

I've got a prototype version implemented in this branch. The proposed additions can be seen in this commit, see the first file changed.

Tagging a few folks interested in the discussion: @EmptyKeys , @tomspilman , @xoofx , @benaadams , @CarolEidt , @terrajobst

I don't know that EVERY function's return should be converted to a ref. A return that is a simple type(float/double/bool) should just return as normal... It's slightly more work to store into a memory address than just keep the result in a register and return (and to fetch it back out of memory for a one-shot use).

public static bool Invert(Matrix4x4 m, out Matrix4x4 inverted)
becomes
public static void Invert(ref Matrix4x4, out bool succeeded, out Matrix4x4 inverted)

If we could avoid this, so cumbersome to use... The returned bool is just fine, easier for the end-user of the function. The main problem in this particular case for VB is that the Invert method would conflict with a Invert with ref, not sure how you gonna deal with this limitation...

Probably an api that would lead to confusion but will suggest anyway :grin:

You could reduce the parameters

public static bool Invert(ref Matrix4x4 matrixToInvert)

Can always precopy for new

var newMatrix = oldMatrix;
Invert(ref newMatrix)

The main problem in this particular case for VB is that the Invert method would conflict with a Invert with ref, not sure how you gonna deal with this limitation...

Yeah, unfortunately I think this is a blocker :disappointed: . If one overload or the other was still usable, it would probably be okay. But in this case, no version of Invert would be usable from VB. @terrajobst Have we made any other similar decisions elsewhere?

public static bool Invert(ref Matrix4x4 matrixToInvert)

It's simple and clean, but I just think it will be too confusing and error-prone to have this single method modify its ref-parameter when none of the others do...

Another option would be to rename the by-ref overload so it didn't name-clash , but that would probably be equally ugly.

I would think that we would want to treat scalar return values (including bool as well as scalar float) equivalently to the scalar parameters - i.e. leave them as is, i.e:

public static bool Invert(ref Matrix4x4 matrixToInvert, out Matrix4X4 inverted)

Question: you have to declare the out param before use, does that declaration of struct get jitted out?
e.g. would

Matrix4x4 inverted;
Invert(ref matrixToInvert, out inverted)

be equivalent to:

Matrix4x4 inverted = new Matrix4x4();
Invert(ref matrixToInvert, out inverted)
{
    inverted = new Matrix4x4();
    ....
}

Would it be better for the second param to also be ref to avoid a double ctor/allocate/copy?

I would think that we would want to treat scalar return values (including bool as well as scalar float) equivalently to the scalar parameters - i.e. leave them as is, i.e:

public static bool Invert(ref Matrix4x4 matrixToInvert, out Matrix4X4 inverted)

The main problem with this (and why I avoided it) is because it's not CLS compliant to have a method that is the same as another which only differs by a ref or out parameter in its signature. You won't be able to call either of the overloads from VB, for example. You get a compiler error like "Two methods with the name 'Invert' exist."

Same thing with Vector4.Dot. We can't have a method float Vector4.Dot(ref Vector4 v1, ref Vector4 v1) because there's already a function taking (Vector4, Vector4) and returning a float.

Ah, I forgot that wrinkle. That is unfortunate, but I agree that it would be a bad idea to have just one or two methods that modify their ref parameter, and keeping the ref overloads as pure "in" parameters is much cleaner.

On Mon, Jul 11, 2016 at 4:04 PM, Eric Mellino [email protected]
wrote:

public static bool Invert(ref Matrix4x4 matrixToInvert)

It's simple and clean, but I just think it will be too confusing and
error-prone to have this single method modify its ref-parameter when none
of the others do...

Another option would be to rename the by-ref overload so it didn't
name-clash , but that would probably be equally ugly.

invert on a vector does? normalize? Oh I guess vectors are just math
operators so operator - and ... operator * ( 1/v.length ) then...

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/dotnet/corefx/issues/157#issuecomment-231891480, or mute
the thread
https://github.com/notifications/unsubscribe/AB0jajF-cD7VySGblb4l5Y28sEFTCaj3ks5qUsv-gaJpZM4DAZQS
.

Edit : later rethought this, and my complaint was that the default constructor was called anyway ()... and being as struct I guess that's all it can have anyway. So the two lines are essentially equivalent.

On Mon, Jul 11, 2016 at 4:19 PM, Ben Adams [email protected] wrote:

Question: you have to declare the out param before use, does that
declaration of struct get jitted out?
e.g. would

Matrix4x4 inverted;
Invert(ref matrixToInvert, out inverted)

be equivalent to:

Matrix4x4 inverted = new Matrix4x4();
Invert(ref matrixToInvert, out inverted)

it is not equivalent. The first will allocate space on the stack and not
call the initializer. the 'out' qualifer enforces that the thing will be
fully initialized before returning there, so that's as good as an alternate
constructor.

The second calls the constructor and stuff's default then overwrites it
with the next call.

(at least this is the behavior demonstrated by Mono and .NET runtimes when
viewed disassembled. documentation probably says 'in a way' or 'undefined'
such that it might be changed later)

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/dotnet/corefx/issues/157#issuecomment-231894026, or mute
the thread
https://github.com/notifications/unsubscribe/AB0jatkuJZTP4TL14dbSXBsXXldQWndNks5qUs-XgaJpZM4DAZQS
.

In the interest of getting some more concrete data here, I'm running a small benchmark to try to gauge the perf difference on various runtimes. Below is the code I ran, for 1 million iterations each.

            // By-Value
            {
                Matrix4x4 m1 = Matrix4x4.Identity;
                Matrix4x4 m2 = Matrix4x4.CreateScale(5, 6, 7);
                Matrix4x4 m3 = Matrix4x4.CreateTranslation(-10, -12, -14);
                Matrix4x4 m4 = Matrix4x4.CreateFromYawPitchRoll(1.5f, 2.5f, 3.5f);
                result = Matrix4x4.Multiply(m1, m2);
                result = Matrix4x4.Multiply(result, m3);
                result = Matrix4x4.Multiply(result, m4);
            }

            // By-Ref
            {
                Matrix4x4 m1 = Matrix4x4.Identity;
                Matrix4x4 m2; Matrix4x4.CreateScale(5, 6, 7, out m2);
                Matrix4x4 m3; Matrix4x4.CreateTranslation(-10, -12, -14, out m3);
                Matrix4x4 m4; Matrix4x4.CreateFromYawPitchRoll(1.5f, 2.5f, 3.5f, out m4);
                Matrix4x4.Multiply(ref m1, ref m2, out result);
                Matrix4x4.Multiply(ref result, ref m3, out result);
                Matrix4x4.Multiply(ref result, ref m4, out result);
            }

By-Value time / By-Ref time

32-bit CLR: 1.286 (28.6% slower)
64-bit CLR: 1.168 (16.8% slower)

64-bit CoreCLR (Windows): 1.160 (16% slower)
64-bit CoreCLR (Ubuntu): 1.117 (11.6% slower)

64-bit Mono (Mono JIT Compiler 3.2.8) (Ubuntu): 1.156 (15.6% slower)

32-Bit .NET Native (Win10 UWP): 1.051 (5.1% slower)
64-Bit .NET Native (Win10 UWP): 1.048 (4.8% slower)

Do folks have any particular runtime/environment that they have experienced struct copying to be significantly slower, i.e. around the speed of the 32-bit results above? Other runtimes are consistently worse, but not by a massive amount. Interestingly, . NET Native times are almost completely identical for these cases, and are also very fast in general. I don't have any iOS or Android devices to test on at the moment. I'd be interested in repeating the same there. It may also be interesting to try this out in the mono runtime Unity uses. I expect we may see something similar to 32-bit CLR above.

Do folks have any particular runtime/environment that they have experienced struct copying to be significantly slower,

I'm pretty sure my cases were all with the 32bit CLR. Even a 16% hit is a bit concerning for games where you want the best possible performance.

The main problem with this (and why I avoided it) is because it's not CLS compliant to have a method that is the same as another which only differs by a ref or out parameter in its signature.

Sounds like warting the name of the function in some way is the last best option here:

public static bool InvertRef(ref Matrix4x4 matrixToInvert, out Matrix4X4 inverted) 

Not pretty, but an acceptable trade for 28% to 15% performance improvement.

And once the JIT/AOT fixes are in place and made it out to most all runtimes they could be depreciated and removed.

Maybe this thread can accomodate this also...

can we have addtional matrix methods, that are rotate( axis1, axis2, angle) where axis1 and 2 are enumerations for 'forward', 'right' and 'up' or whatever native direction your matrices are?


Okay that's too vague, right?


I have a RotateRelative( x, y, z ) method that internally has a tick counter that goes from 0 to 5 for the 6 combinations of roll, pitch and yaw... or rotation about primary axis... This calls a more general function that does rotate( x_axis, /*to*/ y_axis, angle ) to roll left(or right depending on direction of your axis I guess, or -angle which is just a float)
rotate pitch up (pulling back on a airplane yoke)...
rotate( y_axis, z_axis, angle ) /* rotates pitch down in a positive direction. */

These are simply sin(angle),cos(angle) applied to 2 lines of the matrix. (unfortunatly I think your matrix is column major, so the rows can't be treated as the direction numbers for the matrix... to apply forward/backward translation as scale(forward, x ) to move in space that much? getting too abstract again?


a third method I think already exists to rotate a axis around an arbitrary axis by an angle?


Experimentation has shown that in small increments ( < 90 degrees/sec in 60th of that intervals for 90 degress/60 frames) it's not too inaccurate to just use the same order (roll, pitch, yaw) to get a resulting transformation. But I do prefer to just keep my matrix as the orientation representation with an offset that works as the origin of that matrix... which an inverse application of that can work as a camera, which is a rare computation compared to other relative motion calculations....

https://docs.google.com/document/d/1_6JdZ0VplMFpBeR3QOcV5vhlOGYyn52T4-fITgI6lbc

I jotted that down a bit ago.

https://d3x0r.org:444/javascript/test3d/ (self signed certificate, beware, I'm spreading knowledge, please login and save the sources yourself) ... I did update my fork of three.js to have to above mentioned methods.

There's a few buttons at the top, Mode 1, Mode 2 and Mode 3; and Controls 1 and Controls 2.
Mode 1 and Controls 1 is the default. Controls1 click and drag moves like first person camera, controls 2 , click and move is an obit camera (not really important; no motion really required)

Mode 1 - top left cube is reference relative rotation cube stepping through each of the variations... The bottom 6 cubes have their relative rotation always applied in the same order (the order specified by the text over them) ... The 3 cubes above the bottom 6 are single rotation references for directionality... It turns 90 degress and then pauses. Mode 2 is the same as Mode 1 really except slightly faster; so the divergences appear sooner...

Mode 1 - the lower 6 are slow motion, as if rotated sequentailly 90 degrees in each of it's directions; in the various combinartions. THe three rotations happen then a pause showing completed doing 90 degrees each way.

Mode2 - always applies the rotations in the same order but in smaller steps; some of these are faster than the ideal (top left) and some are slower.... I'm not sure what the reason exactly is for the combinations of three that are the same.

Mode 3 the lower 6 cubes are based on building a quaternion using roll pitch yaw... multiplying in order (if text is yaw, pitch, roll... the multiplication is ( ( yaw * pitch ) * roll ) . This is saved as a target vector, and then a slerp is used between initial position and the computed target... so especially with larger angles applied; the resulting computation is not the same... demonstrating quaternions to have similar problems... and ideally should also be muxed though an iteration of applied ordering.

it is not equivalent. The first will allocate space on the stack and not call the initializer. the 'out' qualifer enforces that the thing will be fully initialized before returning there, so that's as good as an alternate constructor.

@d3x0r to be more clear I mean is the process with an out param

  1. Allocate stack (64 bytes)
  2. Zero out memory (64 bytes)
  3. Call function
  4. Allocate stack (64 bytes)
  5. Zero out memory (64 bytes)
  6. Do work...
  7. Copy from function stack, to caller stack (64 bytes)

Or is it optimised already to be equivalent of a ref param?

  1. Allocate stack (64 bytes)
  2. Zero out memory (64 bytes)
  3. Call function, passing ref to caller stack
  4. Do work...
  5. No copy as caller struct directly modified

Maybe this thread can accomodate this also...

I appreciate the discussion, but the issue here is for a specific set of functionality. Could you open a separate issue for a different feature request/discussion?

I nabbed an ARM Android tablet from the office and ran a few additional tests on it from a blank Xamarin application. I ran the same tests with BenchmarkDotNet on my workstation on CoreCLR and (.NET Framework) CLR.

Code for the benchmarks used is here: https://gist.github.com/mellinoe/5af3ae7db013a6ab740fc8b80910b1b8. It's just a smattering of random small sequences of vector operations.

Android

| Method | Duration (10M iterations) |
| --- | --- |
| Matrix4x4ByValue | 164.921 s |
| Matrix4x4ByRef | 32.633 s |
| QuaternionByValue | 36.501 s |
| QuaternionByRef | 34.081 s |
| Vector3ByValue | 17.204 s |
| Vector3ByRef | 16.870 s |
| Vector3SimpleAddByValue | 14.835 s |
| Vector3SimpleAddByRef | 14.874 s |

CoreCLR (win10-x64)

| Method | Median | StdDev |
| --- | --- | --- |
| Matrix4x4ByValue | 379.3958 ns | 6.3489 ns |
| Matrix4x4ByRef | 326.1952 ns | 3.1910 ns |
| QuaterionByValue | 549.0591 ns | 2.3245 ns |
| QuaterionByRef | 527.9179 ns | 4.6265 ns |
| Vector3ByValue | 14.3423 ns | 0.2028 ns |
| Vector3ByRef | 0.8215 ns | 0.1003 ns |
| Vector3SimpleAddByValue | 1.8959 ns | 0.1438 ns |
| Vector3SimpleAddByRef | 1.1114 ns | 0.1023 ns |

CLR (x86 .NET Framework)

| Method | Median | StdDev |
| --- | --- | --- |
| Matrix4x4ByValue | 440.7572 ns | 19.3525 ns |
| Matrix4x4ByRef | 344.5661 ns | 6.2860 ns |
| QuaterionByValue | 571.1591 ns | 16.0463 ns |
| QuaterionByRef | 565.8550 ns | 14.0463 ns |
| Vector3ByValue | 36.9106 ns | 1.1481 ns |
| Vector3ByRef | 4.7637 ns | 0.2553 ns |
| Vector3SimpleAddByValue | 0.8344 ns | 0.1070 ns |
| Vector3SimpleAddByRef | 0.5420 ns | 0.0814 ns |

The above data isn't incredibly robust or revealing, but there's some obvious trends. The Android numbers for by-val and by-ref are pretty competitive, but the performance seems to drop off a cliff with the largest struct, and passing that by value is over 5x slower than by-ref :frowning: . Other cases are about a ~0-5% difference, generally favoring by-ref. Other than for Matrix4x4, the performance number are not terribly different according to my tests.

The Windows numbers are pretty similar to the ones I posted yesterday, although there's some weirdness with the Vector3ByValue/ByRef test case. Apparently that one is 17x faster in the by-ref version :smile: .

I'd really only want to say that the Matrix4x4 tests above show anything interesting. Passing matrices around by value does seem to be a genuine problem, especially on certain platforms. Even on x64 RyuJIT, there's still a not-insignifcant difference to using by-ref (~15-18%).

I also suspect that there might be some more degenerate cases that aren't really caught by these small test cases. If you have a bunch of Quaternions, Vectors, Matrices, etc. stored around in different objects and are combining them in different phases in different functions, that might exacerbate the problems from excessive struct-copying more than these test cases reveal.

I'd really only want to say that the Matrix4x4 tests above show anything interesting.

I agree. I think if just any matrix bigger than 2x2 were the only ones to have by-ref methods that could be considered reasonable.

Still by-ref seems to always be faster than by-value for structures larger than the primitive types. It would be really good to eventually see the various JITs and AOTs optimize for this to get that last bit of performance.

Stepping back, would ref returns change the proposed apis?

Possibly, if they were hypothetically available today. Are you familiar with how they will be handled in VB? I expect we might hit the same issues with overload resolution, since all of the overloads would be identical except the parameters would be passed by reference.

Yeah just wondering if there is a second overload resolution clash on the horizon...

We should be able to expose these in only .NET Core for the current time, and add them to .NET Framework later.

@mellinoe Could you say when exactly you will provide new overloads for .Net Framework and will SIMD be available for matrices? For Now seems that only Vector is using SIMD. Isn`t it?

@mellinoe did it go through API review?

@karelz Yes, it did, and I have a branch with the changes. I need to set aside some time to finish it up and push it through a code review. We should do this for .NET Core 1.2.

Thanks for explanation!

@mellinoe is this still necessary for 2.0? if so is it still feasible for 2.0 with your other commitments?
if no or no please set to future.

We really should not postpone this past 2.0, IMO. I should be able to get to it soon.

Unfortunately, this isn't going to make it into 2.0. Moving this out for now.

Just to share my interest in this topic, it is a shame it's moving to 2.1, I'm also one of these people who has written a .net maths library written using by ref, for performance reasons.

However I also not sure where to bring this up but the issue of performance as well as the API Design is important on this issue but the topic crops up in numerous issues across the CLR and CoreFX.
How we go about getting efficient SIMD instructions for graphics/games/whatever might come down to exposing intrinsics themselves or for library writers to wrap System.Numerics.Vectors, though as mentioned by others as long as there is no cost involved.

For reference this has been brought up in https://github.com/dotnet/corefx/issues/17214 and https://github.com/dotnet/coreclr/issues/6906 although the latter is more generally exposing intrinsics.

https://github.com/dotnet/coreclr/issues/4356 talks about the issue raised above regarding Matrix/Quaternions not using SIMD intrinsics, but the last comment is interesting in terms of the API design which might influence what is done here.

If i'm missing some discussion somewhere then I apologise or for incorrect posting but given I tracked down a variety of issues which I feel really is all related to the same thing of having high performance vector math, I wanted to make sure others could see the discussions and any API design changes were considered in the broader scope.

In regards to this issue explicitly, if we assume the API will still expose matrix/quat and such, it's probably still an improvement to add the ref overloads, given this doesn't affect any JIT stuff is this a case of adding in the overloads because I can probably add a PR if desired? as it's very similar to what I'm doing already in my own lib.

This change was rejected in PR. See https://github.com/dotnet/corefx/pull/25388 and its discussion for reasoning.

Instead, the plan is to fix the cases where the JIT should generate better code using the existing by value methods.

For example, the first case identified is being tracked by: https://github.com/dotnet/coreclr/issues/15467.

Closing this issue since the change was not approved.

Was this page helpful?
0 / 5 - 0 ratings