Runtime: Matrix4x4 Changes - Remove/Move Matrix4x4.CreateWorld, CreateBillboard, etc.

Created on 17 Aug 2016  路  25Comments  路  Source: dotnet/runtime

There are a few things that we would like to see changed in Matrix4x4 that might make us and others more likely to adopt them in the near future.

There seem to be several methods inside System.Numerics.Vectors.Matrix4x4 that should be removed or placed in a utility library as extension methods. Many Matrix4x4.Create* methods impose conventions on the users that they may not be able to accept. For example, Matrix4x4.CreateWorld() seems to negate the passed in forward axis and store it that way. For those using a right-handed x-right, y-forward, z-up coordinate system, this is very annoying for debugging purposes (and incorrect). Similar issues apply to other Create functions and there are several that would never be used (CreateShadow, etc.).

This API would be much easier for us to adopt if it was considerably trimmed down to a minimum (required) number of functions and then other methods added as extensions that can _not_ be included so that those that do things differently or don't need them can still get some SIMD benefits under C#.

Also in regards to SIMD benefits, Matrix4x4 does not seem to have any. I'm not clear as to why this was not designed to contain 4 Vector4 values, but if it was, it is considerably more likely that _required_ operations such as Vector * Matrix would see SIMD benefits without having to add more/unneeded intrinsics to the JIT compiler.

Design Discussion area-System.Numerics question

All 25 comments

Our compatibility guarantees don't allow us to remove functions that are already part of the public surface area, so unfortunately, we can't even consider that option.

Correct me if I'm wrong, but the main issue here seems to be that the static factory methods all use a right-handed coordinate system with Forward=-Z and Up=+Y. We went back on forth on this assumption, considered having multiple versions for different systems, considered only using left-handed, etc. Eventually we decided to only have right-handed constructors. A key point here is that most of the constructors are the same as the version used in XNA, and those also assumed a right-handed system.

I'm not clear as to why this was not designed to contain 4 Vector4 values, but if it was, it is considerably more likely that required operations such as Vector * Matrix would see SIMD benefits

I'm not so sure about this, but I would be interested to see some experiments around it. We can't change the surface shape of the type (it has 16 public float fields), but we could perhaps do some trickery with overlapped fields. I could see some operations benefitting, like operator +,-,*, but these aren't interesting operations, really. Vector4.Transform(Vector4, Matrix4x4) isn't likely to get any benefit, I think; it doesn't do row->row operations, so I doubt anything would improve without the JIT intervening. I would also expect that adding in esoteric optimizations in this manner could harm perf on non-SIMD-enabled platforms, and we need to support the library on many of those. What we'd really like is true JIT recognition of the hot-path functions themselves.

We went back on forth on this assumption, considered having multiple versions for different systems, considered only using left-handed, etc.

This is exactly why something so user-centric should _not_ be part of a System library in my opinion. All that really needs to exist in the Vector library is a thing layer over common SIMD operations (mul, div, add, sub, dot, muladd, some component comparison operations, and possibly a cross product).

Trying to be more than that is just asking for people to be unable to use the library. Certainly additional extension methods could have been added on top of this for a separate XNA style library, but that's really up to the engine/etc developer.

Our previous engine used a right-handed, x-right, y-forward, z-up convention for 15 years. This is not something we have interest in being forced to change simply to get SIMD optimizations from a System level library.

I could see some operations benefitting, like operator +,-,*, but these aren't interesting operations, really. Vector4.Transform(Vector4, Matrix4x4) isn't likely to get any benefit, I think; it doesn't do row->row operations.

This may be true based on how this library is written, but it is not the case for ours as our previous engine was written to perform Vector3*Transform operations using _madd_ ops. Again, this is impossible to do with _System.Numerics.Vector_ because the classes are higher level than they should be for most engine developers. Properly written, we have seen a significant improvement in using SIMD while still maintaining very optimal code for non-SIMD platforms.

This is exactly why something so user-centric should not be part of a System library in my opinion.

I can appreciate that perspective. There were a few goals for this library, XNA compatibility being one major priority. We put quite a bit of deliberation into this point, and we knew we couldn't please everyone with any single set of constructors. I do think it is important to have them in the library itself, though, as another main goal of the library was usability and discoverability, both of which are hurt by moving them outside the type.

Do these constructors really prevent you from adopting the library? You can simply provide your own functions which are appropriate for your coordinate system. I don't think this is a very unique problem; most libraries out there make an assumption about handedness, and only a few have multiple versions.

Again, this is impossible to do with System.Numerics.Vector because the classes are higher level than they should be for most engine developers.

Could you be a bit more specific about where you think the design of the types is preventing optimal code from being written? I can understand how you think the point above brings in too high-level of an assumption into the system, but it seems like a separate concern from how individual operations are implemented. We're very interested in making the library as fast as possible, under the constraints I described in my first response, namely:

  • Public surface area can't be removed and/or incompatibly changed (including public fields).
  • Code can't have degenerate performance without SIMD support.
  • Special JIT recognition is possible, but resource-constrained.

Do these constructors really prevent you from adopting the library?
Yes, it would certainly not be clean or correct for us to return a Matrix4x4 that provides functions that could easily be used incorrectly and, as I mentioned, it is more frustrating to use for debugging.

I don't think this is a very unique problem; most libraries out there make an assumption about handedness, and only a few have multiple versions.

The problem I have is this library is trying to be XNA and there is no other lower-level access to perform SIMD operations. I really don't mind what this library does as long as we have the ability to write our _own_ Vector3/Matrix/etc classes the way we've been doing for so long and have a SIMD type that could be used to obtain SIMD performance. I'm certainly not asking for multiple versions of functions on the same type as this clearly confuses people. Just wanting the ability to write our own with SIMD operations.

I don't think that's an unreasonable request. If C++ tried to add "standard" Vector/Matrix types to STL and that was the only way to access SIMD operations, I think more than a few people would have problems with that. Some people would like to use SIMD in ways other than a game-focused library, so it would be nice if that was supported as well. There are also those that are okay with targeting platforms that only support SIMD operations.

So what would be nice to see are simple SIMD types (Simd.Vector4Single, Simd.Vector4Double) that expose commonly required/supported (engine agnostic) operations as intrinsics. It should be a much smaller library as only Vector4 really needs implementing.

@Ziflin I suggested something along that line on https://github.com/dotnet/corefx/issues/1168#issuecomment-212540494 ... It seems that SIMD support should not be built on top of System.Numerics but the other way around. Having SIMD support as a backed in intrinsics library / JIT morphers and then build System.Numerics on top of it.

Yes, I should have worded my issue better I suppose. I completely agree and was basically hoping that this is what System.Numerics was. Maybe a System.Numerics.Simd or something along those lines. It just seemed really odd to have something in _System_ that dictates things that are clearly different on a variety of existing game engines and 3D products (3DSMax, Maya, etc.).

Would love to see this support added so that we could see similar SIMD benefits in C# as we did in our C++ engines.

/cc @terrajobst @CarolEidt

Having SIMD support as a backed in intrinsics library / JIT morphers and then build System.Numerics on top of it.

@redknightlois Could you be a bit more specific on how you'd expect that to be built? Currently, the JIT recognizes the SIMD types by name, and then recognizes individual intrinsic operations by matching method names. I think what you are suggesting (and please correct me if I'm wrong) is that this name recognition could be moved to a lower level struct (Simd.Vector4Single or whatever), and then System.Numerics.Vector4 would be built on top of it. And somebody's right-handed types could be built on top of Simd.Vector4Single "in parallel".

Another question: Is building a wrapper type with its own conventions, constructors, etc. on top of System.Numerics.Vector4 a palatable solution? It's roughly equivalent to the above, except the underlying vector has the undesirable policy baked into it rather than being an agnostic low-level type. But that policy could be hidden in the wrapping struct such that there is no usage confusion when dealing with your library.

It's my understanding that single-field wrapper structs are very well-optimized and should be more or less identical to using the wrapped struct directly. If my first paragraph was in-line with what you were thinking, then we would be relying on this property regardless.

@mellinoe Sure, let me try to elaborate (difficult topic, so mistakes will be made).

Problem is that Vector4 already imposes many semantic restrictions. From the point of view of SIMD operations we are just operating on registers of varying sizes depending on the operation and the underlying data types (unsafe as it gets). For example, some AVX instructions would work on 512 bits registers while others on 256. For using those operations Vector4 wont cut it anymore, now you need Vector8 or Vector16. And things can get even messier, now you also have operations that understand the register as 16 bytes.

For example on SSE2 the usage of the XMM registers can be:

  • two 64-bit double-precision floating point numbers or
  • two 64-bit integers or
  • four 32-bit integers or
  • eight 16-bit short integers or
  • sixteen 8-bit bytes or characters.

But that can be even worse, because for implementing more than a few algorithms out there you need to treat those registers like 16x8bit and immediately in the next instruction treat it as 4x32bits. Packing and shuffling instructions are the usual suspects. Incidentally they have not surfaced at System.Numerics which I bet is for a good 'design' reason ;)

And things could get nastier, popcnt (the base for many advanced data structures algorithms) works only on 1x32bits registers (a single int).

There are a few ways this can be solved. The easiest would be to have a somewhat difficult to access but available JIT intrinsics instructions that are built on top of the Register (all safeguards are off type of type) and then System.Numerics can be built on top of that. This is the way of the C compilers and their intrinsics libraries that map the actual instructions do it (not pretty but damn effective).

The other (not easy at all) is to teach the JIT to understand the isolated patterns of operations through morphers (with only intrisics for things that cannot be 'defined' in the language). I built such an example when I wrote:

[Vectorize]
public static Vector4<T> MoveLowerToHigher ( Vector4<T> a, Vector4<T> b )
{
      Vector4<T> dest;
      dest[0] = a[0];
      dest[1] = a[1];
      dest[2] = b[0];
      dest[3] = b[1];
      return dest;
}

Problem there is that it is very complex, time consuming, probably not worth it unless the JIT becomes into a autovectorizing compiler. And even there, probably wont be able to figure out many patterns for which you will need the intrinsics anyways.

And as you said:

When we have discussed Shuffle operations internally, this was the general roadblock we hit, and never came up with a great solution.

I feel that the reason of that is because it probably doesn't exist (I would be thrilled to find myself wrong on this one). So IMHO the problem is that System.Numerics is a very nice OO abstraction, but for most uses we should be dealing with fundamental types and primitives instead.

The caveat is, if we go that way System.Numerics is no longer the place to do SIMD but yet another consumer of the low level primitives.

EDIT: Fixed popcnt instruction size.

I think we may have gone off on a tangent, so I'll try to clarify some stuff:

Problem is that Vector4 already imposes many semantic restrictions. From the point of view of SIMD operations we are just operating on registers of varying sizes ...

I don't think this is a problem, nor is Vector4 trying to solve these problems. Vector4 is a fixed-size vector on purpose, and is built for fixed-size spatial computations. The fact that you could potentially fit a Vector16 into the same computations as a Vector4 on an AVX512 processor is not really interesting unless you can somehow modify your spatial algorithms to exploit that. Vector<T> is the type that is aiming to address some of the scenarios you are describing.

This particular issue is mainly about the high-level policy that is present in some of the construction methods, and how that makes it difficult to adopt into a system with different policy. The point I was trying to make above was that, even if we implemented System.Numerics.Vector4 on top of a policy-agnostic, SIMD-only structure, it wouldn't really buy you anything over just implementing your alternate policy LeftHanded.Vector4 (or whatever) on top of System.Numerics.Vector4 itself. This is obviously not as conceptually pure as the alternative I described above, but I'm trying to understand if that is an acceptable tradeoff.

@Ziflin

The fact that you could potentially fit a Vector16 into the same computations as a Vector4 on an AVX512 processor is not really interesting unless you can somehow modify your spatial algorithms to exploit that.

@mellinoe that is exactly the point, why it is so important and far from tangential. System.Numerics is policy aimed at spatial computations (which is a completely fine design decision) but not SIMD. Spatial calculations are not policy-agnostic and shouldn't be by any means. For any bounding box algorithm System.Numerics provide, I can build 2 or 3 different algorithms for generic primitive colisions that can exploit SIMD (even at the tree search level) that are not going to be supported on System.Numerics that would even require different spatial representations; but I am digressing too much on the issue here.

It is fine to build a right-handed numerics library to support basic (and probably not so basic) primitives, but tie SIMD support to it IMHO is the root of the issue. The evidence is easy to find. Matrixes do not end at 4x4, even spatial calculations on a different policy become quite different. In the C++ land we have been building Vector<T> (as in the mathematical/space sense of vector) for years on end, every engine has it's own because it makes sense to make simplications or enhancements at that level.

Today any library that would expect to achieve SIMD calculations (even for basic operations, nothing fancy, just addition, multiplications, conditional select, etc) will have to live with lots of nonesense because of that. But then we have the rabbit hole (if anyone wants to enter or not is a different question). Stripping down numerics to just the basics will eventually beg the question, is System.Numerics the right level of abstraction for SIMD (at all)?

@mellinoe that is exactly the point, why it is so important and far from tangential. System.Numerics is policy aimed at spatial computations (which is a completely fine design decision) but not SIMD. Spatial calculations are not policy-agnostic and shouldn't be by any means. For any bounding box algorithm System.Numerics provide, I can build 2 or 3 different algorithms for generic primitive colisions that can exploit SIMD (even at the tree search level) that are not going to be supported on System.Numerics that would even require different spatial representations; but I am digressing too much on the issue here.

My point was not that these algorithms are not interesting or useful, but that Vector4 isn't intended for them. And that's okay; Vector4 is a fixed-size type with limited (but extremely common) use cases. We implemented it because we identified an extremely common type that could benefit from SIMD codegen. The Vector<T> struct is closer to what you are describing. We don't have support for other fixed-size types which exploit SIMD for a few simple reasons.

  • It's difficult to identify which specific types are useful as each needs to be recognized separately.
  • It's a lot of work, and we have a lot of improvements in Vector<T> to make.
  • It's unclear that Vector<T> isn't applicable in their place (at least in theory).

Most of System.Numerics.Vectors is policy-agnostic. There's a few standalone constructor methods (on Matrix4x4/Matrix3x2) which make handedness assumptions, but those do not preclude anyone from using a different handedness with the rest of the functions, and do not affect anyone who doesn't touch them. My undestanding of the original issue was that using Matrix4x4 as an exchange type in your other-coordinate-system engine will be problematic because people could accidentally use those the factory methods to provide invalid transformations. That part makes sense to me, and is a legitimate concern. Again, I'd like to understand if creating a wrapper struct is a palatable solution for the original issue. I described above more about what I meant.

This warrants a different design discussion because in the end is not a System.Numerics design problem. Added a different issue for discussing it.
https://github.com/dotnet/coreclr/issues/6906

This particular issue is mainly about the high-level policy that is present in some of the construction methods, and how that makes it difficult to adopt into a system with different policy. The point I was trying to make above was that, even if we implemented System.Numerics.Vector4 on top of a policy-agnostic, SIMD-only structure, it wouldn't really buy you anything over just implementing your alternate policy LeftHanded.Vector4 (or whatever) on top of System.Numerics.Vector4 itself. This is obviously not as conceptually pure as the alternative I described above, but I'm trying to understand if that is an acceptable tradeoff.

So if I understand you correctly, you're saying to do something like:

// Our 'Vector3':
struct Vector3
{
   System.Numerics.Vector3 v;
}

And this would compile as optimally as using a Numerics.Vector3? Just curious about any hidden overhead that this would have. For instance, would it break any inlining operations?

There are a few useful functions that do not currently exist in Numerics.Vector3/4 that we use. Some are implemented on _Vector_ but not Vector3/4:

  • LessThanAll, LessThanOrEqualAll, EqualsAll, LessThanAny (basically all of the *All and *Any methods)
  • Sign() - returns a +/-1.0 vector of sign values, similar to Math.Sign (used in several bounding volume tests)
  • A MultiplyAdd( a, b, c ) = a * b + c -- now a single SSE op
  • A way to get the _xyz_ values of a Vector4 into a Vector3. (other variations would help). Very useful.
  • Also a way to build a Vector4 from the 4 x (or y or z or w) components of 4 vector4s. Allows us to write our own _Transpose_ operation (can be done a little faster as one operation though).

That's the majority of what we're using now. The last 2 are variations/combinations of _mm_unpackhi/lo and _mm_shuffle and I think a few people have requested at least the first one before.

So if there isn't any additional overhead in wrapping the Numeric types and it was possible to get some of these capabilities added to the existing Vector3/4, then I think it's fine for us to just wrap them. As long as Vector3/4 stay relatively 'clean', then we'd write our own Quaternion/Matrix classes with those.

@mellinoe

Well since @mellinoe suggested wrapping the Numerics.Vector3/etc types with our own types to address this issue, I finally got a chance to do some tests with .Net Core and unfortunately the results are not pretty.

The test is simple, I wrote a _standard_, non-SIMD Vector3 class with float x,y,z; fields and another version that contained a single System.Numerics.Vector3 v; field. I also implemented operator+ for both and both it and the constructors were tagged with [MethodImpl( MethodImplOptions.AggressiveInlining )]. I ran the tests in Release targeting x64. I'm not aware of any additional hints to improve performance.

The test code starts by generating a random vector (using the same seed each run). It then times the addition of result = result + randomVector; 1,000,000 times. It then prints the result and logs the elapsed time.

I recorded the fastest results from 10 runs:

  • Wrapped Numerics.Vector3 Implementation: 170,750 ticks
  • Simple float Implementation: 61,505 ticks
  • Using plain Numerics.Vector3 (no-wrapper): 23,056 ticks.

Ok, clearly this is bad. So either I'm missing something, or trying to wrap Vector3 is a very bad idea performance-wise - it's almost 3x slower than what we've been using and over 7x slower than the improvement we were hoping to see using a SIMD implementation. Looking through the disassembly it definitely appears to be generating much more code with the wrapper than without.

So what is the solution to this? I know this is a bit of a micro-optimization, but if it's going to be this much slower doing just a series of adds, then I really don't see how it's going to improve...

The test code can be found here: https://drive.google.com/open?id=0ByziG43eLtp6SlVwLXByR2dueWc

Nothing in that code hint it could be 7x slower, no obvious flaw I can see.

Yes, that seems quite bad and looks decidedly slower than the simple, naive version with three fields.

@CarolEidt I know you have been looking into enabling struct optimizations lately. Does this scenario show up on your radar at all, as something that needs to be improved? I would guess that some bad inlining decisions are being made here, or something like that, because the effective logic should be more or less identical across these scenarios.

@Ziflin Would it be possible for you to post some of the disassembly here? It might give @CarolEidt or some of the other code-gen folks a better idea as to what is happening.

@mellinoe Yah, here's some of the Disassembly snippets. These were from placing a Debugger.Break around var c = a+b; in the test code.

Directly Using Numerics.Vector3

00007FFDFC770C6A  movaps      xmm0,xmmword ptr [rsp+30h]  
00007FFDFC770C6F  movaps      xmm1,xmmword ptr [rsp+20h]  
00007FFDFC770C74  movaps      xmm6,xmm0  
00007FFDFC770C77  addps       xmm6,xmm1 

This is nearly optimal code except that I think technically the last movaps is not needed. It should have been able to addps xmm0, xmm1 as neither xmm0 nor xmm1 are used again. That may be more than the optimizer is currently able to do though?

Wrapped Vector3

00007FFDFC790F71  mov         rax,qword ptr [rsp+0A8h]  
00007FFDFC790F79  mov         qword ptr [rsp+78h],rax  
00007FFDFC790F7E  mov         eax,dword ptr [rsp+0B0h]  
00007FFDFC790F85  mov         dword ptr [rsp+80h],eax  
00007FFDFC790F8C  mov         rax,qword ptr [rsp+98h]  
00007FFDFC790F94  mov         qword ptr [rsp+68h],rax  
00007FFDFC790F99  mov         eax,dword ptr [rsp+0A0h]  
00007FFDFC790FA0  mov         dword ptr [rsp+70h],eax  
00007FFDFC790FA4  movss       xmm1,dword ptr [rsp+80h]  
00007FFDFC790FAD  movsd       xmm0,mmword ptr [rsp+78h]  
00007FFDFC790FB3  shufps      xmm0,xmm1,44h  
00007FFDFC790FB7  movss       xmm2,dword ptr [rsp+70h]  
00007FFDFC790FBD  movsd       xmm1,mmword ptr [rsp+68h]  
00007FFDFC790FC3  shufps      xmm1,xmm2,44h  
00007FFDFC790FC7  addps       xmm0,xmm1  
00007FFDFC790FCA  lea         rax,[rsp+58h]  
00007FFDFC790FCF  movsd       mmword ptr [rax],xmm0  
00007FFDFC790FD3  pshufd      xmm1,xmm0,2  
00007FFDFC790FD8  movss       dword ptr [rax+8],xmm1  
00007FFDFC790FDD  mov         rax,qword ptr [rsp+58h]  
00007FFDFC790FE2  mov         qword ptr [rsp+88h],rax  
00007FFDFC790FEA  mov         eax,dword ptr [rsp+60h]  
00007FFDFC790FEE  mov         dword ptr [rsp+90h],eax

This is disassembly for the same var c = a+b; call above. The difference is this uses my Vector3 that wraps/contains a Numerics.Vector3. But it does seem that the struct wrapper is breaking the ability for the optimizer to see that these are really identical operations. I'm definitely confused by the long series of movs at the beginning. I would have thought that it would be able to simply perform the same two movaps ops that the Numerics version did to load the two Vector3s. It looks like the wrapper is causing the JIT to fall back to loading/storing the wrapped SIMD Vector3s to memory.

Simple Vector3 - float x,y,z

00007FFDFC770B65  lea         rax,[rsp+30h]  
00007FFDFC770B6A  movss       xmm0,dword ptr [rax]  
00007FFDFC770B6E  movss       xmm1,dword ptr [rax+4]  
00007FFDFC770B73  movss       xmm2,dword ptr [rax+8]  
00007FFDFC770B78  lea         rax,[rsp+20h]  
00007FFDFC770B7D  movss       xmm3,dword ptr [rax]  
00007FFDFC770B81  movss       xmm4,dword ptr [rax+4]  
00007FFDFC770B86  movss       xmm5,dword ptr [rax+8]  
00007FFDFC770B8B  movaps      xmm6,xmm0  
00007FFDFC770B8E  addss       xmm6,xmm3  
00007FFDFC770B92  movaps      xmm7,xmm1  
00007FFDFC770B95  addss       xmm7,xmm4  
00007FFDFC770B99  movaps      xmm8,xmm2  
00007FFDFC770B9D  addss       xmm8,xmm5

This is the struct Vector3 { float x,y,z; } variant. It is what I'd expect. 6 floating point loads, and 3 scalar adds.

NoInline Issue

One other thing that I tested was to move the "add" into a NoInline method. Oddly doing this made the simple float version 2x faster than either SIMD version. I'm guessing this is from having to convert the SIMD value back into individual floats/scalars instead of being able to simply pass the full SIMD vector by value/register? This would also be great to improve. I think at that point it would be as fast, if not possibly faster than some of our C++ code.

@CarolEidt, @sivarv We should take a close look here and see if there are any obvious things to be improved, or if some of your other work will help here.

@Ziflin Thanks for continuing to help with this. As I was alluding to earlier, even if we provided a "lower-level" SIMD structure, we would still be relying on "wrapper structs" being highly optimized. Hopefully we can identify what the problem is and improve things here.

No problem! I'd love to see it work well and it looks like it's pretty close. Let me know if there's anything else I can help with.

As a note this looks related to https://github.com/dotnet/coreclr/issues/3539 and code gen around compound value types causing too many movs.

The problem with wrapping of SIMD types is a known issue (https://github.com/dotnet/coreclr/issues/7508). I will try to have a look at it again in the next week or so to assess how much work it entails.
I've made some improvements in the struct copying recently, but it doesn't reduce the copies in this case. I think it's a combination of the fact that the optimization phases don't treat SIMD values as a known scalar type, and the fact that the accessing of the fields (without "promoting" them) makes the struct look like it's had its address taken.

Thanks for checking it out Carol. This would really help out our project (and I'm sure anyone wishing to use the SIMD types) considerably.

Since I promised to look at this in "the next week or so", I thought I would update this thread - I've got a preliminary implementation of promotion (aka "scalar" replacement) of wrapped SIMD types: https://github.com/dotnet/coreclr/pull/8402. I haven't yet run the corefx System.Numerics.Vectors tests, and there are some code size increases that require looking at, but it seems promising.

Awesome. Thanks for the update! I appreciate you looking into this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

iSazonov picture iSazonov  路  139Comments

hqueue picture hqueue  路  155Comments

akoeplinger picture akoeplinger  路  207Comments

ebickle picture ebickle  路  318Comments

terrajobst picture terrajobst  路  158Comments