Per discussion in draft PR #14594, we would like to be able to use SIMD instructions for better performance in Drake simulations. This issue is for discussing how best to do that.
We have measured substantial speedups in hand-crafted (via Intel intrinsics) AVX2 and FMA-using methods for common operations and would like to use those by default on machines that have those instructions available. (All Intel and AMD processors after 2014 have those.) Eigen is also capable of taking advantage of these (allegedly).
An issue I saw in #14594 is that many of our CI tests fail with AVX2 & FMA enabled. This appears to be due to tests that are unreasonably sensitive to roundoff error. I have fixed many of them (in that PR), and all tests pass with clang, but there are still several mysterious failures when building with gcc that must be investigated. I _think_ these are still roundoff sensitivity problems but it is conceivable that gcc 7.5 has bugs. Preventing the compilers from using these instructions _except_ in hand-crafted methods would minimize fallout, but that would prevent us from inlining SIMD methods, which can be very small.
BTW the minimal set of flags needed to get the compilers to produce the right instructions from the intrinsics is: -mavx2 -mfma -mtune=haswell. That should produce binaries that will work on any Intel-based machine made in the last 7 years. @jamiesnape pointed out that our current Mac CI machines are older than that!
Thanks in advance for your thoughts on how we can best make progress here.
FYI Some recent AMD Zen processors have issues with FMA, but I do not think that should really inform any discussions. WRT Eigen, I think they support various other processor architecture intrinsics, so there are certain gains to be had (with different flags) even if the world moves away from Intel and Drake supports ARM.
To the issue at had, just as I would with OpenMP, with SIMD, I would create a config in the first instance, so people can opt-in very easily to the flags and CI can run them without having to commit raw compiler flags to CI.
The alternative is to define toolchains. That would arguably be the correct approach, but Bazel makes the process a little more complicated than I would like and the API has not historically been amazingly stable.
I'm afraid that if the default Drake is slow and we leave it to users to build a performant version then most users initial experience will be negative. That could affect Drake's reputation and reduce uptake. I'd like to have the default be the faster build, with users able to opt-in to a more generic version if they have old machines. Is that feasible?
I wrote down a plan in the other PR. I'm working on it. More speculation here isn't useful. Let me get requirements on deck first, and then we can plan for the tactics. I will revisit this on Thursday.
I'm afraid that if the default Drake is slow and we leave it to users to build a performant version then most users initial experience will be negative. That could affect Drake's reputation and reduce uptake. I'd like to have the default be the faster build, with users able to opt-in to a more generic version if they have old machines. Is that feasible?
(It does not need to be opt in forever, that would just be the starting point.)
(@jwnimmer-tri comments copied here from Anzu 6274)
The libpcl binaries from Ubuntu are compiled with one Eigen alignment choice, but our use of AVX2 was triggering different alignment annotations, leading to ABI incompatibility with Eigen's allocator expectations. I'll work on finding the right knob to guarantee that the Eigen ABI still matches.
Trying bazel test //redacted:redacted --copt=-march=broadwell --copt=-g is a good test case. When it fails, I see:
...
#1 0x00007ffff7f84d62 in Eigen::internal::handmade_aligned_free(void*) (ptr=<optimized out>) at external/eigen/include/_usr_include_eigen3/Eigen/src/Core/util/Memory.h:98
#2 0x00007ffff7f84d62 in Eigen::internal::aligned_free(void*) (ptr=<optimized out>) at external/eigen/include/_usr_include_eigen3/Eigen/src/Core/util/Memory.h:179
#3 0x00007ffff7f84d62 in Eigen::aligned_allocator<pcl::PointXYZRGB>::deallocate(pcl::PointXYZRGB*, unsigned long) (this=0x5555556216e8, p=<optimized out>) at external/eigen/include/_usr_include_eigen3/Eigen/src/Core/util/Memory.h:747
#4 0x00007ffff7f84d62 in std::allocator_traits<Eigen::aligned_allocator<pcl::PointXYZRGB> >::deallocate(Eigen::aligned_allocator<pcl::PointXYZRGB>&, pcl::PointXYZRGB*, unsigned long) (__a=..., __n=<optimized out>, __p=<optimized out>) at /usr/include/c++/7/bits/alloc_traits.h:328
#5 0x00007ffff7f84d62 in std::_Vector_base<pcl::PointXYZRGB, Eigen::aligned_allocator<pcl::PointXYZRGB> >::_M_deallocate(pcl::PointXYZRGB*, unsigned long) (this=0x5555556216e8, __n=<optimized out>, __p=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:180
#6 0x00007ffff7f84d62 in std::_Vector_base<pcl::PointXYZRGB, Eigen::aligned_allocator<pcl::PointXYZRGB> >::~_Vector_base() (this=0x5555556216e8, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:162
#7 0x00007ffff7f84d62 in std::vector<pcl::PointXYZRGB, Eigen::aligned_allocator<pcl::PointXYZRGB> >::~vector() (this=0x5555556216e8, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:435
#8 0x00007ffff7f84d62 in pcl::PointCloud<pcl::PointXYZRGB>::~PointCloud() (this=0x5555556216b0, __in_chrg=<optimized out>) at external/pcl/pcl/point_cloud.h:240
...
From my reading so far of Eigen docs, it says ...
EIGEN_MAX_ALIGN_BYTES - ... an upper bound on the memory boundary in bytes on which dynamically and statically allocated data may be aligned by Eigen. ... For instance, one may compile AVX code and enforce ABI compatibility with existing SSE code by defining EIGEN_MAX_ALIGN_BYTES=16 ...
-- https://eigen.tuxfamily.org/dox/TopicPreprocessorDirectives.html#title3
... but that seems like a lie.
Default "ideal" value is 32 bytes with AVX, or 16 bytes otherwise:
https://github.com/eigenteam/eigen-git-mirror/blob/3.3.4/Eigen/src/Core/util/Macros.h#L666-L671
The "ideal" value trumps the user's setting if the ideal value is larger?!:
https://github.com/eigenteam/eigen-git-mirror/blob/3.3.4/Eigen/src/Core/util/Macros.h#L776-L780
There seems to be no way to enable AVX in Eigen, but still use 16-byte memory alignment?
In newer Eigen than 3.3.4 the code moved into a different file, but the logic is unchanged.
Maybe it's time for a StackOverflow question for the Eigen maintainers.
@sherm1 Maybe I can kick this over to you?
We need to keep EIGEN_DEFAULT_ALIGN_BYTES set to 16. Once I turn on -march=broadwell, it seems to bloat to 32 bytes, with no way that I can find to control it (see the above posting for details).
Can you find a solution that keeps the default alignment unchanged, but still allows us to unblock your Drake-related performance improvements?
You can unit test by static_assert(EIGEN_DEFAULT_ALIGN_BYTES == 16, "Not 16!").
Further forwarded discussion:
@sherm1 writes
I will look in to this (on the Drake side -- it is hard for me to access GHE).
However, Eigen is right that with AVX enabled 32 byte alignment is required for best performance. How bad of a problem is that for Anzu?
@jwnimmer-tri writes
Switching our minimum Eigen alignment to be 32 bytes would be a non-trival project. How much work, exactly, I couldn't say without digging further. On the other hand, if we only required 16 bytes, but we annotated RotationMatrix and RigidTransform to internally align to 32 in practice, that probably would be much easier.
cc @rpoyner-tri please see the above discussion. Any rabbits in your hat?
I don't know your performance targets, but -msse4.2 and earlier SSE would be safe (no AVX, just SSE). We use it in other TRI projects, and it does not change Eigen's ABI as best I can tell.
I don't think I can get a significant speedup with SSE double-wide instructions on the operations we're interested in. For example, because of all the non-fp malarkey it takes to get to the good stuff, I have so far only been able to get 3X speedup (on RigidTransform composition) making use of the _8-wide_ FMA instructions. So I don't think it's worth the trouble to make any handcoded SSE methods (though I have no objection to the compiler trying to generate them).
BTW I think SSE is more interesting in single-precision float computations. But we are committed to using double.
BTW I think SSE is more interesting in single-precision float computations. But we are committed to using double.
That is basically how I read things, or at least how Intel sold AVX for doubles when it was released. Obviously, the dimensionality of the data may give you special cases where everything works out, but that is not usual. Many years ago we worked with Intel engineers to get speed up with SSE and gave up and switched attention to TBB on a different part of the problem (we had to use something by Intel as they were sponsors).
The libpcl binaries from Ubuntu are compiled with one Eigen alignment choice, but our use of AVX2 was triggering different alignment annotations, leading to ABI incompatibility with Eigen's allocator expectations.
That is actually the sort of reason why I do not usually enable anything by default. Sadly that wastes half the features of the expensive piece of hardware that many people purchase and the world does not move on. Not sure who to blame on that, it is not just me who thinks that way.
@sherm1 @rpoyner-tri Do you actually need -mfma or -march=broadwell, across the whole project? Are you trying to auto-vectorize our existing Eigen code across the board, or are you only trying to surgically replace a single RigidTransform method or two with intrinsics?
If only the latter, a quick spike test indicates that GCC options inline in the source code might be good enough: https://godbolt.org/z/1M6joq. That uses #pragma GCC target("fma") around just that one method.
If it appears to suit your needs, we can look into how to safely apply it within Drake.
I was thinking about that this morning. It would allow us to implement localized speedups without solving all the other problems. And we haven't seen any great benefit yet to compiling all of Drake with -march=broadwell (or even -march=native). It would also save me from having to fix a half dozen mysterious unrelated CI failures (though that might be good for code health it is an unwanted yak shave).
This approach likely precludes inlined AVX-using methods, but I'm not sure any of ours are so small currently that it would matter.
@sherm1 Please keep me posted on the scope you think you'll need, as your thinking evolves. Over the next couple weeks, I'll be working on a requirements roadmap for how our CI, build, & release will aim to support a bunch of new features (AVX, OpenMP, OpenCL, C++20, AArch64, etc.) over the coming year or two. In my initial drafts of a decision matrix, it will make a big difference whether the entire project must turn on AVX on Ubuntu overall, or whether only a very small handful of methods (e.g., RigidTransform::operator*, RotationMatrix::operator*) merely gain a secondary implementation.
Will do. We'll test the "handful of methods" approach and go with that if we can still get the speedups we want.
Just to make this unnecessarily hard, gcc pragma has a bug in which it appears to ignore targets like "arch=broadwell" or "tune=haswell" although individual flags do work (but don't generate the right instructions since "tune=" is required). Here's a related stackoverflow. The analogous clang pragma OTOH does work correctly.
So I think we can't use #pragma to set these local compile flags. Instead, we should have Bazel pass "-march=broadwell" just when compiling the AVX-using functions. Not sure how to do that but I presume it's possible.
If we isolate the magic methods in specific .cc files, we can add compiler options to individual bazel rules to build them. Still not great, but possible.
You can add copts = ... to a drake_cc_library to change the compiler flags just for that one cc file.
See https://docs.bazel.build/versions/4.0.0/be/c-cpp.html#cc_library.copts for background.
Note for any cc file where you say -march=haswell or similar in copts, you must not transitively include anything from Eigen (or any other library that changes its ABI based on CPU). So you'll need a primitive-based function declaration as the crossing-over point (e.g., pointers to doubles, or raw structs of same), not an Eigen::Matrix or RigidTransform or etc.
To change the list of source files or copts based on platform (macOS vs Ubuntu), you can grep around in the tree for //tools/cc_toolchain:apple for some examples. I think there are enough examples for @rpoyner-tri to suss out the BUILD.bazel rules you'll need, but if you get stuck you can post a prototype and ping me for a look.
Thank you! I'll test that.
For the record, I just tested -mtune=broadwell in Anzu as global rcfile setting, and all tests passed. So if we ever want to adjust the tuning globally for some performance benefit, that's probably a fine option.
Today's proposal to permit use of SIMD methods with C++ fallback and make sure both get unit tested:
mult3x3.ccmult3x3.cc provide unconditionally compiled C++ version, say mult3x3_cpp(); this can always be unit tested. mult3x3.cc use instruction availability macros like __AVX__ to conditionally compile the simd version of mult3x3(), otherwise define mult3x3() to be mult3x3_cpp().mult3x3_cpp() and mult3x3().-march=broadwell on suitable platformsFor now we have to figure out how _not_ to use AVX on our ancient Mac build machines. When those are upgraded CI will be using only the SIMD versions but the C++ versions will still be unit tested. Then when we start supporting ARM machines we'll fall back on those C++ versions until we can do better.
Probably @ToffeeAlbina-TRI and/or @jwnimmer-tri and I need to speak to MacStadium or we move some or all Mac builds to AWS (though Jenkins has not caught up with that yet) rather than jump through hoops to not make something work on Mac. The Mac Pro machines are not exactly doing amazingly well for us now anyway. Literally our CI machines are the only supported Macs for which all the instructions will not work.
There is no need to couple this issue with any CI changes. It's not just macOS CI that has older hardware -- TRI's macsim machine also lacks AVX2 and FMA (but does have AVX). As Sherm notes, we need to keep the vanilla methods around as a general principle. Even if we upgrade CI, the code changes and effort required here are the same.
Fair enough, but I would not jump through hoops too much.
(Also maybe try to get a newer Mac locally; time is running out to buy an Intel one, but even a new Mac Mini would be faster.)
FWIW
I agree with Jamie that upgrading to not-ancient machines for both CI and TRI's macsim is something we should do and that we should do that regardless of what happens with the SIMD instructions stuff.
And, I agree with Sherm that we need to keep the vanilla methods around as a general principle.
@sherm1 per https://github.com/RobotLocomotion/drake/issues/14603#issuecomment-776092312, I think that we don't need any big-picture nor build-and-release changes to resolve this issue. You are still welcome to invite Jamie or myself into PR reviews (or PR drafts for suggestions) to double-check the exact BUILD spellings on a given patch.
Given that, perhaps we can close out this issue?
The one missing big-picture build piece is that we can't currently use SIMD instructions on a Mac, even if it is Intel-based and supports all the necessary instructions. Our current approach simply disables the fast methods (no -march=broadwell) for any Mac build -- that seems unfortunate. We can't enable SIMD indiscriminately on Mac because of (a) backlevel CI machines (see above), and (b) future desire to support ARM-based Macs. Any ideas how we can check locally to determine whether we can enable SIMD when running on Mac?
I will check what CMake does. I think it has that capability (or else someone wrote a module that a lot of people use, I forget).
/usr/sbin/sysctl -n machdep.cpu.features
Output on my laptop:
FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
Is that command identical on Ubuntu and Mac machines? If so I think I see what you're suggesting -- execute that command from the local bazel build file to determine if the currently-building machine supports the needed instructions. If so, add the -march=broadwell flag to the AVX-containing compilation unit. (Not that I know how to do the related bazel magic but I'm sure it's possible!) Is that right?
@sherm1 What is the user story you'd like to enable?
Having the build automatically configure itself based on the host which compiles the code is not a reasonable request -- nothing says that the host performing the build and the host running the compiled library will have compatible SIMD capabilities. In a world of prebuilt images, cloud computing, and virtual machines, that assumption is outdated.
The best you could hope for is for us to offer documention (e.g., a build flag) for a user to opt-in to AVX2 on Intel Macs. We can think about the best way to formulate that, if that has a meaningful payoff in your mind.
However, at the end of the day, my own vote is that Intel Macs are EOL and we should not spend any of our limited engineering-hours worrying about how to get the last percentage of performance from a dead platform. I would much rather spend our time making the ARM-based Mac hardware natively supported by Drake -- that is the future of the platform, and looks extremely promising in initial performance benchmarks of other tools.
The best you could hope for is for us to offer documention (e.g., a build flag) for a user to opt-in to AVX2 on Intel Macs. We can think about the best way to formulate that, if that has a meaningful payoff in your mind.
I do agree with this. If someone wants to run a config to add flags on macOS, I would not oppose that, but they would be on their own, and I would not build binaries that way. Even with new CI we would still be resource limited, we can barely test enough permutations as it is.
And the Apple Silicon Macs are probably faster without SIMD than the Intel ones are with.
OK, that makes it easy -- we'll build the low-level compilation unit with -march=broadwell unless we're on a Mac.
No need for any systematic change in that case so I'll close this.