Speed for both double and AutoDiffXd versions of the dynamics are critical to many use cases. I think it's valuable to provide and consider some benchmarks, particularly during the MultiBody upgrade.
As a starting point, here are some benchmarks for the Acrobot and one of the KUKA iiwa models. It is possible that neither benchmark was created optimally. I encourage someone to double check these benchmarks.
My initial takeaways are that both RBP and MBP have serious scaling issues with AutoDiff, but MBP is noticeably worse. Additionally, MultibodyPlant dynamics calculations are overall slightly slower than their RigidBodyPlant counterparts.
(acrobot_plant) 10000x inertia calculations took 1 milliseconds.
(acrobot_plant) 10000x inertia autodiff calculations took 10 milliseconds.
(rigid_body_plant) 10000x inertia calculations took 21 milliseconds.
(rigid_body_plant) 10000x inertia autodiff calculations took 1493 milliseconds.
(multibody_plant) 10000x inertia calculations took 41 milliseconds.
(multibody_plant) 10000x inertia autodiff calculations took 2872 milliseconds.
(rigid_body_plant) 10000x inverse dynamics calculations took 27 milliseconds.
(rigid_body_plant) 10000x autodiff inverse dynamics calculations took 2035 milliseconds.
(multibody_plant) 10000x inverse dynamics calculations took 40 milliseconds.
(multibody_plant) 10000x autodiff inverse dynamics calculations took 2723 milliseconds.
(rigid_body_plant) 10000x inertia calculations took 63 milliseconds.
(rigid_body_plant) 10000x inertia autodiff calculations took 6095 milliseconds.
(multibody_plant) 10000x inertia calculations took 437 milliseconds.
(multibody_plant) 10000x inertia autodiff calculations took 38863 milliseconds.
(rigid_body_plant) 10000x inverse dynamics calculations took 73 milliseconds.
(rigid_body_plant) 10000x autodiff inverse dynamics calculations took 7617 milliseconds.
(multibody_plant) 10000x inverse dynamics calculations took 115 milliseconds.
(multibody_plant) 10000x autodiff inverse dynamics calculations took 47239 milliseconds.
These benchmarks are in addition to the Cassie simulations (mentioned in Slack), where MBP runs ~9x slower than RBP.
BTW, Michael's benchmark result agrees with the one in #8482
Thanks Hongkai, I forgot about that older issue. AutoDiff performance is one issue, but I think the double versions are also troubling. Looking at the iiwa example, inverse dynamics are 50% slower with MBP than RBT and the mass matrix takes 7x as long to compute. I know the documentation for MBP's mass matrix calculates indicate that it's O(n^2), but is there a reason the old method couldn't be ported over? Separate calculations for the mass matrix may not be needed for simulation, but they're definitely useful in other scenarios.
cc @sherm1 and @amcastro-tri
@amcastro-tri If you don't feel this should be assigned to you, feel free to pass it along.
This is excellent @mposa, thank you!. I just got back from vacation. I'll look into these benchmarks ASAP.
A lil update. I created a branch off the source files @mposa provided by adding BUILD files and updating here and there to compile with the latest Drake. The benchmarks are here in this branch.
FTR, besides updating paths and things like that, I made the following changes:
std::chrono::steady_clock. it provides more stable results in my platform. I believe that is the recommended way to measure time within any system anyway. Context::EnableCaching(). It is disabled by default).For the kuka case the results are:
(acrobot_plant) 100000x inertia calculations took 971 milliseconds.
(rigid_body_plant)500xinertia autodiff calculations took 1054 milliseconds.
(multibody_plant)100000xinertia calculations took 5179 milliseconds.
(multibody_plant)500xinertia autodiff calculations took 4669 milliseconds.
(rigid_body_plant)100000xinverse dynamics calculations took 1089 milliseconds.
(rigid_body_plant)500xautodiff inverse dynamics calculations took 1033 milliseconds.
(multibody_plant)100000xinverse dynamics calculations took 1099 milliseconds.
(multibody_plant)500xautodiff inverse dynamics calculations took 952 milliseconds.
The main observation here is that ID takes (within the statistical variations) the same in both RBT and MBP.
For the mass matrix, MBP computes it by making n calls (with n number of velocities) to ID. Thus far we are not caching things like spatial inertias which then get recomputed for the computation of each M matrix column. Therefore I believe caching will significantly improve the timing results for M.
A note on the complexity of the algorithms used in RBT and MBP for M: both have an O(n^2) approach (a different one though). MBP just explicitly documents this warning users to avoid computing M whenever possible (for instance, when we provide forward dynamics people should prefere that approach vs computing M and inverting it).
I made some minor changes to the printout for readability here: https://github.com/mposa/drake/tree/mbp_vs_rbp_benchmarks
Oddly, running your code, my benchmark results are radically different--I still see a large gap between MBP and RBP.
(rigid_body_plant) 100000x inertia calculations took 576 miliseconds. 5 microseconds per.
(rigid_body_plant) 500x inertia autodiff calculations took 299 miliseconds. 598 microseconds per.
(multibody_plant) 100000x inertia calculations took 4170 miliseconds. 41 microseconds per.
(multibody_plant) 500x inertia autodiff calculations took 1924 miliseconds. 3848 microseconds per.
(rigid_body_plant) 100000x inverse dynamics calculations took 706 miliseconds. 7 microseconds per.
(rigid_body_plant) 500xautodiff inverse dynamics calculations took 463 miliseconds. 926 microseconds per.
(multibody_plant) 100000x inverse dynamics calculations took 898 miliseconds. 8 microseconds per.
(multibody_plant) 500xautodiff inverse dynamics calculations took 413 miliseconds. 826 microseconds per.
Thank you for these results @mposa. Could you comment on what machine and what compiler you used? Not only these show larger differences but they are also faster!
Run on a laptop (Lenovo X1 Extreme), with an i7-8750H (up to 4.1 GHz), 16 GB of RAM.
Linux or Mac? did you use the default compiler?
Ubuntu 18.04, default compiler as far as I know.
Plan to PR MBP benchmark reporting to master
Yeap, working on it. @sherm1 even showed me a cool example on have to estimate FLOPS as a useful statistics to report :-)
Prioritizing Anzu benchmarking
Any status update on this?
yes. I will start pushing my dev code this week. It includes cache entries to accelerate mass matrix computations plus other goodies.
I maintain my thesis that no performance fix PR should be opened until the "Program that I ran which indicates an improvement" is PR'd concurrently or first.
Will be PR'ing the tests used for this.
Next step is to push cache entry and use gripper example to benchmark
Do we consider the purpose of this issue completed? Since it started off with benchmarking showing specific calculations in MBP being slower in RBP, should we re-run those benchmarking code attached to see if the PRs pushed addressed the intent of this issue?
This has been a great place where to start, thanks @mposa for this issue. We landed a number of improvements.
What is not clear to me yet is how exactly to measure not only performance, but how important it is for a particular application. I think best if we have a talk with @mposa about a particular application, make it land, and then profile and beat that problem to death.
For instance, there are different requirements for the multibody computations whether they are used in simulation, control, planning, etc. For our internal applications, simulation is already running faster than other components (for instance than rendering). However, optimizing say, the mass matrix computation, might be important when writing a QP controller for a humanoid. If that's the case, I'd like seeing first that application running with the new components first (I believe @jwnimmer-tri was thinking about this for Atlas). If we consider that to be a good performance benchmark, then we can work on optimizing it.
@mposa, what do you think?
I'd suggest benchmarking (against double and AutoDiff) some of the core multibody code on a representative system (Atlas is fine, but definitely on the more complex end of the spectrum). The methods that come to mind are:
-Mass Matrix
-Bias terms
-Forward kinematic Jacobian
While other methods could also be included, these seem like the most critical.
I agree with you @mposa. However what are we comparing against?
To give you an example, in our simulations, the multibody component is actually one of the fastest and it is not our bottleneck. That is why we didn't spend time optimizing anymore.
However I believe you do have applications for which the multibody component is a bottleneck. I'd need first to have that application, and from there I can profile and measure performance against something.
Does that make any sense?
I'm not sure that an end-to-end example is what we should use to test individual components. Why are the types of benchmarks included earlier in this thread insufficient? For instance, they clearly show that the Autodiff performance is far worse than numerical differentiation--which should definitely NOT be the case.
They also showed a performance gap between MBP and RBP (at least on some machines).
I am not disagreeing with you @mposa. The question I am stating is: "is it worth spending time on improving performance if not an issue in a real application"?
For instance, I do agree with you we can have a faster mass matrix computation. However, we did not prioritize that because for our simulations there are other components that are way slower (camera sim for instance).
In that case, I'd suggest putting together any trajectory optimization example (there is code in drake for this, just not examples, last I saw). This is one way to force use of autodiff and MBP functions.
Using our examples would be pretty cumbersome, as it'd be difficult to divorce from the rest of our code base.
I will add that (1) generating performance benchmarks and (2) a rough evaluation of "good enough?" is doable without such a complex example. These methods have existed in Drake for many years now, since MATLAB, and it's important to understand how core functionality has been effected by these changes.
Good idea. @avalenzu, do we have a good example for trajectory optimization with MBP, we wouldn't need contact, just a large enough system to highlight costs.
Sorry, @amcastro-tri, I'm behind on my github emails. AFAIK we don't have an example for trajectory optimization with MBP in Drake master (correct me if I'm wrong, @RussTedrake). However it seems like converting //examples/acrobot:run_swing_up_traj_optimization to use MbP would be pretty straight-forward. That may or may not be a large enough system.
I'm also running into issues with MbP<AutoDiffXd> performance in some of my research code. Hopefully I can provide a benchmark based on that before too long.
I like the traj opt benchmark. Maybe the acrobot is not the best example but definitely a good one to test the pipeline.
For what it's worth, I re-ran this benchmarking code today (after merging with master) with the following results:
Acrobot:
(acrobot_plant) 500000x inertia calculations took 127 milliseconds.
(acrobot_plant) 50000x inertia autodiff calculations took 62 milliseconds.
(rigid_body_plant) 500000x inertia calculations took 1094 milliseconds.
(rigid_body_plant) 5000x inertia autodiff calculations took 1011 milliseconds.
(multibody_plant) 500000x inertia calculations took 1392 milliseconds.
(multibody_plant) 5000x inertia autodiff calculations took 1070 milliseconds.
(rigid_body_plant) 500000x inverse dynamics calculations took 1309 milliseconds.
(rigid_body_plant) 5000x autodiff inverse dynamics calculations took 1084 milliseconds.
(multibody_plant) 500000x inverse dynamics calculations took 1283 milliseconds.
(multibody_plant) 5000x autodiff inverse dynamics calculations took 984 milliseconds.
KUKA:
(rigid_body_plant) 100000x inertia calculations took 533 miliseconds. 5 microseconds per.
(rigid_body_plant) 500x inertia autodiff calculations took 318 miliseconds. 636 microseconds per.
(multibody_plant) 100000x inertia calculations took 2353 miliseconds. 23 microseconds per.
(multibody_plant) 500x inertia autodiff calculations took 1126 miliseconds. 2252 microseconds per.
(rigid_body_plant) 100000x inverse dynamics calculations took 826 miliseconds. 8 microseconds per.
(rigid_body_plant) 500xautodiff inverse dynamics calculations took 377 miliseconds. 754 microseconds per.
(multibody_plant) 100000x inverse dynamics calculations took 867 miliseconds. 8 microseconds per.
(multibody_plant) 500xautodiff inverse dynamics calculations took 441 miliseconds. 882 microseconds per.
MBP intertia calculations seem noticeably improved, though autodiff for both RBT/MBP remains unusable (100x slower than double for 7-DOF system)
@mposa what do you mean when you say "inertia calculation"?
BTW thanks very much for doing these measurements and posting them!
I believe the code is calling multibody_plant.CalcMassMatrixViaInverseDynamics It was a (pretty arbitrarily) picked function to test on.
Thanks @mposa for updating to latest master and making these.
From your measurements it'd seem the biggest thing is actually how inefficient AutoDiffXd is (I am not worried about the mass matrix, I do know the current algorithm is awful, but it hasn't been priority to implement something like the RBI method).
For a 7 dof arm like the Kuka I'd expect the best scenario to be the derivatives computation be about 7 times more expensive than the double evaluation (if I think in terms of using finite differences for instance). Therefore the factor of ~100 is clearly an AutoDiffXd issue.
Therefore probably we need a new issue? not on MBP performance but on AutoDiffXd performance?
what are your thoughts on this?
Ah, I wondered why it was 4X slower than the RBP time for Kuka. Currently MBP doesn't have a specialized mass matrix computation but instead fakes it with repeat calls to its O(n) ID with unit accelerations.
I think a direct attack on AutoDiffScalarXd performance is warranted. Currently I suspect its memory management is absurdly bad.
cc'ing @jwnimmer-tri and @avalenzu since I had lots of cool discussions with them about this issue. What do you guys think? I would even love a fast gradient for @antequ's work!
Before calling MBP performance good, it'd probably be worthwhile to benchmark a few other functions. Forward kinematics comes to mind as one common workflow.
For a 7 dof arm like the Kuka I'd expect the best scenario to be the derivatives computation be about 7 times more expensive than the double evaluation (if I think in terms of using finite differences for instance). Therefore the factor of ~100 is clearly an AutoDiffXd issue.
The promise of AutoDiff is that it's faster and more accurate than numerical differentiation. If "best case" is the same speed, then I think numerical differentiation should be a well-supported baseline anywhere gradients are used.
@amcastro-tri :
cc'ing @jwnimmer-tri and @avalenzu since I had lots of cool discussions with them about this issue. What do you guys think? I would even love a fast gradient for @antequ's work!
My answer is the same as ever: if you want a computation to get faster, PR to master a reproducible benchmark that exactly reports the latency or throughput that you're interested in, and then we can start profiling and understanding where the speed is going.
Good point @jwnimmer-tri. I think best is to push a non-MBP benchmark only for AutoDiffXd. I'll see what I can come up with.
I think making a smaller benchmark to focus on some specific facet is probably a good technique to keep on deck, but I would still want to start with the top-level benchmark _first_ so that we can measure which facet(s) actually matter most, before digging in to more details.
I think an MBP-based benchmark would be fine in this case since we can just compare directly the cost of MBP<double>::Something() with MBP<AutoDiffXd>::Something(). That would be an easy way to come up with a large and directly-relevant test.