Drake: mbp: non-repeatability for mbp + contact

Created on 5 Jun 2020  路  43Comments  路  Source: RobotLocomotion/drake

EDIT (2020/07/17): Re-scoped to handle both MBP and SceneGraph
https://github.com/RobotLocomotion/drake/pull/13505

Sub-issue:

  • [x] #13507

\cc @antequ @sherm1

multibody plant dynamics

Most helpful comment

Just want to add that this issue, #13069 , for a previous bugged version of an integrator, is related -- many of our Integrator tests do not call Initialize() between simulation runs: within a test, we use the same simulator, run it, then change a few things (including the time), then run it again without calling Initialize(). We've fixed it by making sure the methods-for-changing-simulator-settings called in the tests actually also reset the cached states upon the change, but it's not as future-proof as actually calling Initialize(). Whatever future design we decide on, the tests mentioned in #13069 may be affected.

All 43 comments

@rpoyner-tri has bravely volunteered to serve as point person on this investigation (and #13507). That's jumping into the deep end! Rico will no doubt be calling on many of us for consultation -- please do what you can to help.

well well well. Finally got the pure-c++ version of Eric's counterexample running under ubsan:

rico@uniblab:~/checkout/drake$ bazel run --config=ubsan --nobuild_tests_only //tmp:repro_contact_nondet
INFO: Analyzed 2 targets (0 packages loaded, 0 targets configured).
INFO: Found 2 targets...
INFO: Elapsed time: 0.231s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: /bin/bash -c '/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tools/dynamic_analysis/ubsan bazel-binINFO: Build completed successfully, 1 total action


[ meta_trial = 0, num_sim_trials = 2, setup = Discrete_NoGeometry ]

kReuse
external/eigen/include/_usr_include_eigen3/Eigen/src/Cholesky/LDLT.h:50:49: runtime error: load of value 3279793312, which is not a valid value for type 'ComputationInfo'

That's UB error, something to chase further.

Some stack traces:

rico@uniblab:~/checkout/drake$ UBSAN_OPTIONS=print_stacktrace=1 bazel-bin/tmp/repro_contact_nondet


[ meta_trial = 0, num_sim_trials = 2, setup = Discrete_NoGeometry ]

kReuse
external/eigen/include/_usr_include_eigen3/Eigen/src/Cholesky/LDLT.h:50:49: runtime error: load of value 536689824, which is not a valid value for type 'ComputationInfo'
    #0 0x5573c44dc608 in Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>::LDLT(Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const&) external/eigen/include/_usr_include_eigen3/Eigen/src/Cholesky/LDLT.h:50
    #1 0x5573c44dc608 in void std::_Construct<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const&>(Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>*, Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const&) /usr/include/c++/7/bits/stl_construct.h:75
    #2 0x5573c44dc608 in Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>* std::__uninitialized_copy<false>::__uninit_copy<__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>*>(__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> > > >, __gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>*) /usr/include/c++/7/bits/stl_uninitialized.h:83
    #3 0x5573c44dc608 in Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>* std::uninitialized_copy<__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>*>(__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> > > >, __gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>*) /usr/include/c++/7/bits/stl_uninitialized.h:134
    #4 0x5573c44dc608 in Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>* std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>*, Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> >(__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> > > >, __gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>*, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> >&) /usr/include/c++/7/bits/stl_uninitialized.h:289
    #5 0x5573c44dc608 in std::vector<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> > >::vector(std::vector<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<double, -1, -1, 0, 6, 6>, 1> > > const&) /usr/include/c++/7/bits/stl_vector.h:331
    #6 0x5573c44dc608 in drake::multibody::internal::ArticulatedBodyInertiaCache<double>::ArticulatedBodyInertiaCache(drake::multibody::internal::ArticulatedBodyInertiaCache<double> const&) bazel-out/k8-opt/bin/multibody/tree/_virtual_includes/multibody_tree_caches/drake/multibody/tree/articulated_body_inertia_cache.h:35
    #7 0x5573c44dc608 in drake::Value<drake::multibody::internal::ArticulatedBodyInertiaCache<double> >::Value(drake::multibody::internal::ArticulatedBodyInertiaCache<double> const&) bazel-out/k8-opt/bin/common/_virtual_includes/value/drake/common/value.h:700
    #8 0x5573c44dd333 in std::unique_ptr<drake::AbstractValue, std::default_delete<drake::AbstractValue> > drake::AbstractValue::Make<drake::multibody::internal::ArticulatedBodyInertiaCache<double> >(drake::multibody::internal::ArticulatedBodyInertiaCache<double> const&) bazel-out/k8-opt/bin/common/_virtual_includes/value/drake/common/value.h:646
    #9 0x5573c44dd333 in drake::multibody::internal::MultibodyTreeSystem<double>::Finalize()::{lambda()#11}::operator()() const multibody/tree/multibody_tree_system.cc:216
    #10 0x5573c44dd333 in std::_Function_handler<std::unique_ptr<drake::AbstractValue, std::default_delete<drake::AbstractValue> > (), drake::multibody::internal::MultibodyTreeSystem<double>::Finalize()::{lambda()#11}>::_M_invoke(std::_Any_data const&) /usr/include/c++/7/bits/std_function.h:302
    #11 0x5573c544c1b0 in std::function<std::unique_ptr<drake::AbstractValue, std::default_delete<drake::AbstractValue> > ()>::operator()() const /usr/include/c++/7/bits/std_function.h:706
    #12 0x5573c544c1b0 in drake::systems::CacheEntry::Allocate() const systems/framework/cache_entry.cc:38
    #13 0x5573c5445400 in drake::systems::SystemBase::InitializeContextBase(drake::systems::ContextBase*) const systems/framework/system_base.cc:95
    #14 0x5573c51fa7e1 in drake::systems::LeafSystem<double>::DoAllocateContext() const systems/framework/leaf_system.cc:96
    #15 0x5573c532bd1e in drake::systems::SystemBase::AllocateContext() const bazel-out/k8-opt/bin/systems/framework/_virtual_includes/system_base/drake/systems/framework/system_base.h:84
    #16 0x5573c532bd1e in drake::systems::System<double>::AllocateContext() const systems/framework/system.cc:18
    #17 0x5573c525ac86 in drake::systems::LeafSystem<double>::AllocateContext() const systems/framework/leaf_system.cc:58
    #18 0x5573c525b53a in drake::systems::LeafSystem<double>::GetDirectFeedthroughs() const systems/framework/leaf_system.cc:253
    #19 0x5573c4f05b98 in drake::systems::internal::DiagramBuilderImpl::ThrowIfAlgebraicLoopsExist(std::vector<drake::systems::SystemBase const*, std::allocator<drake::systems::SystemBase const*> > const&, std::map<std::pair<drake::systems::SystemBase const*, drake::TypeSafeIndex<drake::systems::InputPortTag> >, std::pair<drake::systems::SystemBase const*, drake::TypeSafeIndex<drake::systems::OutputPortTag> >, std::less<std::pair<drake::systems::SystemBase const*, drake::TypeSafeIndex<drake::systems::InputPortTag> > >, std::allocator<std::pair<std::pair<drake::systems::SystemBase const*, drake::TypeSafeIndex<drake::systems::InputPortTag> > const, std::pair<drake::systems::SystemBase const*, drake::TypeSafeIndex<drake::systems::OutputPortTag> > > > > const&) systems/framework/diagram_builder.cc:110
    #20 0x5573c4f09296 in drake::systems::DiagramBuilder<double>::ThrowIfAlgebraicLoopsExist() const bazel-out/k8-opt/bin/systems/framework/_virtual_includes/diagram_builder/drake/systems/framework/diagram_builder.h:391
    #21 0x5573c28fceff in drake::tmp::make_simulator[abi:cxx11](drake::tmp::Setup const&) (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x32e3eff)
    #22 0x5573c29018c8 in drake::tmp::simulate_trials[abi:cxx11](drake::tmp::ResimulateStyle, int, drake::tmp::Setup const&) (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x32e88c8)
    #23 0x5573c2909924 in drake::tmp::run_simulations[abi:cxx11](int, drake::tmp::Setup const&) (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x32f0924)
    #24 0x5573c290a46e in drake::tmp::do_main() (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x32f146e)
    #25 0x5573c290ab95 in main (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x32f1b95)
    #26 0x7f021fc09b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
    #27 0x5573c277f479 in _start (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x3166479)

external/eigen/include/_usr_include_eigen3/Eigen/src/Cholesky/LDLT.h:50:49: runtime error: load of value 538976256, which is not a valid value for type 'ComputationInfo'
    #0 0x5573c44e7ea0 in Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>::LDLT(Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const&) external/eigen/include/_usr_include_eigen3/Eigen/src/Cholesky/LDLT.h:50
    #1 0x5573c44e7ea0 in void std::_Construct<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const&>(Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>*, Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const&) /usr/include/c++/7/bits/stl_construct.h:75
    #2 0x5573c44e7ea0 in Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>* std::__uninitialized_copy<false>::__uninit_copy<__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>*>(__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> > > >, __gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>*) /usr/include/c++/7/bits/stl_uninitialized.h:83
    #3 0x5573c44e7ea0 in Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>* std::uninitialized_copy<__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>*>(__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> > > >, __gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>*) /usr/include/c++/7/bits/stl_uninitialized.h:134
    #4 0x5573c44e7ea0 in Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>* std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>*, Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> >(__gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> > > >, __gnu_cxx::__normal_iterator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> const*, std::vector<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> > > >, Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>*, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> >&) /usr/include/c++/7/bits/stl_uninitialized.h:289
    #5 0x5573c44e7ea0 in std::vector<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> > >::vector(std::vector<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1>, std::allocator<Eigen::LDLT<Eigen::Matrix<drake::symbolic::Expression, -1, -1, 0, 6, 6>, 1> > > const&) /usr/include/c++/7/bits/stl_vector.h:331
    #6 0x5573c44e7ea0 in drake::multibody::internal::ArticulatedBodyInertiaCache<drake::symbolic::Expression>::ArticulatedBodyInertiaCache(drake::multibody::internal::ArticulatedBodyInertiaCache<drake::symbolic::Expression> const&) bazel-out/k8-opt/bin/multibody/tree/_virtual_includes/multibody_tree_caches/drake/multibody/tree/articulated_body_inertia_cache.h:35
    #7 0x5573c44e7ea0 in drake::Value<drake::multibody::internal::ArticulatedBodyInertiaCache<drake::symbolic::Expression> >::Value(drake::multibody::internal::ArticulatedBodyInertiaCache<drake::symbolic::Expression> const&) bazel-out/k8-opt/bin/common/_virtual_includes/value/drake/common/value.h:700
    #8 0x5573c44e92e2 in std::unique_ptr<drake::AbstractValue, std::default_delete<drake::AbstractValue> > drake::AbstractValue::Make<drake::multibody::internal::ArticulatedBodyInertiaCache<drake::symbolic::Expression> >(drake::multibody::internal::ArticulatedBodyInertiaCache<drake::symbolic::Expression> const&) bazel-out/k8-opt/bin/common/_virtual_includes/value/drake/common/value.h:646
    #9 0x5573c44e92e2 in drake::multibody::internal::MultibodyTreeSystem<drake::symbolic::Expression>::Finalize()::{lambda()#11}::operator()() const multibody/tree/multibody_tree_system.cc:216
    #10 0x5573c44e92e2 in std::_Function_handler<std::unique_ptr<drake::AbstractValue, std::default_delete<drake::AbstractValue> > (), drake::multibody::internal::MultibodyTreeSystem<drake::symbolic::Expression>::Finalize()::{lambda()#11}>::_M_invoke(std::_Any_data const&) /usr/include/c++/7/bits/std_function.h:302
    #11 0x5573c544c1b0 in std::function<std::unique_ptr<drake::AbstractValue, std::default_delete<drake::AbstractValue> > ()>::operator()() const /usr/include/c++/7/bits/std_function.h:706
    #12 0x5573c544c1b0 in drake::systems::CacheEntry::Allocate() const systems/framework/cache_entry.cc:38
    #13 0x5573c5445400 in drake::systems::SystemBase::InitializeContextBase(drake::systems::ContextBase*) const systems/framework/system_base.cc:95
    #14 0x5573c51ed299 in drake::systems::LeafSystem<drake::symbolic::Expression>::DoAllocateContext() const systems/framework/leaf_system.cc:96
    #15 0x5573c532c20a in drake::systems::SystemBase::AllocateContext() const bazel-out/k8-opt/bin/systems/framework/_virtual_includes/system_base/drake/systems/framework/system_base.h:84
    #16 0x5573c532c20a in drake::systems::System<drake::symbolic::Expression>::AllocateContext() const systems/framework/system.cc:18
    #17 0x5573c5352956 in drake::systems::System<drake::symbolic::Expression>::CreateDefaultContext() const systems/framework/system.cc:53
    #18 0x5573c52bb883 in drake::systems::SystemSymbolicInspector::SystemSymbolicInspector(drake::systems::System<drake::symbolic::Expression> const&) (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x5ca2883)
    #19 0x5573c525d724 in std::_MakeUniq<drake::systems::SystemSymbolicInspector>::__single_object std::make_unique<drake::systems::SystemSymbolicInspector, drake::systems::System<drake::symbolic::Expression>&>(drake::systems::System<drake::symbolic::Expression>&) /usr/include/c++/7/bits/unique_ptr.h:821
    #20 0x5573c525d724 in MakeSystemSymbolicInspector<double> systems/framework/leaf_system.cc:214
    #21 0x5573c525d724 in drake::systems::LeafSystem<double>::GetDirectFeedthroughs() const systems/framework/leaf_system.cc:293
    #22 0x5573c4f05b98 in drake::systems::internal::DiagramBuilderImpl::ThrowIfAlgebraicLoopsExist(std::vector<drake::systems::SystemBase const*, std::allocator<drake::systems::SystemBase const*> > const&, std::map<std::pair<drake::systems::SystemBase const*, drake::TypeSafeIndex<drake::systems::InputPortTag> >, std::pair<drake::systems::SystemBase const*, drake::TypeSafeIndex<drake::systems::OutputPortTag> >, std::less<std::pair<drake::systems::SystemBase const*, drake::TypeSafeIndex<drake::systems::InputPortTag> > >, std::allocator<std::pair<std::pair<drake::systems::SystemBase const*, drake::TypeSafeIndex<drake::systems::InputPortTag> > const, std::pair<drake::systems::SystemBase const*, drake::TypeSafeIndex<drake::systems::OutputPortTag> > > > > const&) systems/framework/diagram_builder.cc:110
    #23 0x5573c4f09296 in drake::systems::DiagramBuilder<double>::ThrowIfAlgebraicLoopsExist() const bazel-out/k8-opt/bin/systems/framework/_virtual_includes/diagram_builder/drake/systems/framework/diagram_builder.h:391
    #24 0x5573c28fceff in drake::tmp::make_simulator[abi:cxx11](drake::tmp::Setup const&) (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x32e3eff)
    #25 0x5573c29018c8 in drake::tmp::simulate_trials[abi:cxx11](drake::tmp::ResimulateStyle, int, drake::tmp::Setup const&) (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x32e88c8)
    #26 0x5573c2909924 in drake::tmp::run_simulations[abi:cxx11](int, drake::tmp::Setup const&) (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x32f0924)
    #27 0x5573c290a46e in drake::tmp::do_main() (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x32f146e)
    #28 0x5573c290ab95 in main (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x32f1b95)
    #29 0x7f021fc09b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
    #30 0x5573c277f479 in _start (/home/rico/.cache/bazel/_bazel_rico/7f8997f28c9253517a55d673c67a6c74/execroot/drake/bazel-out/k8-opt/bin/tmp/repro_contact_nondet+0x3166479)

  index: 0
    good (first)
  index: 1
    BAD
      diff[0], t: 0.001
        - q: [0.02 0.085 -0.285 1.43 0.284 -1.07 0.09999]
        ?                                          ^^^^^

        - q: [0.02 0.085 -0.285 1.43 0.284 -1.07 0.1]
        ?                                          ^


      diff[-1], t: 1
        - q: [-0.316 -2.269 -3.135 -2.194 -0.6454 0.6284 0.1024]
        ?                ^^      ^      ^      ^      -

        - q: [-0.3165 -2.271 -3.139 -2.193 -0.6474 0.624 0.1024]
        ?           +     ^^      ^      ^      ^

Bah. The above errors appear to be deliberate UB on the part of eigen. I'll see if I can make them defined temporarily, to allow analysis to go further.

Oh rats. I was hoping it was something we did in ArticulatedBodyInertiaCache which appeared in the middle of that stack.

Nope. I patched the eigen UB, and can't find any more problems with UBSAN, at least in its current configuration.

Hm. Here's a curious property. The c++ repro implementation (Reuse style, discrete) I have differs at time 0.001, while the python implementation differs at time 0.002. The nominal simulation run-time was 1 second. When I reduce the run-time to 0.009 in both implementations the divergence goes away. At run-time 0.010 in both implementations the divergence reaapears. In all cases, the time step of the divergence remains the same, if it occurs.

Just to check, "divergence goes away" implies that both have non-unique trajectories, but at the same time?

Easier to think of this in terms of steps rather than time. Step 0 is Simulator::Initialize(). Then if the steps are discrete, we can talk about step 1, step 2, etc. The actual times should be irrelevant.

Trickier with variable-step integration (the default for continuous simulation). However those can be single stepped with some advanced methods (or in the debugger).

@EricCousineau-TRI -- yes, if I only run to end_time 0.009, the two trials match in both rigs. I'm only doing 2 trials and 1 meta-trial, but the divergences have been quite stable over various other changes. Hardware is my local laptop.

So here's a bit more information. If instead of tweaking end_time values, I just comment out two lines in systems/analysis/integrator_base.cc, I can use any value of end_time and have Reuse, Discrete, NoGeometry work.

diff --git a/systems/analysis/integrator_base.cc b/systems/analysis/integrator_base.cc
index d0d856647..59b15cbc6 100644
--- a/systems/analysis/integrator_base.cc
+++ b/systems/analysis/integrator_base.cc
@@ -396,8 +396,8 @@ typename IntegratorBase<T>::StepResult
   // If the stop time (boundary time) is sooner than the candidate, use it
   // instead.
   if (boundary_time < target_time) {
-    candidate_result = IntegratorBase<T>::kReachedBoundaryTime;
-    target_time = boundary_time;
+    // candidate_result = IntegratorBase<T>::kReachedBoundaryTime;
+    // target_time = boundary_time;
   }

   // If there is no continuous state, there will be no need to limit the

I'm pretty sure that breaks other stuff, but it is concrete information. Want to know how I found it? Answer beneath the fold:

============ fold ==================
Having found the behavior difference between end_time=0.009 and end_time=0.010, I ran each of those with --config=kcov, saved the output bazel-kcov/ trees separately, and (somewhat tediously) compared them to find the differences. The only differing branch executions were the one indicated, and some consequences in simulator.cc.

Because of roundoff error, setting an end (boundary) time to .009 does not guarantee that 9 discrete updates of .001 each will occur. In fact in floating point world 9.001 > .009 by a _teeny_ bit. So without Rico's change to the code that ensures we don't go past the boundary time, the 9th update doesn't occur. Commenting that code out allows time to reach 9.001 instead so the final update does occur. Still not clear why that affects the behavior on the next run, but at least it is a substantive difference!

Agreed. It's easy to believe that the running to 0.009 with the code commented/uncommented would produce different answers (because the last step would be slightly smaller than all previous steps). But why that would affect restarting? Weird.

I ran each of those with --config=kcov, saved the output bazel-kcov/ trees separately, and (somewhat tediously) compared them to find the differences.

That's.... just.... genius!

https://github.com/rpoyner-tri/drake/tree/pr13505%2Bfixes is a branch that fixes all the style=Reuse cases by adding a (missing, but who would have known?) simulator.Initialize() call. h/t Sherm for the patch idea.

There are still other discrete cases that diverge. As things go on, I may have to reorganize the GH issues to better represent what's happening.

Calling Initialize is not incumbent upon users -- AdvanceTo shall call it implicitly anytime it's needed.

Calling Initialize is not incumbent upon users -- AdvanceTo shall call it implicitly anytime it's needed.

In this case the same Simulator was being used to run a second simulation. There was no way for the Simulator to know that this wasn't just a continuation of the previous simulation. We could attempt to infer that from seeing time go substantially backwards in a Context. However it doesn't seem unreasonable to me to require simulator.Initialize() explicitly when intentionally re-starting a simulation.

One plausible heuristic would be to say that if AdvanceTo() sees _any_ external time change in its internal Context, it should call Initialize() before proceeding. Maybe safer would be to throw a message saying that an Initialize() is required.

@jwnimmer-tri as usual you have a strong opinion. How do I go about gathering requirements? Who are the relevant stakeholders?

TBH I think Sherm's heuristic would totes work?
e.g. your time should progress a certain way, and if it goes opposite your desired direction, require re-initialize to be called (explicitly via an error, or implicitly)?

Not sure if this would hurt any other workflows, though (or cause high overhead).

Even better is to just trigger a specific flag when calling the uber mega method Context.SetTimeStateAndParametersFrom, or make a specific method, like Context.ResetFrom(Context other).

But yeah, it may be good to weigh some diff. options and the Pros/Cons.

I will reply once the email hiatus concludes (on Monday).

Calling Initialize is not incumbent upon users -- AdvanceTo shall call it
implicitly anytime it's needed.

In this case the same Simulator was being used to run a second
simulation. There was no way for the Simulator to know that this wasn't just
a continuation of the previous simulation.

As I understand the experiment, almost every single value stored in the Context
changed between the final AdvanceTo of the prior simulation, and first
AdvanceTo of the next simulation. I cannot fathom the claim that the Simulator
cannot deduce that its class-member data is no longer accurate after such a
huge Context mutation.

We could attempt to infer that from seeing time go substantially backwards in
a Context. However it doesn't seem unreasonable to me to require
simulator.Initialize() explicitly when intentionally re-starting a
simulation.

A wise man once said:

  1. Accuracy
  2. Breadth
  3. Speed
    In that order!

If an expert user was unable to get accurate results from using the obvious
APIs in the obvious way, that is a defect in the Simulator, not a mistake by
the user.

The first question is _why_ the output trajectories differ if Initialize was
not called. I have not yet seen a write up explaining the root cause; only
that calling Initialize papers over the problem. The essential, first question
is "Where is there _hidden state outside of the Context_ that the user has to
care about?" I assume it was class-member data in the Simulator or Integrator?
The most important thing is to deliver accurate results. If we have hidden
state outside of the Context that is derived from (but not contingent on) the
Context data, then we cannot deliver accurate results.

Then the second question is "What user mutation was the (missed) signal that
the hidden state was now stale?" We have a caching system to ensure that
computed values are marked out of date when their upstream dependencies change.
Here, the user called a method on the Context to reset the initial conditions.
Which particular update of which particular value was causal for computing
incorrect results? We must _not_ use "time moved backwards" as a proxy for "a
specific fact that I actually care about has changed". It might be highly
correlated with the correct dependency, but I can't see how it's sound
heuristic. If the user changed q or v but not t, we should still compute the
correct results, without needing to manually call Initialize.

I'll quote some Simulator class docs:

We recommend calling Initialize() explicitly prior to beginning a simulation
so that error conditions will be discovered early. However, Initialize() will
be called automatically by the first AdvanceTo() call if it hasn't already
been called.

This is great! There is some sanity checking done by Initialize(); if the user
just begins advancing, it'll check it before beginning, but the user can force
it early in case that's better for the error reporting.

If you make a change to the Context or to Simulator options between AdvanceTo
calls you should consider whether to call Initialize before resuming;
AdvanceTo will not do that automatically for you. Whether to do so depends on
whether you want the above initialization operations performed. For example,
if you changed the time you will likely want the time-triggered events to be
recalculated in case one is due at the new starting time.

This is embarrassing and broken. We know exactly which computations depend on
which inputs. To produce accurate results, we must ensure that we invalidate
all intermediate computations that may no longer be correct. We must be sound
first, and performant second -- if we can't rule out that a change left a
derived value unperturbed, then we must recompute the derived value, even if its
slower.

Even things like "Which events have been observed / Which events have fired" is
properly conceived as state in the Context. Storing it in the Simulator
(especially without any accessors for it!) is broken.

In short -- Initialize must only ever be a convenience for users to begin a
simulation, just without actually advancing time. We could think of it as
AdvanceTo(context.get_time). It must not affect accuracy, and it must not be
incumbent on the user to meditate on whether or not they should call it in
order to receive accurate results.

The thing that jumped out at me most in what @jwnimmer-tri wrote is something implied about the semantics of Initialize. It seems that Initialize does both internal (things private to Simulator) and external (events declared by systems in the Diagram). We should always initialize the diagram when a simulation on that diagram "starts". The user should be aware of this fact and should have the expectation that it happens. Whatever internal state Simulator has and what book-keeping/clean-up it needs to do should be invisible to the users. I'm inferring from the observed behavior that the two are conflated.

Along those lines, I also believe Initialization-triggered System events are broken and evil. I see no purpose for them other than to confuse and introduce problems akin to those noted upthread. Things like "tell the visualizer about the geometries" are fundamentally wrong and broken when formulated as initialization events. If the visualization publisher needs to sync its framework's model of the remote endpoint's current status with a remote endpoint's actual status, that is a publisher-specific protocol, not a one-shot initialization when the user deigns to say "begin".

Honestly I don't understand the fuss over this. Of course the Simulator has internal state, like any other interruptable computation. That's what allows us to make repeated AdvanceTo() calls to step through a simulation. I would argue that the real source of the problem is the hidden call to Initialize() that we make on first call to AdvanceTo(). If we hadn't done that we could have a simple rule for every simulation: Initialize(), AdvanceTo() ... AdvanceTo(), Done(). Then it would be obvious that if you start a new simulation you need to call Initialize() again.

Also, the Simulator's job is to advance _time_. Everything else in the Context is free to change due to external actions such as event handlers or modifications between AdvanceTo() calls. That's why time is special and is the only thing we could use to infer that someone wants to start a new simulation.

Along those lines, I also believe Initialization-triggered System events are broken and evil.

This may be true; I'm not sure. To be clear though, misuse of Initialization-triggered events is an application issue. No one is obligated to (mis)use them! The Simulator provides that feature and its exact effects are well documented. Personally I think they are necessary and useful. Is there a missing Simulator feature that would make life better for users?

We're now definitely past that point where written communication is effective; your replies indicate you've entirely missed the point. Please schedule a call if this issue is still relevant / being actively worked on. If Rico takes notes, then we can ensure the communication is captured in writing here for the record.

This issue is about nondeterministic results and is still under active investigation (in Rico's capable hands). Adding a call to Initialize() addressed only a small subset of the reported problems. Larger topics raised above should be discussed elsewhere.

I am content to pause the discussion while further investigation continues, as long as we agree that "Call Initialize" is a _mitigation_ in order to continue exploring other causes -- in other words that Initialize is _not a resolution_ to any kind of non-determinism. Above, when Rico wrote "fixes all the style=Reuse cases" and "missing Initialize call", I interpreted that to mean it was viewed as a resolution.

Sorry, I don't yet agree with your statement above. However, I do agree this isn't the right place to discuss it. The work to uncover the sources of nondeterminism has barely begun; I'd like to understand all the causes before arguing about cures.

I did use the word "fix" above, but only intended the sense of "makes trials match numerically". I have been agnostic on the question of whether an explicit Initialize() call should be an approved or complete solution.

Also, FWIW, the phenomena I have seen have been quite stable. i wonder if "divergence" is a better term than "non-determinism". There is no evidence from dynamic analysis of any memory violations or other UB.

FTR I've incorporated Rico's fix in 75b1ebf8, which you can see here (and you can see the change from r5 to r6):
https://github.com/RobotLocomotion/drake/pull/13505#pullrequestreview-440539223

Just want to add that this issue, #13069 , for a previous bugged version of an integrator, is related -- many of our Integrator tests do not call Initialize() between simulation runs: within a test, we use the same simulator, run it, then change a few things (including the time), then run it again without calling Initialize(). We've fixed it by making sure the methods-for-changing-simulator-settings called in the tests actually also reset the cached states upon the change, but it's not as future-proof as actually calling Initialize(). Whatever future design we decide on, the tests mentioned in #13069 may be affected.

Renaming this to indicate the scope of the investigation; specific phenomena may get split out to new tickets.

In that case, should #13507 be closed? (or noted as a "sub-issue"?)

@EricCousineau-TRI yes, but i'm no github issue wizard. Is there a technique or best practice for linking issues in GH issues? AFAICT, there is no platform support for linking or dependency.

Oh, I was just thinking of adding the issue as a checkbox to this issue, and updating the overview there. Nothin' fancy.
Will do that now.

status update: I have a branch that demonstrates suppression of all symptoms via various hacks: https://github.com/rpoyner-tri/drake/releases/tag/repro-contact-nondet-symptom-hacks .

2 of those hacks are subsumed and merged already in #13736 and #13737. The remaining issue is what to do about the simulator.Initialize() work-around. Since both @jwnimmer-tri and @EricCousineau-TRI feel that this needs more work, I will open a specific issue for that.

@EricCousineau-TRI can you confirm that you now get repeatable results with your original test program?

Was this page helpful?
0 / 5 - 0 ratings