Drake: confusion around when simulator.Initialize() is necessary

Created on 24 Jul 2020 · 15Comments · Source: RobotLocomotion/drake

This is a follow-on from #13506; much relevant discussion is contained there.

In the course of investigating non-repeatable results found by @EricCousineau-TRI 's test program in #13505, we discovered that a carefully placed call to simulator.Initialize() could suppress some symptoms. However, there are other examples of reuse of a simulator object that don't apparently require calling Initialize(); see unit tests associated with #13069. So open questions remain:

[x] what is the root cause of non-repeatability in this case that calling Initialize() treats?
[x] what user-visible change of state/configuration went unnoticed that apparently invalidated some internal state?
[x] can changes to API documentation sufficiently clarify the situation?
[x] are code changes necessary to make this workflow less error-prone?

simulator dynamics

Source

rpoyner-tri

👍2

Most helpful comment

My 2c: Since the main objective of (1) is user awareness of a hazard, it's probably fair to start with (1) being only a logging::Warn once to display a console warning, with the promise (in the warning message) that it will harden into an exception three months from now.

jwnimmer-tri on 5 Aug 2020

👍2

All 15 comments

/cc @jwnimmer-tri @sherm1 as interested parties. /cc @antequ highlighted the possible need to revisit tests from #13069.

rpoyner-tri on 24 Jul 2020

To zoom in, here's Jeremy's comment about it directly:
https://github.com/RobotLocomotion/drake/issues/13506#issuecomment-651808088

And here is Sherm's contrary comment:
https://github.com/RobotLocomotion/drake/issues/13506#issuecomment-651899865

EricCousineau-TRI on 24 Jul 2020

👍2

Observations toward a root cause

One postcondition of Initialize() is that the member field time_or_witness_triggered_ (enum-typed bit field) has either no bits set, or it has kTimeTriggered set. For the discrete cases in the repeatability suite (from #13505 ), Initialize() sets kTimeTriggered.

The member field then evolves by receiving the return value of the integrator. Eventually the integrator will return a no-bits-set value. If the trial is then ended and restarted using the Reuse style coded in the repeatability suite without calling Initialize(), the following run will diverge (presumably by a lapse of merging a timed event?).

The other contributing factor is that the Reuse style avoids the many paths through the simulator that could clear the initialization_done_ flag and hence trigger a full initialization on the next AdvanceTo() call.

rpoyner-tri on 28 Jul 2020

It may also be worth mentioning that initialization events can only be triggered by an Initialize() call. If anything in the Diagram has initialization events, only the first simulation will trigger those (via the implicit Initialize() call at the beginning of AdvanceTo()). That could also cause a divergence between the initialized simulation and the non-initialized Reuse.

sherm1 on 28 Jul 2020

Relates #10015

EricCousineau-TRI on 28 Jul 2020

Having fiddled with the innards of the event system a bit more, I'm less interested in the details of that, and more interested in the asymmetry between doing surgery on a simulation-owned context via a saved reference (Eric's example), vs. calling (say) simulator.reset_context(). it seems to me we have a half-closed abstraction that allows "slicing" (inability to recognize invalidation of internal state). How radical would it be to have the simulator watch owned context changes, and invalidate itself based on what it observed?

rpoyner-tri on 29 Jul 2020

I'm skeptical that we could correctly infer whether the user wants a full re-initialization (with initialization events processed). Many external changes to the Context are allowed during a simulation, from both discrete state update events and application-level changes (e.g. to parameters) done between calls to AdvanceTo(). The only thing that is certainly invalidating is a change to time; that is the one thing that the Simulator is supposed to control always. Yet other Context changes _may_ indicate a need for re-initialization; only the user would know that.

In addition to the Simulator's internal state, some of the integrators also maintain internal state (implicit integrators save Jacobians for reuse because they are so expensive to calculate). Those have the character of "hints" (bad Jacobians are detectable at run time and recomputed) but the saved state can cause numerical differences in computed results. It is also conceivable that LeafSystem discrete updates could have such numerical hints saved.

I do not think there is going to be a lovely solution here where we are able to infer user intent with no hint from the user. A _lot_ of work has gone into the design and documentation of this hybrid simulator to make it behave properly and to provide extensive documentation about how it works, which must satisfy many constraints. Regarding the pending events at the end of AdvanceTo(), that is clearly documented and the method AdvancePendingEvents() exists precisely to tidy those up if the user deems that necessary.

IMO we should not dive any further into this rabbit hole except possibly to improve the documentation if someone can figure out how to make it clearer. Maybe some judiciously-placed warnings? Evan and I spent literally months agonizing over the documentation for the Simulator, for example this. There are only a few hybrid simulators around -- I think ours compares favorably in usability and documentation to Simulink's for example.

sherm1 on 29 Jul 2020

On the subject of documentation -- some tutorials might carry the most weight. In my mind, I imagine a tutorial of "Working with simulators" where we show:

Vanilla usage (i.e., one-shot AdvanceTo).
Interrupted usage where we modify the context in a way that doesn't require re-initialization.
Interrupted usage where we modify the context in a way that does require re-initialization.
Configuring integrators.
Other stuff.

If the examples explain the principle in each case of why re-initialization is/isn't required that might be more informative than burying it in documentation somewhere.

SeanCurtis-TRI on 29 Jul 2020

In #13506 we started talking about Initialize but decided to postpone the discussion while other factors were chased down. But in the thread there about Initialize, I took the time to explain why documentation (or even tutorials) are nowhere near sufficient. Sherm's reply shows that I was unable to explain my thoughts clearly, because that subsequent discussion (including the new tangent on hybrid simulators above) misses the point.

Since I am unable to get my point across in writing, I had asked before to schedule a teleconference so that I can try to do a better job interactively. Now that this topic is unpaused, please take up the initiative to do that.

jwnimmer-tri on 29 Jul 2020

Meeting scheduled for 2020/07/30, 4pm EDT.

rpoyner-tri on 29 Jul 2020

Per f2f with @rpoyner-tri @jwnimmer-tri @EricCousineau-TRI:

We agreed that the following Simulator changes will sufficiently resolve this issue.

During a simulation, between AdvanceTo() calls, if someone changes the _time_ in the Simulator's internally-owned Context, throw an error message indicating that re-initialization is necessary before proceeding. This should be checked at the start of AdvanceTo() and will require an additional Simulator class member like double expected_time_; that should record context.get_time() just prior to exiting AdvanceTo().
Add a flag to the Simulator's Initialize() method:
Initialize(bool should_trigger_initialization_events = true). A user is free either to do a full initialization or to suppress the initialization events.

This does not necessarily resolve all possible issues with repeatability (see discussion above) but solves all the actual problems we know about currently.

sherm1 on 31 Jul 2020

As currently specifiied, (1) above will be a breaking change. Do we need a deprecation strategy? Do we just break things and document it in release notes?

rpoyner-tri on 5 Aug 2020

jwnimmer-tri on 5 Aug 2020

👍2

FWIW I'd also be fine if it failed fast to start; the fix is pretty simple? (call initialize?)

EricCousineau-TRI on 5 Aug 2020

Having merged #13851, I will claim the most important piece of this ticket is satisfied. I will open a separate ticket for sub-item 2 "suppress initialization events"; it needs more investigation. At the very least, we need a defensible use case for the feature.

rpoyner-tri on 11 Aug 2020

🎉1

Was this page helpful?

0 / 5 - 0 ratings