This is a follow-on from #13506; much relevant discussion is contained there.
In the course of investigating non-repeatable results found by @EricCousineau-TRI 's test program in #13505, we discovered that a carefully placed call to simulator.Initialize() could suppress some symptoms. However, there are other examples of reuse of a simulator object that don't apparently require calling Initialize(); see unit tests associated with #13069. So open questions remain:
/cc @jwnimmer-tri @sherm1 as interested parties. /cc @antequ highlighted the possible need to revisit tests from #13069.
To zoom in, here's Jeremy's comment about it directly:
https://github.com/RobotLocomotion/drake/issues/13506#issuecomment-651808088
And here is Sherm's contrary comment:
https://github.com/RobotLocomotion/drake/issues/13506#issuecomment-651899865
Observations toward a root cause
One postcondition of Initialize() is that the member field time_or_witness_triggered_ (enum-typed bit field) has either no bits set, or it has kTimeTriggered set. For the discrete cases in the repeatability suite (from #13505 ), Initialize() sets kTimeTriggered.
The member field then evolves by receiving the return value of the integrator. Eventually the integrator will return a no-bits-set value. If the trial is then ended and restarted using the Reuse style coded in the repeatability suite without calling Initialize(), the following run will diverge (presumably by a lapse of merging a timed event?).
The other contributing factor is that the Reuse style avoids the many paths through the simulator that could clear the initialization_done_ flag and hence trigger a full initialization on the next AdvanceTo() call.
It may also be worth mentioning that initialization events can only be triggered by an Initialize() call. If anything in the Diagram has initialization events, only the first simulation will trigger those (via the implicit Initialize() call at the beginning of AdvanceTo()). That could also cause a divergence between the initialized simulation and the non-initialized Reuse.
Relates #10015
Having fiddled with the innards of the event system a bit more, I'm less interested in the details of that, and more interested in the asymmetry between doing surgery on a simulation-owned context via a saved reference (Eric's example), vs. calling (say) simulator.reset_context(). it seems to me we have a half-closed abstraction that allows "slicing" (inability to recognize invalidation of internal state). How radical would it be to have the simulator watch owned context changes, and invalidate itself based on what it observed?
I'm skeptical that we could correctly infer whether the user wants a full re-initialization (with initialization events processed). Many external changes to the Context are allowed during a simulation, from both discrete state update events and application-level changes (e.g. to parameters) done between calls to AdvanceTo(). The only thing that is certainly invalidating is a change to time; that is the one thing that the Simulator is supposed to control always. Yet other Context changes _may_ indicate a need for re-initialization; only the user would know that.
In addition to the Simulator's internal state, some of the integrators also maintain internal state (implicit integrators save Jacobians for reuse because they are so expensive to calculate). Those have the character of "hints" (bad Jacobians are detectable at run time and recomputed) but the saved state can cause numerical differences in computed results. It is also conceivable that LeafSystem discrete updates could have such numerical hints saved.
I do not think there is going to be a lovely solution here where we are able to infer user intent with no hint from the user. A _lot_ of work has gone into the design and documentation of this hybrid simulator to make it behave properly and to provide extensive documentation about how it works, which must satisfy many constraints. Regarding the pending events at the end of AdvanceTo(), that is clearly documented and the method AdvancePendingEvents() exists precisely to tidy those up if the user deems that necessary.
IMO we should not dive any further into this rabbit hole except possibly to improve the documentation if someone can figure out how to make it clearer. Maybe some judiciously-placed warnings? Evan and I spent literally months agonizing over the documentation for the Simulator, for example this. There are only a few hybrid simulators around -- I think ours compares favorably in usability and documentation to Simulink's for example.
On the subject of documentation -- some tutorials might carry the most weight. In my mind, I imagine a tutorial of "Working with simulators" where we show:
AdvanceTo).If the examples explain the principle in each case of why re-initialization is/isn't required that might be more informative than burying it in documentation somewhere.
In #13506 we started talking about Initialize but decided to postpone the discussion while other factors were chased down. But in the thread there about Initialize, I took the time to explain why documentation (or even tutorials) are nowhere near sufficient. Sherm's reply shows that I was unable to explain my thoughts clearly, because that subsequent discussion (including the new tangent on hybrid simulators above) misses the point.
Since I am unable to get my point across in writing, I had asked before to schedule a teleconference so that I can try to do a better job interactively. Now that this topic is unpaused, please take up the initiative to do that.
Meeting scheduled for 2020/07/30, 4pm EDT.
Per f2f with @rpoyner-tri @jwnimmer-tri @EricCousineau-TRI:
We agreed that the following Simulator changes will sufficiently resolve this issue.
During a simulation, between AdvanceTo() calls, if someone changes the _time_ in the Simulator's internally-owned Context, throw an error message indicating that re-initialization is necessary before proceeding. This should be checked at the start of AdvanceTo() and will require an additional Simulator class member like double expected_time_; that should record context.get_time() just prior to exiting AdvanceTo().
Add a flag to the Simulator's Initialize() method:
Initialize(bool should_trigger_initialization_events = true). A user is free either to do a full initialization or to suppress the initialization events.
This does not necessarily resolve all possible issues with repeatability (see discussion above) but solves all the actual problems we know about currently.
As currently specifiied, (1) above will be a breaking change. Do we need a deprecation strategy? Do we just break things and document it in release notes?
My 2c: Since the main objective of (1) is user awareness of a hazard, it's probably fair to start with (1) being only a logging::Warn once to display a console warning, with the promise (in the warning message) that it will harden into an exception three months from now.
FWIW I'd also be fine if it failed fast to start; the fix is pretty simple? (call initialize?)
Having merged #13851, I will claim the most important piece of this ticket is satisfied. I will open a separate ticket for sub-item 2 "suppress initialization events"; it needs more investigation. At the very least, we need a defensible use case for the feature.
Most helpful comment
My 2c: Since the main objective of (1) is user awareness of a hazard, it's probably fair to start with (1) being only a
logging::Warn onceto display a console warning, with the promise (in the warning message) that it will harden into an exception three months from now.