Drake: Improve Robustness of Drake→DrakeVisualizer LoadRobot Message In Terms of Starting Order of Processes

Created on 12 Sep 2016 · 13Comments · Source: RobotLocomotion/drake

Problem Definition

Currently, the LoadRobot message is sent once and if it is lost due to incorrect process starting order (i.e., Drake starting and sending the message before DrakeVisualizer starts), DrakeVisualizer has no way to recover. Periodically re-sending LoadRobot messages is not currently an option since it results in an expensive operation within DrakeVisualizer involving deleting all of DrakeVisualizer's prior state and rebuilding it from scratch.

Longer term, we'll want to generalize the communication protocol to allow models to be dynamically added and removed from DrakeVisualizer at run-time throughout the duration of a simulation.

For context, see: https://reviewable.io/reviews/robotlocomotion/drake/3408#-KRU4i-92Vn8ZuBlBGwr

Allowed Assumptions

The solution to this issue may assume reliable LCM communication, ostensibly by requiring that LCM only communicate over the local loopback network interface.

medium feature request

Source

liangfok

All 13 comments

More generally, drake-visualizer could likely benefit from an interface with a detail level somewhere between LoadRobot/DrawRobot and LCMGL. I may have a look at all this during the course of work on cars + road networks effort over the next week or two.

rpoyner-tri on 12 Sep 2016

👍1

There is a work-around that one can do in launcher scripts to make sure that drake-visualizer is ready when the simulation process starts. I'll comb my notes and try post more info about that here.

rpoyner-tri on 12 Sep 2016

Ah, yes, the work-around. When drake-visualizer is ready, it publishes a message on LCM channel DRAKE_VIEWER_STATUS, of type drake.lcmt_viewer_command, with command_type STATUS (== 0) and the command_data "loaded". A tolerably correct launcher should wait for this event before launching a program that sends DRAKE_VIEW_LOAD_ROBOT.

Instead, lazy folk like me have written code like https://github.com/RobotLocomotion/drake/blob/master/drake/examples/Cars/run_demo_multi_car.sh#L30

rpoyner-tri on 12 Sep 2016

👍1

Since LCM is based on UDP, there's no guarantee that the DRAKE_VIEW_LOAD_ROBOT message will get through even after waiting for a DRAKE_VIEWER_STATUS message. This unreliability suggests that the DRAKE_VIEW_LOAD_ROBOT should be re-sent if Drake can determine that the previous one didn't work, perhaps by analyzing the values within the DRAKE_VIEWER_STATUS it continues to periodically receive.

liangfok on 12 Sep 2016

In theory the VIEWER_STATUS might be lost. In practice, UDP on localhost will be perfect (enough) to not worry about packet loss. The only hazard to worry about is application launch order and process (re)start, which is solved by choosing different message primitives, not adding more ceremony around the current ones.

jwnimmer-tri on 12 Sep 2016

👍1

Are we assuming / requiring that Drake and DrakeVisualier always runs on the same host? If we require that they run on the same host, should we make them run in the same process to avoid these types of message-loss / IPC complications?

liangfok on 12 Sep 2016

Unless you pass custom parameters to LCM, your UDP messages will have ttl=0 and never leave the local host.

rpoyner-tri on 12 Sep 2016

Obviously, it would be nice to improve the communication protocol to not require that IPC messaging remain local and/or be reliable.

liangfok on 12 Sep 2016

So we have several issues:

the current drake visualizer protocol is brittle (which i interpret as the core of #3421)
LCM is not a reliable transport (but good-enough on local host)
LCM's default configuration does not support remote operation

As for 1, we can probably make progress, but it will involve PR's against director.

I do not lose sleep over 2.

If 3 is a problem we need to solve, there are loads of ways to do it, but they require actual thought, because distributed anything is hard. I would argue that it is out of scope for this issue.

rpoyner-tri on 12 Sep 2016

OK. I renamed this PR to focus on brittleness with respect to process starting order, and I created #3422 to handle brittleness due to message loss.

liangfok on 13 Sep 2016

I consider any work on this issue blocked, pending resolution of #3344.

rpoyner-tri on 21 Sep 2016

👍1

Based on @rpoyner-tri's summary above, I'm mildly skeptical there's actual work to do here, but I've reassigned it to @SeanCurtis-TRI to assess and prioritize - the rendering APIs domain space has been lumped into his portfolio along with GeometryWorld, even though this particular issue has little to do with geometry.