Currently, the LoadRobot message is sent once and if it is lost due to incorrect process starting order (i.e., Drake starting and sending the message before DrakeVisualizer starts), DrakeVisualizer has no way to recover. Periodically re-sending LoadRobot messages is not currently an option since it results in an expensive operation within DrakeVisualizer involving deleting all of DrakeVisualizer's prior state and rebuilding it from scratch.
Longer term, we'll want to generalize the communication protocol to allow models to be dynamically added and removed from DrakeVisualizer at run-time throughout the duration of a simulation.
For context, see: https://reviewable.io/reviews/robotlocomotion/drake/3408#-KRU4i-92Vn8ZuBlBGwr
The solution to this issue may assume reliable LCM communication, ostensibly by requiring that LCM only communicate over the local loopback network interface.
More generally, drake-visualizer could likely benefit from an interface with a detail level somewhere between LoadRobot/DrawRobot and LCMGL. I may have a look at all this during the course of work on cars + road networks effort over the next week or two.
There is a work-around that one can do in launcher scripts to make sure that drake-visualizer is ready when the simulation process starts. I'll comb my notes and try post more info about that here.
Ah, yes, the work-around. When drake-visualizer is ready, it publishes a message on LCM channel DRAKE_VIEWER_STATUS, of type drake.lcmt_viewer_command, with command_type STATUS (== 0) and the command_data "loaded". A tolerably correct launcher should wait for this event before launching a program that sends DRAKE_VIEW_LOAD_ROBOT.
Instead, lazy folk like me have written code like https://github.com/RobotLocomotion/drake/blob/master/drake/examples/Cars/run_demo_multi_car.sh#L30
Since LCM is based on UDP, there's no guarantee that the DRAKE_VIEW_LOAD_ROBOT message will get through even after waiting for a DRAKE_VIEWER_STATUS message. This unreliability suggests that the DRAKE_VIEW_LOAD_ROBOT should be re-sent if Drake can determine that the previous one didn't work, perhaps by analyzing the values within the DRAKE_VIEWER_STATUS it continues to periodically receive.
In theory the VIEWER_STATUS might be lost. In practice, UDP on localhost will be perfect (enough) to not worry about packet loss. The only hazard to worry about is application launch order and process (re)start, which is solved by choosing different message primitives, not adding more ceremony around the current ones.
Are we assuming / requiring that Drake and DrakeVisualier always runs on the same host? If we require that they run on the same host, should we make them run in the same process to avoid these types of message-loss / IPC complications?
Unless you pass custom parameters to LCM, your UDP messages will have ttl=0 and never leave the local host.
Obviously, it would be nice to improve the communication protocol to not require that IPC messaging remain local and/or be reliable.
So we have several issues:
As for 1, we can probably make progress, but it will involve PR's against director.
I do not lose sleep over 2.
If 3 is a problem we need to solve, there are loads of ways to do it, but they require actual thought, because distributed anything is hard. I would argue that it is out of scope for this issue.
OK. I renamed this PR to focus on brittleness with respect to process starting order, and I created #3422 to handle brittleness due to message loss.
I consider any work on this issue blocked, pending resolution of #3344.
Based on @rpoyner-tri's summary above, I'm mildly skeptical there's actual work to do here, but I've reassigned it to @SeanCurtis-TRI to assess and prioritize - the rendering APIs domain space has been lumped into his portfolio along with GeometryWorld, even though this particular issue has little to do with geometry.
I recommend closing this in favor of #4343.