Running ./train.py tuned_examples/pong-apex.yaml --redis-address=localhost:6379 will crash almost instantaneously on a multi-node x-ray cluster. This requires a GPU cluster to reach high enough throughputs to crash reliably (or you can patch rllib to take the GPU out of the critical path).
The actual error you see is local_scheduler_client.cc:306 Check failed: static_cast<ray::protocol::MessageType>(type) == ray::protocol::MessageType::WaitReply, but this just means the raylet crashed beforehand.
@atumanov and I were able to get a core dump and find the root cause:
```(gdb) where
at /home/ubuntu/ray/src/ray/status.cc:84
connection_info=...) at /home/ubuntu/ray/src/ray/object_manager/object_manager.cc:301
at /home/ubuntu/ray/src/ray/object_manager/object_manager.cc:263
at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/handler_invoke_hook.hpp:69
at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:37
at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/detail/completion_handler.hpp:68
at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/detail/task_io_service_operation.hpp:38
at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/detail/impl/task_io_service.ipp:372
```
Hey eric, can you try if something like this fixes the segfault?
https://github.com/ray-project/ray/pull/2548
We found that when looking into a valgrind error with Stephanie, it turns out that the error strings from boost::system are corrupted (although we don't know why).
If it's what @pcmoritz suspects, the write fails and this line is invoked: https://github.com/ray-project/ray/blob/master/src/ray/object_manager/object_manager.cc#L341
I think it makes sense to see what happens when running the above code with #2548, as this will either reveal the underlying error or rule out the corrupt error string hypothesis.
With that said, I believe the underlying issue may be fixed by #2557, which may be worth trying out first.
Seems some combination of https://github.com/ray-project/ray/pull/2548 and https://github.com/ray-project/ray/pull/2557 fixes this issue.
Most helpful comment
Seems some combination of https://github.com/ray-project/ray/pull/2548 and https://github.com/ray-project/ray/pull/2557 fixes this issue.