Ray: [xray] Crash during object transfers

Created on 2 Aug 2018 · 3Comments · Source: ray-project/ray

Describe the problem

Running ./train.py tuned_examples/pong-apex.yaml --redis-address=localhost:6379 will crash almost instantaneously on a multi-node x-ray cluster. This requires a GPU cluster to reach high enough throughputs to crash reliably (or you can patch rllib to take the GPU out of the critical path).

The actual error you see is local_scheduler_client.cc:306 Check failed: static_cast<ray::protocol::MessageType>(type) == ray::protocol::MessageType::WaitReply, but this just means the raylet crashed beforehand.

Source code / logs

@atumanov and I were able to get a core dump and find the root cause:

```(gdb) where

0 __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:238

1 0x00007fd761f9ce48 in std::__cxx11::moneypunct::curr_symbol() const () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

2 0x00007fd75d783c80 in ?? ()

3 0x00007fd75d783c50 in ?? ()

4 0x00007fd75d783c40 in ?? ()

5 0x00007fd75d783c80 in ?? ()

6 0x00007fd75d783c70 in ?? ()

7 0x000000000047b9ac in std::string::operator+= (__str=..., this=0x21000) at /usr/include/c++/5/bits/basic_string.h:3348

8 ray::Status::ToString (this=0x7fd761f9ce48 ::curr_symbol() const+128>, this@entry=0x7fd75d783c40)

at /home/ubuntu/ray/src/ray/status.cc:84

9 0x00000000004ba4a1 in ray::ObjectManager::ExecuteSendObject (this=0x7ffed25cdb80, client_id=..., object_id=..., data_size=15326968, metadata_size=0, chunk_index=2,

connection_info=...) at /home/ubuntu/ray/src/ray/object_manager/object_manager.cc:301

10 0x00000000004ba8ab in ray::ObjectManager::

at /home/ubuntu/ray/src/ray/object_manager/object_manager.cc:263

11 boost::asio::asio_handler_invoke

at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/handler_invoke_hook.hpp:69

12 boost_asio_handler_invoke_helpers::invoke

at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:37

13 boost::asio::detail::completion_handler)

at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/detail/completion_handler.hpp:68

14 0x00000000004b7338 in boost::asio::detail::task_io_service_operation::complete (bytes_transferred=0, ec=..., owner=..., this=)

at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/detail/task_io_service_operation.hpp:38

15 boost::asio::detail::task_io_service::do_run_one (ec=..., this_thread=..., lock=..., this=0x11d1b90)

at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/detail/impl/task_io_service.ipp:372

16 boost::asio::detail::task_io_service::run (ec=..., this=0x11d1b90) at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/detail/impl/task_io_service.ipp:149

17 boost::asio::io_service::run (this=) at /home/ubuntu/ray/thirdparty/pkg/boost/include/boost/asio/impl/io_service.ipp:59

18 ray::ObjectManager::RunSendService (this=) at /home/ubuntu/ray/src/ray/object_manager/object_manager.cc:73

19 0x00007fd761f85c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

20 0x00007fd7622566ba in start_thread (arg=0x7fd75d784700) at pthread_create.c:333

21 0x00007fd7616eb41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

```

bug

Source

ericl

Most helpful comment

Seems some combination of https://github.com/ray-project/ray/pull/2548 and https://github.com/ray-project/ray/pull/2557 fixes this issue.

ericl on 4 Aug 2018

👍2

All 3 comments

Hey eric, can you try if something like this fixes the segfault?

https://github.com/ray-project/ray/pull/2548

We found that when looking into a valgrind error with Stephanie, it turns out that the error strings from boost::system are corrupted (although we don't know why).

pcmoritz on 2 Aug 2018

If it's what @pcmoritz suspects, the write fails and this line is invoked: https://github.com/ray-project/ray/blob/master/src/ray/object_manager/object_manager.cc#L341

I think it makes sense to see what happens when running the above code with #2548, as this will either reveal the underlying error or rule out the corrupt error string hypothesis.

With that said, I believe the underlying issue may be fixed by #2557, which may be worth trying out first.