Ray: [xray] Test failure in stress_tests.py::test_multiple_recursive.

Created on 15 Oct 2018  路  6Comments  路  Source: ray-project/ray

The following test fails sometimes

RAY_USE_XRAY=1 python -m pytest -v -s test/stress_tests.py::test_multiple_recursive[4]

It looks like the relevant failure is

/home/ubuntu/ray/build/external/arrow/src/arrow_ep/cpp/src/plasma/client.cc:451:  Check failed: object_entry->second->is_sealed Plasma client called get on an unsealed object that it created
/home/ubuntu/ray/python/ray/core/src/ray/raylet/raylet[0x479e1e]
/home/ubuntu/ray/python/ray/core/src/ray/raylet/raylet(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x56d80d]
/home/ubuntu/ray/python/ray/core/src/ray/raylet/raylet[0x549430]
/home/ubuntu/ray/python/ray/core/src/ray/raylet/raylet(_ZN6plasma12PlasmaClient3GetEPKNS_8UniqueIDEllPNS_12ObjectBufferE+0x44)[0x54adf4]
/home/ubuntu/ray/python/ray/core/src/ray/raylet/raylet(_ZN3ray16ObjectBufferPool8GetChunkERKNS_8UniqueIDEmmm+0x1df)[0x5297ff]
/home/ubuntu/ray/python/ray/core/src/ray/raylet/raylet(_ZN3ray13ObjectManager17SendObjectHeadersERKNS_8UniqueIDEmmmRSt10shared_ptrINS_16SenderConnectionEE+0x58)[0x4dd2c8]
/home/ubuntu/ray/python/ray/core/src/ray/raylet/raylet(_ZN3ray13ObjectManager17ExecuteSendObjectERKNS_8UniqueIDES3_mmmRKNS_20RemoteConnectionInfoE+0x184)[0x4df2f4]
/home/ubuntu/ray/python/ray/core/src/ray/raylet/raylet[0x4df71b]
/home/ubuntu/ray/python/ray/core/src/ray/raylet/raylet(_ZN3ray13ObjectManager14RunSendServiceEv+0x458)[0x4d8f08]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f0892accc80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f0892d9d6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f089223241d]
*** Aborted at 1539557988 (unix time) try "date -d @1539557988" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x3e80000270c) received by PID 9996 (TID 0x7f089212a700) from PID 9996; stack trace: ***
    @     0x7f0892da7390 (unknown)
    @     0x7f0892160428 gsignal
    @     0x7f089216202a abort
    @           0x56d812 arrow::util::ArrowLog::~ArrowLog()
    @           0x549430 plasma::PlasmaClient::Impl::GetBuffers()
    @           0x54adf4 plasma::PlasmaClient::Get()
    @           0x5297ff ray::ObjectBufferPool::GetChunk()
    @           0x4dd2c8 ray::ObjectManager::SendObjectHeaders()
    @           0x4df2f4 ray::ObjectManager::ExecuteSendObject()
    @           0x4df71b _ZN5boost4asio6detail18completion_handlerIZZN3ray13ObjectManager4PushERKNS3_8UniqueIDES7_ENKUlRKNS3_20RemoteConnectionInfoEE0_clESA_EUlvE_E11do_completeEPNS1_15task_io_serviceEPNS1_25task_io_service_operationERKNS_6system10error_codeEm
    @           0x4d8f08 ray::ObjectManager::RunSendService()
    @     0x7f0892accc80 (unknown)
    @     0x7f0892d9d6ba start_thread
    @     0x7f089223241d clone
    @                0x0 (unknown)
bug

All 6 comments

cc @elibol

Some thoughts on this:

I'm assuming everything is running on a single machine/raylet for this test. No objects should be transferred in this setting. However, it appears an object is being pushed by the ObjectManager (presumably to its own instance). A get on the plasma client is being invoked during a push.

If we are in fact running this on a single machine with a single object manager, Pull is being invoked on the client id that corresponds to the current object manager instance. However, this should never happen. See line 204 in ObjectManager.

Another inconsistency: If a Pull on itself (the ObjectManager instance) is somehow invoked, the object being pushed is only partially written. This should never happen, because a push is only invoked on objects that are local. In particular, a notification from the object store is sent to the object manager when a Seal operation completes, so the root error "Check failed: object_entry->second->is_sealed Plasma client called get on an unsealed object that it created" is unexpected in this scenario.

@guoyuhong can you double check the unfulfilled_push_requests_ mechanism to ensure local_objects_ contains the object being read?

@elibol we are running four raylets on a single machine in this test.

@elibol Thanks, for the ideas. I will take a look when I'm spared.
@robertnishihara Can I repro this crash every time when I run this test?

@robertnishihara It looks that I cannot repro this problem in my local machine...

This may be fixed by #3020.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alex-petrenko picture alex-petrenko  路  34Comments

raoul-khour-ts picture raoul-khour-ts  路  35Comments

floringogianu picture floringogianu  路  32Comments

remram44 picture remram44  路  37Comments

roireshef picture roireshef  路  38Comments