Ray: [Core] WorkerThreadContext semantics are incorrect for async Python actors.

Created on 25 Aug 2020 · 7Comments · Source: ray-project/ray

What is the problem?

A thread-local WorkerThreadContext is maintained for each core worker thread, and is used to (1) get the currently executing task, (2) maintain a put index counter tracking the number of object puts in the current task and for deterministically generating put object IDs, and (3) maintain a task index tracking the number of tasks submitted from the current task and for generating task IDs and setting parent counters on task specs. This works fine for the thread-based worker model, where only one task is executing on a thread at a time and said tasks execute to completion; namely, CoreWorker::ExecuteTask properly sets the state and resets the state at the beginning and end of task execution, respectively, and the underlying Python task is executed on the same thread, so the current task is correctly set and the put index state is correctly maintained for this core worker thread.

However, when executing tasks on async Python actors, the Python task is run on an event loop which is on a different thread than the core worker fiber runner thread; the latter is where CoreWorker::ExecuteTask will run and (re)set the current task, while the former is where any subtask submissions and the CoreWorker::Put and CoreWorker::Create calls for put objects will happen within an executing async task, causing the task and put indices to be incremented. This thread split results in:

The current task not being set (and reset) correctly in the async Python task (event loop) thread, which is where object puts and subtask submissions happen, resulting in the current task ID to be randomly generated when the event loop thread is created and to never be reset. This causes incorrect object IDs to be created for put objects, where the underlying task ID for the put object is that random task ID instead of the actual current task ID, and this also causes child task IDs to be generated using this randomly generated task ID instead of the actual current task ID. Moreover, the put index never resets (since the current task is never reset) so it continues to increment for the lifetime of that event loop thread, and the same is true for the task index, causing incorrect parent counters to be set in task specs.
The put and task indices not being maintained at all in the fiber runner thread. This doesn't manifest externally, but could in the future.

Ray version and other system information (Python version, TensorFlow version, OS):

Ray version: Current master

Reproduction (REQUIRED)

@ray.remote(num_cpus=0)
class SignalActor:
    def __init__(self):
        self.ready_event = asyncio.Event()

    def send(self, clear=False):
        self.ready_event.set()
        if clear:
            self.ready_event.clear()

    async def wait(self, should_wait=True):
        ray.put(1)
        ray.get(_put.remote(2))
        if should_wait:
            await self.ready_event.wait()

@ray.remote
def _put(obj):
    return obj

signal = SignalActor.remote()
result_id = signal.wait.remote(should_wait=False)
result_id = signal.wait.remote()

View debug logs showing the put object ID generated by each signal.wait.remote() call. The underlying task ID will be the same (randomly generated) ID, despite the put objects being from two different tasks, and the put index will continue incrementing across the two wait tasks.

If you add << ", parent_task_id=" << ParentTaskId() << ", parent_counter=" << ParentCounter(); to the stream in TaskSpecification::DebugString(), you'll also see that the parent task ID for each of the tasks submitted by the separate wait tasks will have the same parent task ID (the randomly generated one), and the parent counter will be at 2 for the second _put task instead of at 1, as it should be.

[x] I have verified my script runs in a clean environment and reproduces the issue.
[x] I have verified the issue also occurs with the latest wheels.

P2 bug

Source

clarkzinzow

👍1

All 7 comments

I think I know exactly why it happens, but I'd like to understand the implication of wrong objectIDs in this case. Is there more implication than that it is hard to debug? (like functionality issues or etc.)?

rkooo567 on 25 Aug 2020

@rkooo567 This issue was discovered while working on a solution for this issue in this PR, which involves using a task's num_returns as a base for the put index in order to guarantee no object ID conflicts between put objects and return objects. That obviously fails for async Python tasks since the current task spec is not set in the event loop thread. I'll be pushing up a workaround as a stopgap for that PR, if I can find one.

AFAICT, there aren't any hard functionality issues within master, but this definitely breaks the hierarchical object ID model (which _is_ user-facing IMO) and the semantics behind put and task indices: async task put object IDs don't have the right task ID set, and task spec fields (parent task ID and parent task counter) are being incorrectly set for subtasks of async tasks. The WorkerThreadContext abstraction appears to offer semantics/invariants that don't hold for async Python tasks, which will be an issue whenever anyone tries to extend it or use it in any new meaningful way (as I'm trying to do); it's super, super fragile right now.

For a longer-term solution, we'd probably want the put and task indices somehow maintained in Python coroutine memory (contextvar somehow set?), and do the equivalent in C++ land with fiber local storage. I'm still thinking about the best way to do that without creating a large refactor.

clarkzinzow on 25 Aug 2020

👍1

Thanks for the information in details.

This behavior should be fixed as I'd like to support context (which users can print their task id in the method) in the sooner future. I will set P2 because it won't be hard requirement for 1.0 because using objectID is not something we officially support.

Can you try this

I'll be pushing up a workaround as a stopgap for that PR, if I can find one.

and let me know if you can find any workaround? If it is a hard blocker, I will try to spend some time at night to have a fix.

rkooo567 on 25 Aug 2020

👍1

@edoakes Is @ijrsvt planning to fix this issue?

rkooo567 on 14 Nov 2020

There are no immediate plans to fix it.

edoakes on 16 Nov 2020

@edoakes and I chatted a little bit about this offline. We think one possible approach would be:

In the core side, store some mapping of task_id->worker_context.
In the Python side, when you access the worker context, you can access a thread local variable (if threaded or regular actor) or context variable (in asyncio context).

These are just rough thoughts and helpfully it can guide to a more fully flushed design.

simon-mo on 17 Nov 2020

That's actually pretty similar to the idea @clarkzinzow and I discussed before.

Also, I think it is pretty crucial for serve use case. It can probably break custom metrics if this issue is not fixed.

rkooo567 on 17 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

rllib: Using gym.RewardWrapper around MultiAgentEnv cause observation mismatch with observation_space

0luhancheng0 · 3Comments

format.sh script returns illegal option -o pipefail

1beb · 3Comments

Can't create remote actor without decorator and apply arguments

zplizzi · 3Comments

Bazel build error for boost when building from source on ARM(aarch64)

heavyinfo · 3Comments

Unrecognised instruction error running valgrind tests.

robertnishihara · 3Comments