Ray: [ray] Async actor not working in local mode.

Created on 8 May 2020 · 11Comments · Source: ray-project/ray

What is the problem?

Async actor is not being recognized in local mode. cc: @ijrsvt

Ray version and other system information (Python version, TensorFlow version, OS):

ray version: latest master
python: 3.7.4
OS: macos 10.15.3.

Reproduction

import ray 
ray.init(local_mode=True)

@ray.remote
class test_actor:
    async def start(self):
        return 1
actor = test_actor.remote()
ray.get(actor.start.remote())

The error I'm getting

E0507 17:19:21.404151 381615552 core_worker.cc:1082] Pushed Error with JobID: 0100 of type: task with message: ray::test_actor.start() (pid=60331, ip=192.168.0.7)
  File "python/ray/_raylet.pyx", line 464, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 465, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1136, in ray._raylet.CoreWorker.store_task_outputs
  File "/Users/allenyin/Test/ray/python/ray/serialization.py", line 401, in serialize
    return self._serialize_to_msgpack(metadata, value)
  File "/Users/allenyin/Test/ray/python/ray/serialization.py", line 373, in _serialize_to_msgpack
    self._serialize_to_pickle5(metadata, python_objects)
  File "/Users/allenyin/Test/ray/python/ray/serialization.py", line 353, in _serialize_to_pickle5
    raise e
  File "/Users/allenyin/Test/ray/python/ray/serialization.py", line 350, in _serialize_to_pickle5
    value, protocol=5, buffer_callback=writer.buffer_callback)
  File "/Users/allenyin/Test/ray/python/ray/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/Users/allenyin/Test/ray/python/ray/cloudpickle/cloudpickle_fast.py", line 617, in dump
    return Pickler.dump(self, obj)
TypeError: can't pickle coroutine objects at time: 1.5889e+09

[x] I have verified my script runs in a clean environment and reproduces the issue.
[x] I have verified the issue also occurs with the latest wheels.

P3 bug core fix-error-msg good first issue

Source

allenyin55

👍2

All 11 comments

The problem is that when the task is executed locally, it doesn't create Fiber & set the current actor as async. The latter is easy to solve, but the first one makes the code path pretty messy because it requires core_worker to have FiberState class only for the local mode...

rkooo567 on 8 May 2020

@ijrsvt I think I can make a fix by tonight, so you don't need to work on this.

rkooo567 on 8 May 2020

👍1

Okay. I have been digging into this, and it is pretty tricky to fix because there's only one core worker for local mode. That says we cannot corrupt the core_worker state as async actor state. This requires some decent amount of refactoring (which I don't think it is worth taking time now). As you cannot use 0.8.5 until the next release anyway, I will postpone the fix to the next sprint and set the priority as P1.

rkooo567 on 8 May 2020

@rkooo567 I can help with it next sprint as well.
@allenyin55 What is your use case with using async tasks in local mode ?

ijrsvt on 8 May 2020

👍1

@ijrsvt He should run the integration test with local mode, and his integration test contains an async actor.

rkooo567 on 8 May 2020

@rkooo567 Is this for a new integration test or an existing one? I don't know if there are a ton of use cases where local_mode and async actors will be used together. I'm not sure it fits in the definition of local_mode as emulating serial python?

ijrsvt on 8 May 2020

I guess @allenyin55 can answer better for the question. But I believe it was a new one, and he said he should use local mode. (btw, it worked when he used 0.8.4, and idk how)

I don't know well about the purpose of local mode, but my impression is that it is the most useful when you want to reduce the test load (meaning mostly for unit / integration test). If so, I believe it should return the same output as non-local-mode for every API.

(Also, there could be easy fix without using Fiber that just came up to my head. We can probably talk about this offline if you think we should fix this issue).

rkooo567 on 8 May 2020

I just did a bisection and this regression was introduced in https://github.com/ray-project/ray/pull/7670.
We need to either fix it or give a better error message that async actors are not supported in local mode.

We use async actors in local mode for dependency injection during testing. Local mode makes sure that the test code runs in a single process, which allows us to mock certain methods in that process (which get called by Ray tasks).

pcmoritz on 10 May 2020

👍1

The fix could be actually pretty simple if we assume these 2 cases for local mode.

All coroutines are scheduled and running synchronously (we don't actually support asynchronous operation in local mode).
We don't support low-level asyncio APIs inside async actors (such as asyncio.get_event_loop()).

In this case, we just need to check if the function is coroutine and run the event loop + coroutine in the main thread until it is done. @pcmoritz @ijrsvt do you guys think it is a valid premise for local mode?

rkooo567 on 10 May 2020

@rkooo567 I think that is a great idea. It fits with the logic of local mode being _serial_ python. It may be worth renaming it 'serial' mode to make its intended use case more obvious.

ijrsvt on 10 May 2020

Downgrading to P2 since this is not a common use case.