Ray: `ray.wait` more than `timeout`

Created on 15 Aug 2020  路  4Comments  路  Source: ray-project/ray

What is the problem?

when supplied timeout parameter, ray.wait doesn't match the semantics from docs https://docs.ray.io/en/latest/package-ref.html?highlight=wait#ray.wait :

If timeout is set, the function returns either when the requested number of IDs are ready or when the timeout is reached, whichever occurs first.

In fact, I find that, when timeout is less than 1s, the cost time actually is 2 * timeout; and when timeout is greater than 1s, the cost time actually is timeout + 1.

It happens in both 0.8.7 and 0.86.

Ray version and other system information (Python version, TensorFlow version, OS):

Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux

ubuntu 18.04

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

import time

import ray

ray.init()
print(ray.__version__)


@ray.remote
def busy(i):
    time.sleep(i)
    return i


ids = [busy.remote(0.5 * i) for i in range(1, 10)]
num_returns = 5

t1 = time.time()
ready, not_ready = ray.wait(ids, num_returns=num_returns, timeout=0.3)
print(f"{ready}, cost time: {time.time() - t1}")


ids = not_ready
t1 = time.time()
ready, not_ready = ray.wait(ids, num_returns=num_returns, timeout=1.3)
print(f"{ready}, cost time: {time.time() - t1}")

the output is:

0.8.7
[ObjectRef(45b95b1c8bd3a9c4ffffffff010000c001000000)], cost time: 0.6006457805633545
[ObjectRef(ef0a6c221819881cffffffff010000c001000000), ObjectRef(f66d17bae2b0e765ffffffff010000c001000000), ObjectRef(44ee453cd1e8e283ffffffff010000c001000000), ObjectRef(7e0a4dfc4c87306fffffffff010000c001000000)], cost time: 2.3008031845092773

If we cannot run your script, we cannot fix your issue.

  • [x] I have verified my script runs in a clean environment and reproduces the issue.
  • [x] I have verified the issue also occurs with the latest wheels.
P1 bug

Most helpful comment

@ericl I will take it over next week.

All 4 comments

This looks bad. @edoakes Do you have any clue why it happens?

@ericl I will take it over next week.

@rkooo567 not sure exactly why it happens but the wait logic in the core worker is pretty ugly. I think we can actually clean it up a lot now that we expect everything to be in the in-memory store - this would be a good chance to do that.

@rkooo567 Got sucked into this over lunch and ended up finding the issue. Assigned you on the PR.

Was this page helpful?
0 / 5 - 0 ratings