Ray: [Core/Autoscaler] Connecting to an existing cluster via env var issue breaks autoscaler

Created on 13 Aug 2020 · 5Comments · Source: ray-project/ray

What is the problem?

Creating a ray cluster using the RAY_ADDRESS environment variable causes a second copy of monitor.py to launch. 2 competing autoscalers leads to some strange issues.

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

Call ray up config.yaml with any config file.

Observe that everything is normal.

Run ray submit config.yaml test.py

test.py:

import ray
ray.init() # Connects to existing cluster because env var is automatically set when using ray submit

while True:
    pass

Now attach to the session:

ps aux | grep monitor.py shows 2 copies of monitor.py running
/tmp/ray/session_latest now points to a different place

kill the job

observe that the raylet/rest of cluster is also torn down

If we cannot run your script, we cannot fix your issue.

[ ] I have verified my script runs in a clean environment and reproduces the issue.
[ ] I have verified the issue also occurs with the latest wheels.

P1 autoscaler bug core

Source

wuisawesome

Most helpful comment

seems to have fixed itself... closing since it can't be reproduced anymore

wuisawesome on 9 Sep 2020

🎉1 👍1

All 5 comments

What if we made this print a warning (that you should set address="auto" to pick up the address?)

ericl on 8 Sep 2020

why not just make address= work as expected?

wuisawesome on 8 Sep 2020

I would be comfortable with that if we also renamed RAY_ADDRESS to RAY_OVERRIDE_ADDRESS or something like that.

ericl on 8 Sep 2020

so just to be clear, you're advocating for env var taking precedence over argument to ray.init even if auto is specified?

wuisawesome on 8 Sep 2020

seems to have fixed itself... closing since it can't be reproduced anymore

wuisawesome on 9 Sep 2020

🎉1 👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Lost reference to actor exception for seemingly valid code.

robertnishihara · 3Comments

Bazel build error for boost when building from source on ARM(aarch64)

heavyinfo · 3Comments

[ray] How to write into numpy arrays in shared memory with Ray?

monocongo · 3Comments

format.sh script returns illegal option -o pipefail

1beb · 3Comments

rllib: Using gym.RewardWrapper around MultiAgentEnv cause observation mismatch with observation_space

0luhancheng0 · 3Comments