Ray: [Core/Autoscaler] Connecting to an existing cluster via env var issue breaks autoscaler

Created on 13 Aug 2020  路  5Comments  路  Source: ray-project/ray

What is the problem?

Creating a ray cluster using the RAY_ADDRESS environment variable causes a second copy of monitor.py to launch. 2 competing autoscalers leads to some strange issues.

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

Call ray up config.yaml with any config file.

Observe that everything is normal.

Run ray submit config.yaml test.py

test.py:

import ray
ray.init() # Connects to existing cluster because env var is automatically set when using ray submit

while True:
    pass

Now attach to the session:

  • ps aux | grep monitor.py shows 2 copies of monitor.py running
  • /tmp/ray/session_latest now points to a different place

kill the job

observe that the raylet/rest of cluster is also torn down

If we cannot run your script, we cannot fix your issue.

  • [ ] I have verified my script runs in a clean environment and reproduces the issue.
  • [ ] I have verified the issue also occurs with the latest wheels.
P1 autoscaler bug core

Most helpful comment

seems to have fixed itself... closing since it can't be reproduced anymore

All 5 comments

What if we made this print a warning (that you should set address="auto" to pick up the address?)

why not just make address= work as expected?

I would be comfortable with that if we also renamed RAY_ADDRESS to RAY_OVERRIDE_ADDRESS or something like that.

so just to be clear, you're advocating for env var taking precedence over argument to ray.init even if auto is specified?

seems to have fixed itself... closing since it can't be reproduced anymore

Was this page helpful?
0 / 5 - 0 ratings