Ray: [autoscaler] Changing worker IPs inflexible

Created on 12 Dec 2018  路  2Comments  路  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
  • Ray installed from (source or binary): binary
  • Ray version: 0.6
  • Python version:
  • Exact command to reproduce:
cluster_name: default
min_workers: 0
max_workers: 0
docker:
    image: ""
    container_name: ""
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
provider:
    type: local
    head_ip: MILLEN_c71
    worker_ips: [MILLEN_c72_IP]  # subsequent changes to this field throw errors
auth:
    ssh_user: USERNAME
    ssh_private_key: ~/.ssh/id_rsa

file_mounts: {}
#      "/tmp/ray_sha": "/YOUR/LOCAL/RAY/REPO/.git/refs/heads/YOUR_BRANCH"
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
setup_commands:
    - conda activate ray && echo "hello werld"
head_start_ray_commands:
    - conda activate ray && ray stop
    - conda activate ray && ulimit -c unlimited && ray start --head --redis-port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
    - conda activate ray && ray stop
    - conda activate ray && ray start --redis-address=$RAY_HEAD_IP:6379

Describe the problem

Changing worker IPs in local autoscaler will throw an assertion error which I don't know how to get around.

Also the cluster state file isn't cleaned up after ray down, so this problem persists even across different clusters.

Source code / logs

  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/node_provider.py", line 100, in get_node_provider
    return provider_cls(provider_config, cluster_name)
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/local/node_provider.py", line 77, in __init__
    provider_config)
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/local/node_provider.py", line 51, in __init__
    assert len(workers) == len(provider_config["worker_ips"]) + 1
AssertionError

Most helpful comment

You can remove the local state file in /tmp/cluster-NAME (or change the cluster name).

The error message should probably make this more clear.

All 2 comments

You can remove the local state file in /tmp/cluster-NAME (or change the cluster name).

The error message should probably make this more clear.

@ericl is right. Assertion error was occurring due to a conflict between new config file and old cluster state. Removing the cluster's state and lock file from /tmp directory resolved the issue.

Was this page helpful?
0 / 5 - 0 ratings