Ray: [autoscaler] KeyError when starting private cluster

Created on 4 Apr 2019  ·  7Comments  ·  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Ray installed from (source or binary): pip
  • Ray version: 0.6.5
  • Python version: 3.6.7
  • Exact command to reproduce:

ray create-or-update cluster.yaml

Describe the problem

Source code / logs

I followed the documentation and modified example-full.yaml to fill in username, node IP addresses, and custom setup commands.

Traceback:

ray create-or-update cluster.yaml
/tmp/env/lib/python3.6/site-packages/ray/autoscaler/commands.py:38: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(open(config_file).read())
/tmp/env/lib/python3.6/site-packages/ray/autoscaler/node_provider.py:115: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  defaults = yaml.load(f)
2019-04-04 03:39:25,901 INFO node_provider.py:34 -- ClusterState: Loaded cluster state: {'c79.millennium.berkeley.edu': {'tags': {'ray-node-type': 'worker'}, 'state': 'terminated'}, 'c80.millennium.berkeley.edu': {'tags': {'ray-node-type': 'head', 'ray-launch-config': '6c51b8169c9469f0fa2568e5d238af2585d302a7', 'ray-node-name': 'ray-default-head'}, 'state': 'running'}}
2019-04-04 03:39:25,902 INFO node_provider.py:59 -- ClusterState: Writing cluster state: {'c79.millennium.berkeley.edu': {'tags': {'ray-node-type': 'worker'}, 'state': 'terminated'}, 'c80.millennium.berkeley.edu': {'tags': {'ray-node-type': 'head', 'ray-launch-config': '6c51b8169c9469f0fa2568e5d238af2585d302a7', 'ray-node-name': 'ray-default-head'}, 'state': 'running'}}
This will restart cluster services [y/N]: y
2019-04-04 03:39:29,888 INFO commands.py:202 -- get_or_create_head_node: Updating files on head node...
Traceback (most recent call last):
  File "/tmp/env/bin/ray", line 11, in <module>
    sys.exit(main())
  File "/tmp/env/lib/python3.6/site-packages/ray/scripts/scripts.py", line 766, in main
    return cli()
  File "/tmp/env/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/tmp/env/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/tmp/env/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/env/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmp/env/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/tmp/env/lib/python3.6/site-packages/ray/scripts/scripts.py", line 460, in create_or_update
    no_restart, restart_only, yes, cluster_name)
  File "/tmp/env/lib/python3.6/site-packages/ray/autoscaler/commands.py", line 47, in create_or_update_cluster
    override_cluster_name)
  File "/tmp/env/lib/python3.6/site-packages/ray/autoscaler/commands.py", line 243, in get_or_create_head_node
    initialization_commands=config["initialization_commands"],
KeyError: 'initialization_commands'
bug

All 7 comments

Can you try installing the latests master?

We should probably make all of our configs backward compatible from now on...

I'm getting the same error on the latest master. This line is causing the error.

Looks like the local example is out of date.

Same here, 0.6.6

Am having the same issue.. So, what's the fix??
Fresh install, 0.6.6. - also updated to 0.7 dev2, same error.
In my case the error is:

ray up settings.yaml /home/jake/py/tf2/lib/python3.6/site-packages/ray/autoscaler/commands.py:38: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. config = yaml.load(open(config_file).read()) /home/jake/py/tf2/lib/python3.6/site-packages/ray/autoscaler/node_provider.py:115: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. defaults = yaml.load(f) 2019-05-16 03:24:21,128 INFO node_provider.py:34 -- ClusterState: Loaded cluster state: {} Traceback (most recent call last): File "/home/jake/py/tf2/bin/ray", line 10, in <module> sys.exit(main()) File "/home/jake/py/tf2/lib/python3.6/site-packages/ray/scripts/scripts.py", line 766, in main return cli() File "/home/jake/py/tf2/lib/python3.6/site-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/home/jake/py/tf2/lib/python3.6/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/home/jake/py/tf2/lib/python3.6/site-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/jake/py/tf2/lib/python3.6/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/jake/py/tf2/lib/python3.6/site-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/home/jake/py/tf2/lib/python3.6/site-packages/ray/scripts/scripts.py", line 460, in create_or_update no_restart, restart_only, yes, cluster_name) File "/home/jake/py/tf2/lib/python3.6/site-packages/ray/autoscaler/commands.py", line 47, in create_or_update_cluster override_cluster_name) File "/home/jake/py/tf2/lib/python3.6/site-packages/ray/autoscaler/commands.py", line 163, in get_or_create_head_node provider = get_node_provider(config["provider"], config["cluster_name"]) File "/home/jake/py/tf2/lib/python3.6/site-packages/ray/autoscaler/node_provider.py", line 103, in get_node_provider return provider_cls(provider_config, cluster_name) File "/home/jake/py/tf2/lib/python3.6/site-packages/ray/autoscaler/local/node_provider.py", line 86, in __init__ provider_config) File "/home/jake/py/tf2/lib/python3.6/site-packages/ray/autoscaler/local/node_provider.py", line 55, in __init__ TAG_RAY_NODE_TYPE] == "head" AssertionError

YAML file taken from online ray/python/ray/autoscaler/gcp/example-full.yaml. Just added IP's:

cluster_name: default
min_workers: 0
max_workers: 0
initial_workers: 0
autoscaling_mode: default
docker:
image: ""
container_name: ""
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
provider:
type: local
head_ip: 10.0.0.2
worker_ips: [10.0.0.2]
auth:
ssh_user: YOUR_USERNAME
ssh_private_key: ~/.ssh/id_rsa
head_node: {}
worker_nodes: {}
file_mounts: {}
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
setup_commands:

  • source activate ray && pip install -U ray
    head_start_ray_commands:
  • source activate ray && ray stop
  • source activate ray && ulimit -c unlimited && ray start --head --redis-port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    worker_start_ray_commands:
  • source activate ray && ray stop
  • source activate ray && ray start --redis-address=$RAY_HEAD_IP:6379

I am facing the same issue. However, a workaround is to manually ssh in to the head and worker nodes of the cluster and run the respective commands mentioned in the example-full.yaml file OR use gnu-parallel command to run a script which does the same.

It seems a bug. Adding initialization_commands: [] to the yaml file seems to bypasses this error (though there are other errors generated).

See 4811.

55, in __init__ TAG_RAY_NODE_TYPE] == "head" AssertionError

This particular assertion appears to be a separate problem, as I discovered the hard way. It seems to occur when the address of head_ip is also included in worker_ips. Removing the head_ip from the worker list eliminated the error for me. I also found it necessary to delete the tmp/cluster-\

Was this page helpful?
0 / 5 - 0 ratings