Ray: [ray] worker_start_ray_commands are not executed for private cluster

Created on 30 May 2019 · 18Comments · Source: ray-project/ray

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
Ray installed from (source or binary): pip
Ray version: 0.7.0
Python version: 3.6.7
Exact command to reproduce:

Describe the problem

I am following private cluster setup instructions, but only head node starts. Few interesting points:

Seems similar to issue https://github.com/ray-project/ray/issues/3408
Adding initialization_commands: [] fixes the KeyError mentioned in https://github.com/ray-project/ray/issues/4559

Source code / logs

cluster_name: tesq_cluster
min_workers: 48
max_workers: 48
initial_workers: 48
provider:
    type: local
    head_ip: ip1
    worker_ips: [ip2, ip3, ip4]
auth:
    ssh_user: tesq
    ssh_private_key: /home/me/.ssh/keys/local_user
file_mounts: {}
setup_commands: []
initialization_commands: []
head_setup_commands: []
worker_setup_commands: []

head_start_ray_commands:
    - source activate py3_prod && ray stop
    - echo 'I am here' >> /home/tesq/new_file.txt
    - source activate py3_prod && ulimit -c unlimited && ray start --head --redis-port=6379
worker_start_ray_commands:
    - echo 'I am there' >> /home/tesq/new_file.txt
    - source activate py3_prod && ray stop
    - echo 'I am there' >> /home/tesq/new_file.txt
    - source activate py3_prod && ray start --redis-address=ip1:6379

After that only head node starts, and only on the head node I see the created file new_file.txt
Example output of command ray.global_state.client_table()

{'ClientID': 'a7ce937ffcbece9b25a779fa126ba47edef27267',
  'IsInsertion': True,
  'NodeManagerAddress': 'ip1',
  'NodeManagerPort': 45759,
  'ObjectManagerPort': 34107,
  'ObjectStoreSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/raylet',
  'Resources': {'GPU': 3.0, 'CPU': 24.0}},

__Update__:
Seems very similar to issue https://github.com/ray-project/ray/issues/3190
But files monitor.err and monitor.out are empty.

question

Source

neychev

👀1

All 18 comments

@robertnishihara Sorry for summoning you exactly, but this issue seems very similar to your https://github.com/ray-project/ray/issues/3190

neychev on 3 Jun 2019

Still present on 0.7.1. ray up simply does not execute the worker_start_ray_commands
For others seeking for the solution: the old ray 0.5.3 approach still can be used (despite I prefer using ssh in a loop instead of parallel-ssh). Link: https://ray.readthedocs.io/en/ray-0.5.3/using-ray-on-a-large-cluster.html

neychev on 11 Jun 2019

cc @jovany-wang

neychev on 11 Jun 2019

Did you see any exceptions under /tmp/ray/session_xxx/logs/raylet.err or redis.err or plasma_store.err?
And what's the output of running ray start --redis-address=ip1:6379 on the worker node?

raulchen on 14 Jun 2019

@raulchen as far as I can see, the /tmp/ray/session_xxx/ directory is created when some tasks are being executed using the ray cluster. Just on ray up no directory is created in my case.

I start the ray process on the worker nodes just like that via ssh: ray start --redis-address=ip1:6379. No errors, everything works as supposed. No errors during the tasks execution as well.

According to the fact that no file is generated on the worker nodes with the - echo 'I am there' >> /home/tesq/new_file.txt command, it seems that the worker_start_ray_commands from .yaml file are just not executed.

neychev on 16 Jun 2019

@neychev were you able to figure this out?

sp608 on 3 Aug 2019

@sp608 I just tried this out, and this should work now; can you try the nightlies/latest master? You should install this on all nodes (put it in setup_commands as pip install -U whl).

https://ray.readthedocs.io/en/latest/installation.html#trying-snapshots-from-master

richardliaw on 3 Aug 2019

I have encounter a similar issue. Neither the setup_commands nor the worker_start_ray_commands is executed. Below is the information of my environment.

OS: Ubuntu 18.04
Python: 3.6.8
Ray: 0.8.2

And my configuration for the autoscaler to start a private cluster.

min_workers: 8
initial_workers: 8
max_workers: 8

provider:
    type: local
    head_ip: 10.148.186.178
    worker_ips: [10.148.186.18]
    use_internal_ips: true

auth:
    ssh_user: USER_NAME
    ssh_private_key: ~/.ssh/id_rsa

# Files or directories to copy to the head and worker nodes. 
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

setup_commands:
    - echo 'Starting Ray ...' > ~/ray.log

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --redis-port=6379

worker_start_ray_commands:
    - ray stop
    - ray start --address=10.148.186.178:6379

On the head node I can see a file ~/ray.log with the line Starting Ray ... is created while it is not created on the worker node.

dclong on 28 Feb 2020

ray v0.8.4
python 3.6.9
Ubuntu 18.04.4
I'm running into this same thing, none of the commands (setup_commands) or (worker_start_ray_commands) appear to be executing.

I guess it might not be obvious, but that includes the bit about starting up the worker clients. Basically only the head node is launched, none of the workers appear to be executing any commands "ray start" or otherwise.

gimzmoe on 10 Apr 2020

👀2

@sp608 I just tried this out, and this should work now; can you try the nightlies/latest master? You should install this on all nodes (put it in setup_commands as pip install -U whl).

https://ray.readthedocs.io/en/latest/installation.html#trying-snapshots-from-master

Same issue, version 0.8.5 doesn't ssh to worker's IPs on ray up cluster.yaml.

I tried 0.9.0.dev0, with the same effect.
https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl

jaromrax on 20 May 2020

👀1

@dclong @jaromrax @gimzmoe
Do you have --autoscaling-config=~/ray_bootstrap_config.yaml as a flag for your ray start command on the head node?

* Clarification:
This should be specified in the head_start_ray_commands section of your YAML.

ijrsvt on 28 May 2020

👍1

Sorry that I dont give an informed answer, my understanding here is limited:

ray up cluster.yaml                
    ... lot of things ...                                
    bash: cannot set terminal process group (-1): Inappropriate ioctl for device
    ... lot of things ...  and connections to head (that is local also)
    ...

Then I tried to:

ray start --autoscaling-config=~/ray_bootstrap_config.yaml

resulted in:

Traceback (most recent call last):
  File "/home/***/.local/bin//ray", line 11, in <module>
    sys.exit(main())
  File "/home/***/.local/lib/python3.6/site-packages/ray/scripts/scripts.py", line 1028, in main
    return cli()
  File "/home/***/.local/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/***/.local/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/***/.local/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/***/.local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/***/.local/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/***/.local/lib/python3.6/site-packages/ray/scripts/scripts.py", line 397, in start
    raise Exception("If --head is not passed in, --address must "
Exception: If --head is not passed in, --address must be provided.

I dont know how the ~/ray_bootstrap_config.yaml file appeared o my PC. It contains:

{"head_setup_commands": ["pip install -U ray"], "no_restart": false, "head_node": {}, "initial_workers": 0, "idle_timeout_minutes": 5, "worker_nodes": {}, "worker_setup_commands": ["pip install -U ray"], "file_mounts": {}, "initialization_commands": [], "auth": {"ssh_user": "\
 ojr", "ssh_private_key": "~/ray_bootstrap_key.pem"}, "cluster_name": "clusterojr", "max_workers": 0, "worker_start_ray_commands": ["ray stop", "ray start --address=$RAY_HEAD_IP:6379"], "target_utilization_fraction": 0.8, "provider": {"worker_ips": ["ip", ".\
 ip"], "type": "local", "head_ip": "ip"}, "head_start_ray_commands": ["ray stop", "ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml"], "docker": {"image": "", "run_options": [], "pull_before_run": true, "conta\
 iner_name": ""}, "min_workers": 0, "setup_commands": ["pip install -U ray"], "autoscaling_mode": "default"}

To me there are several possibilities:

the corresponding functions in the source code are still empty
there might be something with local x systemwide installation of python modules
there are some other problems bash: cannot set terminal process group (-1): Inappropriate ioctl for device

However, I am sure it ssh-connected to the head (local pc) and it runs there the line from yaml

 Running ulimit -c unlimited && ray start --head --redis-port=6379 --redis-shard-ports=59519 --node-manager-port=19580 --object-manager-port=39066  --autoscaling-config=~/RAY/ray_bootstrap_config.yaml

and it didnt try to connect to other machines, when I did ray up cluster.yaml on one PC

thanks for the care

jaromrax on 1 Jun 2020

@jaromrax , No worries! Thanks for sending all that over :)

First, I was a bit vague in my last answer, and I meant to include the ~/ray_bootstrap_config.yaml in the ray start part of head_start_ray_commands in your YAML. It looks like you did this from the output of your ~/ray_bootstrap_config.yaml.

While I was looking at your output from ~/ray_bootstrap_config.yaml, and it looks like all of your IPs are 'ip' instead of actual addresses? Are they all the same address or are they all different ones. Could you also drop a link to your YAML (feel free to anonymize anything you don't want to show publicly)?

ijrsvt on 1 Jun 2020

@ijrsvt I am facing same issue - tested both on version 0.8.4 and 0.8.5
ray up starts only the head node, but not the worker nodes.

solacerace on 23 Jun 2020

@solacerace can you share your YAMLs?

ijrsvt on 23 Jun 2020

@ijrsvt I'm under the company's firewall, sorry will not be able to post the complete YAML.

I got the YAML from here and updated the head_ip worker_ips and ssh_user.

When I run the command ray up config.yaml it brings up the ray on the head_ip as head node and also

prints the command to add additional node to the cluster
prints the UI address
but does not brings up the ray on the worker nodes

Whereas upon manually running the command ray start --address=head_ip:port on each of the worker machine, the worker nodes gets added to the cluster.

So may be if you could share a working YAML-which can bring up ray on a head node and worker nodes, i could use that as a reference. appreciate your help.
-- thanks

solacerace on 24 Jun 2020

👀1 👍1

@solacerace Make sure that min_workers == initial_workers == max_workers and those all are equal to the number of worker nodes.

ijrsvt on 25 Jun 2020

Closing because this issue was mainly about providing the incorrect args to ray start

ijrsvt on 12 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings