Ray: [autoscaler] "AssertionError: Unable to SSH to node", when starting large cluster

Created on 4 Feb 2018 · 22Comments · Source: ray-project/ray

Running

ray create_or_update ~/Workspace/ray/python/ray/autoscaler/aws/example.yaml

where example.yaml is modified to start 300 nodes, I monitor the auto-scaling activity with

ssh -i /Users/rkn/.ssh/ray-autoscaler_us-west-2.pem [email protected] 'tail -f /tmp/raylogs/monitor-*'

After a while (after around 64 nodes are in the cluster), I see

==> /tmp/raylogs/monitor-2018-02-04_07-20-44-04701.err <==
Process NodeUpdaterProcess-53:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 68, in run
    raise e
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 55, in run
    self.do_update()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 118, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

Inspecting the monitor logs on the head node, I see

Process NodeUpdaterProcess-28:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 68, in run
    raise e
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 55, in run
    self.do_update()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 118, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node
Process NodeUpdaterProcess-53:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 68, in run
    raise e
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 55, in run
    self.do_update()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 118, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

cc @ericl

Source

robertnishihara

All 22 comments

Maybe it would be helpful to get a bigger tail for each monitor file? The assertion failing could either be a timeout or some other exception from the command itself

richardliaw on 4 Feb 2018

I no longer have the tail accessible, but I can reproduce this tomorrow (I imagine it's pretty reproducible). However, I don't remember seeing anything else interesting in the log.

robertnishihara on 4 Feb 2018

Hm was the load on the head node ok? And did this happen with just that node or all of them?

ericl on 4 Feb 2018

@richardliaw @robertnishihara

We are having a similar issue here,
we are running the command:

ray up /Users/kapleesh/cal/cs262a/angela/manifest.yaml

This gives us the error:

```ray up /Users/kapleesh/cal/cs262a/angela/manifest.yaml
This will restart cluster services [y/N]: y
Updating files on head node...
NodeUpdater: Updating i-060669574ffde1697 to 2fe43d49ebaad0c0f0a39bc37474a87c1d072a43, logging to (console)
NodeUpdater: Waiting for IP of i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Error updating Unable to SSH to nodeSee (console) for remote logs.
Process NodeUpdaterProcess-1:
Traceback (most recent call last):
File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 103, in run
raise e
File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 88, in run
self.do_update()
File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 154, in do_update
assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node
Updating 54.201.227.119 failed

**This is our manifest.yaml file:**

An unique identifier for the head node and workers of this cluster.

cluster_name: default

The minimum number of workers nodes to launch in addition to the head

node. This number should be >= 0.

min_workers: 9

The maximum number of workers nodes to launch in addition to the head

node. This takes precedence over min_workers.

max_workers: 9

This executes all commands on all nodes in the docker container,

and opens all the necessary ports to support the Ray cluster.

Empty string means disabled.

docker:
image: "" # e.g., tensorflow/tensorflow:1.5.0-py3
container_name: "" # e.g. ray_docker

The autoscaler will scale up the cluster to this target fraction of resource

usage. For example, if a cluster of 10 nodes is 100% busy and

target_utilization is 0.8, it would resize the cluster to 13. This fraction

can be decreased to increase the aggressiveness of upscaling.

This value must be less than 1.0 for scaling to happen.

target_utilization_fraction: 1.0

If a node is idle for this many minutes, it will be removed.

idle_timeout_minutes: 5

Cloud-provider specific configuration.

provider:
type: aws
region: us-west-2
# Availability zone(s), comma-separated, that nodes may be launched in.
# Nodes are currently spread between zones by a round-robin approach,
# however this implementation detail should not be relied upon.
availability_zone: us-west-2a,us-west-2b, us-west-2c

How Ray will authenticate with newly launched nodes.

auth:
ssh_user: ubuntu

By default Ray creates a new private keypair, but you can also use your own.

If you do so, make sure to also set "KeyName" in the head and worker node

configurations below.

ssh_private_key: /path/to/your/key.pem

Provider-specific config for the head node, e.g. instance type. By default

Ray will auto-configure unspecified fields such as SubnetId and KeyName.

For more documentation on available fields, see:

http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances

head_node:
InstanceType: m5.large
ImageId: ami-a0cfeed8 # US West Oregon, HVM (SSD) EBS-Backed 64-bit

# You can provision additional disk space with a conf as follows
# BlockDeviceMappings:
#     - DeviceName: /dev/sda1
#       Ebs:
#           VolumeSize: 50

# Additional options in the boto docs.

Provider-specific config for worker nodes, e.g. instance type. By default

Ray will auto-configure unspecified fields such as SubnetId and KeyName.

For more documentation on available fields, see:

http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances

worker_nodes:
InstanceType: m5.large
ImageId: ami-a0cfeed8 # US West Oregon, HVM (SSD) EBS-Backed 64-bit

# Run workers on spot by default. Comment this out to use on-demand.
InstanceMarketOptions:
    MarketType: spot
    # Additional options can be found in the boto docs, e.g.
    #   SpotOptions:
    #       MaxPrice: MAX_HOURLY_PRICE

# Additional options in the boto docs.

Files or directories to copy to the head and worker nodes. The format is a

dictionary from REMOTE_PATH: LOCAL_PATH, e.g.

file_mounts: {
"/": "/Users/kapleesh/cal/cs262a/angela",

"/path2/on/remote/machine": "/path2/on/local/machine",

}

List of shell commands to run to set up nodes.

setup_commands:

# Note: if you're developing Ray, you probably want to create an AMI that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# - git clone https://github.com/ramjk/angela
# - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.5.3-cp27-cp27mu-manylinux1_x86_64.whl
# - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.5.3-cp35-cp35m-manylinux1_x86_64.whl
# Consider uncommenting these if you also want to run apt-get commands during setup
# - sudo pkill -9 apt-get || true
# - sudo pkill -9 dpkg || true
# - sudo dpkg --configure -a

Custom commands that will be run on the head node after common setup.

head_setup_commands:
- pip install boto3==1.4.8 # 1.4.8 adds InstanceMarketOptions

Custom commands that will be run on worker nodes after common setup.

worker_setup_commands: []

Command to start ray on the head node. You don't need to change this.

head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

Command to start ray on worker nodes. You don't need to change this.

worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --redis-address=$RAY_HEAD_IP:6379 --object-manager-port=8076

```

we are able to ssh into the headnode but we do not see any logs anywhere here:

ssh "~/.ssh/ray-autoscaler_us-west-2.pem" [email protected]

Do you have any advice here?

aneeshkhera on 25 Nov 2018

👍1

Hm usually this works after a couple tries.

Another thing you can do is first start the cluster with a low min number
of workers, then slowly increase it to 9.

On Sat, Nov 24, 2018 at 3:04 PM Aneesh Khera notifications@github.com
wrote:

@richardliaw https://github.com/richardliaw @robertnishihara
https://github.com/robertnishihara

We are having a similar issue here, we are running the command:

ray up /Users/kapleesh/cal/cs262a/angela/manifest.yaml

This gives us the error:

This will restart cluster services [y/N]: y
Updating files on head node...
NodeUpdater: Updating i-060669574ffde1697 to 2fe43d49ebaad0c0f0a39bc37474a87c1d072a43, logging to (console)
NodeUpdater: Waiting for IP of i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Error updating Unable to SSH to nodeSee (console) for remote logs.
Process NodeUpdaterProcess-1:
Traceback (most recent call last):
File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 103, in run
raise e
File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 88, in run
self.do_update()
File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 154, in do_update
assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node
Updating 54.201.227.119 failed

This is our manifest.yaml file:

An unique identifier for the head node and workers of this cluster.

cluster_name: default

The minimum number of workers nodes to launch in addition to the head

node. This number should be >= 0.

min_workers: 9

The maximum number of workers nodes to launch in addition to the head

node. This takes precedence over min_workers.

max_workers: 9

This executes all commands on all nodes in the docker container,

and opens all the necessary ports to support the Ray cluster.

Empty string means disabled.

docker:
image: "" # e.g., tensorflow/tensorflow:1.5.0-py3
container_name: "" # e.g. ray_docker

The autoscaler will scale up the cluster to this target fraction of resource

usage. For example, if a cluster of 10 nodes is 100% busy and

target_utilization is 0.8, it would resize the cluster to 13. This fraction

can be decreased to increase the aggressiveness of upscaling.

This value must be less than 1.0 for scaling to happen.

target_utilization_fraction: 1.0

If a node is idle for this many minutes, it will be removed.

idle_timeout_minutes: 5

Cloud-provider specific configuration.

provider:
type: aws
region: us-west-2
# Availability zone(s), comma-separated, that nodes may be launched in.
# Nodes are currently spread between zones by a round-robin approach,
# however this implementation detail should not be relied upon.
availability_zone: us-west-2a,us-west-2b, us-west-2c

How Ray will authenticate with newly launched nodes.

auth:
ssh_user: ubuntu

By default Ray creates a new private keypair, but you can also use your own.

If you do so, make sure to also set "KeyName" in the head and worker node

configurations below.

ssh_private_key: /path/to/your/key.pem

Provider-specific config for the head node, e.g. instance type. By default

Ray will auto-configure unspecified fields such as SubnetId and KeyName.

For more documentation on available fields, see:

http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances

head_node:
InstanceType: m5.large
ImageId: ami-a0cfeed8 # US West Oregon, HVM (SSD) EBS-Backed 64-bit
# You can provision additional disk space with a conf as follows
# BlockDeviceMappings:
#     - DeviceName: /dev/sda1
#       Ebs:
#           VolumeSize: 50

# Additional options in the boto docs.
Provider-specific config for worker nodes, e.g. instance type. By default

Ray will auto-configure unspecified fields such as SubnetId and KeyName.

For more documentation on available fields, see:

http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances

worker_nodes:
InstanceType: m5.large
ImageId: ami-a0cfeed8 # US West Oregon, HVM (SSD) EBS-Backed 64-bit
# Run workers on spot by default. Comment this out to use on-demand.
InstanceMarketOptions:
    MarketType: spot
    # Additional options can be found in the boto docs, e.g.
    #   SpotOptions:
    #       MaxPrice: MAX_HOURLY_PRICE

# Additional options in the boto docs.
Files or directories to copy to the head and worker nodes. The format is a

dictionary from REMOTE_PATH: LOCAL_PATH, e.g.

file_mounts: {
"/": "/Users/kapleesh/cal/cs262a/angela",

"/path2/on/remote/machine": "/path2/on/local/machine",

}

List of shell commands to run to set up nodes.

setup_commands:
# Note: if you're developing Ray, you probably want to create an AMI that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# - git clone https://github.com/ramjk/angela
# - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.5.3-cp27-cp27mu-manylinux1_x86_64.whl
# - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.5.3-cp35-cp35m-manylinux1_x86_64.whl
# Consider uncommenting these if you also want to run apt-get commands during setup
# - sudo pkill -9 apt-get || true
# - sudo pkill -9 dpkg || true
# - sudo dpkg --configure -a
Custom commands that will be run on the head node after common setup.

head_setup_commands:
- pip install boto3==1.4.8 # 1.4.8 adds InstanceMarketOptions

Custom commands that will be run on worker nodes after common setup.

worker_setup_commands: []

Command to start ray on the head node. You don't need to change this.

head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

Command to start ray on worker nodes. You don't need to change this.

worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --redis-address=$RAY_HEAD_IP:6379 --object-manager-port=8076

we are able to ssh into the headnode but we do not see any logs anywhere
here:

ssh "~/.ssh/ray-autoscaler_us-west-2.pem"
[email protected]

Do you have any advice here?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/1513#issuecomment-441402376,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEUc5Tyrlcc9n9AvWcvhbJBmM9NimppIks5uydCRgaJpZM4R4elm
.

richardliaw on 25 Nov 2018

@richardliaw

We tried to start from a low min number and increased it, but we are still not able to connect to the head node.

Is there anything in terms of VPCs, subnets, or permission groups on the AWS side of things that you think we might have missed?

aneeshkhera on 25 Nov 2018

You can try SSHing to it normally, it's probably some issue with keys or
permissions on the security group (though the default should work).

On Sat, Nov 24, 2018, 3:46 PM Aneesh Khera notifications@github.com wrote:

@richardliaw https://github.com/richardliaw

We tried to start from a low min number and increased it, but we are still
not able to connect to the head node.

Is there anything in terms of VPCs, subnets, or permission groups on the
AWS side of things that you think we might have missed?

—
You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/1513#issuecomment-441404172,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6SnCeYmMHPGgl4R35z9d3nSSu-HB1ks5uydpGgaJpZM4R4elm
.

ericl on 25 Nov 2018

I am having the same issue and am able to ssh into the head node without issues. I am only trying to create a very small cluster (2 nodes).

Neo-X on 19 Jun 2019

👍1

If a very small cluster has this issue, it could be a firewall problem. Check connection via all TCP port between the head node and the worker node.

haje01 on 16 Jul 2019

In my setup, I could resolve this issue by choosing a shorter cluster_name. I had some warnings about some path being too long.

frthjf on 15 Aug 2019

I also have this issue. How do you guys resolve it?

tianheyu927 on 21 Sep 2019

what command are you running and what is the stacktrace?

richardliaw on 21 Sep 2019

I ran ray.autoscaler.commands.exec_cluster as shown here in the softlearning repo. And here's the stack trace:

```E0920 17:57:28.584939 139628152801024 updater.py:145] NodeUpdater: i-0cd94a9ba7969588e: Error updating Unable to SSH to node
Exception in thread Thread-3:
Traceback (most recent call last):
File "/scr/kevin/anaconda2/envs/softlearning_official/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/scr/kevin/anaconda2/envs/softlearning_official/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 148, in run
raise e
File "/scr/kevin/anaconda2/envs/softlearning_official/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 137, in run
self.do_update()
File "/scr/kevin/anaconda2/envs/softlearning_official/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 216, in do_update
assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

E0920 17:57:28.818691 139629018629888 commands.py:258] get_or_create_head_node: Updating 54.218.214.112 failed```

tianheyu927 on 21 Sep 2019

OK, can you try manually sshing into the node?

On Fri, Sep 20, 2019 at 10:29 PM Tianhe Yu notifications@github.com wrote:

I ran ray.autoscaler.commands.exec_cluster as shown here
https://github.com/rail-berkeley/softlearning/blob/master/examples/instrument.py#L287
in the softlearning repo. And here's the stack trace:

`E0920 17:57:28.584939 139628152801024 updater.py:145] NodeUpdater:
i-0cd94a9ba7969588e: Error updating Unable to SSH to node
Exception in thread Thread-3:
Traceback (most recent call last):
File
"/scr/kevin/anaconda2/envs/softlearning_official/lib/python3.7/threading.py",
line 917, in _bootstrap_inner
self.run()
File
"/scr/kevin/anaconda2/envs/softlearning_official/lib/python3.7/site-packages/ray/autoscaler/updater.py",
line 148, in run
raise e
File
"/scr/kevin/anaconda2/envs/softlearning_official/lib/python3.7/site-packages/ray/autoscaler/updater.py",
line 137, in run
self.do_update()
File
"/scr/kevin/anaconda2/envs/softlearning_official/lib/python3.7/site-packages/ray/autoscaler/updater.py",
line 216, in do_update
assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

E0920 17:57:28.818691 139629018629888 commands.py:258]
get_or_create_head_node: Updating 54.218.214.112 failed`

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/1513?email_source=notifications&email_token=ABCRZZIII2R352LT7FJJ2ATQKWWLTA5CNFSM4EPB5FTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IK3VI#issuecomment-533769685,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABCRZZLQDISCHA2RCPOFDSLQKWWLTANCNFSM4EPB5FTA
.

richardliaw on 21 Sep 2019

Yes, I'm able to manually ssh into it.

tianheyu927 on 21 Sep 2019

Can you make sure the user command you're using for the manual ssh is the
same as the ssh_user field?

Also, what ray are you using?

On Fri, Sep 20, 2019 at 11:00 PM Tianhe Yu notifications@github.com wrote:

Yes, I'm able to manually ssh into it.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/1513?email_source=notifications&email_token=ABCRZZIVRJRLBYJM564MSEDQKWZ7VA5CNFSM4EPB5FTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7ILHJQ#issuecomment-533771174,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABCRZZMBEMEZZMKABORCONDQKWZ7VANCNFSM4EPB5FTA
.

richardliaw on 21 Sep 2019

Yeah, they are the same. I'm using 0.7.1

tianheyu927 on 21 Sep 2019

OK - hm presumably you cannot upgrade because you are using softlearning

Can you provide a bigger stacktrace? Also, if you can try to obtain the ssh
command that is being run (via ssh_cmd in updater), run it and provide the
output - that would be very helpful.

Our error handling did improve a lot since 0.7.1, sorry about that..

On Fri, Sep 20, 2019 at 11:13 PM Tianhe Yu notifications@github.com wrote:

Yeah, they are the same. I'm using 0.7.1

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/1513?email_source=notifications&email_token=ABCRZZJRKYJEL5JNT25PZY3QKW3SLA5CNFSM4EPB5FTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7ILM4A#issuecomment-533771888,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABCRZZJZVHIG2RULU4GMJLTQKW3SLANCNFSM4EPB5FTA
.

richardliaw on 21 Sep 2019

Okay, here's the list of the ssh commands being run:
['ray stop', 'ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml']

tianheyu927 on 21 Sep 2019

Sorry; this is a generated command by the autoscaler (you may need to step
through it. It should probably end with “uptime”.

Also, maybe it is a cluster-name issue (try killing cluster, set name:
“test” and rerun exec).

On Sat, Sep 21, 2019 at 12:08 AM Tianhe Yu notifications@github.com wrote:

Okay, here's the list of the ssh commands being run:
['ray stop', 'ray start --head --redis-port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml']

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/1513?email_source=notifications&email_token=ABCRZZMRBWS2O276QGWIH5LQKXB5PA5CNFSM4EPB5FTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IMD7Q#issuecomment-533774846,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABCRZZIT5GFMZFO5TMZCT3LQKXB5PANCNFSM4EPB5FTA
.

richardliaw on 21 Sep 2019

👍1

Thanks! Seems like setting the cluster name to "test" works. Is there anything we should be careful about naming the cluster?

tianheyu927 on 21 Sep 2019

This should be fixed if you upgrade to 0.7.4 ray. For now, try reducing the
cluster bame.

Thanks for being patient and responsive!

On Sat, Sep 21, 2019 at 12:25 AM Tianhe Yu notifications@github.com wrote:

Thanks! Seems like setting the cluster name to "test" works. Is there
anything we should be careful about naming the cluster?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/1513?email_source=notifications&email_token=ABCRZZOHXNGXHGAN2TIZ523QKXD7BA5CNFSM4EPB5FTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IMLNA#issuecomment-533775796,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABCRZZM5KBJV4HHPFVU7UMDQKXD7BANCNFSM4EPB5FTA
.