Ray: [Ray] Manual Cluster Setup (not bringing CPUs into available resources).

Created on 4 Sep 2020 · 35Comments · Source: ray-project/ray

[RAY CORE]
When following https://docs.ray.io/en/latest/cluster/index.html#manual-ray-cluster-setup
on 0.8.7 Once I have connected the remote worker nodes I see the number of available CPUs go up. However on the current 0.9.0.dev CPUs stay at 0

Ray version and other system information (Python version, TensorFlow version, OS):
0.9.0.dev seems to currently prevent manual clusters

Reproduction (REQUIRED)

install 0.9.0.dev latest
on main machine run: ray start --head --num-cpus=0
on worker machine run: ray start --address=xxx --redis-password=xxx --num-cpus=24
back on main machine run: 'python3 -c "import ray; ray.init(address='auto');print(ray.available_resources())"
see no cpus in available resources

in 0.8.7 and 0.8.6
after following above I see 24cpus

bug needs-repro-script

Source

raoul-khour-ts

All 35 comments

I will investigate and mark as P0 once I can reproduce it

wuisawesome on 4 Sep 2020

Was unable to reproduce this. What I tried:

In EC2 spin up 2 nodes and ensure they're in the same VPC, subnet, security group, etc.
Installed the latest nightly wheel on each machine.
Ran

ubuntu@ip~$ /home/ubuntu/.local/bin/ray start --head --num-cpus=0
Local node IP: 172.31.37.161
Available RAM
  Workers: 18.26 GiB
  Objects: 9.15 GiB

  To adjust these values, use
    ray.init(memory=<bytes>, object_store_memory=<bytes>)
Dashboard URL: http://localhost:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='xxx.xxx.xxx.xxx:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', redis_password='5241590000000000')

  If connection fails, check your firewall settings other network configuration.

  To terminate the Ray runtime, run
    ray stop

On the other node:

ubuntu@ip:~$ .local/bin/ray start --address='xxx.xxx.xxx.xxx:6379' --redis-password='5241590000000000'Local node IP: 172.31.42.67
Available RAM
  Workers: 21.34 GiB
  Objects: 9.15 GiB

  To adjust these values, use
    ray.init(memory=<bytes>, object_store_memory=<bytes>)

--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop

Check the resources from the head node:

ubuntu@ip:~$ python3
Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(address="auto")
2020-09-05 05:25:03,902 INFO worker.py:633 -- Connecting to existing Ray cluster at address: xxx.xxx.xxx.xxx:6379
{'node_ip_address': 'xxx.xxx.xxx.xxx', 'raylet_ip_address': 'xxx.xxx.xxx.xxx', 'redis_address': 'xxx.xxx.xxx.xxx:6379', 'object_store_address': '/tmp/ray/session_2020-09-05_05-24-24_528207_10923/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2020-09-05_05-24-24_528207_10923/sockets/raylet', 'webui_url': 'localhost:8265', 'session_dir': '/tmp/ray/session_2020-09-05_05-24-24_528207_10923', 'metrics_export_port': 62757}
>>> ray.available_resources()
{'memory': 811.0, 'node:xxx.xxx.xxx.xxx': 1.0, 'object_store_memory': 258.0, 'node:xxx.xxx.xxx.xxx': 1.0, 'CPU': 8.0}
>>> ray.cluster_resources()
{'CPU': 8.0, 'node:xxx.xxx.xxx.xxx': 1.0, 'memory': 811.0, 'object_store_memory': 258.0, 'node:xxx.xxx.xxx.xxx': 1.0}

wuisawesome on 5 Sep 2020

back on main machine run: 'python3 -c "import ray; print(ray.available_resources())"

@raoul-khour-ts Is that actually enough to reproduce the issue? Don't you need to call ray.init(address=...)? Note that if you call ray.init() then it won't attach to the cluster, so you'll just see one machine's resources.

robertnishihara on 6 Sep 2020

Note that once I do call ray.init(address="auto") I keep getting this in my logs:
(pid=raylet, ip=xx) service_based_gcs_client.cc:207] Couldn't reconnect to GCS server. The last attempted GCS server address was 127.0.0.1:46515