Ray: [Ray] Manual Cluster Setup (not bringing CPUs into available resources).

Created on 4 Sep 2020  Â·  35Comments  Â·  Source: ray-project/ray


[RAY CORE]
When following https://docs.ray.io/en/latest/cluster/index.html#manual-ray-cluster-setup
on 0.8.7 Once I have connected the remote worker nodes I see the number of available CPUs go up. However on the current 0.9.0.dev CPUs stay at 0

Ray version and other system information (Python version, TensorFlow version, OS):
0.9.0.dev seems to currently prevent manual clusters

Reproduction (REQUIRED)

install 0.9.0.dev latest
on main machine run: ray start --head --num-cpus=0
on worker machine run: ray start --address=xxx --redis-password=xxx --num-cpus=24
back on main machine run: 'python3 -c "import ray; ray.init(address='auto');print(ray.available_resources())"
see no cpus in available resources

in 0.8.7 and 0.8.6
after following above I see 24cpus

bug needs-repro-script

All 35 comments

I will investigate and mark as P0 once I can reproduce it

Was unable to reproduce this. What I tried:

  1. In EC2 spin up 2 nodes and ensure they're in the same VPC, subnet, security group, etc.

  2. Installed the latest nightly wheel on each machine.

  3. Ran

ubuntu@ip~$ /home/ubuntu/.local/bin/ray start --head --num-cpus=0
Local node IP: 172.31.37.161
Available RAM
  Workers: 18.26 GiB
  Objects: 9.15 GiB

  To adjust these values, use
    ray.init(memory=<bytes>, object_store_memory=<bytes>)
Dashboard URL: http://localhost:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='xxx.xxx.xxx.xxx:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', redis_password='5241590000000000')

  If connection fails, check your firewall settings other network configuration.

  To terminate the Ray runtime, run
    ray stop
  1. On the other node:
ubuntu@ip:~$ .local/bin/ray start --address='xxx.xxx.xxx.xxx:6379' --redis-password='5241590000000000'Local node IP: 172.31.42.67
Available RAM
  Workers: 21.34 GiB
  Objects: 9.15 GiB

  To adjust these values, use
    ray.init(memory=<bytes>, object_store_memory=<bytes>)

--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop
  1. Check the resources from the head node:
ubuntu@ip:~$ python3
Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(address="auto")
2020-09-05 05:25:03,902 INFO worker.py:633 -- Connecting to existing Ray cluster at address: xxx.xxx.xxx.xxx:6379
{'node_ip_address': 'xxx.xxx.xxx.xxx', 'raylet_ip_address': 'xxx.xxx.xxx.xxx', 'redis_address': 'xxx.xxx.xxx.xxx:6379', 'object_store_address': '/tmp/ray/session_2020-09-05_05-24-24_528207_10923/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2020-09-05_05-24-24_528207_10923/sockets/raylet', 'webui_url': 'localhost:8265', 'session_dir': '/tmp/ray/session_2020-09-05_05-24-24_528207_10923', 'metrics_export_port': 62757}
>>> ray.available_resources()
{'memory': 811.0, 'node:xxx.xxx.xxx.xxx': 1.0, 'object_store_memory': 258.0, 'node:xxx.xxx.xxx.xxx': 1.0, 'CPU': 8.0}
>>> ray.cluster_resources()
{'CPU': 8.0, 'node:xxx.xxx.xxx.xxx': 1.0, 'memory': 811.0, 'object_store_memory': 258.0, 'node:xxx.xxx.xxx.xxx': 1.0} 

back on main machine run: 'python3 -c "import ray; print(ray.available_resources())"

@raoul-khour-ts Is that actually enough to reproduce the issue? Don't you need to call ray.init(address=...)? Note that if you call ray.init() then it won't attach to the cluster, so you'll just see one machine's resources.

Note that once I do call ray.init(address="auto") I keep getting this in my logs:
(pid=raylet, ip=xx) service_based_gcs_client.cc:207] Couldn't reconnect to GCS server. The last attempted GCS server address was 127.0.0.1:46515

@raoul-khour-ts can you try the steps I used and verify that works?

I did exactly that but in our environment (not EC2 but a local farm of machines)

on 0.8.7 I can recreate your steps all fine and no logs.

on 0.9.0.dev no cpu's show up in the cluster and the above error message keeps getting logged.

The remote machine (and local) shows up on my ray dashboard... I just for some reason don't have access to its cpus...

Could you do a fresh run of this, then share /tmp/ray/session_latest/logs? Also what's your OS, and python version?

Unfortunately I can not share all my logs. Is there a particular one of the files you are interested in?

One notable one is the gcs_server.err and out giving these lines:
network_util.cc:62] Failed to find other valid local IP. Using 127.0.0.1, not possible to go distributed!

Is there a change between how this occurs between 0.8.7 and 9.0.0.dev?

I am running python 3.6.10 on a linux machine.

Just checked that downgrading to 0.8.7 again does not show that gcs_server.err.

Interestingly I did notice that when I start python and run ray.init(address='auto') on 0.8.7 I get:

import ray
ray.init(address='auto')
WARNING: Logging before InitGoogleLogging() is written to STDERR
global_state_accessor.cc:25] Redis server address = xx:6379, is test flag = 0
redis_client.cc:146] RedisClient connected.
redis_gcs_client.cc:89] RedisGcsClient Connected.
service_based_gcs_client.cc:193] Reconnected to GCS server: xx:43483
service_based_accessor.cc:92] Reestablishing subscription for job info.
...
Reestablishing subscription for worker failures.
ServiceBasedGcsClient Connected.
ray.cluster_resources()
{'object_store_memory': 1786.0, 'memory': 5849.0, 'node:xxx': 1.0, 'node:xx': 1.0, 'CPU': 24.0}

I dont seem to have that in 0.9.0.dev

back on main machine run: 'python3 -c "import ray; print(ray.available_resources())"

@raoul-khour-ts Is that actually enough to reproduce the issue? Don't you need to call ray.init(address=...)? Note that if you call ray.init() then it won't attach to the cluster, so you'll just see one machine's resources.

Hey @robertnishihara, yeah sorry I forgot to add that to my reproduce guide (but yes I did call ray.init(address='auto')). Updated reproduce guide now.

another observance:
if I run the ray start --address='xxx.xxx.xxx.xxx:6379' --redis-password='5241590000000000' --num-cpus=20 on the head then I see 20 cpus in the cluster...

Are there any errors/stack traces in raylet.err or and of the worker logs?

raylet.err is empty
raylet.out looks normal

Not sure where the worker logs are. are those the python-core-driver-xxx.log?
if so then nothing different from the 0.8.7 runs.

There were some changes in finding IP addresses from 0.8.7 => 0.9.0.dev0

network_util.cc:62] Failed to find other valid local IP. Using 127.0.0.1, not possible to go distributed!

I believe this log is something we added lately. What OS are you currently using?

https://github.com/ray-project/ray/pull/10004

One of possibilities is that your machines are not using any of network interface listed in the PR description. Can you elaborate more about your cluster environment?

The workaround is to manually set the Redis key so that raylets can find the IP address of GCS. look at this link https://github.com/ray-project/ray/issues/8648#issuecomment-664233815

I'm using Debian 9 and python 3.6.10

The workaround is to manually set the Redis key so that raylets can find the IP address of GCS. look at this link #8648 (comment)

I would be happy to test if this would fix my issue but. I actually have no idea how to set the GcsServerAddress to 10.251.231.121:40531 in redis manually,

Thank you @rkooo567 in advance!

In this case the gcs server address will be [your head node private ip address]:[gcs_server port], and you can set this by connecting to the Redis in a head node.

so running python3 -c "import redis; r = redis.Redis(host=xxx, port=yyy, db=0); r.set('GcsServerAddress', 'xxx:zzz')"

Ill try this tomorrow. Thank you @rkooo567 Ill let you know how that goes.

Yeah I think that should work. Don’t forget to specify gcs-server port with “ray start —head —gcs-port=“ so that you don’t need to worry about finding where gcs server has been bound to.

Btw, it is interesting the new version has some addresses it resolve while 0.8.7 could. We should probably find a solution for this.

@rkooo567
Sorry about the delay here but it worked!!!:

ray start --head --gcs-server-port=ddd --cpus=0

import redis
r = redis.Redis(host=xxx, port=yyy, password=zzz, db=0)
r.get('GcsServerAddress')
>>> 127.0.0.1:ddd
r.set('GcsServerAddress', xxx:ddd)

ssh to other_machine
ray start --address=xxx --redis-password=zzz --num-cpus=24

back on main machine:
'python3 -c "import ray; ray.init(address='auto');print(ray.available_resources())"

{..., cpus:24}

Now the question is how do we make this work without having to do all this...

The issue here is your manual cluster’s network interface is probably not typical, and we have trouble resolving addresses. I think I can create a patch that allows you to specify GCS address. What do you think about this solution?

@rkooo567 honestly the gcs address is always going to be the --head ip_address for me at least. Shouldn't that be easy to resolve?

The port seems to be resolving fine...

Ah, I see. So, the issue is gcs server address is not matching head ip address you specified right?

Well the gcs server address seems to for some reason fail to connect to head_ip_address then fall back to localhost. However that prevents remote machines from connecting.

but If i go back into redis and manually set it to head_ip_address everything is fine...

@rkooo567 and I have isolated the issue to being an issue with private networks. My machine can not ping out to 8.8.8.8 which makes the GCS configuration fail.

Note: I will make a PR to use Python code to resolve addresses of GCS servers

Was able to confirm that https://github.com/ray-project/ray/pull/10946 fixed the issue.

Thank you @rkooo567. This should make it out to the 1.1.0 release hopefully sometime in November?

@raoul-khour-ts Yes! We have a monthly release cycle, and the 1.0 target date is the end of September, so the next version will be released around Nov! There's a possibility to have a shorter release cycle (though I cannot guarantee). The next version might not be 1.1, but it will be 1.0.1. We might have a new second digit release every quarter.

Was this page helpful?
0 / 5 - 0 ratings