I tried "manual cluster setup" on gcp instances, but always fail.
I used ray start --head --redis-port=6379 command on head machine, and used import ray and ray.init(redis_address="10.129.0.7:6379") on node machine.
I attached log below
It showed exception error about raylets.
I also tested ray version 0.6.3 and 0.7.0, but got the same result.
There's no communication problem to communicate each machine with redis.
And all port are open.
But why cannot set up the cluster?
log of head
2019-03-18 01:19:44,763 INFO scripts.py:286 -- Using IP address 10.129.0.7 for this node.
2019-03-18 01:19:44,763 INFO node.py:439 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-03-18_01-19-44_3587/logs.
2019-03-18 01:19:44,866 INFO services.py:364 -- Waiting for redis server at 127.0.0.1:6379 to respond...
2019-03-18 01:19:44,975 INFO services.py:364 -- Waiting for redis server at 127.0.0.1:32675 to respond...
2019-03-18 01:19:44,976 INFO services.py:761 -- Starting Redis shard with 6.32 GB max memory.
2019-03-18 01:19:44,984 INFO services.py:1449 -- Starting the Plasma object store with 9.48 GB memory using /dev/shm.
2019-03-18 01:19:44,991 INFO scripts.py:317 --
Started Ray on this node. You can add additional nodes to the cluster by calling
ray start --redis-address 10.129.0.7:6379
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray
ray.init(redis_address="10.129.0.7:6379")
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
log of node 1
>>> import ray
>>> ray.init(redis_address="10.129.0.7:6379")
2019-03-18 01:21:15,265 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:16,267 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:17,271 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:18,274 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:19,276 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1499, in init
redis_address, node_ip_address, redis_password=redis_password)
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1242, in get_address_info_from_redis
redis_address, node_ip_address, redis_password=redis_password)
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1222, in get_address_info_from_redis_helper
"Redis has started but no raylets have registered yet.")
Exception: Redis has started but no raylets have registered yet.
Can you try ray.init(redis_address="localhost:6379"). Does that work?
Also, can you do ps aux | grep "raylet/raylet " to see if there are any live raylets?
Thank you for answering my question.
ray.init(redis_address="localhost:6379") works on head machine.
>>> import ray
>>> ray.init(redis_address="localhost:6379")
{'node_ip_address': '10.129.0.7', 'redis_address': '10.129.0.7:6379', 'object_store_address': '/tmp/ray/session_2019-03-20_02-10-36_3239/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2019-03-20_02-10-36_3239/sockets/raylet', 'webui_url': None}
But it does not work on node machine.
>>> import ray
>>> ray.init(redis_address="localhost:6379")
2019-03-20 02:12:35,118 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-20 02:12:36,119 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-20 02:12:37,121 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-20 02:12:38,123 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-20 02:12:39,125 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/connection.py", line 492, in connect
sock = self._connect()
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/connection.py", line 550, in _connect
raise err
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/connection.py", line 538, in _connect
sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1499, in init
redis_address, node_ip_address, redis_password=redis_password)
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1242, in get_address_info_from_redis
redis_address, node_ip_address, redis_password=redis_password)
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1207, in get_address_info_from_redis_helper
client_table = ray.experimental.state.parse_client_table(redis_client)
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/experimental/state.py", line 32, in parse_client_table
"", NIL_CLIENT_ID)
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/client.py", line 772, in execute_command
connection = pool.get_connection(command_name, **options)
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/connection.py", line 994, in get_connection
connection.connect()
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/connection.py", line 497, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 10.129.0.10:6379. Connection refused.
>>>
And live raylets on head machines are
jason.p+ 3291 0.1 0.0 52792 5728 pts/8 Sl 02:10 0:00 /home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/core/src/ray/raylet/raylet /tmp/ray/session_2019-03-20_02-10-36_3239/sockets/raylet /tmp/ray/session_2019-03-20_02-10-36_3239/sockets/plasma_store 0 0 10.129.0.7 10.129.0.7 6379 8 8 CPU,8,GPU,1 /home/jason.park/.pyenv/versions/3.6.8/bin/python3.6 /home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/workers/default_worker.py --node-ip-address=10.129.0.7 --object-store-name=/tmp/ray/session_2019-03-20_02-10-36_3239/sockets/plasma_store --raylet-name=/tmp/ray/session_2019-03-20_02-10-36_3239/sockets/raylet --redis-address=10.129.0.7:6379 --temp-dir=/tmp/ray/session_2019-03-20_02-10-36_3239 /tmp/ray/session_2019-03-20_02-10-36_3239
jason.p+ 3530 0.0 0.0 12944 940 pts/8 S+ 02:15 0:00 grep --color=auto raylet/raylet
on node machine
jason.p+ 3685 0.0 0.0 12944 1012 pts/8 S+ 02:15 0:00 grep --color=auto raylet/raylet
Hi,
I am facing the same issue regarding the ray cluster. I am using python 3.5, ray 0.7 and the latest ray-repository from github (0.6.4).
from head node:
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
import ray
ray.init(redis_address="localhost:6379")
{'webui_url': None, 'object_store_address': '/tmp/ray/session_2019-03-26_14-26-20_2021/sockets/plasma_store', 'redis_address': '172.31.17.10:6379', 'node_ip_address': '172.31.17.10', 'raylet_socket_name': '/tmp/ray/session_2019-03-26_14-26-20_2021/sockets/raylet'}
from worker node:
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
import ray
ray.init(redis_address="localhost:6379")
2019-03-26 14:41:35,126 WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-26 14:41:36,127 WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-26 14:41:37,129 WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-26 14:41:38,131 WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
^CTraceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.5/site-packages/redis/connection.py", line 492, in connect
sock = self._connect()
File "/home/ubuntu/.local/lib/python3.5/site-packages/redis/connection.py", line 550, in _connect
raise err
File "/home/ubuntu/.local/lib/python3.5/site-packages/redis/connection.py", line 538, in _connect
sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused
@robertnishihara
Do you know why the cluster is not getting communicated?
Kindly help...
I had the same problem when manually setting up the cluster. For me, the problem is that I did not open enough ports for Ray. According to this comment, multiple ports need to be open.
I solve this problem by opening port 6379, 6380, 12345 and 12346 on all nodes.
On the head node:
ray start --head --redis-port=6379 --redis-shard-ports=6380 \
--node-manager-port=12345 --object-manager-port=12346
On the other nodes:
ray start --redis-address=<head-node-ip>:6379 \
--node-manager-port=12345 --object-manager-port=12346
Now I can connect a driver to the cluster on both head node and the other nodes:
ray.init(redis_address="<head-node-ip>:6379")
@y-xue
It works!!!!
I could use clustering by python script after all nodes opened.
Thank you very much @y-xue
The problem still persists even after opening the ports. I followed exactly the same commands suggested by @y-xue but error is same.
Also the raylet process starts and then disappears.
Please suggest anything that I'm missing. Thanks.
I am facing the exact same issue. Any pointers on the root cause and how to fix it ?
Starting head node with
--node-ip-address <head-node-ip>
fix problem for me.
So you call 'ray start' from the head node itself?
I am having trouble connecting to my cluster via python from my local machine. I am trying to (1) start the cluster from my local machine with ray up or ray start, which is successful, then (2) ray.init(redis_address=' I am confident the cluster starts because I am able to run the python script with ray submit config.yaml script.py, which I understand copies the python script to the head node. However, I imagine it is possible to connect to your cluster from your local machine and make remote cluster calls? Has anyone else experienced this? Could the above responders kindly provide some more specifics on where they are starting the cluster from, where they are running their python scripts from, etc?
redis_address
so, when I run ray in a container, but also have the same problem. when I connect the head node from the local machine to container
I have the same problem. I tried the suggestions above and I can only submit jobs from the head node. I am fairly sure it is a firewall issue but do not know how to control all the ports that ray apparently needs open as I of course do not want to open everything. Very frustrating.
Same issue here
2020-04-16 18:45:58,907 INFO scripts.py:374 -- Using IP address 10.0.7.12:6379 for this node.
2020-04-16 18:45:58,922 INFO resource_spec.py:212 -- Starting Ray with 34.57 GiB memory available for workers and up to 46.57 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-16 18:45:59,882 INFO services.py:563 -- Failed to connect to the redis server, retrying.
2020-04-16 18:45:59,915 INFO services.py:563 -- Failed to connect to the redis server, retrying.
2020-04-16 18:45:59,913 INFO services.py:563 -- Failed to connect to the redis server, retrying.
2020-04-16 18:45:59,890 INFO scripts.py:446 -- Using IP address 10.0.7.111 for this node.
2020-04-16 18:45:59,918 INFO scripts.py:446 -- Using IP address 10.0.7.18 for this node.
2020-04-16 18:45:59,915 INFO scripts.py:446 -- Using IP address 10.0.7.133 for this node.
Traceback (most recent call last):
Traceback (most recent call last):
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/bin/ray", line 10, in <module>
Traceback (most recent call last):
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/bin/ray", line 10, in <module>
sys.exit(main())
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/bin/ray", line 10, in <module>
sys.exit(main())
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1045, in main
sys.exit(main())
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1045, in main
return cli()
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1045, in main
return cli()
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return cli()
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 717, in main
return self.main(*args, **kwargs)
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
rv = self.invoke(ctx)
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 555, in invoke
I can confirm that this issue persists in ray 0.8.6, even though I am not seeing any errors in the log files.
Most helpful comment
I had the same problem when manually setting up the cluster. For me, the problem is that I did not open enough ports for Ray. According to this comment, multiple ports need to be open.
I solve this problem by opening port 6379, 6380, 12345 and 12346 on all nodes.
On the head node:
On the other nodes:
Now I can connect a driver to the cluster on both head node and the other nodes: