Ray: [ray] Clustering issue

Created on 18 Mar 2019  路  14Comments  路  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Ray installed from (source or binary): binary
  • Ray version: 0.6.4
  • Python version: 3.6.8
  • Exact command to reproduce:

Describe the problem


I tried "manual cluster setup" on gcp instances, but always fail.
I used ray start --head --redis-port=6379 command on head machine, and used import ray and ray.init(redis_address="10.129.0.7:6379") on node machine.

I attached log below
It showed exception error about raylets.

I also tested ray version 0.6.3 and 0.7.0, but got the same result.
There's no communication problem to communicate each machine with redis.
And all port are open.

But why cannot set up the cluster?

Source code / logs


log of head

2019-03-18 01:19:44,763 INFO scripts.py:286 -- Using IP address 10.129.0.7 for this node.
2019-03-18 01:19:44,763 INFO node.py:439 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-03-18_01-19-44_3587/logs.
2019-03-18 01:19:44,866 INFO services.py:364 -- Waiting for redis server at 127.0.0.1:6379 to respond...
2019-03-18 01:19:44,975 INFO services.py:364 -- Waiting for redis server at 127.0.0.1:32675 to respond...
2019-03-18 01:19:44,976 INFO services.py:761 -- Starting Redis shard with 6.32 GB max memory.
2019-03-18 01:19:44,984 INFO services.py:1449 -- Starting the Plasma object store with 9.48 GB memory using /dev/shm.
2019-03-18 01:19:44,991 INFO scripts.py:317 --
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --redis-address 10.129.0.7:6379

from the node you wish to add. You can connect a driver to the cluster from Python by running

    import ray
    ray.init(redis_address="10.129.0.7:6379")

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

    ray stop

log of node 1

>>> import ray
>>> ray.init(redis_address="10.129.0.7:6379")
2019-03-18 01:21:15,265 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:16,267 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:17,271 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:18,274 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:19,276 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1499, in init
    redis_address, node_ip_address, redis_password=redis_password)
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1242, in get_address_info_from_redis
    redis_address, node_ip_address, redis_password=redis_password)
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1222, in get_address_info_from_redis_helper
    "Redis has started but no raylets have registered yet.")
Exception: Redis has started but no raylets have registered yet.

Most helpful comment

I had the same problem when manually setting up the cluster. For me, the problem is that I did not open enough ports for Ray. According to this comment, multiple ports need to be open.

I solve this problem by opening port 6379, 6380, 12345 and 12346 on all nodes.

On the head node:

ray start --head --redis-port=6379 --redis-shard-ports=6380 \
--node-manager-port=12345 --object-manager-port=12346

On the other nodes:

ray start --redis-address=<head-node-ip>:6379 \
--node-manager-port=12345 --object-manager-port=12346

Now I can connect a driver to the cluster on both head node and the other nodes:

ray.init(redis_address="<head-node-ip>:6379")

All 14 comments

Can you try ray.init(redis_address="localhost:6379"). Does that work?

Also, can you do ps aux | grep "raylet/raylet " to see if there are any live raylets?

Thank you for answering my question.
ray.init(redis_address="localhost:6379") works on head machine.

>>> import ray
>>> ray.init(redis_address="localhost:6379")
{'node_ip_address': '10.129.0.7', 'redis_address': '10.129.0.7:6379', 'object_store_address': '/tmp/ray/session_2019-03-20_02-10-36_3239/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2019-03-20_02-10-36_3239/sockets/raylet', 'webui_url': None}

But it does not work on node machine.

>>> import ray
>>> ray.init(redis_address="localhost:6379")
2019-03-20 02:12:35,118 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-20 02:12:36,119 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-20 02:12:37,121 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-20 02:12:38,123 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-20 02:12:39,125 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/connection.py", line 492, in connect
    sock = self._connect()
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/connection.py", line 550, in _connect
    raise err
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/connection.py", line 538, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1499, in init
    redis_address, node_ip_address, redis_password=redis_password)
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1242, in get_address_info_from_redis
    redis_address, node_ip_address, redis_password=redis_password)
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1207, in get_address_info_from_redis_helper
    client_table = ray.experimental.state.parse_client_table(redis_client)
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/experimental/state.py", line 32, in parse_client_table
    "", NIL_CLIENT_ID)
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/client.py", line 772, in execute_command
    connection = pool.get_connection(command_name, **options)
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/connection.py", line 994, in get_connection
    connection.connect()
  File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/redis/connection.py", line 497, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 10.129.0.10:6379. Connection refused.
>>>

And live raylets on head machines are

jason.p+  3291  0.1  0.0  52792  5728 pts/8    Sl   02:10   0:00 /home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/core/src/ray/raylet/raylet /tmp/ray/session_2019-03-20_02-10-36_3239/sockets/raylet /tmp/ray/session_2019-03-20_02-10-36_3239/sockets/plasma_store 0 0 10.129.0.7 10.129.0.7 6379 8 8 CPU,8,GPU,1  /home/jason.park/.pyenv/versions/3.6.8/bin/python3.6 /home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/workers/default_worker.py --node-ip-address=10.129.0.7 --object-store-name=/tmp/ray/session_2019-03-20_02-10-36_3239/sockets/plasma_store --raylet-name=/tmp/ray/session_2019-03-20_02-10-36_3239/sockets/raylet --redis-address=10.129.0.7:6379 --temp-dir=/tmp/ray/session_2019-03-20_02-10-36_3239   /tmp/ray/session_2019-03-20_02-10-36_3239
jason.p+  3530  0.0  0.0  12944   940 pts/8    S+   02:15   0:00 grep --color=auto raylet/raylet

on node machine

jason.p+  3685  0.0  0.0  12944  1012 pts/8    S+   02:15   0:00 grep --color=auto raylet/raylet

Hi,

I am facing the same issue regarding the ray cluster. I am using python 3.5, ray 0.7 and the latest ray-repository from github (0.6.4).

from head node:
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.

import ray
ray.init(redis_address="localhost:6379")
{'webui_url': None, 'object_store_address': '/tmp/ray/session_2019-03-26_14-26-20_2021/sockets/plasma_store', 'redis_address': '172.31.17.10:6379', 'node_ip_address': '172.31.17.10', 'raylet_socket_name': '/tmp/ray/session_2019-03-26_14-26-20_2021/sockets/raylet'}

from worker node:
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.

import ray
ray.init(redis_address="localhost:6379")
2019-03-26 14:41:35,126 WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-26 14:41:36,127 WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-26 14:41:37,129 WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-26 14:41:38,131 WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
^CTraceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.5/site-packages/redis/connection.py", line 492, in connect
sock = self._connect()
File "/home/ubuntu/.local/lib/python3.5/site-packages/redis/connection.py", line 550, in _connect
raise err
File "/home/ubuntu/.local/lib/python3.5/site-packages/redis/connection.py", line 538, in _connect
sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

@robertnishihara
Do you know why the cluster is not getting communicated?
Kindly help...

I had the same problem when manually setting up the cluster. For me, the problem is that I did not open enough ports for Ray. According to this comment, multiple ports need to be open.

I solve this problem by opening port 6379, 6380, 12345 and 12346 on all nodes.

On the head node:

ray start --head --redis-port=6379 --redis-shard-ports=6380 \
--node-manager-port=12345 --object-manager-port=12346

On the other nodes:

ray start --redis-address=<head-node-ip>:6379 \
--node-manager-port=12345 --object-manager-port=12346

Now I can connect a driver to the cluster on both head node and the other nodes:

ray.init(redis_address="<head-node-ip>:6379")

@y-xue
It works!!!!
I could use clustering by python script after all nodes opened.
Thank you very much @y-xue

The problem still persists even after opening the ports. I followed exactly the same commands suggested by @y-xue but error is same.
Also the raylet process starts and then disappears.
Please suggest anything that I'm missing. Thanks.

I am facing the exact same issue. Any pointers on the root cause and how to fix it ?

Starting head node with

--node-ip-address <head-node-ip>

fix problem for me.

So you call 'ray start' from the head node itself?

I am having trouble connecting to my cluster via python from my local machine. I am trying to (1) start the cluster from my local machine with ray up or ray start, which is successful, then (2) ray.init(redis_address=':

I am confident the cluster starts because I am able to run the python script with ray submit config.yaml script.py, which I understand copies the python script to the head node. However, I imagine it is possible to connect to your cluster from your local machine and make remote cluster calls?

Has anyone else experienced this? Could the above responders kindly provide some more specifics on where they are starting the cluster from, where they are running their python scripts from, etc?

redis_address

so, when I run ray in a container, but also have the same problem. when I connect the head node from the local machine to container

I have the same problem. I tried the suggestions above and I can only submit jobs from the head node. I am fairly sure it is a firewall issue but do not know how to control all the ports that ray apparently needs open as I of course do not want to open everything. Very frustrating.

Same issue here

2020-04-16 18:45:58,907 INFO scripts.py:374 -- Using IP address 10.0.7.12:6379 for this node.
 2020-04-16 18:45:58,922 INFO resource_spec.py:212 -- Starting Ray with 34.57 GiB memory available for workers and up to 46.57 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>,         object_store_memory=<bytes>).
 2020-04-16 18:45:59,882 INFO services.py:563 -- Failed to connect to the redis server, retrying.
 2020-04-16 18:45:59,915 INFO services.py:563 -- Failed to connect to the redis server, retrying.
 2020-04-16 18:45:59,913 INFO services.py:563 -- Failed to connect to the redis server, retrying.
 2020-04-16 18:45:59,890 INFO scripts.py:446 -- Using IP address 10.0.7.111 for this node.
 2020-04-16 18:45:59,918 INFO scripts.py:446 -- Using IP address 10.0.7.18 for this node.
 2020-04-16 18:45:59,915 INFO scripts.py:446 -- Using IP address 10.0.7.133 for this node.
 Traceback (most recent call last):
 Traceback (most recent call last):
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/bin/ray", line 10, in <module>
 Traceback (most recent call last):
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/bin/ray", line 10, in <module>
     sys.exit(main())
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/bin/ray", line 10, in <module>
     sys.exit(main())
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1045, in main
     sys.exit(main())
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1045, in main
     return cli()
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1045, in main
     return cli()
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 764, in __call__
     return cli()
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 764, in __call__
     return self.main(*args, **kwargs)
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 764, in __call__
     return self.main(*args, **kwargs)
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 717, in main
     return self.main(*args, **kwargs)
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 717, in main
     rv = self.invoke(ctx)
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 717, in main
     rv = self.invoke(ctx)
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
     rv = self.invoke(ctx)
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
     return _process_result(sub_ctx.command.invoke(sub_ctx))
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
     return _process_result(sub_ctx.command.invoke(sub_ctx))
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 956, in invoke
     return _process_result(sub_ctx.command.invoke(sub_ctx))
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 956, in invoke
     return ctx.invoke(self.callback, **ctx.params)
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 956, in invoke
     return ctx.invoke(self.callback, **ctx.params)
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 555, in invoke
     return ctx.invoke(self.callback, **ctx.params)
   File "/global/scratch/jiahaoyao/anaconda3/envs/qrl2/lib/python3.7/site-packages/click/core.py", line 555, in invoke

I can confirm that this issue persists in ray 0.8.6, even though I am not seeing any errors in the log files.

Was this page helpful?
0 / 5 - 0 ratings