Ray version and other system information (Python version, TensorFlow version, OS):
OS version: macOS Big Sur
Ray version: should impact all versions.
Issue:
ray.init() fails when starting redis, with the following error stack.
Traceback (most recent call last):
File "debugging.py", line 2, in <module>
ray.init()
File "/Users/haochen/code/ant_ray/python/ray/worker.py", line 740, in init
ray_params=ray_params)
File "/Users/haochen/code/ant_ray/python/ray/node.py", line 200, in __init__
self.start_head_processes()
File "/Users/haochen/code/ant_ray/python/ray/node.py", line 801, in start_head_processes
self.start_redis()
File "/Users/haochen/code/ant_ray/python/ray/node.py", line 580, in start_redis
fate_share=self.kernel_fate_share)
File "/Users/haochen/code/ant_ray/python/ray/_private/services.py", line 720, in start_redis
fate_share=fate_share)
File "/Users/haochen/code/ant_ray/python/ray/_private/services.py", line 902, in _start_redis_instance
ulimit_n - redis_client_buffer)
File "/Users/haochen/.pyenv/versions/3.7.6/lib/python3.7/site-packages/redis/client.py", line 1243, in config_set
return self.execute_command('CONFIG SET', name, value)
File "/Users/haochen/.pyenv/versions/3.7.6/lib/python3.7/site-packages/redis/client.py", line 901, in execute_command
return self.parse_response(conn, command_name, **options)
File "/Users/haochen/.pyenv/versions/3.7.6/lib/python3.7/site-packages/redis/client.py", line 915, in parse_response
response = connection.read_response()
File "/Users/haochen/.pyenv/versions/3.7.6/lib/python3.7/site-packages/redis/connection.py", line 747, in read_response
raise response
redis.exceptions.ResponseError: The operating system is not able to handle the specified number of clients, try with -33
Digging into this issue, I found it's because resource.getrlimit(resource.RLIMIT_NOFILE)[0] (see here) now returns 9223372036854775807 on Big Sur, while it returns 256 on previous macOS versions.
Removing this line can fix this issue. @ericl @edoakes @rkooo567 Do you know what is the purpose of this code? Is it still needed?
Just ray.init().
Oh interesting. We set it before because 256 limits are usually not enough to handle all connections to Redis.
In the cluster setting, you could have many hundreds of thousands of workers, so maxclients needs to be at least that large.
Just found the reason why resource.getrlimit(resource.RLIMIT_NOFILE)[0] returns 9223372036854775807 is not related to Big Sur, it's because of this line.
It looks like you've already figured out the issue, but I'll post my error message here since it's slightly different and the last line gives an idea of what we could try setting the upper limit to (namely, 4294967295). I don't know if that makes sense though, I'm not familiar with this part of the code.
[...same as above...]
File "/Users/archit/anaconda3/envs/ray-py36/lib/python3.6/site-packages/redis/connection.py", line 756, in read_response
raise response
redis.exceptions.ResponseError: Invalid argument '9223372036854775775' for CONFIG SET 'maxclients' - argument must be between 1 and 4294967295 inclusive