Ray version and other system information (Python version, TensorFlow version, OS):
Ray: 0.9dev
Python version: 3.6
TensorFlow version: 1.15
OS: Linux
I ran my task on a machine with 2T RAM and less than 1000 CPUs, when running my task, I got
2020-06-11 09:17:17,308 INFO resource_spec.py:212 -- Starting Ray with 11758.35 GiB memory available for workers and up to 186.26 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
Traceback (most recent call last):
File "run_distmarl_mpe.py", line 52, in <module>
main(args)
File "run_distmarl_mpe.py", line 14, in main
U.init_ray(temp_dir=args.temp_dir)
File "/export/home/wqiu/Projects/NewDistMARL/rllib-marl/utils/utils.py", line 207, in init_ray
ray.init(**kwargs)
File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/worker.py", line 769, in init
ray_params=ray_params)
File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/node.py", line 184, in __init__
self.start_ray_processes()
File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/node.py", line 672, in start_ray_processes
self.start_raylet()
File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/node.py", line 601, in start_raylet
fate_share=self.kernel_fate_share)
File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/services.py", line 1283, in start_raylet
static_resources = resource_spec.to_resource_dict()
File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/resource_spec.py", line 120, in to_resource_dict
resource_label, resources))
ValueError: Resource quantities must be at most 100000. Violated by resource 'memory' in {'node:172.21.151.5': 1.0, 'CPU': 10, 'GPU': 1, 'memory': 240811, 'object_store_memory': 2632}.
So I set the memory with 2T to avoid the memory limit error. Then I got
F0609 11:17:52.115675 387084 387084 core_worker.cc:289] Check failed: assigned_port != -1 Failed to allocate a port for the worker. Please specify a wider port range using the '--min-worker-port' and '--max-worker-port' arguments to 'ray start'.
I did not found configs to set --min-worker-port and --max-worker-port in ray.
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
To reproduce this bug, simply set the memory with 2T on a large machine (with 2T RAM) and run any RLlib task.
If we cannot run your script, we cannot fix your issue.
How many workers are you running in your machine? Each worker needs to take each port, and this highly likely happened because you have too many workers (as you are using very large machines) which takes more than the max number of ports specified by default port ranges.
How many workers are you running in your machine? Each worker needs to take each port, and this highly likely happened because you have too many workers (as you are using very large machines) which takes more than the max number of ports specified by default port ranges.
There were 8 rollout workers, and each had 8 envs. There are 2000 CPUs on the machine.
cc @edoakes
Can you try ps -aux | grep "ray::" | wc -l and show me the output when you run Ray? Also, can you try running a normal Ray script as well? Something like this.
ray.init(address='auto')
@ray.remote
class Actor:
def ping(self):
return True
actors = [Actor.remote() for _ in range(100)]
ray.get([actor.ping.remote() for actor in actors])
Can you try passing those args to ray.init / start? Btw, how is it possible to have 2000 CPUs on a machine?
Can you try passing those args to ray.init / start? Btw, how is it possible to have 2000 CPUs on a machine?
Hi, I found --min-worker-port and --max-worker-port are not in ray.init right? After checking the machine, there are less than 1000 CPUs and 2000G RAM on the machine.
Yeah I believe those options are only available for ray start. Can you try start running your cluster using this CLI?
Yeah I believe those options are only available for
ray start. Can you try start running your cluster using this CLI?
Thanks. I will try it.
I am having a similar issue on nodes with 56 logical cores. However, it only seems to affect _certain_ cluster nodes, while other nodes with identical hardware have no problems. Manually specifying a port range does not resolve the problem.
The issue may have resolved itself. I'm not sure what changed but now everything seems to be working.
i'm hitting this issue as well on a cluster of roughly 15 commodity machines. Seems to be completely random, but consistent issue.
I have this issue as well and would like to add the information that I run with n_workers=0. In other words, a high number of workers is not the cause of this issue. However, I start multiple scripts (via Univa Grid Engine) on the same node that all do ray.init().
I just hit this bug too on a system with 112 CPUs when running a small Modin application. Normally it works, but this time I had Ray built from sources with debug information as bazel build -c dbg //:ray_pkg. Revision is 0.8.7 tag ray-0.8.7. Ray doesn't seem to honor num_cpus parameter and spawn a lot of workers. This results in an enormous number of connections, my netstat -tapn list is longer than 10000 that fit into my terminal back scroll.
Then I decided to try rebuilding everything with optimizations. Removed ~/.cache/bazel and ran bazel build -c opt //:ray_pkg. And now everything works just fine and doesn't cause number of ports to exhaust.
I just hit this bug too on a system with 112 CPUs when running a small Modin application. Normally it works, but this time I had Ray built from sources with debug information as
bazel build -c dbg //:ray_pkg. Revision is 0.8.7 tagray-0.8.7. Ray doesn't seem to honornum_cpusparameter and spawn a lot of workers. This results in an enormous number of connections, mynetstat -tapnlist is longer than 10000 that fit into my terminal back scroll.
Then I decided to try rebuilding everything with optimizations. Removed~/.cache/bazeland ranbazel build -c opt //:ray_pkg. And now everything works just fine and doesn't cause number of ports to exhaust.
Cool, is it should be built from source?
@GoingMyWay No, you don't have to build Ray from source. I built it from source trying to debug some strange behaviour, but debug version (because it is slower than optimized?) triggered this bug with ports exhaustion. Maybe this information will help developers to reproduce it.