Ray: [Ray and RLlib] Ray doesn't run on a machine with large number of cpus and RAM due to port exhaustion

Created on 11 Jun 2020 · 14Comments · Source: ray-project/ray

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):

Ray: 0.9dev
Python version: 3.6
TensorFlow version: 1.15
OS: Linux

I ran my task on a machine with 2T RAM and less than 1000 CPUs, when running my task, I got

2020-06-11 09:17:17,308 INFO resource_spec.py:212 -- Starting Ray with 11758.35 GiB memory available for workers and up to 186.26 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
Traceback (most recent call last):
  File "run_distmarl_mpe.py", line 52, in <module>
    main(args)
  File "run_distmarl_mpe.py", line 14, in main
    U.init_ray(temp_dir=args.temp_dir)
  File "/export/home/wqiu/Projects/NewDistMARL/rllib-marl/utils/utils.py", line 207, in init_ray
    ray.init(**kwargs)
  File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/worker.py", line 769, in init
    ray_params=ray_params)
  File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/node.py", line 184, in __init__
    self.start_ray_processes()
  File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/node.py", line 672, in start_ray_processes
    self.start_raylet()
  File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/node.py", line 601, in start_raylet
    fate_share=self.kernel_fate_share)
  File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/services.py", line 1283, in start_raylet
    static_resources = resource_spec.to_resource_dict()
  File "/export/home/wqiu/anaconda3/envs/2ray/lib/python3.6/site-packages/ray/resource_spec.py", line 120, in to_resource_dict
    resource_label, resources))
ValueError: Resource quantities must be at most 100000. Violated by resource 'memory' in {'node:172.21.151.5': 1.0, 'CPU': 10, 'GPU': 1, 'memory': 240811, 'object_store_memory': 2632}.

So I set the memory with 2T to avoid the memory limit error. Then I got

F0609 11:17:52.115675 387084 387084 core_worker.cc:289]  Check failed: assigned_port != -1 Failed to allocate a port for the worker. Please specify a wider port range using the '--min-worker-port' and '--max-worker-port' arguments to 'ray start'.

I did not found configs to set --min-worker-port and --max-worker-port in ray.

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

To reproduce this bug, simply set the memory with 2T on a large machine (with 2T RAM) and run any RLlib task.

If we cannot run your script, we cannot fix your issue.

[] I have verified my script runs in a clean environment and reproduces the issue.
[] I have verified the issue also occurs with the latest wheels.

P3 bug rllib

Source

GoingMyWay

👍2

All 14 comments

How many workers are you running in your machine? Each worker needs to take each port, and this highly likely happened because you have too many workers (as you are using very large machines) which takes more than the max number of ports specified by default port ranges.

rkooo567 on 11 Jun 2020

How many workers are you running in your machine? Each worker needs to take each port, and this highly likely happened because you have too many workers (as you are using very large machines) which takes more than the max number of ports specified by default port ranges.

There were 8 rollout workers, and each had 8 envs. There are 2000 CPUs on the machine.

GoingMyWay on 12 Jun 2020

cc @edoakes

Can you try ps -aux | grep "ray::" | wc -l and show me the output when you run Ray? Also, can you try running a normal Ray script as well? Something like this.

ray.init(address='auto')
@ray.remote
class Actor:
    def ping(self):
        return True

actors = [Actor.remote() for _ in range(100)]
ray.get([actor.ping.remote() for actor in actors])

rkooo567 on 12 Jun 2020

Can you try passing those args to ray.init / start? Btw, how is it possible to have 2000 CPUs on a machine?

ericl on 12 Jun 2020

Can you try passing those args to ray.init / start? Btw, how is it possible to have 2000 CPUs on a machine?

Hi, I found --min-worker-port and --max-worker-port are not in ray.init right? After checking the machine, there are less than 1000 CPUs and 2000G RAM on the machine.

GoingMyWay on 13 Jun 2020

Yeah I believe those options are only available for ray start. Can you try start running your cluster using this CLI?

rkooo567 on 17 Jun 2020

Yeah I believe those options are only available for ray start. Can you try start running your cluster using this CLI?

Thanks. I will try it.

GoingMyWay on 18 Jun 2020

I am having a similar issue on nodes with 56 logical cores. However, it only seems to affect _certain_ cluster nodes, while other nodes with identical hardware have no problems. Manually specifying a port range does not resolve the problem.

20zinnm on 8 Jul 2020

👍1

The issue may have resolved itself. I'm not sure what changed but now everything seems to be working.

20zinnm on 8 Jul 2020

i'm hitting this issue as well on a cluster of roughly 15 commodity machines. Seems to be completely random, but consistent issue.

vroomzel on 11 Jul 2020

👍1

I have this issue as well and would like to add the information that I run with n_workers=0. In other words, a high number of workers is not the cause of this issue. However, I start multiple scripts (via Univa Grid Engine) on the same node that all do ray.init().

Maltimore on 16 Aug 2020

I just hit this bug too on a system with 112 CPUs when running a small Modin application. Normally it works, but this time I had Ray built from sources with debug information as bazel build -c dbg //:ray_pkg. Revision is 0.8.7 tag ray-0.8.7. Ray doesn't seem to honor num_cpus parameter and spawn a lot of workers. This results in an enormous number of connections, my netstat -tapn list is longer than 10000 that fit into my terminal back scroll.
Then I decided to try rebuilding everything with optimizations. Removed ~/.cache/bazel and ran bazel build -c opt //:ray_pkg. And now everything works just fine and doesn't cause number of ports to exhaust.

gshimansky on 4 Sep 2020

I just hit this bug too on a system with 112 CPUs when running a small Modin application. Normally it works, but this time I had Ray built from sources with debug information as bazel build -c dbg //:ray_pkg. Revision is 0.8.7 tag ray-0.8.7. Ray doesn't seem to honor num_cpus parameter and spawn a lot of workers. This results in an enormous number of connections, my netstat -tapn list is longer than 10000 that fit into my terminal back scroll.
Then I decided to try rebuilding everything with optimizations. Removed ~/.cache/bazel and ran bazel build -c opt //:ray_pkg. And now everything works just fine and doesn't cause number of ports to exhaust.

Cool, is it should be built from source?

GoingMyWay on 5 Sep 2020

@GoingMyWay No, you don't have to build Ray from source. I built it from source trying to debug some strange behaviour, but debug version (because it is slower than optimized?) triggered this bug with ports exhaustion. Maybe this information will help developers to reproduce it.

gshimansky on 5 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings