Ray: [Core] Cluster crashes when the number of nodes exceeds certain threshold.

Created on 13 Oct 2020 · 23Comments · Source: ray-project/ray

What is the problem?

Dan discovered that his application crashes with some unknown issues (usually one of the nodes crashed, and that causes the script to be crashed). In order to isolate if it is the application issue or Ray issue, he created a mock script that emulates his computation pattern (https://github.com/danwallach/arlo-e2e-noop) with the same cluster config.

I've spent some time debugging issues, and although we didn't figure out the root cause yet, we discovered the script crashes when it reaches to certain (unknown) number of nodes. For example, when we run it with autoscaler, it always crashed around 3 minutes, which is the time where the number of nodes exceeds the threshold. We discovered it wasn't crashed with 30 nodes cluster, but was with 60 nodes static sized cluster as well.

cc @ericl @stephanie-wang

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

https://github.com/danwallach/arlo-e2e-noop Run the script from this repo. It doesn't require any dependencies to run.

[x] I have verified my script runs in a clean environment and reproduces the issue.
[x] I have verified the issue also occurs with the latest wheels.

P0 bug

Source

rkooo567

All 23 comments

Interesting. This reminds of the broadcast crash where lack of rate limiting / duplicate requests causes a crash with high fanout broadcasts. @rkooo567 are the data sizes here significant or are they trivial?

ericl on 13 Oct 2020

Outbound data from the head node to the worker nodes is configurable. There's probably 50-ish kB of fixed data going to every node. The input is a list of dictionaries (mapping 100 short string keys to 1 or 0). That list is first "batched" into 10,000 dictionaries at a time, and then "sharded" into 4 inputs for each remote call. Those remote calls then return one dictionary, each, which maps those same 100 short strings to 100 4000-ish bit numbers. Then there's a reduction phase that adds up those 4000-ish bit numbers.

The crashiness doesn't seem to depend on whether we're in the "mapping" or "reducing" phase, but rather seems related to when the total number of nodes exceeds some threshold.

danwallach on 13 Oct 2020

Original bug report: https://github.com/ray-project/ray/issues/11037

danwallach on 13 Oct 2020

When you're playing around with the arlo-e2e-noop repo, a few things worth noting:

Right now, it's configured to have zero workers running on the head node; this was to work around earlier concerns we had that the head node was running out of memory. That means that if there are no worker nodes yet connected, you'll get unrelated failure notices.
There are several command-line flags that you can tweak to change the size of each input and the total number of inputs (all randomly generated). You can also run local or on a cluster, and you can tweak the --speedup to effect how much time each worker sleeps, to simulate the computation that the real version is doing.
You can turn on and off the progressbar (which uses tqdm and my own actor code to keep track of what's going on).
Inside compute.py, there are several constants at the top of the file that relate to how the batching & sharding works. Those are also tweakable.

Also, FWIW, I'm doing this on Ubuntu 20.04, starting from the AMI on Amazon, then running ubuntu-setup.sh, which then subsequently saved as my own AMI, for faster startup. That's what you'll see referenced in aws-config.yaml. My AMI is probably not reachable by you, but you should at least be able to reproduce it.

I separately tried this with Ubuntu 18.04 and had the same issues.

danwallach on 13 Oct 2020

Also, you'll see that the "reduction" computation is fairly complicated. It's calling ray.wait to get any available results, and dispatch the tallying code. I've got a completely unrelated implementation, not part of this repo, that does a tree-structured reduction, and the failures are exactly the same, so I'm tempted not to blame the complexity of my reduction code.

danwallach on 13 Oct 2020

Sounds good. @rkooo567 probably the next step for us is to simplify this code into a 10-20 line script that can reproduce the core issue.

ericl on 13 Oct 2020

👍1

One last thing worth mentioning: on the real application, from which I derived this "noop" thing, I've currently capped Ray to use at most 12 nodes * 64 vCPUs (= 768 vCPUs total), and everything is running reliably. This has me roughly a factor of 3x away from my target performance goal, but it's close enough that I can live with it for a while. Of course, if/when you've got a patch ready, I'd love to try it out.

danwallach on 15 Oct 2020

I was searching for a similar issue I was having as I tried to install Ray 1.0.0.... I have a local cluster which can scale up to 18 nodes, 440 vCPUs, and it works flawlessly in 0.8.6... in 1.0.0, I can't seem to get more than 6-10 nodes up without a trial crashing... and I'm not even using all the cluster resources... the same trial will run with only 3-4 nodes... and then when I give the cluster more nodes, but run the same trial, same resourcing, it crashes... and forget running w/ 18 nodes... I've disabled dashboard, and doesn't seem to matter... I end up w/ core dumps on head and workers nodes.... I haven't verified I get the same exact behavior in AWS environment yet... will try that today... but I'm on RHEL.

waldroje on 16 Oct 2020

my issue seems to mimic https://github.com/ray-project/ray/issues/11239 as well

waldroje on 16 Oct 2020

I'll take a look at https://github.com/ray-project/ray/issues/11239 in the next week, there's a good chance it's a different issue.

ericl on 16 Oct 2020

@danwallach What was the threshold number that broke the job? I tried debugging this today, and my suspicion now is the Redis rejects requests from new nodes/workers because it reaches to the max number of clients (which is 10,000, which can be reached if we assume each worker has multiple connections to redis).

rkooo567 on 26 Oct 2020

The crashiness seems to be a function of the number of vCPUs. If I keep them around 700, everything works. Somewhere not too far north of that, everything starts crashing. This seems to be independent of the size of each machine. So, more vCPUs per machine, same crashy threshold in terms of the number of vCPUs.

danwallach on 26 Oct 2020

I see. It looks like around 1000 workers, the redis max connection issue starts happening. My guess is when this error happens, your job starts failing, and the error was for some reason now shown. I will keep troubleshooting this.

rkooo567 on 26 Oct 2020

@danwallach When I ran, I could see these logs from the job console.

(pid=raylet, ip=172.31.37.245)     connection.connect()
(pid=raylet, ip=172.31.37.245)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 561, in connect
(pid=raylet, ip=172.31.37.245)     self.on_connect()
(pid=raylet, ip=172.31.37.245)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 637, in on_connect
(pid=raylet, ip=172.31.37.245)     auth_response = self.read_response()
(pid=raylet, ip=172.31.37.245)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 734, in read_response
(pid=raylet, ip=172.31.37.245)     response = self._parser.read_response()
(pid=raylet, ip=172.31.37.245)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 477, in read_response
(pid=raylet, ip=172.31.37.245)     raise response
(pid=raylet, ip=172.31.37.245) redis.exceptions.ConnectionError: max number of clients reached

But as far as I remember, you didn't see these logs right?

rkooo567 on 26 Oct 2020

Also, @danwallach I saw it starts around 400 workers for 1000 size ballots. Is it a normal behavior?

rkooo567 on 26 Oct 2020

Hmm. I don't recall ever seeing an error message like that, and I'm currently running 700+ workers without any issues.

danwallach on 26 Oct 2020

I just grepped through all the logs from every crash I have, and I've never seen a ConnectionError.

danwallach on 26 Oct 2020

👍1

Okay. I successfully ran 1200+ workers scenario (101 nodes) without any error with this https://github.com/rkooo567/arlo-e2e-noop.

There were some differences in env setting because I couldn't run your example without modifying it. Here are the differences;

Used ubuntu 16.04
Used python 3.7 (and I allow python 3.7)
I made sure ulimit -n 65000; works. There are some ami that fail to run ulimit with this error. This caused the Redis max-clients error. It is because each worker uses almost 8~9 connections to Redis (which we should fix). The ubuntu ami that I used can run ulimit without issues.

ubuntu@ip-172-31-51-208:~$ ulimit -n 60500
bash: ulimit: open files: cannot modify limit: Operation not permitted

I ran directly inside the head node instead of using ray submit. I don't think this affects anything. (main.py --progress --num-ballots 100000).
I ran the latest master with the multi tenancy feature on.
I specified --max-worker-port and --min-worker-port (because it kind of broke dashboard without these flags).

rkooo567 on 27 Oct 2020

So, in conclusion, I couldn't reproduce the error. I am not sure what's the exact cause (it is unlikely python 3.7 or ubuntu 16.04 affects this in my opinion). The only possibility in my mind is your ami actually doesn't allow you to run ulimit -n 65000, and it failed them, but you didn't notice that because it was buried in the middle of logs. But I am not sure. (You can see it from monitor.out).

rkooo567 on 27 Oct 2020

Hmm. Can you try Python 3.8 and Ubuntu 20.04? I could probably make my stuff work on older Ubuntu, but I have a bunch of dependencies on Python 3.8.

danwallach on 27 Oct 2020

Also, which exact AMI are you using? I'd like to make sure that I'm on the same base as you.

danwallach on 27 Oct 2020

You can find ami I used from here https://github.com/rkooo567/arlo-e2e-noop. (You can look at aws-*.yaml file).

Also, can you make sure your ulimit ran successfully without bash: ulimit: open files: cannot modify limit: Operation not permitted?

rkooo567 on 27 Oct 2020

Closing this for now, as we cannot reproduce (perhaps related to ulimit).

ericl on 2 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

When will you achieve high availability

WangYiPengPeter · 3Comments

[ray] How to write into numpy arrays in shared memory with Ray?

monocongo · 3Comments

[xray] Crash during object transfers

ericl · 3Comments

[rllib]How to load the pre-trained PPO model with a lower overhead?

xudongliao · 3Comments

No module named mycustom.py on other node

zhaokang1228 · 3Comments