Dan discovered that his application crashes with some unknown issues (usually one of the nodes crashed, and that causes the script to be crashed). In order to isolate if it is the application issue or Ray issue, he created a mock script that emulates his computation pattern (https://github.com/danwallach/arlo-e2e-noop) with the same cluster config.
I've spent some time debugging issues, and although we didn't figure out the root cause yet, we discovered the script crashes when it reaches to certain (unknown) number of nodes. For example, when we run it with autoscaler, it always crashed around 3 minutes, which is the time where the number of nodes exceeds the threshold. We discovered it wasn't crashed with 30 nodes cluster, but was with 60 nodes static sized cluster as well.
cc @ericl @stephanie-wang
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
https://github.com/danwallach/arlo-e2e-noop Run the script from this repo. It doesn't require any dependencies to run.
Interesting. This reminds of the broadcast crash where lack of rate limiting / duplicate requests causes a crash with high fanout broadcasts. @rkooo567 are the data sizes here significant or are they trivial?
Outbound data from the head node to the worker nodes is configurable. There's probably 50-ish kB of fixed data going to every node. The input is a list of dictionaries (mapping 100 short string keys to 1 or 0). That list is first "batched" into 10,000 dictionaries at a time, and then "sharded" into 4 inputs for each remote call. Those remote calls then return one dictionary, each, which maps those same 100 short strings to 100 4000-ish bit numbers. Then there's a reduction phase that adds up those 4000-ish bit numbers.
The crashiness doesn't seem to depend on whether we're in the "mapping" or "reducing" phase, but rather seems related to when the total number of nodes exceeds some threshold.
Original bug report: https://github.com/ray-project/ray/issues/11037
When you're playing around with the arlo-e2e-noop repo, a few things worth noting:
--speedup to effect how much time each worker sleeps, to simulate the computation that the real version is doing.tqdm and my own actor code to keep track of what's going on).compute.py, there are several constants at the top of the file that relate to how the batching & sharding works. Those are also tweakable.Also, FWIW, I'm doing this on Ubuntu 20.04, starting from the AMI on Amazon, then running ubuntu-setup.sh, which then subsequently saved as my own AMI, for faster startup. That's what you'll see referenced in aws-config.yaml. My AMI is probably not reachable by you, but you should at least be able to reproduce it.
I separately tried this with Ubuntu 18.04 and had the same issues.
Also, you'll see that the "reduction" computation is fairly complicated. It's calling ray.wait to get any available results, and dispatch the tallying code. I've got a completely unrelated implementation, not part of this repo, that does a tree-structured reduction, and the failures are exactly the same, so I'm tempted not to blame the complexity of my reduction code.
Sounds good. @rkooo567 probably the next step for us is to simplify this code into a 10-20 line script that can reproduce the core issue.
One last thing worth mentioning: on the real application, from which I derived this "noop" thing, I've currently capped Ray to use at most 12 nodes * 64 vCPUs (= 768 vCPUs total), and everything is running reliably. This has me roughly a factor of 3x away from my target performance goal, but it's close enough that I can live with it for a while. Of course, if/when you've got a patch ready, I'd love to try it out.
I was searching for a similar issue I was having as I tried to install Ray 1.0.0.... I have a local cluster which can scale up to 18 nodes, 440 vCPUs, and it works flawlessly in 0.8.6... in 1.0.0, I can't seem to get more than 6-10 nodes up without a trial crashing... and I'm not even using all the cluster resources... the same trial will run with only 3-4 nodes... and then when I give the cluster more nodes, but run the same trial, same resourcing, it crashes... and forget running w/ 18 nodes... I've disabled dashboard, and doesn't seem to matter... I end up w/ core dumps on head and workers nodes.... I haven't verified I get the same exact behavior in AWS environment yet... will try that today... but I'm on RHEL.
my issue seems to mimic https://github.com/ray-project/ray/issues/11239 as well
I'll take a look at https://github.com/ray-project/ray/issues/11239 in the next week, there's a good chance it's a different issue.
@danwallach What was the threshold number that broke the job? I tried debugging this today, and my suspicion now is the Redis rejects requests from new nodes/workers because it reaches to the max number of clients (which is 10,000, which can be reached if we assume each worker has multiple connections to redis).
The crashiness seems to be a function of the number of vCPUs. If I keep them around 700, everything works. Somewhere not too far north of that, everything starts crashing. This seems to be independent of the size of each machine. So, more vCPUs per machine, same crashy threshold in terms of the number of vCPUs.
I see. It looks like around 1000 workers, the redis max connection issue starts happening. My guess is when this error happens, your job starts failing, and the error was for some reason now shown. I will keep troubleshooting this.
@danwallach When I ran, I could see these logs from the job console.
(pid=raylet, ip=172.31.37.245) connection.connect()
(pid=raylet, ip=172.31.37.245) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 561, in connect
(pid=raylet, ip=172.31.37.245) self.on_connect()
(pid=raylet, ip=172.31.37.245) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 637, in on_connect
(pid=raylet, ip=172.31.37.245) auth_response = self.read_response()
(pid=raylet, ip=172.31.37.245) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 734, in read_response
(pid=raylet, ip=172.31.37.245) response = self._parser.read_response()
(pid=raylet, ip=172.31.37.245) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 477, in read_response
(pid=raylet, ip=172.31.37.245) raise response
(pid=raylet, ip=172.31.37.245) redis.exceptions.ConnectionError: max number of clients reached
But as far as I remember, you didn't see these logs right?
Also, @danwallach I saw it starts around 400 workers for 1000 size ballots. Is it a normal behavior?
Hmm. I don't recall ever seeing an error message like that, and I'm currently running 700+ workers without any issues.
I just grepped through all the logs from every crash I have, and I've never seen a ConnectionError.
Okay. I successfully ran 1200+ workers scenario (101 nodes) without any error with this https://github.com/rkooo567/arlo-e2e-noop.
There were some differences in env setting because I couldn't run your example without modifying it. Here are the differences;
ubuntu@ip-172-31-51-208:~$ ulimit -n 60500
bash: ulimit: open files: cannot modify limit: Operation not permitted
So, in conclusion, I couldn't reproduce the error. I am not sure what's the exact cause (it is unlikely python 3.7 or ubuntu 16.04 affects this in my opinion). The only possibility in my mind is your ami actually doesn't allow you to run ulimit -n 65000, and it failed them, but you didn't notice that because it was buried in the middle of logs. But I am not sure. (You can see it from monitor.out).
Hmm. Can you try Python 3.8 and Ubuntu 20.04? I could probably make my stuff work on older Ubuntu, but I have a bunch of dependencies on Python 3.8.
Also, which exact AMI are you using? I'd like to make sure that I'm on the same base as you.
You can find ami I used from here https://github.com/rkooo567/arlo-e2e-noop. (You can look at aws-*.yaml file).
Also, can you make sure your ulimit ran successfully without bash: ulimit: open files: cannot modify limit: Operation not permitted?
Closing this for now, as we cannot reproduce (perhaps related to ulimit).