we use presto 0.198 ,14 workers, sometimes query suddenly slow down and queue a lots ,even a very easy query takes long time . what we can see is that ,code_cache_usage use less ,all workers gc time and counts is less ,it seems that the worker do very litter tasks.
First , we restart the coordinater ,it do not work.Then we restart all the workers ,it works .
And we can make sure tha all the catalogs are normal ,because we have two presto clusers using the same catalogs ,the other presto cluster work without any problems
I am running Presto 0.212 on AWS EMR 5.19 and I am facing the same issue as well.
Can someone help me on how to debug the situation?
Presto UI:

Some of the running queries

Ganglia cluster overview

You can notice in the Ganglia overview, that the cluster performs fine when it starts off, but then something is off. CPU usage is very low, although there are 11 queries running my case (the oldest one running for approx. 38 minutes).
I have had much complex queries that have had greater input data, which ran fine.
I have upgraded from Presto 0.203 recently and I remember this issue still existed but appeared after several days on my running cluster.
In my experience, this is almost always caused by slow storage, and specifically S3 throttling. You can verify this by looking at a jstack for you workers to see where the worker threads are "stuck" (worker threads are named after the query). If they are stuck in s3 code, you are getting throttled.
When S3 throttles your requests the individual HTTP requests should fail with 503 error code instead of threads getting stuck (then Presto will retry those requests and log that requests failed in the server logs), more info can be found in S3 docs: https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html and https://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html#UsingErrorsSlowDown
If threads are getting stuck in S3 code, then it can be due to many different issues (we need to see the stacktrace to understand the cause) either due to S3 being slow, network being slow, no connection in HTTP client connection pools, etc.
Seeing these same issues still with 0.206, has a workaround been found?
@dain The server logs show no S3 failures so this is not an S3 throttling issue
Ganglia logs look identical in behavior to @mnoumanshahzad's, with initially normal CPU usage level and then a drop in CPU usage as Memory plateaus at around 50% of cluster size
Also: is there a once-over command that can restart all presto workers from the coordinator?
Edit / Temp-Workaround: Restarting presto-server on the worker nodes clears up memory and solves the slow down issues: have not looked into what is causing this memory build up yet, temporary workaround was to have a cron job run a stop start on the presto-server across the worker nodes every couple hours
We are observing same issues .. Over time presto performance degrades. We don't use S3. We run on EMR with HDFS. Version 0.229. I went over ganglia. We did a restart thursday evening and some things definitely changed.



The most interesting one is the processes count .. Which seems to have went down with the restart. Also it seems it was able to read much faster from HDFS after a restart.
No idea how to troubleshoot this further to find the root cause. Is there anything we can do / provide to help you guys figure out what might be happening? The only choice I might have at this point is doing regular restarts.
@dain could you take a look at these screenshots. Could you provide any tips / clues what I could look into next time this happens?
@zsaltys for your information, @dain now works on https://github.com/prestosql/presto/ repo, you can find him there.
You can also reach him on Presto Community slack.
Random thought: JStack might be something interesting to look at just to see what the threads are busy with.
on 'Performance boost - timeline' slide of https://www.starburstdata.com/wp-content/uploads/2019/06/Lyft-Dynamic-Presto-Scaling.pdf Lyft mentions daily recycling of nodes
@zsaltys can u do heap dump on fresh cluster and week later?
did u solve @yangjun616 @mnoumanshahzad @zsaltys @dpat ? I also face particular older worker processing splits much slower than other workers on v336
Most helpful comment
In my experience, this is almost always caused by slow storage, and specifically S3 throttling. You can verify this by looking at a jstack for you workers to see where the worker threads are "stuck" (worker threads are named after the query). If they are stuck in s3 code, you are getting throttled.