presto 0.198 all queries slow down and worker seems do not work

Created on 21 Nov 2018 · 11Comments · Source: prestodb/presto

we use presto 0.198 ,14 workers, sometimes query suddenly slow down and queue a lots ,even a very easy query takes long time . what we can see is that ,code_cache_usage use less ,all workers gc time and counts is less ,it seems that the worker do very litter tasks.
First , we restart the coordinater ,it do not work.Then we restart all the workers ,it works .
And we can make sure tha all the catalogs are normal ,because we have two presto clusers using the same catalogs ,the other presto cluster work without any problems

Source

yangjun616

Most helpful comment

In my experience, this is almost always caused by slow storage, and specifically S3 throttling. You can verify this by looking at a jstack for you workers to see where the worker threads are "stuck" (worker threads are named after the query). If they are stuck in s3 code, you are getting throttled.

dain on 23 Nov 2018

👍2

All 11 comments

I am running Presto 0.212 on AWS EMR 5.19 and I am facing the same issue as well.

Can someone help me on how to debug the situation?

Presto UI:

Some of the running queries

Ganglia cluster overview

You can notice in the Ganglia overview, that the cluster performs fine when it starts off, but then something is off. CPU usage is very low, although there are 11 queries running my case (the oldest one running for approx. 38 minutes).

I have had much complex queries that have had greater input data, which ran fine.

I have upgraded from Presto 0.203 recently and I remember this issue still existed but appeared after several days on my running cluster.

mnoumanshahzad on 21 Nov 2018

dain on 23 Nov 2018

👍2

When S3 throttles your requests the individual HTTP requests should fail with 503 error code instead of threads getting stuck (then Presto will retry those requests and log that requests failed in the server logs), more info can be found in S3 docs: https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html and https://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html#UsingErrorsSlowDown

If threads are getting stuck in S3 code, then it can be due to many different issues (we need to see the stacktrace to understand the cause) either due to S3 being slow, network being slow, no connection in HTTP client connection pools, etc.

nezihyigitbasi on 27 Nov 2018

👍1

Seeing these same issues still with 0.206, has a workaround been found?

@dain The server logs show no S3 failures so this is not an S3 throttling issue

Ganglia logs look identical in behavior to @mnoumanshahzad's, with initially normal CPU usage level and then a drop in CPU usage as Memory plateaus at around 50% of cluster size

Also: is there a once-over command that can restart all presto workers from the coordinator?

Edit / Temp-Workaround: Restarting presto-server on the worker nodes clears up memory and solves the slow down issues: have not looked into what is causing this memory build up yet, temporary workaround was to have a cron job run a stop start on the presto-server across the worker nodes every couple hours

dpat on 6 Jun 2019

We are observing same issues .. Over time presto performance degrades. We don't use S3. We run on EMR with HDFS. Version 0.229. I went over ganglia. We did a restart thursday evening and some things definitely changed.

Screenshot 2020-01-03 at 11 38 58

Screenshot 2020-01-03 at 11 37 06

The most interesting one is the processes count .. Which seems to have went down with the restart. Also it seems it was able to read much faster from HDFS after a restart.

No idea how to troubleshoot this further to find the root cause. Is there anything we can do / provide to help you guys figure out what might be happening? The only choice I might have at this point is doing regular restarts.

zsaltys on 3 Jan 2020

@dain could you take a look at these screenshots. Could you provide any tips / clues what I could look into next time this happens?

zsaltys on 3 Jan 2020

@zsaltys for your information, @dain now works on https://github.com/prestosql/presto/ repo, you can find him there.
You can also reach him on Presto Community slack.

findepi on 3 Jan 2020

Random thought: JStack might be something interesting to look at just to see what the threads are busy with.

shixuan-fan on 8 Jan 2020

on 'Performance boost - timeline' slide of https://www.starburstdata.com/wp-content/uploads/2019/06/Lyft-Dynamic-Presto-Scaling.pdf Lyft mentions daily recycling of nodes

tooptoop4 on 1 Mar 2020

@zsaltys can u do heap dump on fresh cluster and week later?

tooptoop4 on 8 Mar 2020

did u solve @yangjun616 @mnoumanshahzad @zsaltys @dpat ? I also face particular older worker processing splits much slower than other workers on v336

tooptoop4 on 26 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings