Incubator-superset: Speed up presto queries!

Created on 23 Jun 2020 · 14Comments · Source: apache/incubator-superset

"SELECT 1" - the simplest of presto queries takes over 3 seconds (3.18-3.4 seconds for me) when run from superset but only 0.2 seconds when run using presto JDBC.

1.
Sleep 1 second at this line is too long:
https://github.com/apache/incubator-superset/blob/0.36.0/superset/db_engine_specs/presto.py#L760

2.
see https://github.com/dropbox/PyHive/issues/337 for another culprit line

I changed to 0.1 for both cases and now I can get 0.47 seconds from superset!

I can also get select count(1) from sometable in 0.86 seconds from superset after the 2 changes (was 3.5 seconds before)

.backend request

Source

tooptoop4

🎉3

Most helpful comment

SQLAlchemy supports the setting this per here so no changes to PyHive are necessary.

john-bodley on 25 Jun 2020

🎉3

All 14 comments

Issue-Label Bot is automatically applying the label #enhancement to this issue, with a confidence of 0.64. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] on 23 Jun 2020

@john-bodley @villebro @mistercrunch, is this a viable solution? What are the performance/load implications on polling a cursor 10 times more often? Is there a way where we don't need to poll the cursor but instead use some async magic to await on a response?

etr2460 on 24 Jun 2020

Can probably adjust the polling interval according to number of retries, i.e., longer interval if waited longer.

time.sleep(min(0.2 + 0.5 * num_retries, 5))

ktmud on 24 Jun 2020

Good idea to use some form of exponential or other backoff. However, perhaps leverage backoff, as it's already in the python deps. If Presto can handle shorter polling intervals, why not. But that discussion needs to be resolved on PyHive. @bkyryliuk thoughts?

Having said all this, while unnecessary waiting should be avoided, I'm personally of the inclination OLAP-type queries are ok to take a few seconds. As I expect people will tend to be hitting fairly expensive queries through SQL Lab, waiting for a few seconds for something that completed in 0.1 seconds, and conversely waiting 32 seconds for something that took 30.1 seconds, usually isn't that big a deal for me. Especially if it can spare a busy db some unnecessary load.

villebro on 24 Jun 2020

Echoing @villebro and @ktmud, some sort of backoff sounds good, so that a 0.1 second query comes back in 0.2 or 0.4 seconds, but the 30 second query comes back in 32.

I will note that performance around the 1 to 5 second long queries does appear especially bad on Superset's side, especially because of all the steps required to run the query:

receive the request
schedule a task for a worker
pick up the worker task
Run the query (polling ever 2 seconds)
Write the result to the results backend (s3)
Wait for the next poll from the client side (also every 2 seconds I think)
Download the results from s3

This means that a query that takes a second or 2 on cli, can take 2-3 times as long in sql lab. Any improvements we can make around this without overloading services seems worth considering

In the long term, moving to the approach outlined in https://github.com/apache/incubator-superset/issues/9190 will solve a lot of these issues too, but i think updating the cursor wait time here is a valid thing to start with

etr2460 on 24 Jun 2020

I reckon there is no harm at polling every 0.1 seconds, presto jdbc driver does that anyway (website says "Presto is an open source distributed SQL query engine for running interactive analytic queries"). I heard of the 3 second rule with keeping attention span (https://paceadv.com/the-3-second-rule/), if your query takes 3.5seconds you have a bit of disengagement but if its a snappy < 1.5 seconds stay hooked. If others concerned about increasing load on presto then perhaps allow these to be set in config of the data source?

tooptoop4 on 24 Jun 2020

Write the result to the results backend (s3)

Is it required for Superset to write query results cache to s3? Why not Redis? And why does it need to download the cache from s3 again?

ktmud on 25 Jun 2020

@ktmud it is configurable in superset config, s3 is more scalable if you have many users and large results

@villebro it seems like poll interval is configurable: https://github.com/dropbox/PyHive/blob/8eb0aeab8ca300f3024655419b93dad926c1a351/pyhive/presto.py#L93 I don't have capacity to make the change, but can review the code if needed to expose it as sqllachemy conn stirng parameter

bkyryliuk on 25 Jun 2020

SQLAlchemy supports the setting this per here so no changes to PyHive are necessary.

john-bodley on 25 Jun 2020

🎉3

@ktmud Perhaps I was a bit unclear in my description. When running the query on the async worker, once completed it gets written to a results backend. We use s3 for the reasons @bkyryliuk said. We then mark the query as successfully completed, and add the cache key to the query record

Then, we need to get the results to the client. This is done by polling the queries endpoint every 2 seconds. Once the query is labeled as success in that response, we call Superset again with the cache key on the query object to actually download the results

etr2460 on 25 Jun 2020

Thanks for the detailed explanation, @etr2460 and @bkyryliuk .

ktmud on 25 Jun 2020

I started working on this today (https://github.com/apache/incubator-superset/commit/bf0025f687fef0f75f660d384a5559ad5d3ed7c9), but I don't think it makes a lot of sense to add a backoff within Superset if pyhive only supports constant polling intervals. Therefore, I have a few options, ranked in my preference order:

Update the sleep within handle_cursor to be equal to the poll_interval configured in the presto database's engine_params object. then both params will be aligned and fully configurable by the admin
Open a PR against pyhive to support using a generator function as well as a constant value for poll_interval, then update Superset to default to some variety of fibo backoff for both
Manually update the constant sleep value in Superset to something smaller than 1 (maybe 0.5 for now) and also default the presto engine params to also send 0.5. This isn't configurable, but it would have immediate performance improvements

@john-bodley @bkyryliuk @ktmud @villebro @tooptoop4, thoughts?

etr2460 on 26 Jun 2020

I like approach #1.
Honestly speaking getting the query results within 2 seconds vs 0.1 second doesn't seem like a big user win, thanks why I would prefer the most simple approach here.

bkyryliuk on 26 Jun 2020

@tooptoop4 if you use master branch then you should be able to set the poll_interval param in database extra => engine_params => connect_args and that'll control both PyHive's and Superset's poll interval at the same time. Hope this works for you!

etr2460 on 30 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings