"SELECT 1" - the simplest of presto queries takes over 3 seconds (3.18-3.4 seconds for me) when run from superset but only 0.2 seconds when run using presto JDBC.
1.
Sleep 1 second at this line is too long:
https://github.com/apache/incubator-superset/blob/0.36.0/superset/db_engine_specs/presto.py#L760
2.
see https://github.com/dropbox/PyHive/issues/337 for another culprit line
I changed to 0.1 for both cases and now I can get 0.47 seconds from superset!
I can also get select count(1) from sometable in 0.86 seconds from superset after the 2 changes (was 3.5 seconds before)
Issue-Label Bot is automatically applying the label #enhancement to this issue, with a confidence of 0.64. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
@john-bodley @villebro @mistercrunch, is this a viable solution? What are the performance/load implications on polling a cursor 10 times more often? Is there a way where we don't need to poll the cursor but instead use some async magic to await on a response?
Can probably adjust the polling interval according to number of retries, i.e., longer interval if waited longer.
time.sleep(min(0.2 + 0.5 * num_retries, 5))
Good idea to use some form of exponential or other backoff. However, perhaps leverage backoff, as it's already in the python deps. If Presto can handle shorter polling intervals, why not. But that discussion needs to be resolved on PyHive. @bkyryliuk thoughts?
Having said all this, while unnecessary waiting should be avoided, I'm personally of the inclination OLAP-type queries are ok to take a few seconds. As I expect people will tend to be hitting fairly expensive queries through SQL Lab, waiting for a few seconds for something that completed in 0.1 seconds, and conversely waiting 32 seconds for something that took 30.1 seconds, usually isn't that big a deal for me. Especially if it can spare a busy db some unnecessary load.
Echoing @villebro and @ktmud, some sort of backoff sounds good, so that a 0.1 second query comes back in 0.2 or 0.4 seconds, but the 30 second query comes back in 32.
I will note that performance around the 1 to 5 second long queries does appear especially bad on Superset's side, especially because of all the steps required to run the query:
This means that a query that takes a second or 2 on cli, can take 2-3 times as long in sql lab. Any improvements we can make around this without overloading services seems worth considering
In the long term, moving to the approach outlined in https://github.com/apache/incubator-superset/issues/9190 will solve a lot of these issues too, but i think updating the cursor wait time here is a valid thing to start with
I reckon there is no harm at polling every 0.1 seconds, presto jdbc driver does that anyway (website says "Presto is an open source distributed SQL query engine for running interactive analytic queries"). I heard of the 3 second rule with keeping attention span (https://paceadv.com/the-3-second-rule/), if your query takes 3.5seconds you have a bit of disengagement but if its a snappy < 1.5 seconds stay hooked. If others concerned about increasing load on presto then perhaps allow these to be set in config of the data source?
- Write the result to the results backend (s3)
Is it required for Superset to write query results cache to s3? Why not Redis? And why does it need to download the cache from s3 again?
@ktmud it is configurable in superset config, s3 is more scalable if you have many users and large results
@villebro it seems like poll interval is configurable: https://github.com/dropbox/PyHive/blob/8eb0aeab8ca300f3024655419b93dad926c1a351/pyhive/presto.py#L93 I don't have capacity to make the change, but can review the code if needed to expose it as sqllachemy conn stirng parameter
SQLAlchemy supports the setting this per here so no changes to PyHive are necessary.
@ktmud Perhaps I was a bit unclear in my description. When running the query on the async worker, once completed it gets written to a results backend. We use s3 for the reasons @bkyryliuk said. We then mark the query as successfully completed, and add the cache key to the query record
Then, we need to get the results to the client. This is done by polling the queries endpoint every 2 seconds. Once the query is labeled as success in that response, we call Superset again with the cache key on the query object to actually download the results
Thanks for the detailed explanation, @etr2460 and @bkyryliuk .
I started working on this today (https://github.com/apache/incubator-superset/commit/bf0025f687fef0f75f660d384a5559ad5d3ed7c9), but I don't think it makes a lot of sense to add a backoff within Superset if pyhive only supports constant polling intervals. Therefore, I have a few options, ranked in my preference order:
engine_params object. then both params will be aligned and fully configurable by the adminpoll_interval, then update Superset to default to some variety of fibo backoff for both@john-bodley @bkyryliuk @ktmud @villebro @tooptoop4, thoughts?
I like approach #1.
Honestly speaking getting the query results within 2 seconds vs 0.1 second doesn't seem like a big user win, thanks why I would prefer the most simple approach here.
@tooptoop4 if you use master branch then you should be able to set the poll_interval param in database extra => engine_params => connect_args and that'll control both PyHive's and Superset's poll interval at the same time. Hope this works for you!
Most helpful comment
SQLAlchemy supports the setting this per here so no changes to PyHive are necessary.