Incubator-superset: Slow query in SQL Lab

Created on 17 Mar 2017 · 9Comments · Source: apache/incubator-superset

From Sophie B.: The following query takes 3 minutes to report a spelling mistake in SQL Lab and only takes 2 seconds in Airpal. Airpal appears to be 40 seconds faster on this queries when launched at the same time:

inactive

Source

jfeng15

Most helpful comment

@mistercrunch I feel there are two kinds of slowness. One is caused by the fetch all thing when too many rows as described above. Another is about the query itself somehow. For example, when the query contains many joins, though the returned rows are just a few, it is significantly slower than Oracle SQL Developer.

yileic on 28 Apr 2017

👍2

All 9 comments

@bkyryliuk soooo sorry... I am not sure why it shows I unassigned you...I really didn't do it on purpose...

yileic on 29 Mar 2017

I have the same issue here, maybe if you can share your thoughts on where this can be solved (files, causes) i am willing to contribute.
I have a huge mysql query, the same query in phpmyadmin takes around 10 seconds to finish, in sql lab it takes 36seconds then times out. in addition, the result is around 300k rows, superset crashes on such big number of rows (while rendering the table i believe), maybe we can add pagination or just show an option to download the table as a csv file.

Bamieh on 28 Apr 2017

👍1

Next version (already in master) will have a "virtualized table" which should help. This should fix the rendering in a similar way that client side pagination would. But bringing 300k to the browser probably isn't a great idea, it will take time to get that over the network. Your system wide configuration limit should be lower than that. I'd advise 50k at most. I think that limit don't apply to CSV download. My feeling is no one needs to view more than a few thousand rows. It's "not fit for human consumption"

If you're thinking of pagination as in a server-side cursor, I'm unclear whether it's supported evenly across the SQLAlchemy engines. Also the client side search feature wouldn't be compatible with that.

mistercrunch on 28 Apr 2017

@mistercrunch even 100 is enough for the table view in the sql lab! the problem is that the CSV does comply to the LIMIT, so if i put a LIMIT of 100, i only get a csv with 100 rows.

unlocking it to download the whole thing without a limit is a very appealing idea, but again, what about the people who just want to download their csv with the limit included (like the current behaviour). maybe we can add a button to download the FULL query, or the table result.

if this is possible, i can start digging into it.

Bamieh on 28 Apr 2017

yileic on 28 Apr 2017

👍2

It doesn't seem like the client would change the speed of the query that is executing on the server. Do you have data on the difference? Can you run the query a few times on both sides and see what the average is?

mistercrunch on 1 May 2017

Confirmed here, this is dead slow and needs fixing, it even triggers a "this site does not respond" on ie.
Turning on profiling, I see the following top offenders (displaying 3K rows here):

Function | Count | Inclusive time | Exclusive time
-----------|----------|----------------|----------------
appendChild|33 428|10 121,00|10 112,00
removeChild|9 973|7 047,00|6 749,00
innerHTML|57 173|5 857,00|5 856,00
getClientRects|247|3 467,00|3 467,00
n|6 342|1 650,00|1 071,00

There is definitely a need for server side pagination when supported by the underlying engine.
Another way would be to postpone more things, for example only run those append/remove when the user clicks on the "next" button.

Looks like this is done in superset/assets/visualizations/table.js.
Digging through datatables.net-bs's API, I'll see if I can help

ghost on 12 Feb 2018

❤1

@THDA are sounds like you're looking at the Table component in the "explore view" (not in SQL Lab). For the "explore view" we should use react-virtualized (as in SQL Lab).

About SQL Lab, there are a few reasons why a CLI will beat Superset. In non-async mode, the CLI may run a fetchmany as opposed to fetchall and can start showing rows before the query is done executing. In async mode, there are much more things going on. The whole query gets executed, returned, cached in the results backend, and only then, once the client polls sees that the data is ready, the server will fetch the data from the results backend and return it to the client. Those delay become worse as the size of the dataset grows.

mistercrunch on 12 Feb 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.