This is a TODO list for listing queries that we expect to run using DistSQL (in auto mode) by 1.0. Feel free to add as needed, cc @andreimatei @RaduBerinde @cuongdo @asubiotto.
SELECT COUNT(*) should always run through DistSQL for speed reasons. It is a common operation just after loading a bunch of data. It is currently exceedingly slow on large tables (without DistSQL). @arjunravinarayanSELECT COUNT (DISTINCT) column_name @arjunravinarayanWHERE clause. @arjunravinarayanLIMIT queries, particularly limits after JOINs that would cause large amounts of state to stream across machines. @arjunravinarayanIf there are any queries that we want turned on by 1.0, please add them to the list above.
Do not check off an item as done without adding a comment/issue tracking the queries actually attempted on a cluster. Try and report running times and EXPLAIN(query) output to show the DistSQL plan.
Ran
SELECT MIN(l_extendedprice) FROM lineitem
on a 3 node local cluster, using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:
| Without DistSQL | With DistSQL |
| --- | --- |
| 22.028s | 9.520s |
Ran
SELECT MAX(l_extendedprice) FROM lineitem
on a 3 node local cluster, using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:
| Without DistSQL | With DistSQL |
| --- | --- |
| 22.821s | 9.481s |
Hm, that's much slower than I'd have expected. Isn't DistSQL able to reduce network traffic on these queries to O(1)?
Yes, the network traffic is O(1) in these plans. Note that these are local clusters (all nodes on the same machine) so network traffic is not really network traffic.
Ah, understood.
On Thu, Mar 30, 2017 at 3:18 PM, RaduBerinde notifications@github.com
wrote:
Yes, the network traffic is O(1) in these plans. Note that these are
local clusters (all nodes on the same machine) so network traffic is
not really network traffic.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/14288#issuecomment-290516307,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPPXaHXVWMN34eLKNY2q0emX5eUEjks5rq__rgaJpZM4Mj8Zu
.
@danhhz mentioned a SELECT COUNT(*) experiment on a real 3 node cluster (I think it was ~200M rows, SELECT COUNT(*) took 7m33s compared to over an hour).
Makes much more sense, thanks.
On Thu, Mar 30, 2017 at 3:20 PM, RaduBerinde notifications@github.com
wrote:
@danhhz https://github.com/danhhz mentioned a SELECT COUNT()
experiment on a real 3 node cluster (I think it was ~200M rows, SELECT
COUNT() took 7m33s compared to over an hour).—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/14288#issuecomment-290517077,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPC0LSslegjSGJTl7havxxxJy5jWVks5rrACNgaJpZM4Mj8Zu
.
Indeed, it was the production lapis cluster
Ran
SELECT COUNT(*) FROM lineitem
on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:
| Without DistSQL | With DistSQL |
| --- | --- |
| 36.707s | 12.350s |
Ran
SELECT AVG(l_extendedprice) FROM lineitem
using f97a5c3 on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:
| Without DistSQL | With DistSQL |
| --- | --- |
| 46.255s | 18.560s |
Can you report the cockroach sha you used for future reference?
@asubiotto The ranges for the lineitem table are only spread across 4 nodes? Or is there another reason the query plan only has 4 TableReaders?
Is there a quick way to find this information out? Last time I asked, there wasn't a clean way to figure out, given a table, what ranges the table was spread across.
Admin UI says there are 74 ranges for table lineitem (but I can't figure out how to find range IDs or node information for those ranges), so it being spread across just 4 nodes seems strange (but possible).
SHOW TESTING_RANGES FROM TABLE <table-name>
Thank you @petermattis. It appears there are replicas and lease holders on every node, so let me investigate.
Also, there are 24 ranges, for 72 replicas, so clearly the Admin UI is buggy when it says there are 74 ranges. I will file an issue.
I updated the execution plan. I think the gateway node probably had a cache that hadn't been updated. The query plan now shows only 5 TableReaders which is still weird.
Actually, 1 doesn't seem to be a lease holder for any range in lineitem.
Sure, but it was showing 4 before, and that was certainly incorrect. cc @andreimatei.
Sure, but it was showing 4 before, and that was certainly incorrect.
The range-descriptor and leaseholder caches can be empty or stale. This explains it, right?
The state of the caches is supposed to be seen in the ranges_cached internal table, but I think this hasn't been implemented yet.
All these queries were run using 4129fe0 on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.
I started by trying to run
SELECT * FROM lineitem ORDER BY l_extendedprice
On 6million ~2.5GiB lineitem but this failed (see #15332 but note that this query doesn't cause an OOM crash with 4129fe0 any more)
I then moved down to orders, a 1.5million ~600MiB table. Ran
SELECT * FROM orders ORDER BY o_totalprice
The query runs correctly, the execution plan looks good (note that nodes 1 and 3 aren't leaseholders for any of orders' ranges), and the times for one run are below:
| Without DistSQL | With DistSQL |
| --- | --- |
| 3m51s | 3m34s |
I also ran:
SELECT * FROM lineitem ORDER BY l_extendedprice LIMIT 10
To avoid running out of memory on the gateway node (execution plan here). The times for one run are as follows:
| Without DistSQL | With DistSQL |
| --- | --- |
| 1m45s | 15.389s |
Queries run with 630757cbc0 on a 6-nodeo navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; select count(DISTINCT l_suppkey) FROM tpch.lineitem LIMIT 1;"
+---------------------------+
| count(DISTINCT l_suppkey) |
+---------------------------+
| 10000 |
+---------------------------+
(1 row)
real 0m29.891s
user 0m0.028s
sys 0m0.012s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; select count(DISTINCT l_suppkey) FROM tpch.lineitem LIMIT 1;"
+---------------------------+
| count(DISTINCT l_suppkey) |
+---------------------------+
| 10000 |
+---------------------------+
(1 row)
real 0m9.481s
user 0m0.044s
sys 0m0.008s
I forgot to include the execution plan, which looks good.
Queries run with 630757cbc0 on a 6-node navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.
This query was constructed to have a sparse where clause, and really nothing else. The
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem WHERE l_extendedprice < 1000;" > foo_on
real 0m12.068s
user 0m0.072s
sys 0m0.012s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem WHERE l_extendedprice < 1000;" > foo_off
real 0m49.482s
user 0m0.076s
sys 0m0.008s
The execution plan looks good
The results are correct:
cockroach@cockroach-navy-0006:~$ sort foo_on > foo_on_sorted
cockroach@cockroach-navy-0006:~$ sort foo_off > foo_off_sorted
cockroach@cockroach-navy-0006:~$ diff foo_off_sorted foo_on_sorted
cockroach@cockroach-navy-0006:~$ wc -l foo_off
3082 foo_off
cockroach@cockroach-navy-0006:~$ wc -l foo_on
3082 foo_on
I did not add an ORDER BY or a COUNT operation so that we controlled the number of moving parts, but when added the runtimes are similar.
Queries run with 630757c on 6-node navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.
The query is SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey. This query was constructed to have a JOIN that could have a lot of rows in it, followed by a limit. It is extremely artificial and otherwise useless in any real analytics scenario.
The DistSQL execution plan is intimidating, but ultimately correct.
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey limit 10" > foo_on
real 0m0.124s
user 0m0.028s
sys 0m0.028s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey limit 10" > foo_off
real 0m0.186s
user 0m0.036s
sys 0m0.016s
When run without the limit, both versions OOM:
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey" > foo_off
Error: pq: root: memory budget exceeded: 10240 bytes requested, 3676112896 bytes in budget
Failed running "sql"
real 0m16.898s
user 0m0.036s
sys 0m0.012s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey" > foo_off
Error: pq: root: memory budget exceeded: 136007680 bytes requested, 3676112896 bytes in budget
Failed running "sql"
real 0m35.439s
user 0m0.028s
sys 0m0.028s
While that is unsatisfactory and needs work, the point of this query demonstrates that in both time and memory usage, LIMIT queries in DistSQL (and regular SQL) do the right thing, and match expected behavior.
Query run using 630757c on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.
SELECT l_shipmode, AVG(l_extendedprice) FROM lineitem GROUP BY l_shipmode;
The query runs correctly, the execution plan looks good, and the times for one run are below:
| Without DistSQL | With DistSQL |
| --- | --- |
| 44.103s | 7.799s |
@asubiotto Almost linear speedup. Nice!
Queries run using 630757c on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.
I ran a variety of join queries, but not documenting all of them, since they all have the same story: we always plan HashJoins with full bisection flows on all nodes that have a TableReader for that query. Sadly, this means we are very susceptible to running out of memory, which we still do on large datasets (and sometimes kill nodes as well since the memory accounting guardrails are not in 630757c).
Here is one sample execution plan: as you can see, the planner is planning HashJoins and doing full bisection flows between all the nodes.
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT count(*) FROM tpch.lineitem, tpch.supplier where lineitem.l_suppkey = supplier.s_suppkey;"
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
(1 row)
real 0m52.902s
user 0m0.032s
sys 0m0.016s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT count(*) FROM tpch.lineitem, tpch.supplier where lineitem.l_suppkey = supplier.s_suppkey;"
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
(1 row)
real 0m13.531s
user 0m0.028s
sys 0m0.020s
There is speedup, so the HashJoin, while not the best possible plan for this query, is still a hefty speedup over local execution.
Closing this issue as we have now empirically evaluated and learned the breadth and limits of our DistSQL processors and the planning. All the credit to @asubiotto, who shepherded this through all those OOMs!
🎉
Spun up an azworker with the same specs as navy and ran all of these queries against postgres (TPC-H scalefactor 1). These numbers are from one run only. Note that the single-node and distributed SQL numbers are from the runs above (copy-pasted for convenience) from 6 node clusters (only the first query was run on a 3 node local cluster).
| Query | Postgres | CockroachDB (single-node SQL) | CockroachDB (distributed SQL) |
| --- | --- | --- | --- |
| SELECT MIN(l_extendedprice) FROM lineitem | 1.582s | 22.028s | 9.520s |
| SELECT COUNT(*) FROM lineitem | 0.925s | 36.707s | 12.350s |
| SELECT AVG(l_extendedprice) FROM lineitem | 1.839s | 46.255s | 18.560s |
| SELECT * FROM orders ORDER BY o_totalprice | 6.063s | 3m51s | 3m34s |
| SELECT * FROM lineitem ORDER BY l_extendedprice LIMIT 10 | 1.476s | 1m45s | 15.389s |
| SELECT COUNT(DISTINCT l_suppkey) FROM lineitem LIMIT 1 | 5.234s | 29.891s | 9.481s |
| SELECT * FROM lineitem WHERE l_extendedprice < 1000 | 1.343s | 49.482s | 12.068s |
| SELECT * FROM lineitem JOIN supplier ON lineitem.l_suppkey = supplier.s_suppkey LIMIT 10 | 1.247ms | 1860ms | 1240ms |
| SELECT l_shipmode, AVG(l_extendedprice) FROM lineitem GROUP BY l_shipmode | 3.548s | 44.103s | 7.799s |
| SELECT COUNT(*) FROM lineitem, supplier where lineitem.l_suppkey = supplier.s_suppkey | 2.409s | 52.902s | 13.531s |
cc @petermattis @arjunravinarayan
Thanks, @asubiotto. This will definitely motivate work in 1.1.