Cockroach: distsql: tracking issue for queries we expect to run through DistSQL

Created on 21 Mar 2017  Â·  31Comments  Â·  Source: cockroachdb/cockroach

This is a TODO list for listing queries that we expect to run using DistSQL (in auto mode) by 1.0. Feel free to add as needed, cc @andreimatei @RaduBerinde @cuongdo @asubiotto.

  • [x] SELECT COUNT(*) should always run through DistSQL for speed reasons. It is a common operation just after loading a bunch of data. It is currently exceedingly slow on large tables (without DistSQL). @arjunravinarayan
  • [x] SELECT COUNT (DISTINCT) column_name @arjunravinarayan
  • [x] Distributed Aggregations over large datasets: might be implicitly tested by a TPC-H query. Identify which one and document. @asubiotto
  • [x] Large scans with a very sparse WHERE clause. @arjunravinarayan
  • [x] TPC-H queries that we support (see #14295) @asubiotto

    • Queries 9 and 15 don't run through DistSQL, see issue for more info.

  • [x] Basic join queries. For some helpful query ideas, the join order benchmark provides plenty of samples to try with DistSQL, even if we won't be optimally efficient on these for a while. @arjunravinarayan
  • [x] Distributed sorting over large datasets: might be implicitly tested by a TPC-H query. Identify which one and document. @asubiotto
  • [x] LIMIT queries, particularly limits after JOINs that would cause large amounts of state to stream across machines. @arjunravinarayan
  • [x] queries using MIN @asubiotto
  • [x] queries using MAX @asubiotto
  • [x] queries using AVG (see #14351) @asubiotto

If there are any queries that we want turned on by 1.0, please add them to the list above.

Do not check off an item as done without adding a comment/issue tracking the queries actually attempted on a cluster. Try and report running times and EXPLAIN(query) output to show the DistSQL plan.

All 31 comments

Ran

SELECT MIN(l_extendedprice) FROM lineitem

on a 3 node local cluster, using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

| Without DistSQL | With DistSQL |
| --- | --- |
| 22.028s | 9.520s |

Ran

SELECT MAX(l_extendedprice) FROM lineitem

on a 3 node local cluster, using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

| Without DistSQL | With DistSQL |
| --- | --- |
| 22.821s | 9.481s |

Hm, that's much slower than I'd have expected. Isn't DistSQL able to reduce network traffic on these queries to O(1)?

Yes, the network traffic is O(1) in these plans. Note that these are local clusters (all nodes on the same machine) so network traffic is not really network traffic.

Ah, understood.

On Thu, Mar 30, 2017 at 3:18 PM, RaduBerinde notifications@github.com
wrote:

Yes, the network traffic is O(1) in these plans. Note that these are
local clusters (all nodes on the same machine) so network traffic is
not really network traffic.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/14288#issuecomment-290516307,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPPXaHXVWMN34eLKNY2q0emX5eUEjks5rq__rgaJpZM4Mj8Zu
.

@danhhz mentioned a SELECT COUNT(*) experiment on a real 3 node cluster (I think it was ~200M rows, SELECT COUNT(*) took 7m33s compared to over an hour).

Makes much more sense, thanks.

On Thu, Mar 30, 2017 at 3:20 PM, RaduBerinde notifications@github.com
wrote:

@danhhz https://github.com/danhhz mentioned a SELECT COUNT()
experiment on a real 3 node cluster (I think it was ~200M rows, SELECT
COUNT(
) took 7m33s compared to over an hour).

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/14288#issuecomment-290517077,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPC0LSslegjSGJTl7havxxxJy5jWVks5rrACNgaJpZM4Mj8Zu
.

Indeed, it was the production lapis cluster

Ran

SELECT COUNT(*) FROM lineitem

on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

| Without DistSQL | With DistSQL |
| --- | --- |
| 36.707s | 12.350s |

Ran

SELECT AVG(l_extendedprice) FROM lineitem

using f97a5c3 on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

| Without DistSQL | With DistSQL |
| --- | --- |
| 46.255s | 18.560s |

Can you report the cockroach sha you used for future reference?

@asubiotto The ranges for the lineitem table are only spread across 4 nodes? Or is there another reason the query plan only has 4 TableReaders?

Is there a quick way to find this information out? Last time I asked, there wasn't a clean way to figure out, given a table, what ranges the table was spread across.

Admin UI says there are 74 ranges for table lineitem (but I can't figure out how to find range IDs or node information for those ranges), so it being spread across just 4 nodes seems strange (but possible).

SHOW TESTING_RANGES FROM TABLE <table-name>

Thank you @petermattis. It appears there are replicas and lease holders on every node, so let me investigate.
Also, there are 24 ranges, for 72 replicas, so clearly the Admin UI is buggy when it says there are 74 ranges. I will file an issue.

I updated the execution plan. I think the gateway node probably had a cache that hadn't been updated. The query plan now shows only 5 TableReaders which is still weird.

Actually, 1 doesn't seem to be a lease holder for any range in lineitem.

Sure, but it was showing 4 before, and that was certainly incorrect. cc @andreimatei.

Sure, but it was showing 4 before, and that was certainly incorrect.

The range-descriptor and leaseholder caches can be empty or stale. This explains it, right?

The state of the caches is supposed to be seen in the ranges_cached internal table, but I think this hasn't been implemented yet.

All these queries were run using 4129fe0 on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.

I started by trying to run

SELECT * FROM lineitem ORDER BY l_extendedprice

On 6million ~2.5GiB lineitem but this failed (see #15332 but note that this query doesn't cause an OOM crash with 4129fe0 any more)

I then moved down to orders, a 1.5million ~600MiB table. Ran

SELECT * FROM orders ORDER BY o_totalprice

The query runs correctly, the execution plan looks good (note that nodes 1 and 3 aren't leaseholders for any of orders' ranges), and the times for one run are below:

| Without DistSQL | With DistSQL |
| --- | --- |
| 3m51s | 3m34s |

I also ran:

SELECT * FROM lineitem ORDER BY l_extendedprice LIMIT 10

To avoid running out of memory on the gateway node (execution plan here). The times for one run are as follows:

| Without DistSQL | With DistSQL |
| --- | --- |
| 1m45s | 15.389s |

Queries run with 630757cbc0 on a 6-nodeo navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; select count(DISTINCT l_suppkey) FROM tpch.lineitem LIMIT 1;"
+---------------------------+
| count(DISTINCT l_suppkey) |
+---------------------------+
|                     10000 |
+---------------------------+
(1 row)

real    0m29.891s
user    0m0.028s
sys 0m0.012s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; select count(DISTINCT l_suppkey) FROM tpch.lineitem LIMIT 1;"
+---------------------------+
| count(DISTINCT l_suppkey) |
+---------------------------+
|                     10000 |
+---------------------------+
(1 row)

real    0m9.481s
user    0m0.044s
sys 0m0.008s

I forgot to include the execution plan, which looks good.

Queries run with 630757cbc0 on a 6-node navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.

This query was constructed to have a sparse where clause, and really nothing else. The

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem WHERE l_extendedprice < 1000;" > foo_on

real    0m12.068s
user    0m0.072s
sys 0m0.012s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem WHERE l_extendedprice < 1000;" > foo_off

real    0m49.482s
user    0m0.076s
sys 0m0.008s


The execution plan looks good

The results are correct:

cockroach@cockroach-navy-0006:~$ sort foo_on > foo_on_sorted
cockroach@cockroach-navy-0006:~$ sort foo_off > foo_off_sorted
cockroach@cockroach-navy-0006:~$ diff foo_off_sorted foo_on_sorted
cockroach@cockroach-navy-0006:~$ wc -l foo_off
3082 foo_off
cockroach@cockroach-navy-0006:~$ wc -l foo_on
3082 foo_on

I did not add an ORDER BY or a COUNT operation so that we controlled the number of moving parts, but when added the runtimes are similar.

Queries run with 630757c on 6-node navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.

The query is SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey. This query was constructed to have a JOIN that could have a lot of rows in it, followed by a limit. It is extremely artificial and otherwise useless in any real analytics scenario.

The DistSQL execution plan is intimidating, but ultimately correct.

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey limit 10" > foo_on

real    0m0.124s
user    0m0.028s
sys 0m0.028s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey limit 10" > foo_off

real    0m0.186s
user    0m0.036s
sys 0m0.016s

When run without the limit, both versions OOM:

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey" > foo_off
Error: pq: root: memory budget exceeded: 10240 bytes requested, 3676112896 bytes in budget
Failed running "sql"

real    0m16.898s
user    0m0.036s
sys 0m0.012s

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey" > foo_off
Error: pq: root: memory budget exceeded: 136007680 bytes requested, 3676112896 bytes in budget
Failed running "sql"

real    0m35.439s
user    0m0.028s
sys 0m0.028s

While that is unsatisfactory and needs work, the point of this query demonstrates that in both time and memory usage, LIMIT queries in DistSQL (and regular SQL) do the right thing, and match expected behavior.

Query run using 630757c on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.

SELECT l_shipmode, AVG(l_extendedprice) FROM lineitem GROUP BY l_shipmode;

The query runs correctly, the execution plan looks good, and the times for one run are below:

| Without DistSQL | With DistSQL |
| --- | --- |
| 44.103s | 7.799s |

@asubiotto Almost linear speedup. Nice!

Queries run using 630757c on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.

I ran a variety of join queries, but not documenting all of them, since they all have the same story: we always plan HashJoins with full bisection flows on all nodes that have a TableReader for that query. Sadly, this means we are very susceptible to running out of memory, which we still do on large datasets (and sometimes kill nodes as well since the memory accounting guardrails are not in 630757c).

Here is one sample execution plan: as you can see, the planner is planning HashJoins and doing full bisection flows between all the nodes.

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT count(*) FROM tpch.lineitem, tpch.supplier where lineitem.l_suppkey = supplier.s_suppkey;"
+----------+
| count(*) |
+----------+
|  6001215 |
+----------+
(1 row)

real    0m52.902s
user    0m0.032s
sys 0m0.016s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT count(*) FROM tpch.lineitem, tpch.supplier where lineitem.l_suppkey = supplier.s_suppkey;"
+----------+
| count(*) |
+----------+
|  6001215 |
+----------+
(1 row)

real    0m13.531s
user    0m0.028s
sys 0m0.020s

There is speedup, so the HashJoin, while not the best possible plan for this query, is still a hefty speedup over local execution.

Closing this issue as we have now empirically evaluated and learned the breadth and limits of our DistSQL processors and the planning. All the credit to @asubiotto, who shepherded this through all those OOMs!

🎉

Spun up an azworker with the same specs as navy and ran all of these queries against postgres (TPC-H scalefactor 1). These numbers are from one run only. Note that the single-node and distributed SQL numbers are from the runs above (copy-pasted for convenience) from 6 node clusters (only the first query was run on a 3 node local cluster).

| Query | Postgres | CockroachDB (single-node SQL) | CockroachDB (distributed SQL) |
| --- | --- | --- | --- |
| SELECT MIN(l_extendedprice) FROM lineitem | 1.582s | 22.028s | 9.520s |
| SELECT COUNT(*) FROM lineitem | 0.925s | 36.707s | 12.350s |
| SELECT AVG(l_extendedprice) FROM lineitem | 1.839s | 46.255s | 18.560s |
| SELECT * FROM orders ORDER BY o_totalprice | 6.063s | 3m51s | 3m34s |
| SELECT * FROM lineitem ORDER BY l_extendedprice LIMIT 10 | 1.476s | 1m45s | 15.389s |
| SELECT COUNT(DISTINCT l_suppkey) FROM lineitem LIMIT 1 | 5.234s | 29.891s | 9.481s |
| SELECT * FROM lineitem WHERE l_extendedprice < 1000 | 1.343s | 49.482s | 12.068s |
| SELECT * FROM lineitem JOIN supplier ON lineitem.l_suppkey = supplier.s_suppkey LIMIT 10 | 1.247ms | 1860ms | 1240ms |
| SELECT l_shipmode, AVG(l_extendedprice) FROM lineitem GROUP BY l_shipmode | 3.548s | 44.103s | 7.799s |
| SELECT COUNT(*) FROM lineitem, supplier where lineitem.l_suppkey = supplier.s_suppkey | 2.409s | 52.902s | 13.531s |

cc @petermattis @arjunravinarayan

Thanks, @asubiotto. This will definitely motivate work in 1.1.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nvanbenschoten picture nvanbenschoten  Â·  3Comments

bdarnell picture bdarnell  Â·  4Comments

couchand picture couchand  Â·  3Comments

HeikoOnnebrink picture HeikoOnnebrink  Â·  4Comments

petermattis picture petermattis  Â·  4Comments