Consider the following TPC-C query:
SELECT COUNT(DISTINCT(s_i_id)) FROM order_line JOIN stock ON s_i_id=ol_i_id
WHERE ol_w_id = $1 AND ol_d_id = $2 AND ol_o_id BETWEEN $3 - 20 AND $3 - 1
AND s_w_id = $1 AND s_quantity < $4
This issue is mostly concerned with the JOIN between the two tables. We have some filters which we can push down to their respective tables before the JOIN. Now the order_line table happens to be way sparser than the stock table after all the filters are applied. The ideal way to execute this query is to do a full table scan on the order_line table, and then take the resulting tuples and do lookups in the stock table for those values (our primary index definition in the stock is such that these lookups would be very efficient). Basically, merge or hash join, you really don't want to do the work associated with a full scan over the stock table, as the number of items is very high compared to the selectivity of the filters and the join.
This is exactly how an index join is executed when we need to look up values in the primary index based on matches in a secondary index, so we already have the algorithmic processors working and extensively tested, we just need to extend it to perform joins across different tables. I understand that recognizing when this would be a superior option requires extensive table statistics, but given that we want to start thinking about hinting explicit query execution strategies, this is something we should support with specialized syntax.
cc @RaduBerinde @jordanlewis @knz.
Two items in your issue here:
1) the join algorithm is called a lookup join and Jordan was asking for this last week too. I have put it as an item in #18977. It depends on a few other things to be done first.
2) the ability to hint in a query which join algorithm to use is also another item. This one doesn't have prerequisites.
I will keep in mind that this improvement has significant impact, this will help us prioritize.
We need to reprioritize this as it has come up as a customer issue.
This has come up from users on the forum as well: https://forum.cockroachlabs.com/t/subquery-evaluation-on-simple-table-structure/1275/11.
See also #21301:
Note that distsql has the joinReader for index joins. That is effectively a nested-loop join and could be extended to support any table on the right side.
And @RaduBerinde says:
The only thing that's missing from joinReader to do general lookup-joins is routing some columns from the input to the output - currently it can only return things from the second table. For example for SELECT a, b, c FROM ab JOIN bc USING b we need to plumb column a through the joinReader.
Anyone interested in picking this up?
My hands are full, but I would really like to see this make it into the 2.0 release.
assigning to paul for now since he's mucking around in joins. If it gets to be too hard we can assign someone else.
A quick summary of a meeting with @knz and @andreimatei I had yesterday regarding hinting:
_Hint_: Suggestions for the database that it is free to ignore.
_Constraint_: Forced "hints". If non-sensical would throw error.
Need to distinguish whether or not this syntax of forcing a lookup join is a hint or constraint. It looks like a constraint.
Long term note: Might be nice to consider having a system where we can have hints that can be ignored and then a flag which can force the hints, turning them into constraints. This would be useful for testing to have an easy way to force a certain execution path.
In the end we decided that something like:
SELECT * FROM ABC JOIN@{STRATEGY=LOOKUP} DEFG ON A = D AND B = E;
specific indexes can be forced for using the join using the current forced index syntax. For example to specify the join should be on the DEF index, you can write as below:
SELECT * FROM ABC JOIN@{STRATEGY=LOOKUP} DEFG@DEF ON A = D AND B = E;
The hint will always follow the JOIN word: e.g.: SELECT * FROM ABC NATURAL JOIN@{STRATEGY=LOOKUP} DEFG;.
Benefits of using this syntax:
I believe Postgres doesn't have inline hints/constraints like this. Is there syntax from other systems that we should be adopting? I'm slightly anxious about deciding on this syntax in the next week or so. Perhaps in the short term we should have a session variable like set experimental.force_lookup_join = true. This would clearly be more limited as the hint would apply to every join in a query. Mainly I want to raise an alternative and make sure it was considered.
The session variable is probably simpler to implement/use too.
(nit: experimental_force_lookup_join as session variables cannot contain periods)
The question though is whether it would satisfy the use case(s).
I'm in favor if the set experimental_force_lookup_join = true; syntax since it's less of a commitment and doesn't change the inline syntax for select queries.
It appears as though it would satisfy at least the use case for the TPC-C query, the queries in the forum linked above, and the linked issue. If we are okay with the limitations in the short term, it seems like a lightweight way to introduce the functionality. Also it would give us more time to think through how we may want to introduce inline hinting if at all.
use
SET experimental_force_lookup_join = true;
to enable, and
SET experimental_force_lookup_join = false;
to disable
Great stuff paul!
On Thu, Feb 15, 2018 at 7:24 PM Paul Bardea notifications@github.com
wrote:
Closed #19038 https://github.com/cockroachdb/cockroach/issues/19038 via
22674 https://github.com/cockroachdb/cockroach/pull/22674.
—
You are receiving this because you commented.Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/19038#event-1477060799,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALOpBMkTgg80nSd_0njFtWkMTBh0ma3nks5tVMrOgaJpZM4PuMwr
.