Presto: Fix USING clause for JOINs

Created on 2 Feb 2015 · 6Comments · Source: prestodb/presto

Columns that appear in a USING clause should not require qualification. This is particularly useful for outer joins, because the non-qualified reference always references the non-null value. It is not clear if qualifying reference is allowed: PostgreSQL allows it but Oracle does not. Allowing it is useful in order to do IS NULL checks for outer joins, especially when no other columns are available.

presto> select x from (values (1)) a (x) join (values (1)) b (x) using (x);
Query 20150202_160415_12625_wu6ji failed: Column 'x' is ambiguous

Additionally, because USING makes them the same column, the column should only appear once in the output descriptor:

presto> select * from (values (1)) a (x) join (values (1)) b (x) using (x);
 x | x 
---+---
 1 | 1 
(1 row)

Correct behavior for the above:

postgresql=# select x from (values (1)) a (x) join (values (1)) b (x) using (x);
 x 

---
 1
(1 row)

postgresql=# select * from (values (1)) a (x) join (values (1)) b (x) using (x);
 x 

---
 1
(1 row)

Additional test cases that should return the value from the outside of the join:

SELECT x FROM (VALUES (1), (2)) a (x) LEFT JOIN (VALUES (1)) b (x) USING (x);
SELECT x FROM (VALUES (1)) a (x) RIGHT JOIN (VALUES (1), (2)) b (x) USING (x);

bug

Source

electrum

All 6 comments

I will try to give it a shot.

kokosing on 9 Dec 2016

👍1

@martint Postgres changes the order of selection list when USING is used. See:

postgres=# select * from t join t a using (b);
 b | a | a 
---+---+---
 4 | 2 | 2
(1 row)

postgres=# select * from t join t a using (a); 
 a | b | b 
---+---+---
 1 |   |  
 2 | 4 | 4
(2 rows)

is that expected according to the ANSI SQL syntax?

kokosing on 9 Dec 2016

From the spec:

If a <named columns join> is specified, then every <column name> in the <join column list> shall be equivalent to the <field name> of exactly one field of RTA and the <field name> of exactly one field of RTB. Let _common column name_ be the name of such a column. Let _corresponding join columns_ refer to the columns identified in the <join column list>.

If there is at least one corresponding join column, then:
i) Let N be the number of corresponding join columns.
ii) For each _i_, 1 (one) ≤ i ≤ N:
1) Let CJCN_i be the i-th common column name, taken in order of their ordinal positions in RTA.

iii) Let SLCC be a <select list> of <derived column>s of the form:
COALESCE(RVA_i.CJCN_i, RVB_i.CJCN_i) AS CJCN_i
for every i, 1 (one) ≤ i N, in ascending order.

If RTA contains at least one field that is not a corresponding join column, then:
i) Let NCA be the number of fields of RTA that are not corresponding join columns.
ii) For each j, 1 (one) ≤ j ≤ NCA:
1) Let CA_j be the name of the j-th field that is not a corresponding join column, taken in order of their ordinal positions in RTA.

iii) Let SLTA be a <select list> of <derived column>s of the form RVCA_j.CA_j
for every j, 1 (one) ≤ j ≤ NCA, in ascending order.

If RTB contains at least one field that is not a corresponding join column, then:
i) Let NCB be the number of fields of RTB that are not corresponding join columns.
ii) For each k, 1 (one) ≤ k ≤ NCB:
1) Let CB_k be the name of the k-th field that is not a corresponding join column, taken in order of their ordinal positions in RTB.

iii) Let SLTB be a <select list> of <derived column>s of the form RVCB_k . CB_k
for every k, 1 (one) ≤ k ≤ NCB, in ascending order.

h) Let the <select list> SL be defined as
Case:
i) If all of the fields of RTA and RTB are corresponding join columns, then let SL be “SLCC”.
ii) If RTA contains no corresponding join columns and RTB contains no corresponding join columns, then let SL be “SLTA, SLTB”.
iii) If RTA contains no fields other than corresponding join columns, then let SL be “SLCC, SLTB”.
iv) If RTB contains no fields other than corresponding join columns, then let SL be “SLCC, SLTA”.
v) Otherwise, let SL be “SLCC, SLTA, SLTB”.

The descriptors of the columns of the result of the , with the possible exception of the nullability characteristics of the columns, are the same as the descriptors of the columns of the result of
SELECT SL FROM TRA, TRB

So, to summarize:

The descriptor of the result of the join is composed of the join columns, followed by the columns from the left relation (minus the join columns), followed by the columns of the right relation (minus the join columns)
The value of each join column is equivalent to coalesce(l.x, r.x)

martint on 11 Dec 2016

Thanks for explanation. I must admit that without your summary at the end I would not understand a word ;) Have you considered publishing a book like ANSI SQL explained?

I think I need to re implement this completely. I am going to write below rewrite. From

SELECT * from l JOIN r USING(x)

to:

SELECT * from (
  SELECT coalesce(l.x, r.x) as x, l.* (without x), r.* (without x) from l JOIN r ON lx = r.x

What do you think?

kokosing on 11 Dec 2016

👍1

What do you think?

Yes, you have to plan it as if you were seeing a query of that shape, but there's not mechanism to rewrite from one syntactic structure to another.

martint on 12 Dec 2016

Yes. I find more and more reasons that such mechanisms (iterative semantic analyzer) is needed. See https://github.com/prestodb/presto/issues/6625.

kokosing on 12 Dec 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Running presto sql script from cli - not able to see time taken by the query.

KishorBaindoor · 3Comments

Hive connector - DROP TABLE does not work

tomz · 3Comments

SIGTERM doesn't wait GracefulShutdownHandler completion

xerial · 3Comments

java.lang.UnsupportedOperationException: com.facebook.presto.spi.type.LongDecimalType

dterror-zz · 4Comments

sequence() -> array<date> does not behave consistently with last day of the month

cps333 · 3Comments