Support queries like:
presto:sf1> select count(*) from lineitem where (orderkey, linenumber) IN (SELECT orderkey, linenumber from lineitem);
Query 20161018_062422_00016_uqzsf failed: line 1:60: Multiple columns returned by subquery are not yet supported. Found 2
It should be easy to implement once https://github.com/prestodb/presto/issues/6384 got implemented.
There's a PR in progress to support this: https://github.com/prestodb/presto/pull/5323
I'm starting working on this
@arhimondr do you have plans how tuples with NULLs should be distributed?
IN currently employs "replicate any and null" distribution (unless it's broadcasted of course). In tuple case, at first it seems you need to replicate tuples that contain any NULL value. Is it possible to do better?
Also, how does it play with structural types? What should be the result of map(2 => null) IN ( map(2 => 3) ) ?
Basically any structural type that contains a null value (no matter how deeply it is enclosed) is considered indeterminete. And IN for indeterminete should behave exactly the same as for NULL
replicate any and null
I'm going to change that to be replicate any and indeterminete
Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?
Yes
Just note that today:
presto> select array[1, 2, 3] = array[42, null];
_col0
-------
false
and IN is defined in terms of equality, so = and IN should remain consistent. (AFAIU... @martint can you comment?)
Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?
No, they have different size, so the comparison should return false.
Basically any structural type that contains a null value (no matter how deeply it is enclosed) is considered indeterminete. And IN for indeterminete should behave exactly the same as for NULL
With respect to how they are distributed, yes. But not for the purpose of comparison for the IN test - that still behaves according to equality semantics
Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?
No, they have different size, so the comparison should return false.
Oh, sure. Would the answer be the same for array[1, 2] and array[42, null]? (array = is left-to-right, so this comparison returns false)
What about array[1, 2] and array[null, 42]? This = comparison fails today, but multi-column IN can result in equivalent case ((1,2) IN ( select null, 42 ))
IN should have the same semantics as =
IN should have the same semantics as =
Right, so
array[1, 2, 3] IN ( array[42, null] ) is equal to array[1, 2, 3] = array[42, null]
And yeah,
No, they have different size, so the comparison should return false.
that's how the array comparison is implemented now. It works only because we check the size first. If it is different - it is false.
But when it comes to actually compare the values, we do not support nulls
presto> SELECT ARRAY[1, null] = ARRAY[1, null];
Query 20180502_154031_00004_dwbpc failed: ARRAY comparison not supported for arrays with null elements
Postgree would just answer true in that case.
postgres=# SELECT ARRAY[1, null] = ARRAY[1, null];
?column?
----------
t
(1 row)
I guess since we didn't implement this - it is not covered by the standard.
Null comparisons for null in row are also not supported https://github.com/prestodb/presto/blob/master/presto-spi/src/main/java/com/facebook/presto/spi/type/RowType.java#L263
And it seems that semantics for null in row and semantics for null in multicolumn in are different
postgres=# SELECT (1, cast(null as int)) IN (SELECT 1, cast(null as int));
?column?
----------
(1 row)
is equivalent to
postgres=# SELECT 1 = 1 AND null = null;
?column?
----------
(1 row)
postgres=# SELECT array[1, cast(null as int)] IN (SELECT array[1, cast(null as int)]);
?column?
----------
t
(1 row)
is equivalent to
postgres=# SELECT array[1, cast(null as int)] = array[1, cast(null as int)];
?column?
----------
t
(1 row)
I didn't check the SQL standard yet (i will), but so far the idea is to change the INDETERMINATE operator for ROW/MAP/ARRAY to return true only if there is a null on the first level. And than to just use it instead of isNull checks in SemiJoin related code.
* *Update: this comment is incorrect. See follow-up comments below. **
~Ugh... I just checked the spec. For the purpose of array equality, values are compared according to a different set of rules than regular equality:~
Let A1 and A2 be arrays of EDT. A1 and A2 are identical if and only if A1 and A2
have the same cardinality n and if, for all i in the range 1 (one) ≤ i ≤ n, the element
at ordinal position i in A1 is identical to the element at ordinal position i in A2.
and
9.10 Determination of identical values
...
2) Case:
a) If V1 and V2 are both the null value, then V1 is identical to V2.
Ah, I take that back. That first block talks about arrays being "identical", not "equal", so that's not the operation to be considered for equality.
Here are the equality rules for arrays:
ii) If the declared types of XV and YV are array types or distinct types whose source types
are array types and the cardinalities of XV and YV are N1 and N2, respectively, then
let Xi, 1 (one) ≤ i ≤ N1, denote a <value expression> whose value and declared type is
that of the i-th element of XV and let Yi, 1 (one) ≤ i ≤ N2, denote a <value expression>
whose value and declared type is that of the i-th element of YV. The result of
X <comp op> Y
is determined as follows:
1) X = Y is True if N1 = 0 (zero) and N2 = 0 (zero).
2) X = Y is True if N1 = N2 and, for all i, Xi = Yi is True.
3) X = Y is False if and only if one of the following is true:
A) N1 ≠ N2.
B) N1 = N2 and for some i, 1 (one) ≤ i ≤ N1, NOT (Xi = Yi) is True.
4) X <comp op> Y is Unknown if X <comp op> Y is neither True nor False.
So, according to the standard, array [1, null] = array[1, null] => null
And for rows:
i) If the declared types of XV and YV are row types with degree N, then let Xi, 1 (one) ≤ i ≤ N,
denote a <value expression> whose value and declared type is that of the i-th field of XV and let Yi
denote a <value expression> whose value and declared type is that of the i-th field of YV.
The result of
X <comp op> Y
is determined as follows:
1) X = Y is True if and only if Xi = Yi is True for all i.
2) X < Y is True if and only if Xi = Yi is True for all i < n and Xn < Yn for some n.
3) X = Y is False if and only if NOT (Xi = Yi) is True for some i.
4) X < Y is False if and only if X = Y is True or Y < X is True.
5) X <comp op> Y is Unknown if X <comp op> Y is neither True nor False.
Yeah, thanks for digging into the standard. I was just about to do so.
So, yeah, PostgreSQL implements array equals not by the standard.
the idea is to change the INDETERMINATE operator for ROW/MAP/ARRAY to return true only if there is a null on the first level
We should leave INDETERMINATE operator implementation as is than.
And yeah, we cannot rely on INDETERMINATE when comparing, because due to the standard array [1, null] = array[1, null] => null, but array [1, null] = array[2, null] => FALSE
Little bit of a pseudo code =)
Boolean in(List<Object> row, Set<List<Object>> searchRowSet)
{
if (searchRowSet.isEmpty()) {
return false;
}
if (searchRowSet.containsExact(row)) {
return true;
}
for (List<Object> setElement : searchRowSet) {
if (nullableEquals(setElement, row) == null) {
return null;
}
}
return false;
}
Boolean nullableEquals(List<Object> setElement, List<Object> values)
{
boolean match = true;
for (int i = 0; i < values.size(); i++) {
Boolean result = nullableEquals(setElement.get(i), values.get(i));
if (result == Boolean.FALSE) {
return false;
}
if (result == null) {
match = false;
}
}
if (!match) {
return null;
}
return true;
}
Boolean nullableEquals(Object first, Object second)
{
if (first == null || second == null) {
return null;
}
return first.equals(second);
}
This algorithm is already implemented in the HashSemiJoinOperator.
On the probe side we need no modification. We always operate on a single row.
On the build side we need to broadcast all the values with nulls at any level (we call them INDETERMINATE). This should be an easy change, since we already doing this for null's.
In that PR we also blocking null values on the probe side. That change was introduced mainly because we didn't broadcast an any row. But @findepi already added it. So, i guess i can just drop the very last commit.
The only unresolved issue there is the nullableEquals. When for basic types it is easy to take care about nulls outside of the equals method, for complex types it is not that easy. Currently equals for complex typed doesn't support null at all.
After implementing a nullableEquals for complex types we can add a rewrite:
SELECT (value1, value2) IN (SELECT column1, column2 FROM table) -> SELECT row(value1, value2) IN (SELECT row(column1, column2) FROM table)
@kokosing @martint @findepi
presto> SELECT ARRAY[1, null] = ARRAY[1, null]; Query 20180502_154031_00004_dwbpc failed: ARRAY comparison not supported for arrays with null elements
@arhimondr -- this is correct in a sense this doesn't produce wrong result. It's not (yet) supported.
In that PR we also blocking null values on the probe side. That change was introduced mainly because we didn't broadcast an any row. But @findepi already added it. So, i guess i can just drop the very last commit.
Which PR you refer to?
The only unresolved issue there is the nullableEquals. When for basic types it is easy to take care about nulls outside of the equals method, for complex types it is not that easy. Currently equals for complex typed doesn't support null at all.
Doesn't it mean we need to implement missing support for null elements in equality functions for complex types? I.e. replace "ARRAY comparison not supported for arrays with null elements" with something that follows the standard more closely?
Which PR you refer to?
https://github.com/prestodb/presto/pull/7792
Doesn't it mean we need to implement missing support for null elements in equality functions for complex types? I.e. replace "ARRAY comparison not supported for arrays with null elements" with something that follows the standard more closely?
Yes, exactly.
On the probe side we need no modification. We always operate on a single row.
This isn't true unfortunately. Consider probe row: array[null, 1] and build row: array[2, 1]. Now you need to make sure that both rows land on the same machine to deduce that the semi join result for array[null, 1] should be null instead of false. In order to do so, you most likely need to broadcast array[null, 1] row to each node. But then you need to reduce such broadcasted probe rows after semi join operator.
So it seems that probe rows need to have unique row id columns before broadcasting. After broadcast there should be an aggregation that will group by unique row id with OR aggregation on semi join symbol.
E.g probe should be similar to:
Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)]
|
SemiJoin (broadcasts probe rows with nulls)
|
AssignUniqueID
We already distribute an "any" row. Initially i though we introduced that
to cover the "empty in" case, but than i realized that it should cover the
probe nulls case as well.
For your particular example it would return the correct result. Because the
"array[2, 1]" value is gonna be broadcasted as the "any" value.
Although It still feels that there's a case that may not be covered, so far
i couldn't come up with one.
Let me try to follow you through my thoughts.
No matter what the build side set is, for an indeterminate value on a probe
side only "false" or "null" can be returned.
In the SQL "null or false" expression evaluates to null.
It is enough a single expression in the IN comparison chain to return
"null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2,
value3 ... ) is always null. No matter what the rest of the values are.
You can get a "null" only when matching an indeterminate with an
indeterminate (a value that contains a null somewhere).
Since we replicate all the indeterminate values, we shouldn't miss a null.
Let me know if I'm missing something.
On Mon, May 7, 2018 at 10:51 AM Karol Sobczak notifications@github.com
wrote:
E.g probe should be similar to:
Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)]
|
SemiJoin (broadcasts probe rows with nulls)
|
AssignUniqueID—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prestodb/presto/issues/6385#issuecomment-387089981,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFUBrEgaUwO_xoS3gLaebmamEb_pICHuks5twF9dgaJpZM4KZa0o
.
UPD:
I'm slow. You get a "null" also for an "array[1,null] in (array[1,2],
array[2,3])". In case array[2,3] is replicated we get a "false", what is
incorrect.
Yeah, you are totally right, we need a unique row id approach here.
On Mon, May 7, 2018 at 5:20 PM Andriy Rosa andriyrosa@gmail.com wrote:
We already distribute an "any" row. Initially i though we introduced that
to cover the "empty in" case, but than i realized that it should cover the
probe nulls case as well.For your particular example it would return the correct result. Because
the "array[2, 1]" value is gonna be broadcasted as the "any" value.Although It still feels that there's a case that may not be covered, so
far i couldn't come up with one.Let me try to follow you through my thoughts.
No matter what the build side set is, for an indeterminate value on a
probe side only "false" or "null" can be returned.In the SQL "null or false" expression evaluates to null.
It is enough a single expression in the IN comparison chain to return
"null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2,
value3 ... ) is always null. No matter what the rest of the values are.You can get a "null" only when matching an indeterminate with an
indeterminate (a value that contains a null somewhere).Since we replicate all the indeterminate values, we shouldn't miss a null.
Let me know if I'm missing something.
On Mon, May 7, 2018 at 10:51 AM Karol Sobczak notifications@github.com
wrote:E.g probe should be similar to:
Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)]
|
SemiJoin (broadcasts probe rows with nulls)
|
AssignUniqueID—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prestodb/presto/issues/6385#issuecomment-387089981,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFUBrEgaUwO_xoS3gLaebmamEb_pICHuks5twF9dgaJpZM4KZa0o
.
It is enough a single expression in the IN comparison chain to return
"null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2,
value3 ... ) is always null. No matter what the rest of the values are.
unless eg value1 = value3: null or false or true is true
Yeah, you are totally right, we need a unique row id approach here.
Yes, but it is adds overhead for cases where there are no nulls at all. Maybe instead for now we could just support semi joins which are used by EXISTS (e.g: FilterNode above SemiJoin node)? That is much easier since we don't have to distinguish between null/false semi join result. For cases where semi join is not used filter we can simply use current semantics (e.g: fail when any row is indeterminate [e.g: via filter below semi join]).
On the other hand semi join with majority of build rows being indeterminate might have performance similar to cross join.
Do you want to complete this issue because you want to fix complex type equality and this is prerequisite?
It is the other way. I want to fix complex type equation as a prerequisite
to this.
On Tue, May 8, 2018 at 4:38 AM Karol Sobczak notifications@github.com
wrote:
Yeah, you are totally right, we need a unique row id approach here.
Yes, but it is adds overhead for cases where there are no nulls at all.
Maybe instead for now we could just support semi joins which are used for
row filtering (e.g: FilterNode above SemiJoin node)? That is much easier
issue since we don't have to distinguish between null/false semi join
result. For cases where semi join is not used filter we can simply use
current semantics (e.g: fail when any row is indeterminate (e.g: via filter
below semi join)).On the other hand semi join with majority of build rows being
indeterminate might have performance similar to cross join.Do you want to complete this issue because you want to fix complex type
equality and this is prerequisite?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prestodb/presto/issues/6385#issuecomment-387345060,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFUBrPI4cNpro52g7VAjjqrAc_Pcx8Vtks5twWd8gaJpZM4KZa0o
.
That is much easier since we don't have to distinguish between null/false semi join result.
That's a good point. Actually in many cases you don't have to distinguish between NULL and FALSE for IN query as well.
For example
SELECT *
FROM nation
WHERE nationkey in (SELECT nationkey FROM othertable)
I'm not sure, but it seems that whenever we use an IN subquery as a filter predicate - we shouldn't care about null vs false in many cases, since in predicate we only care if the entire expression is evaluated to true (there are tricky cases with negation though?).
Our optimizer should be smart enough to detect whether we really care about nulls.
IN in projection is uncommon. However, in WHERE the IN might be negated.
Then we need to distinguish between false result and null
>
Hi there, just upping this thread in case anyone has given some more thoughts into it. Multiple columns in IN predicate sounds like an interesting feature to have as it can avoid a lot of complexity, especially when dealing with template-generated queries.
Update: there is a way to use multiple-columns with an WHERE (a, b) IN (SELECT ...), by wrapping the selected rows in parenthesis so as they're effectively turned into one single anonymous row.
Example:
SELECT a, b
FROM table
WHERE (a, b)
IN (
SELECT (a, b) --- wrap a, b in parenthesis here
FROM table
)
The problem is that currently Presto doesn't support IN expression for
complex types. There were complications mostly related to null vs false
result for semi join.
On Wed, Sep 11, 2019 at 2:45 AM Jivan Roquet notifications@github.com
wrote:
Update: there is a way to use multiple-columns with an WHERE (a, b) IN
(SELECT ...), by wrapping the selected rows in parenthesis so as they're
effectively turned into one single anonymous row.Example:
SELECT a, bFROM tableWHERE (a, b)IN (
SELECT (a, b) --- wrap a, b in parenthesis here
FROM table
)—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prestodb/presto/issues/6385?email_source=notifications&email_token=ABKQDLD7TR3EAORGCAVA7FLQJC427A5CNFSM4CTFVUUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6N5LFY#issuecomment-530306455,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKQDLDL62J2DXVVEAJMGXDQJC427ANCNFSM4CTFVUUA
.
The problem is that currently Presto doesn't support IN expression for complex types. There were complications mostly related to
nullvsfalseresult for semi join.
How comes it works by simply wrapping multiple columns in parenthesis, then?
Most helpful comment
I'm starting working on this