presto 🚀 - Support multiple columns in IN predicate

There's a PR in progress to support this: https://github.com/prestodb/presto/pull/5323

martint on 18 Oct 2016

I'm starting working on this

arhimondr on 30 Apr 2018

👍4 ❤2

@arhimondr do you have plans how tuples with NULLs should be distributed?
IN currently employs "replicate any and null" distribution (unless it's broadcasted of course). In tuple case, at first it seems you need to replicate tuples that contain any NULL value. Is it possible to do better?

Also, how does it play with structural types? What should be the result of map(2 => null) IN ( map(2 => 3) ) ?

findepi on 30 Apr 2018

Basically any structural type that contains a null value (no matter how deeply it is enclosed) is considered indeterminete. And IN for indeterminete should behave exactly the same as for NULL

arhimondr on 30 Apr 2018

replicate any and null

I'm going to change that to be replicate any and indeterminete

arhimondr on 30 Apr 2018

Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?

findepi on 1 May 2018

Yes

arhimondr on 1 May 2018

Just note that today:

presto> select array[1, 2, 3] = array[42, null];
 _col0
-------
 false

and IN is defined in terms of equality, so = and IN should remain consistent. (AFAIU... @martint can you comment?)

findepi on 1 May 2018

Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?

No, they have different size, so the comparison should return false.

martint on 1 May 2018

Basically any structural type that contains a null value (no matter how deeply it is enclosed) is considered indeterminete. And IN for indeterminete should behave exactly the same as for NULL

With respect to how they are distributed, yes. But not for the purpose of comparison for the IN test - that still behaves according to equality semantics

martint on 1 May 2018

Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?

No, they have different size, so the comparison should return false.

Oh, sure. Would the answer be the same for array[1, 2] and array[42, null]? (array = is left-to-right, so this comparison returns false)
What about array[1, 2] and array[null, 42]? This = comparison fails today, but multi-column IN can result in equivalent case ((1,2) IN ( select null, 42 ))

findepi on 1 May 2018

IN should have the same semantics as =

martint on 1 May 2018

IN should have the same semantics as =

Right, so

array[1, 2, 3] IN ( array[42, null] ) is equal to array[1, 2, 3] = array[42, null]

And yeah,

No, they have different size, so the comparison should return false.

that's how the array comparison is implemented now. It works only because we check the size first. If it is different - it is false.

But when it comes to actually compare the values, we do not support nulls

presto> SELECT ARRAY[1, null] = ARRAY[1, null];
Query 20180502_154031_00004_dwbpc failed: ARRAY comparison not supported for arrays with null elements

Postgree would just answer true in that case.

postgres=# SELECT ARRAY[1, null] = ARRAY[1, null];
 ?column?
----------
 t
(1 row)

I guess since we didn't implement this - it is not covered by the standard.

Null comparisons for null in row are also not supported https://github.com/prestodb/presto/blob/master/presto-spi/src/main/java/com/facebook/presto/spi/type/RowType.java#L263

And it seems that semantics for null in row and semantics for null in multicolumn in are different

postgres=# SELECT (1, cast(null as int)) IN (SELECT 1, cast(null as int));
 ?column?
----------

(1 row)

is equivalent to

postgres=# SELECT 1 = 1 AND null = null;
 ?column?
----------

(1 row)

postgres=# SELECT array[1, cast(null as int)] IN (SELECT array[1, cast(null as int)]);
 ?column?
----------
 t
(1 row)

is equivalent to

postgres=# SELECT array[1, cast(null as int)] = array[1, cast(null as int)];
 ?column?
----------
 t
(1 row)

I didn't check the SQL standard yet (i will), but so far the idea is to change the INDETERMINATE operator for ROW/MAP/ARRAY to return true only if there is a null on the first level. And than to just use it instead of isNull checks in SemiJoin related code.

arhimondr on 2 May 2018

* *Update: this comment is incorrect. See follow-up comments below. **

~~~Ugh... I just checked the spec. For the purpose of array equality, values are compared according to a different set of rules than regular equality:~~~

Let A1 and A2 be arrays of EDT. A1 and A2 are identical if and only if A1 and A2
have the same cardinality n and if, for all i in the range 1 (one) ≤ i ≤ n, the element 
at ordinal position i in A1 is identical to the element at ordinal position i in A2.

and

9.10 Determination of identical values
...

2) Case:
    a) If V1 and V2 are both the null value, then V1 is identical to V2.

martint on 2 May 2018

Ah, I take that back. That first block talks about arrays being "identical", not "equal", so that's not the operation to be considered for equality.

martint on 2 May 2018

Here are the equality rules for arrays:

ii) If the declared types of XV and YV are array types or distinct types whose source types 
    are array types and the cardinalities of XV and YV are N1 and N2, respectively, then 
    let Xi, 1 (one) ≤ i ≤ N1, denote a <value expression> whose value and declared type is 
    that of the i-th element of XV and let Yi, 1 (one) ≤ i ≤ N2, denote a <value expression> 
    whose value and declared type is that of the i-th element of YV. The result of

        X <comp op> Y

    is determined as follows:
    1) X = Y is True if N1 = 0 (zero) and N2 = 0 (zero).
    2) X = Y is True if N1 = N2 and, for all i, Xi = Yi is True.
    3) X = Y is False if and only if one of the following is true:
            A) N1 ≠ N2.
            B) N1 = N2 and for some i, 1 (one) ≤ i ≤ N1, NOT (Xi = Yi) is True.
    4) X <comp op> Y is Unknown if X <comp op> Y is neither True nor False.

So, according to the standard, array [1, null] = array[1, null] => null

martint on 2 May 2018

And for rows:

i) If the declared types of XV and YV are row types with degree N, then let Xi, 1 (one) ≤ i ≤ N,
    denote a <value expression> whose value and declared type is that of the i-th  field of XV and let Yi
    denote a <value expression> whose value and declared type is that of the i-th  field of YV.

    The result of

        X <comp op> Y

    is determined as follows:
    1) X = Y is True if and only if Xi = Yi is True for all i.
    2) X < Y is True if and only if Xi = Yi is True for all i < n and Xn < Yn for some n.
    3) X = Y is False if and only if NOT (Xi = Yi) is True for some i.
    4) X < Y is False if and only if X = Y is True or Y < X is True.
    5) X <comp op> Y is Unknown if X <comp op> Y is neither True nor False.

martint on 2 May 2018

Yeah, thanks for digging into the standard. I was just about to do so.

So, yeah, PostgreSQL implements array equals not by the standard.

the idea is to change the INDETERMINATE operator for ROW/MAP/ARRAY to return true only if there is a null on the first level

We should leave INDETERMINATE operator implementation as is than.

arhimondr on 2 May 2018

And yeah, we cannot rely on INDETERMINATE when comparing, because due to the standard array [1, null] = array[1, null] => null, but array [1, null] = array[2, null] => FALSE

arhimondr on 2 May 2018

Little bit of a pseudo code =)

    Boolean in(List<Object> row, Set<List<Object>> searchRowSet)
    {
        if (searchRowSet.isEmpty()) {
            return false;
        }

        if (searchRowSet.containsExact(row)) {
            return true;
        }

        for (List<Object> setElement : searchRowSet) {
            if (nullableEquals(setElement, row) == null) {
                return null;
            }
        }

        return false;
    }

    Boolean nullableEquals(List<Object> setElement, List<Object> values)
    {
        boolean match = true;
        for (int i = 0; i < values.size(); i++) {
            Boolean result = nullableEquals(setElement.get(i), values.get(i));
            if (result == Boolean.FALSE) {
                return false;
            }
            if (result == null) {
                match = false;
            }
        }
        if (!match) {
            return null;
        }
        return true;
    }

    Boolean nullableEquals(Object first, Object second)
    {
        if (first == null || second == null) {
            return null;
        }
        return first.equals(second);
    }

arhimondr on 3 May 2018

This algorithm is already implemented in the HashSemiJoinOperator.

On the probe side we need no modification. We always operate on a single row.

On the build side we need to broadcast all the values with nulls at any level (we call them INDETERMINATE). This should be an easy change, since we already doing this for null's.

In that PR we also blocking null values on the probe side. That change was introduced mainly because we didn't broadcast an any row. But @findepi already added it. So, i guess i can just drop the very last commit.

The only unresolved issue there is the nullableEquals. When for basic types it is easy to take care about nulls outside of the equals method, for complex types it is not that easy. Currently equals for complex typed doesn't support null at all.

After implementing a nullableEquals for complex types we can add a rewrite:

SELECT (value1, value2) IN (SELECT column1, column2 FROM table) -> SELECT row(value1, value2) IN (SELECT row(column1, column2) FROM table)

@kokosing @martint @findepi

arhimondr on 3 May 2018

presto> SELECT ARRAY[1, null] = ARRAY[1, null];
Query 20180502_154031_00004_dwbpc failed: ARRAY comparison not supported for arrays with null elements

@arhimondr -- this is correct in a sense this doesn't produce wrong result. It's not (yet) supported.

In that PR we also blocking null values on the probe side. That change was introduced mainly because we didn't broadcast an any row. But @findepi already added it. So, i guess i can just drop the very last commit.

Which PR you refer to?

The only unresolved issue there is the nullableEquals. When for basic types it is easy to take care about nulls outside of the equals method, for complex types it is not that easy. Currently equals for complex typed doesn't support null at all.

Doesn't it mean we need to implement missing support for null elements in equality functions for complex types? I.e. replace "ARRAY comparison not supported for arrays with null elements" with something that follows the standard more closely?

findepi on 3 May 2018

Which PR you refer to?

https://github.com/prestodb/presto/pull/7792

Doesn't it mean we need to implement missing support for null elements in equality functions for complex types? I.e. replace "ARRAY comparison not supported for arrays with null elements" with something that follows the standard more closely?

Yes, exactly.

arhimondr on 3 May 2018

On the probe side we need no modification. We always operate on a single row.

This isn't true unfortunately. Consider probe row: array[null, 1] and build row: array[2, 1]. Now you need to make sure that both rows land on the same machine to deduce that the semi join result for array[null, 1] should be null instead of false. In order to do so, you most likely need to broadcast array[null, 1] row to each node. But then you need to reduce such broadcasted probe rows after semi join operator.

So it seems that probe rows need to have unique row id columns before broadcasting. After broadcast there should be an aggregation that will group by unique row id with OR aggregation on semi join symbol.

sopel39 on 7 May 2018

E.g probe should be similar to:

Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)]
   |
SemiJoin (broadcasts probe rows with nulls)
   |
AssignUniqueID

sopel39 on 7 May 2018

We already distribute an "any" row. Initially i though we introduced that
to cover the "empty in" case, but than i realized that it should cover the
probe nulls case as well.

For your particular example it would return the correct result. Because the
"array[2, 1]" value is gonna be broadcasted as the "any" value.

Although It still feels that there's a case that may not be covered, so far
i couldn't come up with one.

Let me try to follow you through my thoughts.

No matter what the build side set is, for an indeterminate value on a probe
side only "false" or "null" can be returned.

In the SQL "null or false" expression evaluates to null.

It is enough a single expression in the IN comparison chain to return
"null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2,
value3 ... ) is always null. No matter what the rest of the values are.

You can get a "null" only when matching an indeterminate with an
indeterminate (a value that contains a null somewhere).

Since we replicate all the indeterminate values, we shouldn't miss a null.

Let me know if I'm missing something.

On Mon, May 7, 2018 at 10:51 AM Karol Sobczak notifications@github.com
wrote:

E.g probe should be similar to:

Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)]
|
SemiJoin (broadcasts probe rows with nulls)
|
AssignUniqueID

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prestodb/presto/issues/6385#issuecomment-387089981,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFUBrEgaUwO_xoS3gLaebmamEb_pICHuks5twF9dgaJpZM4KZa0o
.

arhimondr on 8 May 2018

UPD:

I'm slow. You get a "null" also for an "array[1,null] in (array[1,2],
array[2,3])". In case array[2,3] is replicated we get a "false", what is
incorrect.

Yeah, you are totally right, we need a unique row id approach here.
On Mon, May 7, 2018 at 5:20 PM Andriy Rosa andriyrosa@gmail.com wrote:

We already distribute an "any" row. Initially i though we introduced that
to cover the "empty in" case, but than i realized that it should cover the
probe nulls case as well.

For your particular example it would return the correct result. Because
the "array[2, 1]" value is gonna be broadcasted as the "any" value.

Although It still feels that there's a case that may not be covered, so
far i couldn't come up with one.

Let me try to follow you through my thoughts.

No matter what the build side set is, for an indeterminate value on a
probe side only "false" or "null" can be returned.

In the SQL "null or false" expression evaluates to null.

It is enough a single expression in the IN comparison chain to return
"null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2,
value3 ... ) is always null. No matter what the rest of the values are.

You can get a "null" only when matching an indeterminate with an
indeterminate (a value that contains a null somewhere).

Since we replicate all the indeterminate values, we shouldn't miss a null.

Let me know if I'm missing something.

On Mon, May 7, 2018 at 10:51 AM Karol Sobczak notifications@github.com
wrote:

E.g probe should be similar to:

Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)]
|
SemiJoin (broadcasts probe rows with nulls)
|
AssignUniqueID

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prestodb/presto/issues/6385#issuecomment-387089981,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFUBrEgaUwO_xoS3gLaebmamEb_pICHuks5twF9dgaJpZM4KZa0o
.

arhimondr on 8 May 2018

It is enough a single expression in the IN comparison chain to return
"null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2,
value3 ... ) is always null. No matter what the rest of the values are.

unless eg value1 = value3: null or false or true is true

findepi on 8 May 2018

Yeah, you are totally right, we need a unique row id approach here.

Yes, but it is adds overhead for cases where there are no nulls at all. Maybe instead for now we could just support semi joins which are used by EXISTS (e.g: FilterNode above SemiJoin node)? That is much easier since we don't have to distinguish between null/false semi join result. For cases where semi join is not used filter we can simply use current semantics (e.g: fail when any row is indeterminate [e.g: via filter below semi join]).

On the other hand semi join with majority of build rows being indeterminate might have performance similar to cross join.

Do you want to complete this issue because you want to fix complex type equality and this is prerequisite?

sopel39 on 8 May 2018

It is the other way. I want to fix complex type equation as a prerequisite
to this.
On Tue, May 8, 2018 at 4:38 AM Karol Sobczak notifications@github.com
wrote:

Yeah, you are totally right, we need a unique row id approach here.

Yes, but it is adds overhead for cases where there are no nulls at all.
Maybe instead for now we could just support semi joins which are used for
row filtering (e.g: FilterNode above SemiJoin node)? That is much easier
issue since we don't have to distinguish between null/false semi join
result. For cases where semi join is not used filter we can simply use
current semantics (e.g: fail when any row is indeterminate (e.g: via filter
below semi join)).

On the other hand semi join with majority of build rows being
indeterminate might have performance similar to cross join.

Do you want to complete this issue because you want to fix complex type
equality and this is prerequisite?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prestodb/presto/issues/6385#issuecomment-387345060,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFUBrPI4cNpro52g7VAjjqrAc_Pcx8Vtks5twWd8gaJpZM4KZa0o
.

arhimondr on 8 May 2018

That is much easier since we don't have to distinguish between null/false semi join result.

That's a good point. Actually in many cases you don't have to distinguish between NULL and FALSE for IN query as well.

For example

SELECT *
FROM nation
WHERE nationkey in (SELECT nationkey FROM othertable)

I'm not sure, but it seems that whenever we use an IN subquery as a filter predicate - we shouldn't care about null vs false in many cases, since in predicate we only care if the entire expression is evaluated to true (there are tricky cases with negation though?).

Our optimizer should be smart enough to detect whether we really care about nulls.

arhimondr on 9 May 2018

IN in projection is uncommon. However, in WHERE the IN might be negated.
Then we need to distinguish between false result and null

>

findepi on 9 May 2018

👍1

Hi there, just upping this thread in case anyone has given some more thoughts into it. Multiple columns in IN predicate sounds like an interesting feature to have as it can avoid a lot of complexity, especially when dealing with template-generated queries.

JivanRoquet on 10 Sep 2019

Update: there is a way to use multiple-columns with an WHERE (a, b) IN (SELECT ...), by wrapping the selected rows in parenthesis so as they're effectively turned into one single anonymous row.

Example:

SELECT a, b
FROM table
WHERE (a, b)
IN (
  SELECT (a, b) --- wrap a, b in parenthesis here
  FROM table
)

Credits: https://stackoverflow.com/a/57876281/2091169

JivanRoquet on 11 Sep 2019

The problem is that currently Presto doesn't support IN expression for
complex types. There were complications mostly related to null vs false
result for semi join.

On Wed, Sep 11, 2019 at 2:45 AM Jivan Roquet notifications@github.com
wrote:

Update: there is a way to use multiple-columns with an WHERE (a, b) IN
(SELECT ...), by wrapping the selected rows in parenthesis so as they're
effectively turned into one single anonymous row.

Example:

SELECT a, bFROM tableWHERE (a, b)IN (
SELECT (a, b) --- wrap a, b in parenthesis here
FROM table
)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prestodb/presto/issues/6385?email_source=notifications&email_token=ABKQDLD7TR3EAORGCAVA7FLQJC427A5CNFSM4CTFVUUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6N5LFY#issuecomment-530306455,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKQDLDL62J2DXVVEAJMGXDQJC427ANCNFSM4CTFVUUA
.

arhimondr on 11 Sep 2019

The problem is that currently Presto doesn't support IN expression for complex types. There were complications mostly related to null vs false result for semi join.

How comes it works by simply wrapping multiple columns in parenthesis, then?

JivanRoquet on 12 Sep 2019

Presto: Support multiple columns in IN predicate

Most helpful comment

All 36 comments

Related issues