Presto: != operator is not working correctly in presto for decimal values

Created on 23 Jun 2017 · 18Comments · Source: prestodb/presto

I connected hive with presto and query a table containing decimal values. In case of != operator it's showing me all the rows of the table.

Below are the create table and load queries for hive:

create table decimalOperatorCheck(name String, ids decimal(10,2))ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Load data:
load data local inpath '/home/hduser/Files/decimaldata.csv' into table decimalOperatorCheck;

Query in Presto CLI:
select * from decimaloperatorcheck where ids !=12345.56;

Output:

presto:default> select * from decimaloperatorcheck where ids !=12345.56;
name | ids
--------+----------
Justin | 11.90
Josh | 233.34
Alex | 123.45
Ryan | 12345.56
(4 rows)

Here is the csv used for loading data.
data.txt

bug correctness

Source

geetikagupta16

All 18 comments

It seems like a bug, thanks for reporting

sopel39 on 23 Jun 2017

@geetikagupta16 What version of Presto are you running?

kbajda on 23 Jun 2017

@KBP-TDC I am using presto-server-0.170

geetikagupta16 on 24 Jun 2017

Actually this is not strictly decimal related. It is implication of three facts:

numeric literals with dot (e.g 12345.56) are interpreted in Presto as literals of type DOUBLE
if there is comparison of values of DECIMAL and DOUBLE type the comparison is happening in DOUBLE domain (the decimal value is first cast to DOUBLE).
you should not expect accurate equality semantics on approximate types like DOUBLE

So the query

select * from decimaloperatorcheck where ids !=12345.56;

is effectively executing as:

select * from decimaloperatorcheck where cast(ids as DOUBLE) != 12345.56;

If you explicitly rewrite query so it is happening in DECIMAL domain using appropriate literal constructor the query will hopefully pass:

select * from decimaloperatorcheck where ids != DECIMAL '12345.56';

We have plans straightening that up but it is somewhat larger effort. So far pushed out of the priority queue.

losipiuk on 26 Jun 2017

👍1

@losipiuk, there's something else going on here. That number should be representable as an exact double and the comparison should succeed. For instance, this works as expected:

```
presto> select cast('12345.56' as decimal(10, 2)) != 12345.56;

_col0

false
````

It might be an issue in the file format reader in the Hive connector.

martint on 26 Jun 2017

The conversion is other way around, e.g: decimal column is converted to double, not double to decimal.

sopel39 on 26 Jun 2017

I know. That was just a way to force a decimal(10,2), that when compared to a double should implicitly cast to a double.

The same result happens with the more explicit shape:

presto> select cast(cast('12345.56' as decimal(10, 2)) as double) != 12345.56;
 _col0
-------
 false

martint on 26 Jun 2017

Yeah. You are right @martint.
It seems like this query triggers a bug in DomainTranslator. I guess somewhere around determining TupleDomain for comparison of types involving coercions using saturated_floor_cast.

The original constraint from query is translated to ids is NOT NULL.

Here is the line from explain output:

         - ScanFilter[table = hive:hive:default:decimaloperatorcheck, originalConstraint = (CAST("ids" AS double) <> 12345.56), filterPredicate = (NOT ("ids" IS NULL))] => [name:varchar, ids:decimal(10,2)]

I will look into that tomorrow.

losipiuk on 26 Jun 2017

Slice decimal = doubleToLongDecimal(12345.56, 10, 2);
System.out.println(Decimals.toString(decimal, 2));

which is used in double to decimal saturated floor cast outputs 12345.55. It should be 12345.56.

martint on 27 Jun 2017

And the problem is that new BigDecimal(12345.56) doesn't produce a BigDecimal that represents that value exactly. It's 12345.559999999999490682967007160186767578125.

martint on 27 Jun 2017

Inside DomainTranslator.rewriteComparisonExpression, it turns the original NOT_EQUAL expression into ids != 12345.55 or ids = 12345.55 (using the value returned from saturated floor cast operator). This is probably later eliminated or turned into the equivalent expression ids IS NOT NULL.

case NOT_EQUAL: {
    ...
    // Return something that is true for all non-null values
    return or(new ComparisonExpression(EQUAL, symbolExpression, coercedLiteral),
            new ComparisonExpression(NOT_EQUAL, symbolExpression, coercedLiteral));
}

martint on 27 Jun 2017

Ultimately, the problem is that 12345.56 doesn't have an exact representation in binary (the fact that it's displayed that way is because the formatter for double is making decisions about where to cut off and round). So any conversions between a binary and a decimal representation will introduce errors. I can think of three options:

Try to fix the saturated floor cast operator for double->decimal. It may be hard to cover all possible edge cases.
Remove support for floor cast from double->decimal, at the expense of not being able to push down those predicates.
Declare that comparisons with double and real are inherently imprecise.

martint on 27 Jun 2017

While looking at this with @sopel39 we spotted yet another issue with saturated floor cast.

If you look at SFC(DOUBLE -> BIGINT) it may look ok, as integer values have exact representation in DOUBLE domain.
Yet there is a problem with large values when multiple BIGINT values are merged into single DOUBLE value.
Example:

CREATE TABLE bigint_test(a bigint);
INSERT INTO bigint_test VALUES (9223372036854775807),(9223372036854775806);

With TupleDomain built using SFT DOUBLE->BIGINT we will get:

presto:default> select * from bigint_test where a = 9223372036854775807.0;
          a
---------------------
 9223372036854775807
(1 row)

Even though:

SELECT CAST(9223372036854775807 as DOUBLE) = 9223372036854775807.0; is true
SELECT CAST(9223372036854775806 as DOUBLE) = 9223372036854775807.0; is true

Here is the PR, which addresses both issues by removing relevant SFCs: https://github.com/prestodb/presto/pull/8375.

But it is done at a cost that some conditions which were pushed down to connector, no longer will.
Maybe instead doing that we should just go for solution 3 and document that comparisons on doubles is imprecise.
We then would need to accept some incoherent Presto behaviour.

losipiuk on 27 Jun 2017

Hmm. Maybe saturated floor cast and the current logic for pushing down these kinds of predicates are not the right abstractions.

All we need is to find a suitable lower and upper bound on a such that the value that makes that filter evaluate to true after coercions is in that range. The predicate doesn't need to be absorbed by the connector, it just needs to be able to access that information for internal pruning.

martint on 27 Jun 2017

I agree. It would be much easier to reason about what is happening, if we left original predicate, coming from filters in query intact. And pass to connector something weaker. But still allowing connector to filter out most of the data on its end.
Then original predicate would be used to do the rest of the filtering on execution engine side.

Part of the problem we have right now is because we are doing transformation of original constraint (the part of it actually) into TupleDomain. Forget it. And then perform reverse transformation on unenforcedPredicate.

The proposed approach saves us the second transformation (TupleDomain -> Expression), as we are leaving original constraint in Filter anyway.

If we just did the first transformation (Expression -> TupleDomain). But left original constraint without change it would be simpler.

PS.
After selecting layout we would still need to transform the ConnectorTableLayout.predicate to Expression and use it is different story and should not cause any problems. We would just need to add it as an extra conjunct to filtering condition of data being read from table.

losipiuk on 27 Jun 2017

@martint I captured the proposed improvement in https://github.com/prestodb/presto/issues/8409

losipiuk on 30 Jun 2017

@losipiuk with https://github.com/prestodb/presto/pull/8375 merged, can I close this?