I connected hive with presto and query a table containing decimal values. In case of != operator it's showing me all the rows of the table.
Below are the create table and load queries for hive:
create table decimalOperatorCheck(name String, ids decimal(10,2))ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Load data:
load data local inpath '/home/hduser/Files/decimaldata.csv' into table decimalOperatorCheck;
Query in Presto CLI:
select * from decimaloperatorcheck where ids !=12345.56;
Output:
presto:default> select * from decimaloperatorcheck where ids !=12345.56;
name | ids
--------+----------
Justin | 11.90
Josh | 233.34
Alex | 123.45
Ryan | 12345.56
(4 rows)
Here is the csv used for loading data.
data.txt
It seems like a bug, thanks for reporting
@geetikagupta16 What version of Presto are you running?
@KBP-TDC I am using presto-server-0.170
Actually this is not strictly decimal related. It is implication of three facts:
12345.56) are interpreted in Presto as literals of type DOUBLEDECIMAL and DOUBLE type the comparison is happening in DOUBLE domain (the decimal value is first cast to DOUBLE).DOUBLESo the query
select * from decimaloperatorcheck where ids !=12345.56;
is effectively executing as:
select * from decimaloperatorcheck where cast(ids as DOUBLE) != 12345.56;
If you explicitly rewrite query so it is happening in DECIMAL domain using appropriate literal constructor the query will hopefully pass:
select * from decimaloperatorcheck where ids != DECIMAL '12345.56';
We have plans straightening that up but it is somewhat larger effort. So far pushed out of the priority queue.
@losipiuk, there's something else going on here. That number should be representable as an exact double and the comparison should succeed. For instance, this works as expected:
```
presto> select cast('12345.56' as decimal(10, 2)) != 12345.56;
false
````
It might be an issue in the file format reader in the Hive connector.
The conversion is other way around, e.g: decimal column is converted to double, not double to decimal.
I know. That was just a way to force a decimal(10,2), that when compared to a double should implicitly cast to a double.
The same result happens with the more explicit shape:
presto> select cast(cast('12345.56' as decimal(10, 2)) as double) != 12345.56;
_col0
-------
false
Yeah. You are right @martint.
It seems like this query triggers a bug in DomainTranslator. I guess somewhere around determining TupleDomain for comparison of types involving coercions using saturated_floor_cast.
The original constraint from query is translated to ids is NOT NULL.
Here is the line from explain output:
- ScanFilter[table = hive:hive:default:decimaloperatorcheck, originalConstraint = (CAST("ids" AS double) <> 12345.56), filterPredicate = (NOT ("ids" IS NULL))] => [name:varchar, ids:decimal(10,2)]
I will look into that tomorrow.
Slice decimal = doubleToLongDecimal(12345.56, 10, 2);
System.out.println(Decimals.toString(decimal, 2));
which is used in double to decimal saturated floor cast outputs 12345.55. It should be 12345.56.
And the problem is that new BigDecimal(12345.56) doesn't produce a BigDecimal that represents that value exactly. It's 12345.559999999999490682967007160186767578125.
Inside DomainTranslator.rewriteComparisonExpression, it turns the original NOT_EQUAL expression into ids != 12345.55 or ids = 12345.55 (using the value returned from saturated floor cast operator). This is probably later eliminated or turned into the equivalent expression ids IS NOT NULL.
case NOT_EQUAL: {
...
// Return something that is true for all non-null values
return or(new ComparisonExpression(EQUAL, symbolExpression, coercedLiteral),
new ComparisonExpression(NOT_EQUAL, symbolExpression, coercedLiteral));
}
Ultimately, the problem is that 12345.56 doesn't have an exact representation in binary (the fact that it's displayed that way is because the formatter for double is making decisions about where to cut off and round). So any conversions between a binary and a decimal representation will introduce errors. I can think of three options:
While looking at this with @sopel39 we spotted yet another issue with saturated floor cast.
If you look at SFC(DOUBLE -> BIGINT) it may look ok, as integer values have exact representation in DOUBLE domain.
Yet there is a problem with large values when multiple BIGINT values are merged into single DOUBLE value.
Example:
CREATE TABLE bigint_test(a bigint);
INSERT INTO bigint_test VALUES (9223372036854775807),(9223372036854775806);
With TupleDomain built using SFT DOUBLE->BIGINT we will get:
presto:default> select * from bigint_test where a = 9223372036854775807.0;
a
---------------------
9223372036854775807
(1 row)
Even though:
SELECT CAST(9223372036854775807 as DOUBLE) = 9223372036854775807.0; is trueSELECT CAST(9223372036854775806 as DOUBLE) = 9223372036854775807.0; is trueHere is the PR, which addresses both issues by removing relevant SFCs: https://github.com/prestodb/presto/pull/8375.
But it is done at a cost that some conditions which were pushed down to connector, no longer will.
Maybe instead doing that we should just go for solution 3 and document that comparisons on doubles is imprecise.
We then would need to accept some incoherent Presto behaviour.
Hmm. Maybe saturated floor cast and the current logic for pushing down these kinds of predicates are not the right abstractions.
All we need is to find a suitable lower and upper bound on a such that the value that makes that filter evaluate to true after coercions is in that range. The predicate doesn't need to be absorbed by the connector, it just needs to be able to access that information for internal pruning.
I agree. It would be much easier to reason about what is happening, if we left original predicate, coming from filters in query intact. And pass to connector something weaker. But still allowing connector to filter out most of the data on its end.
Then original predicate would be used to do the rest of the filtering on execution engine side.
Part of the problem we have right now is because we are doing transformation of original constraint (the part of it actually) into TupleDomain. Forget it. And then perform reverse transformation on unenforcedPredicate.
The proposed approach saves us the second transformation (TupleDomain -> Expression), as we are leaving original constraint in Filter anyway.
If we just did the first transformation (Expression -> TupleDomain). But left original constraint without change it would be simpler.
PS.
After selecting layout we would still need to transform the ConnectorTableLayout.predicate to Expression and use it is different story and should not cause any problems. We would just need to add it as an extra conjunct to filtering condition of data being read from table.
@martint I captured the proposed improvement in https://github.com/prestodb/presto/issues/8409
@losipiuk with https://github.com/prestodb/presto/pull/8375 merged, can I close this?
I think so. Followup from #8409 is still valid, yet it is not related to this issue.