When query to hive catalog and parquet_predicate_pushdown_enabled=true( older version),Some blocks may be filtered,and the query will back unexpected result.
Trace the source code,when column is HiveColumnType.String, it may cause by Unequal Semantics between parquet BinaryStatistics and Slice.
I just made a test case.
@Test
public void minMaxBinaryStatistics(){
List<String> testStrings = Arrays.asList("","管易云","e店宝3秘密秘密");
BinaryStatistics binaryStatistics = new BinaryStatistics();
for (String testString : testStrings) {
binaryStatistics.updateStats(Binary.fromString(testString));
}
Slice minSlice = Slices.wrappedBuffer(binaryStatistics.getMin().getBytes());
Slice maxSlice = Slices.wrappedBuffer(binaryStatistics.getMax().getBytes());
Slice slice = Slices.utf8Slice("管易云");
Assert.assertEquals(1,slice.compareTo(minSlice));
Assert.assertEquals(-1,slice.compareTo(maxSlice));
}
It's so magical
I seems a bug of odler version parquet.
https://github.com/apache/parquet-mr/commit/c6764c4a0848abf1d581e22df8b33e28ee9f2ced#diff-a7b135229f233fb724e70eae501f97c3L192
@sczhaoqi I don't see parquet_predicate_pushdown_enabled property in the code base. Where can I find it? Also, I don't understand the comment about a bug in apache/parquet-mr. Would you clarify how it affects Presto?
When pushdown predicates working for string type in hive which stored as parquet, the older version parquet'statistics for string is incorrect.If there has empty string,there will be only ("" and maxString) bettwen the BinaryStatistics' min and max.
Which means that some block will filtered.
Now I just commented code from line 216 to 226 in TupleDomainParquetPredicate.
The test case not passed, when parquet spi is older version
import parquet.column.statistics.BinaryStatistics;
import parquet.io.api.Binary;
And new version is
import org.apache.parquet.column.statistics.BinaryStatistics;
import org.apache.parquet.io.api.Binary;
@sczhaoqi I'm still confused. Do you think there is a bug in latest version of Presto? Would you submit a patch that includes a test that reproduces the bug and a fix?
@mbasmanova This problem occurs because the statistics in the old version of parquet file are incorrect, but the data has been wrong from the source.Even if the lasted version of presto, read those data, will back incorrect results. I think additional logic needed to deal with this situation.
The older version parquet's statistics for string -> binary
expected : (min, max)
when has ""
actually: ("", min)
when not
actually:(max,min)
so even the lasted version
if (isVarcharType(type) && statistics instanceof BinaryStatistics) {
BinaryStatistics binaryStatistics = (BinaryStatistics) statistics;
Slice minSlice = Slices.wrappedBuffer(binaryStatistics.genericGetMin().getBytes());
Slice maxSlice = Slices.wrappedBuffer(binaryStatistics.genericGetMax().getBytes());
/*** If the older version parquet file:
contains empty-string in block:
minSlice is empty-string-slice, maxSlice is min-string-slice
will filter other values
not contains empty-string:
minSlice.compareTo(maxSlice) always true
***/
if (minSlice.compareTo(maxSlice) > 0) {
failWithCorruptionException(failOnCorruptedParquetStatistics, column, id, binaryStatistics);
return Domain.create(ValueSet.all(type), hasNullValue);
}
ParquetStringStatistics parquetStringStatistics = new ParquetStringStatistics(minSlice, maxSlice);
return createDomain(type, hasNullValue, parquetStringStatistics);
}
@sczhaoqi Thanks for explaining. My understanding is that the statistics in the Parquet files could be wrong and that may cause Presto to return incorrect results. How were these wrong stats created? By an earlier version of Presto or some other tool?
We produced it by hive-2.1.1. It may Occurs anywhere if the parquet version below parquet-1025.
In presto. before e313922f234948bfba27f9334905a750580d824f not upgrading parquet version may also can produce the error file
@sczhaoqi Is there a way to tell the Parquet version from the file and ignore the stats if the version is < 1025?
@mbasmanova At the footer of the parquet file,we may find producer version info. Like Iparquet-mr version 1.8.1(build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf). But we cannot find out whether this version newer than 1025 at once
In method,ParquetPageSourceFactory.createParquetPageSource
ParquetMetadata parquetMetadata = ParquetMetadataReader.readFooter(inputStream, path, fileSize);
FileMetaData fileMetaData = parquetMetadata.getFileMetaData();
// this string info contains parquet version/build-tag
// like: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
String createdBy = fileMetaData.getCreatedBy();
@sczhaoqi
But we cannot find out whether this version newer than 1025
Would you explain why? What is the relationship between 1025 and 1.8.1 above? Is there any?
Would it make sense to ignore stats if min = ""?
@mbasmanova
I just find another tool class to deal with it, if users used the major parquet version.
org.apache.parquet.CorruptStatistics
Still, I think an additional properties is necessary to turn on/off the pushdown predicates when reading parquet file
@sczhaoqi
I think an additional properties is necessary to turn on/off the pushdown predicates when reading parquet file
Makes sense. Would you like to introduce one?
My pleasure
Most helpful comment
We produced it by hive-2.1.1. It may Occurs anywhere if the parquet version below parquet-1025.
In presto. before e313922f234948bfba27f9334905a750580d824f not upgrading parquet version may also can produce the error file