Presto: TupleDomainParquetPredicate.matches() Mismatch when filter string type in hive

Created on 18 Jun 2019  ·  18Comments  ·  Source: prestodb/presto

When query to hive catalog and parquet_predicate_pushdown_enabled=true( older version),Some blocks may be filtered,and the query will back unexpected result.
Trace the source code,when column is HiveColumnType.String, it may cause by Unequal Semantics between parquet BinaryStatistics and Slice.

Most helpful comment

We produced it by hive-2.1.1. It may Occurs anywhere if the parquet version below parquet-1025.
In presto. before e313922f234948bfba27f9334905a750580d824f not upgrading parquet version may also can produce the error file

All 18 comments

I just made a test case.

@Test
    public void minMaxBinaryStatistics(){
        List<String> testStrings = Arrays.asList("","管易云","e店宝3秘密秘密");

        BinaryStatistics binaryStatistics = new BinaryStatistics();
        for (String testString : testStrings) {
            binaryStatistics.updateStats(Binary.fromString(testString));
        }

        Slice minSlice = Slices.wrappedBuffer(binaryStatistics.getMin().getBytes());
        Slice maxSlice = Slices.wrappedBuffer(binaryStatistics.getMax().getBytes());
        Slice slice = Slices.utf8Slice("管易云");
        Assert.assertEquals(1,slice.compareTo(minSlice));
        Assert.assertEquals(-1,slice.compareTo(maxSlice));
    }

It's so magical

I seems a bug of odler version parquet.
https://github.com/apache/parquet-mr/commit/c6764c4a0848abf1d581e22df8b33e28ee9f2ced#diff-a7b135229f233fb724e70eae501f97c3L192

@sczhaoqi I don't see parquet_predicate_pushdown_enabled property in the code base. Where can I find it? Also, I don't understand the comment about a bug in apache/parquet-mr. Would you clarify how it affects Presto?

.parquet_predicate_pushdown_enabled has removed from release-0.213, and higher-version allways enable.
When pushdown predicates working for string type in hive which stored as parquet, the older version parquet'statistics for string is incorrect.If there has empty string,there will be only ("" and maxString) bettwen the BinaryStatistics' min and max.
Which means that some block will filtered.
Now I just commented code from line 216 to 226 in TupleDomainParquetPredicate.

The test case not passed, when parquet spi is older version

import parquet.column.statistics.BinaryStatistics;
import parquet.io.api.Binary;

And new version is

import org.apache.parquet.column.statistics.BinaryStatistics;
import org.apache.parquet.io.api.Binary;

@sczhaoqi I'm still confused. Do you think there is a bug in latest version of Presto? Would you submit a patch that includes a test that reproduces the bug and a fix?

@mbasmanova This problem occurs because the statistics in the old version of parquet file are incorrect, but the data has been wrong from the source.Even if the lasted version of presto, read those data, will back incorrect results. I think additional logic needed to deal with this situation.

The older version parquet's statistics for string -> binary

expected : (min, max)  
when has ""
actually: ("", min)
when not
actually:(max,min)

so even the lasted version

        if (isVarcharType(type) && statistics instanceof BinaryStatistics) {
            BinaryStatistics binaryStatistics = (BinaryStatistics) statistics;
            Slice minSlice = Slices.wrappedBuffer(binaryStatistics.genericGetMin().getBytes());
            Slice maxSlice = Slices.wrappedBuffer(binaryStatistics.genericGetMax().getBytes());
            /*** If the older version parquet file:
                   contains empty-string in block:
                        minSlice is empty-string-slice, maxSlice is min-string-slice
                        will filter other values 
                   not contains empty-string:
                        minSlice.compareTo(maxSlice) always true
           ***/
            if (minSlice.compareTo(maxSlice) > 0) {
                failWithCorruptionException(failOnCorruptedParquetStatistics, column, id, binaryStatistics);
                return Domain.create(ValueSet.all(type), hasNullValue);
            }
            ParquetStringStatistics parquetStringStatistics = new ParquetStringStatistics(minSlice, maxSlice);
            return createDomain(type, hasNullValue, parquetStringStatistics);
        }

@sczhaoqi Thanks for explaining. My understanding is that the statistics in the Parquet files could be wrong and that may cause Presto to return incorrect results. How were these wrong stats created? By an earlier version of Presto or some other tool?

We produced it by hive-2.1.1. It may Occurs anywhere if the parquet version below parquet-1025.
In presto. before e313922f234948bfba27f9334905a750580d824f not upgrading parquet version may also can produce the error file

@sczhaoqi Is there a way to tell the Parquet version from the file and ignore the stats if the version is < 1025?

@mbasmanova At the footer of the parquet file,we may find producer version info. Like Iparquet-mr version 1.8.1(build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf). But we cannot find out whether this version newer than 1025 at once
In method,ParquetPageSourceFactory.createParquetPageSource

ParquetMetadata parquetMetadata = ParquetMetadataReader.readFooter(inputStream, path, fileSize);
FileMetaData fileMetaData = parquetMetadata.getFileMetaData();

// this string info contains parquet version/build-tag
// like: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
String createdBy = fileMetaData.getCreatedBy();

@sczhaoqi

But we cannot find out whether this version newer than 1025

Would you explain why? What is the relationship between 1025 and 1.8.1 above? Is there any?

Would it make sense to ignore stats if min = ""?

@mbasmanova
I just find another tool class to deal with it, if users used the major parquet version.
org.apache.parquet.CorruptStatistics
Still, I think an additional properties is necessary to turn on/off the pushdown predicates when reading parquet file

@sczhaoqi

I think an additional properties is necessary to turn on/off the pushdown predicates when reading parquet file

Makes sense. Would you like to introduce one?

My pleasure

12998 Try to cover this condition

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tesseract2048 picture tesseract2048  ·  3Comments

aminalaee picture aminalaee  ·  3Comments

sopel39 picture sopel39  ·  3Comments

synhershko picture synhershko  ·  4Comments

dterror-zz picture dterror-zz  ·  4Comments