Dask: read_parquet with fastparquet does not support DNF filters

Created on 8 Nov 2020  路  5Comments  路  Source: dask/dask

What happened:
Calling read_parquet with filters in DNF format (i.e. List of Lists of Tuples), using the fastparquet backend, does not filter partitions and returns all partitions instead.

What you expected to happen:
I expected filters to be applied as stated in the documentation of read_parquet.

Minimal Complete Verifiable Example:

I expect the following to return parts that have dataset_name == 'option1' OR dataset_name == 'option2'. Instead, all the partitions are returned regardless of dataset_name.

from dask import dataframe as dd
dd.read_parquet(
    '/some/path/to.parquet', infer_divisions=True, 
    filters=[
        [
            ('dataset_name', '==', 'option1'), 
            ('dataset_name', '==', 'option2')
        ]
    ])

Anything else we need to know?:
From a shallow analysis, seems fastparquet.api.filter_out_cats does not handle the "list of lists of tuples" input, and expects only list of tuples. The former does not crash the code but results in a nonsensical comparison in the fastparquet code where the tuple is compared against the column name, so that a filter is never applied.

Environment:

  • Dask version: 2.21.0
  • fastparquet version: 0.4.1
  • Python version: 3.8.4
  • Operating System: Win10
  • Install method (conda, pip, source): pip
documentation good first issue

All 5 comments

This seems like a fastparquet bug to me. You might be better off opening it over there.

Thanks @jsignell for the reply. I don't claim to be an expert but the mentioned fastparquet API does not seem to support the input it is being given by dask.

See this function: https://github.com/dask/fastparquet/blob/2527bca33b0137fca6cc800fc4a264c37f63f02c/fastparquet/api.py#L803

It claims to support only list of tuples, and from reading the code it joins those filters with an AND.

On the other hand, dask documentation for read_parquet claims to support full DNF annotations regardless of backend, and it passes this input as-is to fp. So this might be semantics, but I don't see it as a "bug" on fp side.

Also, I think this bug only occurs for filters on columns that are "partitioned by" on disk.

Ah ok sorry I misunderstood. So it sounds like the solution is better docs on what is actually supported on the different parquet engines. Would you be interested in improving the docs with a note about this? If so you can go ahead and open a pull request.

My vote would be to upgrade the fastparquet logic to match the doc as filtering is kind of useless for multi-partitioned data otherwise.

My vote would be to upgrade the fastparquet logic to match the doc as filtering is kind of useless for multi-partitioned data otherwise.

You are welcome to open an issue or PR over on fastparquet to request this feature, but in the meantime I think the docs should be made to reflect reality :)

Was this page helpful?
0 / 5 - 0 ratings