What happened:
Calling read_parquet with filters in DNF format (i.e. List of Lists of Tuples), using the fastparquet backend, does not filter partitions and returns all partitions instead.
What you expected to happen:
I expected filters to be applied as stated in the documentation of read_parquet.
Minimal Complete Verifiable Example:
I expect the following to return parts that have dataset_name == 'option1' OR dataset_name == 'option2'. Instead, all the partitions are returned regardless of dataset_name.
from dask import dataframe as dd
dd.read_parquet(
'/some/path/to.parquet', infer_divisions=True,
filters=[
[
('dataset_name', '==', 'option1'),
('dataset_name', '==', 'option2')
]
])
Anything else we need to know?:
From a shallow analysis, seems fastparquet.api.filter_out_cats does not handle the "list of lists of tuples" input, and expects only list of tuples. The former does not crash the code but results in a nonsensical comparison in the fastparquet code where the tuple is compared against the column name, so that a filter is never applied.
Environment:
This seems like a fastparquet bug to me. You might be better off opening it over there.
Thanks @jsignell for the reply. I don't claim to be an expert but the mentioned fastparquet API does not seem to support the input it is being given by dask.
See this function: https://github.com/dask/fastparquet/blob/2527bca33b0137fca6cc800fc4a264c37f63f02c/fastparquet/api.py#L803
It claims to support only list of tuples, and from reading the code it joins those filters with an AND.
On the other hand, dask documentation for read_parquet claims to support full DNF annotations regardless of backend, and it passes this input as-is to fp. So this might be semantics, but I don't see it as a "bug" on fp side.
Also, I think this bug only occurs for filters on columns that are "partitioned by" on disk.
Ah ok sorry I misunderstood. So it sounds like the solution is better docs on what is actually supported on the different parquet engines. Would you be interested in improving the docs with a note about this? If so you can go ahead and open a pull request.
My vote would be to upgrade the fastparquet logic to match the doc as filtering is kind of useless for multi-partitioned data otherwise.
My vote would be to upgrade the fastparquet logic to match the doc as filtering is kind of useless for multi-partitioned data otherwise.
You are welcome to open an issue or PR over on fastparquet to request this feature, but in the meantime I think the docs should be made to reflect reality :)