Describe the bug
There seems to be a performance regression with dask_cudf.read_parquet.
Below call used to be almost instantaneous earlier but recently it seems to have been slowed down.
Steps/Code to reproduce bug
%%timeit
ss_cols = ["ss_item_sk", "ss_store_sk", "ss_ticket_number"]
df = dask_cudf.read_parquet('/raid/tpcx-bb/sf10000/parquet_1gb/store_sales',
columns=ss_cols,
gather_statistics=False,
split_row_groups=False)
```python
1.56 s 卤 4.43 ms per loop (mean 卤 std. dev. of 7 runs, 1 loop each)
**Expected behavior**
I would expect it to be almost instantaneous.
**Environment overview (please complete the following information)**
- Method of cuDF install: [conda]
```python
dask 2.23.0+7.gbdf0e2df pypi_0 pypi
dask-cuda 0.15.0a200818 py37_120 rapidsai-nightly
dask-cudf 0.15.0a200818 py37_gf7fbc1160_4749 rapidsai-nightly
cudf 0.15.0a200818 cuda_10.2_py37_gf7fbc1160_4749 rapidsai-nightly
dask-cudf 0.15.0a200818 py37_gf7fbc1160_4749 rapidsai-nightly
libcudf 0.15.0a200818 cuda10.2_gf7fbc1160_4749 rapidsai-nightly
Addionnal Context
I am trying to find the last package where this was working correctly, will update the issue as I triage this.
CC: @rjzamora .
CC: @randerzander / @beckernick / @ayushdg .
Can you run the call through snakeviz or some other profiler and share the results here?
It seems that the upstream Dask code is allowing pyarrow to validate the schema of every file in the dataset (by default). You can avoid this behavior by specifying the "validate_schema" pyarrow option to the dataset argument:
df = dask_cudf.read_parquet(
'/raid/tpcx-bb/sf10000/parquet_1gb/store_sales',
columns=ss_cols,
gather_statistics=False,
split_row_groups=False,
dataset={"validate_schema": False},
)
I will submit a PR to Dask to make validate_schema=False the default :)
Thanks @rjzamora for the quick triage.
Just confirming doing this brought it back to 85 ms.
%%timeit
ss_cols = ["ss_item_sk", "ss_store_sk", "ss_ticket_number"]
df = dask_cudf.read_parquet('/raid/tpcx-bb/sf10000/parquet_1gb/store_sales',
columns=ss_cols,
gather_statistics=False,
split_row_groups=False,
dataset={"validate_schema": False})
85 ms 卤 1.29 ms per loop (mean 卤 std. dev. of 7 runs, 1 loop each)
Closing this now that upstream fix is merged :)
Most helpful comment
It seems that the upstream Dask code is allowing pyarrow to validate the schema of every file in the dataset (by default). You can avoid this behavior by specifying the
"validate_schema"pyarrow option to thedatasetargument:I will submit a PR to Dask to make
validate_schema=Falsethe default :)