Cudf: [BUG] dask_cudf.read_parquet slow down

Created on 20 Aug 2020  路  4Comments  路  Source: rapidsai/cudf

Describe the bug

There seems to be a performance regression with dask_cudf.read_parquet.

Below call used to be almost instantaneous earlier but recently it seems to have been slowed down.

Steps/Code to reproduce bug

%%timeit
ss_cols = ["ss_item_sk", "ss_store_sk", "ss_ticket_number"]


df = dask_cudf.read_parquet('/raid/tpcx-bb/sf10000/parquet_1gb/store_sales',
                           columns=ss_cols, 
                           gather_statistics=False,
                           split_row_groups=False)

```python
1.56 s 卤 4.43 ms per loop (mean 卤 std. dev. of 7 runs, 1 loop each)

**Expected behavior**
I would expect it to be almost instantaneous. 

**Environment overview (please complete the following information)**
 - Method of cuDF install: [conda]
```python

dask                      2.23.0+7.gbdf0e2df          pypi_0    pypi
dask-cuda                 0.15.0a200818          py37_120    rapidsai-nightly
dask-cudf                 0.15.0a200818   py37_gf7fbc1160_4749    rapidsai-nightly


cudf                      0.15.0a200818   cuda_10.2_py37_gf7fbc1160_4749    rapidsai-nightly
dask-cudf                 0.15.0a200818   py37_gf7fbc1160_4749    rapidsai-nightly
libcudf                   0.15.0a200818   cuda10.2_gf7fbc1160_4749    rapidsai-nightly

Addionnal Context

I am trying to find the last package where this was working correctly, will update the issue as I triage this.

CC: @rjzamora .
CC: @randerzander / @beckernick / @ayushdg .

bug dask

Most helpful comment

It seems that the upstream Dask code is allowing pyarrow to validate the schema of every file in the dataset (by default). You can avoid this behavior by specifying the "validate_schema" pyarrow option to the dataset argument:

df = dask_cudf.read_parquet(
    '/raid/tpcx-bb/sf10000/parquet_1gb/store_sales',
    columns=ss_cols, 
    gather_statistics=False,
    split_row_groups=False,
    dataset={"validate_schema": False},
)

I will submit a PR to Dask to make validate_schema=False the default :)

All 4 comments

Can you run the call through snakeviz or some other profiler and share the results here?

It seems that the upstream Dask code is allowing pyarrow to validate the schema of every file in the dataset (by default). You can avoid this behavior by specifying the "validate_schema" pyarrow option to the dataset argument:

df = dask_cudf.read_parquet(
    '/raid/tpcx-bb/sf10000/parquet_1gb/store_sales',
    columns=ss_cols, 
    gather_statistics=False,
    split_row_groups=False,
    dataset={"validate_schema": False},
)

I will submit a PR to Dask to make validate_schema=False the default :)

Thanks @rjzamora for the quick triage.

Just confirming doing this brought it back to 85 ms.

%%timeit
ss_cols = ["ss_item_sk", "ss_store_sk", "ss_ticket_number"]


df = dask_cudf.read_parquet('/raid/tpcx-bb/sf10000/parquet_1gb/store_sales',
                           columns=ss_cols, 
                           gather_statistics=False,
                           split_row_groups=False,
                           dataset={"validate_schema": False})
85 ms 卤 1.29 ms per loop (mean 卤 std. dev. of 7 runs, 1 loop each)

Closing this now that upstream fix is merged :)

Was this page helpful?
0 / 5 - 0 ratings