Cudf: [BUG] dask_cudf.read_parquet slow down

Created on 20 Aug 2020 · 4Comments · Source: rapidsai/cudf

Describe the bug

There seems to be a performance regression with dask_cudf.read_parquet.

Below call used to be almost instantaneous earlier but recently it seems to have been slowed down.

Steps/Code to reproduce bug

%%timeit
ss_cols = ["ss_item_sk", "ss_store_sk", "ss_ticket_number"]


df = dask_cudf.read_parquet('/raid/tpcx-bb/sf10000/parquet_1gb/store_sales',
                           columns=ss_cols, 
                           gather_statistics=False,
                           split_row_groups=False)

```python
1.56 s ± 4.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

**Expected behavior**
I would expect it to be almost instantaneous. 

**Environment overview (please complete the following information)**
 - Method of cuDF install: [conda]
```python

dask                      2.23.0+7.gbdf0e2df          pypi_0    pypi
dask-cuda                 0.15.0a200818          py37_120    rapidsai-nightly
dask-cudf                 0.15.0a200818   py37_gf7fbc1160_4749    rapidsai-nightly


cudf                      0.15.0a200818   cuda_10.2_py37_gf7fbc1160_4749    rapidsai-nightly
dask-cudf                 0.15.0a200818   py37_gf7fbc1160_4749    rapidsai-nightly
libcudf                   0.15.0a200818   cuda10.2_gf7fbc1160_4749    rapidsai-nightly

Addionnal Context

I am trying to find the last package where this was working correctly, will update the issue as I triage this.

CC: @rjzamora .
CC: @randerzander / @beckernick / @ayushdg .

bug dask

Source

VibhuJawa

Most helpful comment

It seems that the upstream Dask code is allowing pyarrow to validate the schema of every file in the dataset (by default). You can avoid this behavior by specifying the "validate_schema" pyarrow option to the dataset argument:

df = dask_cudf.read_parquet(
    '/raid/tpcx-bb/sf10000/parquet_1gb/store_sales',
    columns=ss_cols, 
    gather_statistics=False,
    split_row_groups=False,
    dataset={"validate_schema": False},
)

I will submit a PR to Dask to make validate_schema=False the default :)

rjzamora on 20 Aug 2020

🚀2 👍1

All 4 comments

Can you run the call through snakeviz or some other profiler and share the results here?

kkraus14 on 20 Aug 2020

df = dask_cudf.read_parquet(
    '/raid/tpcx-bb/sf10000/parquet_1gb/store_sales',
    columns=ss_cols, 
    gather_statistics=False,
    split_row_groups=False,
    dataset={"validate_schema": False},
)

I will submit a PR to Dask to make validate_schema=False the default :)

rjzamora on 20 Aug 2020

🚀2 👍1

Thanks @rjzamora for the quick triage.

Just confirming doing this brought it back to 85 ms.

%%timeit
ss_cols = ["ss_item_sk", "ss_store_sk", "ss_ticket_number"]


df = dask_cudf.read_parquet('/raid/tpcx-bb/sf10000/parquet_1gb/store_sales',
                           columns=ss_cols, 
                           gather_statistics=False,
                           split_row_groups=False,
                           dataset={"validate_schema": False})

85 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

VibhuJawa on 20 Aug 2020

🚀1 🎉1

Closing this now that upstream fix is merged :)

rjzamora on 21 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[BUG] cudf installation via pip segfaults

henningpeters · 3Comments

[BUG] datetime dtype not being inferred by cudf avro reader

galipremsagar · 3Comments

[QST] What if I need Python 3 to integrate with PySyft

Polarbeargo · 3Comments

[FEA] Remove `experimental` namespace from libcudf

jrhemstad · 3Comments

[BUG] Writing negative timestamps in ORC doesn't match after its read in

razajafri · 3Comments