In version 0.15 the parquet engine used to saved Parquet files could be passed as argument, as ParquetLocalDataSet were saved via the pandas function:
https://github.com/quantumblacklabs/kedro/blob/98a6c8fbecc16d3d0f0a9d810b02e123c48e8285/kedro/io/parquet_local.py#L132
Starting with 0.16.0 the saving of ParquetDataSet is done by explicitly calling the pyarrow backend engine.
https://github.com/quantumblacklabs/kedro/blob/d291a21bee56fdd7da4426e817fab43c9ece2302/kedro/extras/datasets/pandas/parquet_dataset.py#L172
As a direct consequence, files defined in v0.15 with engine: fastparquet now raise this error
https://github.com/quantumblacklabs/kedro/blob/ec3a325aab3e4b1a7f62ca2c3fafed1860c1b531/kedro/io/core.py#L181
I don't see this being mentioned in the Release Note, thus I am assuming this is an involuntary regression.
Note that the implication may be broader as pyarrow is a dependency depending on the extra requirements in the install.
Hi @philippegr, thank you for raising this, it's a good catch!
I _think_ there was a reason why we migrated to using pyarrow directly, but I can't quite remember. Someone else from the team should be able to give the reasoning there.
A workaround for now would be for you to copy the source code for the parquet_dataset and create a fastparquet_dataset which should be a very minor change.
Hey @philippegr , did a bit of digging and turns out to_parquet and fastparquet didn't work with remote paths (at least at the time of development). Since we were aiming for the new dataset classes to be filesystem-agnostic, we switched to pyarrow backend instead.
I'm not sure if dask.ParquetDataSet would prove to be more useful here, or indeed, as Zain suggested, simply creating a custom dataset to work with fastparquet engine.
Thank you both for your suggestions.
I ended up going with @mzjp2 's suggestion and copy pasted the code from 0.15 to our project repo. Thanks for the Dask idea @lorenabalan. It would have been quite involved here as this is a many nodes DS pipeline where each intermediary file is saved as parquet and so we would have had to convert back and forth to pandas in each one of them.
As for why we are going for fastparquet in my team: first we aim at saving intermediary files as parquet and second we have had several problems over the fast year with pyarrow (support of Categorical variable about a year ago) and currently the unability to save timedelta's with pyarrow:
https://issues.apache.org/jira/browse/ARROW-6780
Because of that the support of different engine is an important feature to us. Not sure how representative this is though.
Still I am wondering if you did not run into maybe a panda's bug @lorenabalan since fastparquet is the default dask parquet engine and you do not owerwrite this when saving ParquetDataSet with Dask
https://github.com/quantumblacklabs/kedro/blob/3faa0d454f3584f39285843f1ae28bec18cc3fee/kedro/extras/datasets/dask/parquet_dataset.py#L136
This issue can be closed as is.
Still I am wondering if you did not run into maybe a panda's bug @lorenabalan since fastparquet is the default dask parquet engine and you do not owerwrite this when saving ParquetDataSet with Dask
That's a good point. We haven't had issues reported about it but that doesn't necessarily mean it's all perfect. Fortunately, it looks like in the meantime they have changed their default to "auto", so if only pyarrow is installed it'll go with that one.
Closing this issue, but we welcome contributions for a fastparquet dataset if that's of interest. 馃槃
Most helpful comment
Thank you both for your suggestions.
I ended up going with @mzjp2 's suggestion and copy pasted the code from 0.15 to our project repo. Thanks for the Dask idea @lorenabalan. It would have been quite involved here as this is a many nodes DS pipeline where each intermediary file is saved as parquet and so we would have had to convert back and forth to pandas in each one of them.
As for why we are going for fastparquet in my team: first we aim at saving intermediary files as parquet and second we have had several problems over the fast year with pyarrow (support of Categorical variable about a year ago) and currently the unability to save timedelta's with pyarrow:
https://issues.apache.org/jira/browse/ARROW-6780
Because of that the support of different engine is an important feature to us. Not sure how representative this is though.
Still I am wondering if you did not run into maybe a panda's bug @lorenabalan since fastparquet is the default dask parquet engine and you do not owerwrite this when saving ParquetDataSet with Dask
https://github.com/quantumblacklabs/kedro/blob/3faa0d454f3584f39285843f1ae28bec18cc3fee/kedro/extras/datasets/dask/parquet_dataset.py#L136
This issue can be closed as is.