Dask: Persist/checkpoint to disk

Created on 30 Apr 2019 · 6Comments · Source: dask/dask

Do we have some function to persist data into disk when not using a cluster? It would just be a small function that calls compute, writes the data to disk and then loads it back again. I'm currently writing my own wrapper function.

ddf = ddf.checkpoint(to_parquet, filename...)

#do more work with ddf, but computation is faster since ddf is persisted
a = ddf[...]

Source

hoangthienan95

Most helpful comment

This is sidelong related to the concept of persist in Intake, which specifically has exactly one file format to output for each type of data source. In dask, we could say that the canonical storage for dataframes is parquet and for arrays zarr... or something like that, but this is not a simple problem at all. It may be that Intake or some other pipeline-like system would be a good layer over dask to handle intermediate persistence.

martindurant on 30 Apr 2019

👍2

All 6 comments

Since different data types/situations would want to use different file formats, a standard method for doing so would be tricky. As you've done above, writing to parquet and then reading back out is one valid method.

jcrist on 30 Apr 2019

martindurant on 30 Apr 2019

👍2

Wow @martindurant Intake is very cool, thanks for letting me know!! Sorry I haven't had time to go through the docs, but have you guys thought of integrating with Great Expectations and DVC for data validation (not just schema, actual data, but statistics and distribution of features). Again sorry if Intake already has these features

hoangthienan95 on 30 Apr 2019

Also yes, using Intake to do this in Dask would be a great idea! User can choose to output the data in any way, the container type will be stored in the metadata, and whoever loads the data back in need not worry about what format it is in or how it was saved.

hoangthienan95 on 30 Apr 2019

using Intake to do this in Dask would be a great idea

It may be a bit premature ...

have you guys thought of integrating with Great Expectations and DVC for data validation

Honestly I don't know what these are, but please raise on the Intake tracker with some description, and we'd be happy to consider it.

From the point of view of this issue, I don't think we have a solution at the moment beyond specific "store" for arrays and "to_parquet" for dataframes (i.e., the user needs to know which and how); so I suspect we should close this for now.

martindurant on 1 May 2019