Do we have some function to persist data into disk when not using a cluster? It would just be a small function that calls compute, writes the data to disk and then loads it back again. I'm currently writing my own wrapper function.
ddf = ddf.checkpoint(to_parquet, filename...)
#do more work with ddf, but computation is faster since ddf is persisted
a = ddf[...]
Since different data types/situations would want to use different file formats, a standard method for doing so would be tricky. As you've done above, writing to parquet and then reading back out is one valid method.
This is sidelong related to the concept of persist in Intake, which specifically has exactly one file format to output for each type of data source. In dask, we could say that the canonical storage for dataframes is parquet and for arrays zarr... or something like that, but this is not a simple problem at all. It may be that Intake or some other pipeline-like system would be a good layer over dask to handle intermediate persistence.
Wow @martindurant Intake is very cool, thanks for letting me know!! Sorry I haven't had time to go through the docs, but have you guys thought of integrating with Great Expectations and DVC for data validation (not just schema, actual data, but statistics and distribution of features). Again sorry if Intake already has these features
Also yes, using Intake to do this in Dask would be a great idea! User can choose to output the data in any way, the container type will be stored in the metadata, and whoever loads the data back in need not worry about what format it is in or how it was saved.
using Intake to do this in Dask would be a great idea
It may be a bit premature ...
have you guys thought of integrating with Great Expectations and DVC for data validation
Honestly I don't know what these are, but please raise on the Intake tracker with some description, and we'd be happy to consider it.
From the point of view of this issue, I don't think we have a solution at the moment beyond specific "store" for arrays and "to_parquet" for dataframes (i.e., the user needs to know which and how); so I suspect we should close this for now.
Most helpful comment
This is sidelong related to the concept of
persistin Intake, which specifically has exactly one file format to output for each type of data source. In dask, we could say that the canonical storage for dataframes is parquet and for arrays zarr... or something like that, but this is not a simple problem at all. It may be that Intake or some other pipeline-like system would be a good layer over dask to handle intermediate persistence.