Dask: Coercing a Pandas Dataframe from a Dask Dataframe

Created on 26 Apr 2016 · 3Comments · Source: dask/dask

Is it possible to coerce a Pandas dataframe from an existing dask dataframe? e.g.

ddf.to_pandasdataframe(df, etc..)

I'm dumping to a csv and reading it back in later now, and that's slow and silly.

The use case here is for a single node, many core machine, with data that fits in memory, and, a CPU-intensive process that is embarrassingly parallel -- so using ddf.groupby(ddf.index).apply(func) to speed up the work. This turns out to be an order of magnitude faster than multiprocessing, btw. The result of the groupby.apply is a dask dataframe, but I need to do work on it using a variety of pandas functions not currently available in dask.

Source

RedPandaHat

Most helpful comment

The .compute() method produces a Pandas DataFrame from a Dask DataFrame.

mrocklin on 26 Apr 2016

👍3

All 3 comments

The .compute() method produces a Pandas DataFrame from a Dask DataFrame.

mrocklin on 26 Apr 2016

👍3

Hi Matthew (@mrocklin ), I have been following and trying to use dask for last few months and great work on that. I tried to use above suggestion and it works. however it tries to re-calculate entire graph when i loop through every partition.compute. I have persisted the data frame and can see execution happened in UI. but when i loop through each partition to call other function and pass this partition as data frame, it executes same graph again and again the loop.
my other function to apply on each partition is based on pandas. when map partition is passing the df, its passing as dask and i am unable to change indexes etc.

any thoughts, please? thanks for your help.