Hi,
The Dask API provides a method to compute quantiles of Series:
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.quantile
This is already a great thing as quantile distributed computation is known to be a real challenge.
Unfortunately, this is not available using aggregate API.
https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate
Would it be possible to provide quantile computation to Dask aggregate syntax?
Can you provide a snippet with the input data and expected output? Is this in a groupby context, i.e. the Dask version of
In [12]: df = pd.DataFrame({"A": ['a', 'a', 'b'], "B": [1, 2, 3], "C": [4, 5, 4]})
In [13]: df.groupby("A").quantile()
Out[13]:
B C
A
a 1.5 4.5
b 3.0 4.0
One slight complication is that quantile isn't always an aggregation.
In [14]: df.groupby("A").quantile([0.5, 0.75])
Out[14]:
B C
A
a 0.50 1.50 4.50
0.75 1.75 4.75
b 0.50 3.00 4.00
0.75 3.00 4.00
It may still be doable though.
Thank you for your answer.
Correct, this is in groupby context.
Here is a snippet of what I would do using pure Pandas:
In [3]: import pandas as pd
def q25(s):
return s.quantile(0.25)
df_pandas = pd.DataFrame({"car": [1, 1, 2, 4, 4, 4], "speed": [1, 2, 3, 4, 5, 6]})
df_pandas.groupby("car").speed.agg(["mean", q25])
Out[3]:
mean q25
car
1 1.5 1.25
2 3.0 3.00
4 5.0 4.50
Most helpful comment
Thank you for your answer.
Correct, this is in groupby context.
Here is a snippet of what I would do using pure Pandas: