Pandas: ENH: Group By Grouping set /Cube/ Rollup

Created on 5 Nov 2019  路  10Comments  路  Source: pandas-dev/pandas

Problem description

Basically these are performance tools in SQL to get analysis in multiple dimensions and they are missing in Pandas out of the box. Some of these can be achieved by a pivot table and melt/stack functions but being tools for analysis these functions should be a must and it also decreases the number of lines of code.

Group by Grouping set will help to rewrite the query with multiple groups by clauses combined with union statements into a single query. Cube is shorthand notation of grouping sets if the user chooses all the combinations of the fields listed in the cube clause

```SQL Code
SELECT
column1,
column2,
aggregate_function (column3)
FROM
table_name
GROUP BY
GROUPING SETS (
(column1, column2),
(column1),
(column2),
()
);

Select column1,
column2,
column3,
column4,
aggregate_function (column5)
from table
group by column1, column2, cube (column3,column4)```

Current way
```pseudo code
a=
a1 = a.groupby([column1]).sum(column5)
a2 = a.groupby([column1,column2]).sum(column5)
...
an = a.groupby([column1,...,columnn]).sum(column5)
result= union(a1,a2,......an)

Expected way 
```pseudo code
  a= <pandas dataframe>

  gropby_cube1 = a.gropby([column1,column2]).cube([column3,.....,columnn]).sum(column5)
   gropby_cube2 = a.gropby.cube([column1,column2,.....,columnn]).sum(column5)

   gropby_sets1 = a.gropby.sets( {column1,column2} ,{column1,column2,column3} ,{}).sum(column5)
   gropby_sets2 = a.gropby([column1,column2).sets({column1,column2,column3} ,{} ).sum(column5)

   gropby_rollup1 = a.gropby.rollup({column1,column2,column3}).sum(column5)
   gropby_rollup2 = a.gropby([column1,column2).rollup({column3} ).sum(column5)
API Design Enhancement Groupby

Most helpful comment

@jreback , I not sure why it's complicated. It's already been implemented in different open sources DB like PostgreSQL So it would be a similar approach so no need to reinvent the wheel. As I said above these function are implemented such way it helps the performance rather than the concat way.
pandas was known as an Analytics tool so I strongly say these functions should be out of the box. If this API was implemented it would also help other libraries built on pandas API. For example, Dask. Every software will have some bug's and they will be fixed in further iterations.
In my above example, you can see 2 different API styles. 1st style is the existing pandas style.
If you choose the second style it is a totally new API so even if it has bug it will not affect other API's and once this is fully implemented without bugs the other existing API can be deprecated.

All 10 comments

Can you provide a minimal example, with data, and a proposed API so that we can evaluate the proposal?

@TomAugspurger , Let me know if you are not clear with the below example.

Database: Snowflake
The below scripts should also work with PostgreSQL

Sample Dataframe

create or replace  table {db}.{schema}.test (date date,region varchar,product varchar, sales int,revenue int);

insert into {db}.{schema}.test ( date,region,product,sales ,revenue)
values('2019-10-12','EU','a1',200,200),
('2019-10-12','EU','a2',400,250),
('2019-10-12','NA','a1',200,200),
('2019-10-12','NA','a2',10,35),
('2019-10-13','EU','a2',3,1),
('2019-10-13','EU','b1',400,250),
('2019-10-13','NA','b2',200,200),
('2019-10-13','NA','b1',10,35);

DATE    REGION  COUNTRY SALES   REVENUE
2019-10-12  EU  UK  200 200
2019-10-12  EU  FR  400 250
2019-10-12  NA  US  200 200
2019-10-12  NA  CN  10  35
2019-10-13  EU  UK  3   1
2019-10-13  EU  FR  400 250
2019-10-13  NA  US  200 200
2019-10-13  NA  CN  10  35

After Transformation

DATE    YEAR_WEEK   YEAR_MONTH  PRODUCT SALES   REVENUE
2019-10-12  201941  201910  a1  200 200
2019-10-12  201941  201910  a2  400 250
2019-10-12  201941  201910  a1  200 200
2019-10-12  201941  201910  a2  10  35
2019-10-13  201941  201910  a2  3   1
2019-10-13  201941  201910  b1  400 250
2019-10-13  201941  201910  b2  200 200
2019-10-13  201941  201910  b1  10  35

Group by Cube

This will group by all the combinations of the dimensions mentioned in the cube clause.

As its combinations of different dimensions, the ignored dimension will get the nulls values so the below output will be kind of confusing you can load the data and run the SQL.

Sales and Revenue for combinations all the of product ,data,week and month

select date,(year(date)||week(date))::int as year_week,(year(date)||month(date))::int as year_month,product,sum(sales) as total_sales,sum(revenue) as total_revenue
from {db}.{schema}test
group by cube(1,2,3,4);


DATE    YEAR_WEEK   YEAR_MONTH  PRODUCT TOTAL_SALES TOTAL_REVENUE
2019-10-12  201941  201910  a1  400 400
2019-10-12  201941  201910  a2  410 285
2019-10-13  201941  201910  a2  3   1
2019-10-13  201941  201910  b1  410 285
2019-10-13  201941  201910  b2  200 200
2019-10-12  201941  201910      810 685
2019-10-13  201941  201910      613 486
2019-10-12  201941          810 685
2019-10-13  201941          613 486
2019-10-12              810 685
2019-10-13              613 486
                1423    1171
    201941          1423    1171
2019-10-12      201910      810 685
2019-10-13      201910      613 486
        201910      1423    1171
    201941  201910      1423    1171
2019-10-12  201941      a1  400 400
2019-10-12  201941      a2  410 285
2019-10-13  201941      b1  410 285
2019-10-13  201941      b2  200 200
2019-10-13  201941      a2  3   1
2019-10-12          a1  400 400
2019-10-12          a2  410 285
2019-10-13          b1  410 285
2019-10-13          b2  200 200
2019-10-13          a2  3   1
            a1  400 400
            a2  413 286
            b1  410 285
            b2  200 200
    201941      a1  400 400
    201941      a2  413 286
    201941      b1  410 285
    201941      b2  200 200
2019-10-12      201910  a1  400 400
2019-10-12      201910  a2  410 285
2019-10-13      201910  a2  3   1
2019-10-13      201910  b1  410 285
2019-10-13      201910  b2  200 200
        201910  a1  400 400
        201910  a2  413 286
        201910  b1  410 285
        201910  b2  200 200
    201941  201910  a1  400 400
    201941  201910  a2  413 286
    201941  201910  b1  410 285
    201941  201910  b2  200 200

you can also do only a few fields in the cube to reduce the combinations
ex: group by 1,2, cube(3.4)

Group by Grouping Sets

This is a lighter version of the cube that only computes the mentioned combination of fields.

Sale and Revenue of a product by region for a week

select (year(date)||week(date))::int as year_week,region,product,sum(sales) as total_sales,sum(revenue) as total_revenue
from {db}.{schema}.test
group by  grouping sets((1,2),(1,3));

YEAR_WEEK   REGION  PRODUCT TOTAL_SALES TOTAL_REVENUE
201941  EU      1003    701
201941  NA      420 470
201941      a1  400 400
201941      a2  413 286
201941      b1  410 285
201941      b2  200 200

Like Cube you can mention only a few fields as grouping sets and few a group by field
Eg: group by 1, grouping sets((2,3),(4,5))

Rollup
If you have dimensions in hierarchies and that needs to be rolled up this toll will do it

select date,week(date) as week,region,product,sum(sales) as total_sales,sum(revenue) as total_revenue
from {db}.{schema}.test
group by  date,region ,product, rollup(week(date));

DATE    WEEK    REGION  PRODUCT TOTAL_SALES TOTAL_REVENUE
2019-10-12  41  EU  a1  200 200
2019-10-12  41  EU  a2  400 250
2019-10-12  41  NA  a1  200 200
2019-10-12  41  NA  a2  10  35
2019-10-13  41  EU  a2  3   1
2019-10-13  41  EU  b1  400 250
2019-10-13  41  NA  b2  200 200
2019-10-13  41  NA  b1  10  35
2019-10-12      EU  a1  200 200
2019-10-12      EU  a2  400 250
2019-10-12      NA  a1  200 200
2019-10-12      NA  a2  10  35
2019-10-13      EU  a2  3   1
2019-10-13      EU  b1  400 250
2019-10-13      NA  b2  200 200
2019-10-13      NA  b1  10  35

Equivalent Pandas API expectation:

Let's consider pd as our data frame and write the equivalent example for above few cases with the API that needs to be implemented
groupby.rollup
SQL function : group by rollup
Pandas example : pd.groupby((date,product,week)).rollup((week)).sum(sales)
pd.groupby().rollup((date,week)).sum(sales)

grouping sets
SQL function : group by grouping sets
Pandas example : pd.groupby().sets([(date,product),(week)]).sum(sales)

Cube
SQL function : group by cube
Pandas example : pd.groupby().cube((year,week,region,product)).sum([sales,revenue])
pd.groupby(week(date)).cube((region,product)).sum([sales,revenue])

The above are the example and it should also support all other group-by functions like before. For example, using agg clause for multiple aggregates. It should not only support files but also expression in group by or column number reference if possible.

Alternative Way
Even this API also should be fine it's up to you
pd.group({'by':(),'sets':[(date,product),(week)]}).sum([sales,revenue]) pd.group({'by':(date,product),'rollup':(week(date)]),(week))}).sum([sales,revenue]) pd.group({'by':(),'cube':(date,product,week)}).sum([sales,revenue])

Apologies, If I am missing anything or for the unbalanced brackets. My intention in both approached is to use sets input for rollup, cube, and list of set input for grouping sets.

rollup looks like a resample

cube is way too terse and completely non pythonic; i think you can replace this with a group by and concat (you indicated above). this operation is just doing to much.

i am not sure you get a whole lot by adding these kinds of operation except some api bloat. pandas is all about simple, practical, and composable operations.

i would fear an implementation of these would be a source of bugs and be overly complicated

@jreback , I not sure why it's complicated. It's already been implemented in different open sources DB like PostgreSQL So it would be a similar approach so no need to reinvent the wheel. As I said above these function are implemented such way it helps the performance rather than the concat way.
pandas was known as an Analytics tool so I strongly say these functions should be out of the box. If this API was implemented it would also help other libraries built on pandas API. For example, Dask. Every software will have some bug's and they will be fixed in further iterations.
In my above example, you can see 2 different API styles. 1st style is the existing pandas style.
If you choose the second style it is a totally new API so even if it has bug it will not affect other API's and once this is fully implemented without bugs the other existing API can be deprecated.

bump

bump

bump

bump

@rsdpyenugula if you really want to this then you should edit the top with
detailed examples (quality not quantity matters here); doc-strings, and typed function signatures

a POC implementation PR would also be nice to have

pandas is all volunteer and folks have limited time with 3000+ issues; so contributions are the only thing to move this forward

Love the idea!

Was this page helpful?
0 / 5 - 0 ratings