Basically these are performance tools in SQL to get analysis in multiple dimensions and they are missing in Pandas out of the box. Some of these can be achieved by a pivot table and melt/stack functions but being tools for analysis these functions should be a must and it also decreases the number of lines of code.
Group by Grouping set will help to rewrite the query with multiple groups by clauses combined with union statements into a single query. Cube is shorthand notation of grouping sets if the user chooses all the combinations of the fields listed in the cube clause
```SQL Code
SELECT
column1,
column2,
aggregate_function (column3)
FROM
table_name
GROUP BY
GROUPING SETS (
(column1, column2),
(column1),
(column2),
()
);
Select column1,
column2,
column3,
column4,
aggregate_function (column5)
from table
group by column1, column2, cube (column3,column4)```
Current way
```pseudo code
a=
a1 = a.groupby([column1]).sum(column5)
a2 = a.groupby([column1,column2]).sum(column5)
...
an = a.groupby([column1,...,columnn]).sum(column5)
result= union(a1,a2,......an)
Expected way
```pseudo code
a= <pandas dataframe>
gropby_cube1 = a.gropby([column1,column2]).cube([column3,.....,columnn]).sum(column5)
gropby_cube2 = a.gropby.cube([column1,column2,.....,columnn]).sum(column5)
gropby_sets1 = a.gropby.sets( {column1,column2} ,{column1,column2,column3} ,{}).sum(column5)
gropby_sets2 = a.gropby([column1,column2).sets({column1,column2,column3} ,{} ).sum(column5)
gropby_rollup1 = a.gropby.rollup({column1,column2,column3}).sum(column5)
gropby_rollup2 = a.gropby([column1,column2).rollup({column3} ).sum(column5)
Can you provide a minimal example, with data, and a proposed API so that we can evaluate the proposal?
@TomAugspurger , Let me know if you are not clear with the below example.
Database: Snowflake
The below scripts should also work with PostgreSQL
Sample Dataframe
create or replace table {db}.{schema}.test (date date,region varchar,product varchar, sales int,revenue int);
insert into {db}.{schema}.test ( date,region,product,sales ,revenue)
values('2019-10-12','EU','a1',200,200),
('2019-10-12','EU','a2',400,250),
('2019-10-12','NA','a1',200,200),
('2019-10-12','NA','a2',10,35),
('2019-10-13','EU','a2',3,1),
('2019-10-13','EU','b1',400,250),
('2019-10-13','NA','b2',200,200),
('2019-10-13','NA','b1',10,35);
DATE REGION COUNTRY SALES REVENUE
2019-10-12 EU UK 200 200
2019-10-12 EU FR 400 250
2019-10-12 NA US 200 200
2019-10-12 NA CN 10 35
2019-10-13 EU UK 3 1
2019-10-13 EU FR 400 250
2019-10-13 NA US 200 200
2019-10-13 NA CN 10 35
After Transformation
DATE YEAR_WEEK YEAR_MONTH PRODUCT SALES REVENUE
2019-10-12 201941 201910 a1 200 200
2019-10-12 201941 201910 a2 400 250
2019-10-12 201941 201910 a1 200 200
2019-10-12 201941 201910 a2 10 35
2019-10-13 201941 201910 a2 3 1
2019-10-13 201941 201910 b1 400 250
2019-10-13 201941 201910 b2 200 200
2019-10-13 201941 201910 b1 10 35
Group by Cube
This will group by all the combinations of the dimensions mentioned in the cube clause.
As its combinations of different dimensions, the ignored dimension will get the nulls values so the below output will be kind of confusing you can load the data and run the SQL.
Sales and Revenue for combinations all the of product ,data,week and month
select date,(year(date)||week(date))::int as year_week,(year(date)||month(date))::int as year_month,product,sum(sales) as total_sales,sum(revenue) as total_revenue
from {db}.{schema}test
group by cube(1,2,3,4);
DATE YEAR_WEEK YEAR_MONTH PRODUCT TOTAL_SALES TOTAL_REVENUE
2019-10-12 201941 201910 a1 400 400
2019-10-12 201941 201910 a2 410 285
2019-10-13 201941 201910 a2 3 1
2019-10-13 201941 201910 b1 410 285
2019-10-13 201941 201910 b2 200 200
2019-10-12 201941 201910 810 685
2019-10-13 201941 201910 613 486
2019-10-12 201941 810 685
2019-10-13 201941 613 486
2019-10-12 810 685
2019-10-13 613 486
1423 1171
201941 1423 1171
2019-10-12 201910 810 685
2019-10-13 201910 613 486
201910 1423 1171
201941 201910 1423 1171
2019-10-12 201941 a1 400 400
2019-10-12 201941 a2 410 285
2019-10-13 201941 b1 410 285
2019-10-13 201941 b2 200 200
2019-10-13 201941 a2 3 1
2019-10-12 a1 400 400
2019-10-12 a2 410 285
2019-10-13 b1 410 285
2019-10-13 b2 200 200
2019-10-13 a2 3 1
a1 400 400
a2 413 286
b1 410 285
b2 200 200
201941 a1 400 400
201941 a2 413 286
201941 b1 410 285
201941 b2 200 200
2019-10-12 201910 a1 400 400
2019-10-12 201910 a2 410 285
2019-10-13 201910 a2 3 1
2019-10-13 201910 b1 410 285
2019-10-13 201910 b2 200 200
201910 a1 400 400
201910 a2 413 286
201910 b1 410 285
201910 b2 200 200
201941 201910 a1 400 400
201941 201910 a2 413 286
201941 201910 b1 410 285
201941 201910 b2 200 200
you can also do only a few fields in the cube to reduce the combinations
ex: group by 1,2, cube(3.4)
Group by Grouping Sets
This is a lighter version of the cube that only computes the mentioned combination of fields.
Sale and Revenue of a product by region for a week
select (year(date)||week(date))::int as year_week,region,product,sum(sales) as total_sales,sum(revenue) as total_revenue
from {db}.{schema}.test
group by grouping sets((1,2),(1,3));
YEAR_WEEK REGION PRODUCT TOTAL_SALES TOTAL_REVENUE
201941 EU 1003 701
201941 NA 420 470
201941 a1 400 400
201941 a2 413 286
201941 b1 410 285
201941 b2 200 200
Like Cube you can mention only a few fields as grouping sets and few a group by field
Eg: group by 1, grouping sets((2,3),(4,5))
Rollup
If you have dimensions in hierarchies and that needs to be rolled up this toll will do it
select date,week(date) as week,region,product,sum(sales) as total_sales,sum(revenue) as total_revenue
from {db}.{schema}.test
group by date,region ,product, rollup(week(date));
DATE WEEK REGION PRODUCT TOTAL_SALES TOTAL_REVENUE
2019-10-12 41 EU a1 200 200
2019-10-12 41 EU a2 400 250
2019-10-12 41 NA a1 200 200
2019-10-12 41 NA a2 10 35
2019-10-13 41 EU a2 3 1
2019-10-13 41 EU b1 400 250
2019-10-13 41 NA b2 200 200
2019-10-13 41 NA b1 10 35
2019-10-12 EU a1 200 200
2019-10-12 EU a2 400 250
2019-10-12 NA a1 200 200
2019-10-12 NA a2 10 35
2019-10-13 EU a2 3 1
2019-10-13 EU b1 400 250
2019-10-13 NA b2 200 200
2019-10-13 NA b1 10 35
Let's consider pd as our data frame and write the equivalent example for above few cases with the API that needs to be implemented
groupby.rollup
SQL function : group by rollup
Pandas example : pd.groupby((date,product,week)).rollup((week)).sum(sales)
pd.groupby().rollup((date,week)).sum(sales)
grouping sets
SQL function : group by grouping sets
Pandas example : pd.groupby().sets([(date,product),(week)]).sum(sales)
Cube
SQL function : group by cube
Pandas example : pd.groupby().cube((year,week,region,product)).sum([sales,revenue])
pd.groupby(week(date)).cube((region,product)).sum([sales,revenue])
The above are the example and it should also support all other group-by functions like before. For example, using agg clause for multiple aggregates. It should not only support files but also expression in group by or column number reference if possible.
Alternative Way
Even this API also should be fine it's up to you
pd.group({'by':(),'sets':[(date,product),(week)]}).sum([sales,revenue])
pd.group({'by':(date,product),'rollup':(week(date)]),(week))}).sum([sales,revenue])
pd.group({'by':(),'cube':(date,product,week)}).sum([sales,revenue])
Apologies, If I am missing anything or for the unbalanced brackets. My intention in both approached is to use sets input for rollup, cube, and list of set input for grouping sets.
rollup looks like a resample
cube is way too terse and completely non pythonic; i think you can replace this with a group by and concat (you indicated above). this operation is just doing to much.
i am not sure you get a whole lot by adding these kinds of operation except some api bloat. pandas is all about simple, practical, and composable operations.
i would fear an implementation of these would be a source of bugs and be overly complicated
@jreback , I not sure why it's complicated. It's already been implemented in different open sources DB like PostgreSQL So it would be a similar approach so no need to reinvent the wheel. As I said above these function are implemented such way it helps the performance rather than the concat way.
pandas was known as an Analytics tool so I strongly say these functions should be out of the box. If this API was implemented it would also help other libraries built on pandas API. For example, Dask. Every software will have some bug's and they will be fixed in further iterations.
In my above example, you can see 2 different API styles. 1st style is the existing pandas style.
If you choose the second style it is a totally new API so even if it has bug it will not affect other API's and once this is fully implemented without bugs the other existing API can be deprecated.
bump
bump
bump
bump
@rsdpyenugula if you really want to this then you should edit the top with
detailed examples (quality not quantity matters here); doc-strings, and typed function signatures
a POC implementation PR would also be nice to have
pandas is all volunteer and folks have limited time with 3000+ issues; so contributions are the only thing to move this forward
Love the idea!
Most helpful comment
@jreback , I not sure why it's complicated. It's already been implemented in different open sources DB like PostgreSQL So it would be a similar approach so no need to reinvent the wheel. As I said above these function are implemented such way it helps the performance rather than the concat way.
pandas was known as an Analytics tool so I strongly say these functions should be out of the box. If this API was implemented it would also help other libraries built on pandas API. For example, Dask. Every software will have some bug's and they will be fixed in further iterations.
In my above example, you can see 2 different API styles. 1st style is the existing pandas style.
If you choose the second style it is a totally new API so even if it has bug it will not affect other API's and once this is fully implemented without bugs the other existing API can be deprecated.