Currently combine/select on GroupedDataFrame with 0 groups just executes no operation and returns a result with grouping columns only (or empty). This is a bit inconsistent as the output shape of the operation depends on the number of groups.
We have several options here:
length(::GroupedDataFrame) is 0 (this is a safest option and probably is not a problem in practice as probably almost no real use cases combine empty groups)Point 3. is related to a more general issue of allowing empty groups (we currently disallow them). They naturally happen if e.g. we would like to do aggregation over all levels of a categorical column, where not all levels are present.
Option 2. (throw an error) is a safe choice for 1.0, as later we would have an option to switch to any behaviour we like.
Option 1
Option 2
Option 3
Is this similar to #1837?
Is this similar to #1837?
No it is orthogonal. The question is what should happen when you work on a GroupedDataFrame like groupby(DataFrame(a=[], b=[], c=[]), [:a, :b]).
Option 3 is probably the best in the long term, but Option 2 is good in the short term as it allows doing anything after 1.0.
Actually we have to go for option 3 to fix a bug in combine called on a DataFrame that has 0-rows.
OK, if we go for option 3 (which I agree) then the rules would be:
combine/select/transform on an empty gdf return a data frame with 0 rowsselect/transform on an empty df return a data frame with 0 rowscombine on an empty df return a data frame with as many rows as would be implied by the returned values (this case breaks symmetry in the code, so I want to confirm that we agree that this is the desired behaviour before I start implementing it)2 is actually consistent with 3 AFAICT, it's just that select/transform additionally require the functions to return the same number of values as their inputs. But the functions would still be called on the empty inputs to find out the return type, right?
For 1, it's tricky since functions have to be called on an empty pseudo group to find out return types. Is that what you have in mind?
You have summarized it correctly - I just wanted to be explicit.
So now regarding implementation:
_combineTo be clear where the challenge is if we accept these rules:
combine(x -> DataFrame(a=1), DataFrame())
will produce DataFrame(a=1) in result, but:
combine(x -> DataFrame(a=1), groupby(DataFrame(), []))
will produce DataFrame(a=Int[]) in result.
I am spelling this example out to double check that we all agree this is how things should work.
combine(::Function, ::GroupedDataFrame) makes sense but didn't know combine(::Function, ::DataFrame) would work.
Is a DataFrame considered a GroupedDataFrame with 0 or 1 group?
DataFrame is not considered to be a GroupedDataFrame.
combine means "combine rows". You can combine rows in a GroupedDataFrame (then you combine them per group) or you can combine rows in a DataFrame (then you combine all rows in a data frame).
The key thing in what we discuss is a difference between 0 groups and one or more groups with zero rows per group.
The latter (zero rows per group) is currently disallowed (there were requests for this functionality though to allowing filling-in missing levels in CategoricalVector).
So technically you could say that DataFrame is considered to be a GroupedDataFrame with 1 group and no grouping columns, except that we have a corner case that 0-row data frame has to have 0 groups as we disallow 0 row groups.
So technically you could say that
DataFrameis considered to be aGroupedDataFramewith 1 group and no grouping columns, except that we have a corner case that0-row data frame has to have0groups as we disallow0row groups.
Maybe we'd better consider a 0-row DataFrame as having 1 empty group. GroupedDataFrame doesn't allow that, but that's not necessarily a problem: that way there's no inconsistency, just something that works for DataFrame but not for GroupedDataFrame. And it could be supported later.
In doubt, we could throw an error when returning a non-zero number of rows from combine when the input has no rows. Not sure it's worth it.
Maybe we'd better consider a 0-row
DataFrameas having 1 empty group.GroupedDataFramedoesn't allow that, but that's not necessarily a problem
This is the point of this issue to do exactly what you write 馃槃.
In doubt, we could throw an error when returning a non-zero number of rows from combine when the input has no rows.
It would not reduce the complexity of the code (I anyway have to "hardcode" a case "0-row DataFrame as having 1 empty group" for the purpose of combine) and would be inconsistent I think.
Most helpful comment
Option 3