Dataframes.jl: combine/select/transform on empty `GroupedDataFrame`

Created on 19 Jun 2020  路  14Comments  路  Source: JuliaData/DataFrames.jl

Currently combine/select on GroupedDataFrame with 0 groups just executes no operation and returns a result with grouping columns only (or empty). This is a bit inconsistent as the output shape of the operation depends on the number of groups.

We have several options here:

  1. leave things as they are
  2. throw an error if length(::GroupedDataFrame) is 0 (this is a safest option and probably is not a problem in practice as probably almost no real use cases combine empty groups)
  3. Throw a dummy input (i.e. with correct shape and element types but with zero rows) to grouping functions and just capture the shape and eltype of the output

Point 3. is related to a more general issue of allowing empty groups (we currently disallow them). They naturally happen if e.g. we would like to do aggregation over all levels of a categorical column, where not all levels are present.

Option 2. (throw an error) is a safe choice for 1.0, as later we would have an option to switch to any behaviour we like.

breaking bug decision grouping priority

Most helpful comment

Option 3

All 14 comments

Option 1

Option 2

Option 3

Is this similar to #1837?

Is this similar to #1837?

No it is orthogonal. The question is what should happen when you work on a GroupedDataFrame like groupby(DataFrame(a=[], b=[], c=[]), [:a, :b]).

Option 3 is probably the best in the long term, but Option 2 is good in the short term as it allows doing anything after 1.0.

Actually we have to go for option 3 to fix a bug in combine called on a DataFrame that has 0-rows.

OK, if we go for option 3 (which I agree) then the rules would be:

  1. for combine/select/transform on an empty gdf return a data frame with 0 rows
  2. for select/transform on an empty df return a data frame with 0 rows
  3. for combine on an empty df return a data frame with as many rows as would be implied by the returned values (this case breaks symmetry in the code, so I want to confirm that we agree that this is the desired behaviour before I start implementing it)

2 is actually consistent with 3 AFAICT, it's just that select/transform additionally require the functions to return the same number of values as their inputs. But the functions would still be called on the empty inputs to find out the return type, right?

For 1, it's tricky since functions have to be called on an empty pseudo group to find out return types. Is that what you have in mind?

You have summarized it correctly - I just wanted to be explicit.

So now regarding implementation:

  1. requires an implementation of a new code path in _combine
  2. this is how things work now, so we do not have to change anything
  3. this is actually tricky to make things work as in the code path implemented for 1. we need to make another branch signaling that we want to retain the produced rows and I have to think how to do it in the most clean way.

To be clear where the challenge is if we accept these rules:

combine(x -> DataFrame(a=1), DataFrame())

will produce DataFrame(a=1) in result, but:

combine(x -> DataFrame(a=1), groupby(DataFrame(), []))

will produce DataFrame(a=Int[]) in result.

I am spelling this example out to double check that we all agree this is how things should work.

combine(::Function, ::GroupedDataFrame) makes sense but didn't know combine(::Function, ::DataFrame) would work.
Is a DataFrame considered a GroupedDataFrame with 0 or 1 group?

DataFrame is not considered to be a GroupedDataFrame.

combine means "combine rows". You can combine rows in a GroupedDataFrame (then you combine them per group) or you can combine rows in a DataFrame (then you combine all rows in a data frame).


The key thing in what we discuss is a difference between 0 groups and one or more groups with zero rows per group.
The latter (zero rows per group) is currently disallowed (there were requests for this functionality though to allowing filling-in missing levels in CategoricalVector).

So technically you could say that DataFrame is considered to be a GroupedDataFrame with 1 group and no grouping columns, except that we have a corner case that 0-row data frame has to have 0 groups as we disallow 0 row groups.

So technically you could say that DataFrame is considered to be a GroupedDataFrame with 1 group and no grouping columns, except that we have a corner case that 0-row data frame has to have 0 groups as we disallow 0 row groups.

Maybe we'd better consider a 0-row DataFrame as having 1 empty group. GroupedDataFrame doesn't allow that, but that's not necessarily a problem: that way there's no inconsistency, just something that works for DataFrame but not for GroupedDataFrame. And it could be supported later.

In doubt, we could throw an error when returning a non-zero number of rows from combine when the input has no rows. Not sure it's worth it.

Maybe we'd better consider a 0-row DataFrame as having 1 empty group. GroupedDataFrame doesn't allow that, but that's not necessarily a problem

This is the point of this issue to do exactly what you write 馃槃.

In doubt, we could throw an error when returning a non-zero number of rows from combine when the input has no rows.

It would not reduce the complexity of the code (I anyway have to "hardcode" a case "0-row DataFrame as having 1 empty group" for the purpose of combine) and would be inconsistent I think.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bkamins picture bkamins  路  8Comments

blackeneth picture blackeneth  路  5Comments

xiaodaigh picture xiaodaigh  路  7Comments

garborg picture garborg  路  8Comments

bkamins picture bkamins  路  8Comments