It would be great to have a function that returns a vector of unique identifier for each group of a GroupedDataframe (as a PooledDataArray or as a CategoricalVector). Similar functions exist in other packages: group_indices in dplyr, .GRP in data.table, group in Stata. In particular, it would make it easier for external packages to benefit from the all recent work done in group_rows.
You can get it like this:
getindex.(gd, 1, (gd.cols,))
or like this
copy.(getindex.(gd, 1, (gd.cols,)))
But I agree that it would be good to define a function that does this in the package (recently we have discussed this with @nalimilan).
What name for such a function would you see and should it return DataFrameRows or rather NamedTuples?
@matthieugomez Can you detail what kind of output you would expect? A single vector giving groups? Then it's present as integer indices in gd.groups. Or do you need labels too?
Yes - now I see that dplyr function group_indices is something we have as groups field.
Maybe it would be good to expose it also (probably wrapping in ReadOnlyArray).
Oh yes you're right, an integer vector would be enough. The only thing is that groups with missing values should return a certain integer (0 or missing or some other value, as long as it is documented).
OK. We already use 0 for missing values when skipmissing=true. I guess we could add a group_indices or groupindices function (unless we can override an already existing and more standard function for that).
Thinking about it, maybe it would be safer to return a missing value for groups where one variable is missing when skipmissing = true.
We could do that, but that would require allocating a new vector. What do other implementations do?
By default, Stata returns missing values in case one variable has a missing value. With the option missing, missing values are considered like distinct values (so, for the case of two variables missing, 1 is one integer, 1, missing is another integer).
It seems like dplyr and datatable always consider missing values as distinct values.
By the way, I actually think it is a bit weird that groups with a missing value are included by default in groupby by etc no? Would not it be less error-prone to remove them by default? I think Stata and Panda remove them, but R dplyr and datatable keep them.
It think it is very good that you raise these issues before DataFrames.jl 1.0, as now we have time to change the API if we want.
From my side I am happy with the current functionality, that is:
missing (actually I fint it better, because you do not omit something by accident)skipmissing=true if at least one value is missing, in this case all "missing" observations have the same code - 0 - as opposed to the case above when they would get different codes i.e. in case 1, missing and missing, 1missing as a distinct value when group-by is by multiple columns, e.g.julia> df = DataFrame(a=[1, missing, 3, missing], b=[missing, 2, 3, missing], c = [1, 2, 3, 4])
4ร3 DataFrame
โ Row โ a โ b โ c โ
โ โ Int64โฐ โ Int64โฐ โ Int64 โ
โโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโค
โ 1 โ 1 โ missing โ 1 โ
โ 2 โ missing โ 2 โ 2 โ
โ 3 โ 3 โ 3 โ 3 โ
โ 4 โ missing โ missing โ 4 โ
julia> gd = collect(groupby(df, [1,2]))
4-element Array{Any,1}:
1ร3 SubDataFrame
โ Row โ a โ b โ c โ
โ โ Int64โฐ โ Int64โฐ โ Int64 โ
โโโโโโโผโโโโโโโโโผโโโโโโโโโโผโโโโโโโโค
โ 1 โ 1 โ missing โ 1 โ
1ร3 SubDataFrame
โ Row โ a โ b โ c โ
โ โ Int64โฐ โ Int64โฐ โ Int64 โ
โโโโโโโผโโโโโโโโโโผโโโโโโโโโผโโโโโโโโค
โ 1 โ missing โ 2 โ 2 โ
1ร3 SubDataFrame
โ Row โ a โ b โ c โ
โ โ Int64โฐ โ Int64โฐ โ Int64 โ
โโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโค
โ 1 โ 3 โ 3 โ 3 โ
1ร3 SubDataFrame
โ Row โ a โ b โ c โ
โ โ Int64โฐ โ Int64โฐ โ Int64 โ
โโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโค
โ 1 โ missing โ missing โ 4 โ
All these options are "safe" IMHO. What would you change here?
However, what I would add is (here a feedback from you would be valuable if you also find them useful):
groups variable (as accessing internal undocumented field is not a proper way to go)Yes, in Julia missing values are never skipped by default, so doing that with groupby is consistent (and safer).
a function exposing
groupsvariable (as accessing internal undocumented field is not a proper way to go)
+1
a function allowing to get the values of grouping variables in each group (the thing I initially thought you wanted)
Is there prior art for this? I'm not sure how to call it nor what it should return (a vector of tuples? a data frame?).
Regarding the first function - can we call it groups? If we did not want to wrap it we could have:
groups(gd) = gd.groups
Then in dplyr you can get information about grouping variables, we could have something like:
groupvars(gd) = gd.cols
Regarding the second - initially I thought about returning a vector of DataFrameRows or NamedTuples as proposed above:
getindex.(gd, 1, (gd.cols,))
or
copy.(getindex.(gd, 1, (gd.cols,)))
I could not find a similar function in other packages, but now when I am thinking about it actually combine(gd) could return a DataFrame holding only grouping variables - one row per group. This would be something like combine(gd, x->nothing) but without creating this extra column with nothing values.
yes, exactly, I would like a function that exposes the groups variable.
About the skipmissing stuff, I just checked and all other softwares (except panda I think) do like you do by default, so I guess this is the right default.
The only remaining question is whether the function should return missing values when skipmissing = true. I think it would be better to do so but I'm fine either way.
groups and groupvars sounds fine, sticking to the dplyr terminology is a good idea when we don't have a reason to diverge (though for some obscure reason groups also returns the names of variables). groups should probably replace 0 with missing when skipmissing=true (otherwise 0 isn't used).
Using combine(df) like you suggest is an interesting idea. IIUC that would be equivalent to combine(df -> df[1, gd.cols], gd), right? However it's hard to decide what's the most logical behavior for combine(df). Having it equivalent to DataFrame(combine) (like ungroup in dplyr) would also be appealing: the sequence df |> groupby(:key) |> map(:x => mean) |> combine() is natural ("split-apply-combine"), even if using map is less efficient than combine(df, :x => mean).
The issue is that combine(df -> df[1, gd.cols], gd) would not be right as it duplicates the columns. The ways I have proposed above do what is intended. Using combine or map it is not possible as they always expect to add some columns.
groups should probably replace
0withmissingwhen skipmissing=true (otherwise0isn't used).
I agree that having missing is more consistent logically, but having 0 will be faster.
it's hard to decide what's the most logical behavior for
combine(df)
For me the logical is what I propose (i.e. we computed no columns so only grouping columns are retained as we always retain them - note that e.g. this is the same what you do if you have zero rows of data). But I agree that this is not necessarily most useful feature, so I will not stick to it very hard if there are other opinions.
just to clarify, group in dplyr returns the name of the groups,group_indices is the function returning an integer vector.
Yes, but do we want to keep these names. group_indices is intuitive and we could keep it (probably groupindices), but groups is unclear, so I have proposed groupvars.
OK, groupindices and groupvars sounds good (actually in dply groups and group_vars do almost the same thing, but have a different return type).
Regarding the treatment of missing values, we can still have an argument to use 0, but by default it would be more consistent to use missing. The performance impact should be negligible compared to the cost of grouping anyway.
The issue is that
combine(df -> df[1, gd.cols], gd)would not be right as it duplicates the columns. The ways I have proposed above do what is intended. Usingcombineormapit is not possible as they always expect to add some columns.
Ah, right. We really need to fix this.
For me the logical is what I propose (i.e. we computed no columns so only grouping columns are retained as we always retain them - note that e.g. this is the same what you do if you have zero rows of data). But I agree that this is not necessarily most useful feature, so I will not stick to it very hard if there are other opinions.
I agree about the columns, but the question is what rows to return. Returning one row per group assumes a summary function is the most natural operation, but combine supports any kind of operation, so we could also return as many rows as in the input data frame (or 2 rows per group, or anything).
relevant discussion also at #1693.
groupindicesandgroupvars
OK - I will open a PR adding them. For groupindices I will transform 0 to missing internally and create a new vector.
We really need to fix this.
I will not touch it (and leave you to decide how to implement it when you do other improvements to grouping code)
but the question is what rows to return
good point. So for now we can simply keep it deprecated and decide later.
relevant discussion also at #1693.
Agreed - that is why I propose to leave it for later to decide by @nalimilan. What is simple to decide to add groupindices and groupvars functions and that is why I propose to push a PR adding them (I shall open this PR today).
It needs to be sorted out if all issues raised here are resolved (some of them are, I am not sure if all), so I mark it for 1.0.
Most helpful comment
Yes, but do we want to keep these names.
group_indicesis intuitive and we could keep it (probablygroupindices), butgroupsis unclear, so I have proposedgroupvars.