On slack it came up that the legacy combine(::Callable, ::GroupedDataFrame) signature is the opposite of the new style combine(::GroupedDataFrame, transforms...) (as well as select and transform), and exists only for "advanced users" since it's slower because the callable operates on the entire SubDataFrame.
One possibility would be to use another function name for this to keep the combine/select/transform API consistent, since it's only combine that has this signature. @bkamins suggested that map may be good name for something like this since it operates on the things that the GroupedDataFrame iterates, but that he also has some reservations about that still. So I'm opening this issue as a place for discussing whether renaming this is a good idea, and if so what a good name for it would be.
Thank you for opening this.
The major considerations are:
mapWe deprecated map for GroupedDataFrame because it was not clear what should be the return type of this operation: it could be a vector, a GroupedDataFrame or even a DataFrame; actually if we decided to "revive" map I would tend to add ungroup kwarg to it similarly to what select etc. have to decide on the return type; however this would have to be made consistent with broadcasting behavior I feel (see https://github.com/JuliaData/DataFrames.jl/issues/2194)
In the long run two extra functionalities of select/transform/combine are considered to be added:
::Callable and Pair{Callable => Union{Symbol, AbstractString}} in which case ::Callable would get a SubDataFrame (as it gets now if ::Callable is the first argument); other idea was to alternatively allow DF() => fun in which case fun would get a SubDataFrameselector => ::Callable and ::Callable to return multiple columns if they are not the first argumentIf we added this then essentially the form combine(fun, gdf) would be combine(gdf, fun) and the same with pair passed as a first argument. In this case maybe the form with ::Callable as the fist argument will not be needed at all. But we will be able to remove it only after we allow for these other forms. Unfortunately this change is quite complex to pull off so I am not sure we can ship it for 1.0 release. Just to give some more details - currently when you call e.g. select we immediately know what is the list of target columns, if it is valid, and what computations need to be performed. If we allow returning tables (and not only single values or vectors) then we have to handle the possibility that such tables can have conflicting column names only after they are executed and we would have to unwrap these tables.
To give you the simplest case where this is visible consider the following:
julia> df = DataFrame(rand(3,3))
3ร3 DataFrame
โ Row โ x1 โ x2 โ x3 โ
โ โ Float64 โ Float64 โ Float64 โ
โโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโค
โ 1 โ 0.557086 โ 0.358442 โ 0.415375 โ
โ 2 โ 0.0455213 โ 0.98734 โ 0.406856 โ
โ 3 โ 0.409151 โ 0.859466 โ 0.313986 โ
julia> transform!(df, :x1 => ByRow(x -> x^2) => :x2)
3ร3 DataFrame
โ Row โ x1 โ x2 โ x3 โ
โ โ Float64 โ Float64 โ Float64 โ
โโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโค
โ 1 โ 0.557086 โ 0.310345 โ 0.415375 โ
โ 2 โ 0.0455213 โ 0.00207219 โ 0.406856 โ
โ 3 โ 0.409151 โ 0.167405 โ 0.313986 โ
Note that in this case we know after running x -> x^2 function that we will replace column :x2 by some new column, but still we place this new column in the correct place (where :x2 was originally located - and not e.g. at the end of the data frame).
So these are the major considerations. Let us keep this issue open and hear what people think about it.
For 0.21, allowing combine(fun, gdf) was decided to be the safest choice, as it at least was non-breaking.
I'm not convinced using another name would be better. This sounds like a perfectly natural use of multiple dispatch.
Also it's not necessarily a bad method: if you have a small number of groups or don't care about performance, it's very practical as you can do anything you want on the passed SubDataFrame.
To chime in as an end-user with personal biases, I think the combine inconsistency adds unnecessary complexity. For example, calculating the weighted and unweighted means are basic operations that require quite a different syntax:
combine(gdf, :var=> mean)
versus:
combine(x -> (mean(x.var,weights(x.w)), gdf)
combine(x -> (mean(x.var,weights(x.w)), gdf)
can be also written as:
combine(gdf, [:var, :w] => (var, w) -> (mean(var,weights(w)))
or e.g.
combine(gdf, AsTable(:) => x -> (mean(x.var,weights(x.w)))
The issue here is orthogonal to your comment (largely it is about the type you pass to your function).
I've found the following useful in piping
function modify(arg, x)
arg(x)
end
julia> @pipe df |>
transform(_, "income" => (t -> t .+ 100)) |>
modify(_) do t
t[:, :x] .= "some stuff"
return t
end |>
select(_, :x)
Sometimes what you want to write is easiest using normal getindex and setindex syntax, but you still want to use piping. This kind of anonymous function syntax can help with that. So maybe it's worth having a modify-type function that can take in a data frame or grouped data frame.
this is the first thing I hit on starting something new with 0.21.. from an outsider's viewpoint it was a bit of a wtf.
My use-case was to return multiple columns from my combine function.
If a differing signature to (gd, args...) is a necessity for reasons of dispatch, could a kwarg be used instead? (combine(gd, users_function_returning_multiple_columns=(gd) -> ...))
In the future we plan to allow in combine(gd, args...) for functions in args to return multiple columns from a single call. Simply this is not implemented yet and in general this is not a simple change (as you can have multiple args that potentially return columns that have the same names, which you do not know before calling them). In combine(arg, gd) we know that we have exactly one function to call so there is no risk of column name clashes so the implementation is easier (as there is nothing to be checked).
But even if this is added the form combine(arg, gd) will be retained anyway I think to allow for do-block notation.
I renamed the issue to keep track of it easier (every time I read it I had to check what was the real issue).
However, essentially what it asks for is more flexibility in combine (the same with select and transform). These requests are scattered in various issues so let me summarize them here. In the form combine(gdf, args) allow in args:
Not, Between etc. (medium)Cols as a better version of All (easy)Not more flexible (allowing unseen columns) (medium)(if I have forgotten something please comment, especially @pdeffebach as you have spent a lot of time thinking about it)
I have classified them on difficulty level (using two factors: how hard it is to implement and if we can do it just in DataFrames.jl or need some coordination with external packages)
- make
Notmore flexible (allowing unseen columns) (medium)
This is a very low priority. It should go into inverted indices directly and tests are already written for the behavior.
- allow more flexible column selectors (#2328) (medium)
I would love to get more input from the community on this one. Currently my motivation is a very specific problem in DataFramesMeta, but overall adding this to the meta language might also be very useful!
:+1: to
allow the function to return multiple columns (hard)
would you mind expanding why it is hard to implement? What would be the problem with a MultiCol function that indicates that the output should be split into columns?
select(df, :x => x -> MultiCol(ByRow((x2 = x^2, x3 = x^3))))
We do not need MultiCol - we already have a syntax reserved for it, just like in combine(fun, df) e.g.:
julia> df = DataFrame(reshape(1:12, 3, 4))
3ร4 DataFrame
โ Row โ x1 โ x2 โ x3 โ x4 โ
โ โ Int64 โ Int64 โ Int64 โ Int64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโโค
โ 1 โ 1 โ 4 โ 7 โ 10 โ
โ 2 โ 2 โ 5 โ 8 โ 11 โ
โ 3 โ 3 โ 6 โ 9 โ 12 โ
julia> combine(:x2 => x -> (a=x, b=2x), df)
3ร2 DataFrame
โ Row โ a โ b โ
โ โ Int64 โ Int64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโค
โ 1 โ 4 โ 8 โ
โ 2 โ 5 โ 10 โ
โ 3 โ 6 โ 12 โ
so essentially - if you do not use select but combine you already have what you want now.
What is hard is:
select(df, some_cols1, some_cols2 => fun2 => col2, some_cols3 => multicolfun3, some_cols4 => multicolfun4)
and you
cols2 => multicolfun2 => col2 (this should error, but this can be caught only after the function gets evaluated, and currently we do name parsing statically)some_cols3 => multicolfun3 and some_cols4 => multicolfun4 might generate column names that are conflicting (and again - you are not able to catch this case statically)And it is hard because the design of select is that it currently resolves column names in the target statically as it is easier that doing it dynamically after evaluating the functions. This is particularly challenging when you need to handle duplicate column names (and they can happen - and sometimes it is an error, but sometimes it is allowed as e.g. in select(df, :x, :) - we allowed duplicate column name :x, but we do not throw an error, because this is what we want - it is convenient this way to move column :x to the front).
So in summary:
combine(fun, df) already does what you want, and I assume that this should cover 90% of use cases in practice. The only thing combine in this form dose not cover is pseudo-broadcasting:julia> combine(:x2 => x -> (a=x, b=1), df)
ERROR: ArgumentError: mixing single values and vectors in a named tuple is not allowed