Dataframes.jl: Make combine(gdf, args...) more flexible

Created on 17 May 2020 · 11Comments · Source: JuliaData/DataFrames.jl

On slack it came up that the legacy combine(::Callable, ::GroupedDataFrame) signature is the opposite of the new style combine(::GroupedDataFrame, transforms...) (as well as select and transform), and exists only for "advanced users" since it's slower because the callable operates on the entire SubDataFrame.

One possibility would be to use another function name for this to keep the combine/select/transform API consistent, since it's only combine that has this signature. @bkamins suggested that map may be good name for something like this since it operates on the things that the GroupedDataFrame iterates, but that he also has some reservations about that still. So I'm opening this issue as a place for discussing whether renaming this is a good idea, and if so what a good name for it would be.

decision grouping

Source

kleinschmidt

👍2

All 11 comments

Thank you for opening this.

The major considerations are:

regarding `map`

We deprecated map for GroupedDataFrame because it was not clear what should be the return type of this operation: it could be a vector, a GroupedDataFrame or even a DataFrame; actually if we decided to "revive" map I would tend to add ungroup kwarg to it similarly to what select etc. have to decide on the return type; however this would have to be made consistent with broadcasting behavior I feel (see https://github.com/JuliaData/DataFrames.jl/issues/2194)

regarding other functions

In the long run two extra functionalities of select/transform/combine are considered to be added:

allow forms ::Callable and Pair{Callable => Union{Symbol, AbstractString}} in which case ::Callable would get a SubDataFrame (as it gets now if ::Callable is the first argument); other idea was to alternatively allow DF() => fun in which case fun would get a SubDataFrame
allow forms selector => ::Callable and ::Callable to return multiple columns if they are not the first argument

If we added this then essentially the form combine(fun, gdf) would be combine(gdf, fun) and the same with pair passed as a first argument. In this case maybe the form with ::Callable as the fist argument will not be needed at all. But we will be able to remove it only after we allow for these other forms. Unfortunately this change is quite complex to pull off so I am not sure we can ship it for 1.0 release. Just to give some more details - currently when you call e.g. select we immediately know what is the list of target columns, if it is valid, and what computations need to be performed. If we allow returning tables (and not only single values or vectors) then we have to handle the possibility that such tables can have conflicting column names only after they are executed and we would have to unwrap these tables.

To give you the simplest case where this is visible consider the following:

julia> df = DataFrame(rand(3,3))
3×3 DataFrame
│ Row │ x1        │ x2       │ x3       │
│     │ Float64   │ Float64  │ Float64  │
├─────┼───────────┼──────────┼──────────┤
│ 1   │ 0.557086  │ 0.358442 │ 0.415375 │
│ 2   │ 0.0455213 │ 0.98734  │ 0.406856 │
│ 3   │ 0.409151  │ 0.859466 │ 0.313986 │

julia> transform!(df, :x1 => ByRow(x -> x^2) => :x2)
3×3 DataFrame
│ Row │ x1        │ x2         │ x3       │
│     │ Float64   │ Float64    │ Float64  │
├─────┼───────────┼────────────┼──────────┤
│ 1   │ 0.557086  │ 0.310345   │ 0.415375 │
│ 2   │ 0.0455213 │ 0.00207219 │ 0.406856 │
│ 3   │ 0.409151  │ 0.167405   │ 0.313986 │

Note that in this case we know after running x -> x^2 function that we will replace column :x2 by some new column, but still we place this new column in the correct place (where :x2 was originally located - and not e.g. at the end of the data frame).

So these are the major considerations. Let us keep this issue open and hear what people think about it.

For 0.21, allowing combine(fun, gdf) was decided to be the safest choice, as it at least was non-breaking.

bkamins on 17 May 2020

I'm not convinced using another name would be better. This sounds like a perfectly natural use of multiple dispatch.

Also it's not necessarily a bad method: if you have a small number of groups or don't care about performance, it's very practical as you can do anything you want on the passed SubDataFrame.

nalimilan on 18 May 2020

To chime in as an end-user with personal biases, I think the combine inconsistency adds unnecessary complexity. For example, calculating the weighted and unweighted means are basic operations that require quite a different syntax:

combine(gdf, :var=> mean)

versus:

combine(x -> (mean(x.var,weights(x.w)), gdf)

pmarg on 19 May 2020

👍1

combine(x -> (mean(x.var,weights(x.w)), gdf)

can be also written as:

combine(gdf, [:var, :w] => (var, w) -> (mean(var,weights(w)))

or e.g.

combine(gdf, AsTable(:) => x -> (mean(x.var,weights(x.w)))

The issue here is orthogonal to your comment (largely it is about the type you pass to your function).

bkamins on 19 May 2020

I've found the following useful in piping

function modify(arg, x)
    arg(x)
end

julia> @pipe df |>
       transform(_, "income" => (t -> t .+ 100)) |>
       modify(_) do t
           t[:, :x] .= "some stuff"
           return t
       end |>
       select(_, :x)

Sometimes what you want to write is easiest using normal getindex and setindex syntax, but you still want to use piping. This kind of anonymous function syntax can help with that. So maybe it's worth having a modify-type function that can take in a data frame or grouped data frame.

pdeffebach on 25 May 2020

this is the first thing I hit on starting something new with 0.21.. from an outsider's viewpoint it was a bit of a wtf.
My use-case was to return multiple columns from my combine function.
If a differing signature to (gd, args...) is a necessity for reasons of dispatch, could a kwarg be used instead? (combine(gd, users_function_returning_multiple_columns=(gd) -> ...))

akdor1154 on 16 Jun 2020

In the future we plan to allow in combine(gd, args...) for functions in args to return multiple columns from a single call. Simply this is not implemented yet and in general this is not a simple change (as you can have multiple args that potentially return columns that have the same names, which you do not know before calling them). In combine(arg, gd) we know that we have exactly one function to call so there is no risk of column name clashes so the implementation is easier (as there is nothing to be checked).

But even if this is added the form combine(arg, gd) will be retained anyway I think to allow for do-block notation.

bkamins on 16 Jun 2020

👍1

I renamed the issue to keep track of it easier (every time I read it I had to check what was the real issue).

However, essentially what it asks for is more flexibility in combine (the same with select and transform). These requests are scattered in various issues so let me summarize them here. In the form combine(gdf, args) allow in args:

to pass a whole data frame to the function (easy)
allow the function to return multiple columns (hard)
allow more flexible column selectors (https://github.com/JuliaData/DataFrames.jl/issues/2328) (medium)
allow broadcasting of Not, Between etc. (medium)
add Cols as a better version of All (easy)
make Not more flexible (allowing unseen columns) (medium)

(if I have forgotten something please comment, especially @pdeffebach as you have spent a lot of time thinking about it)

I have classified them on difficulty level (using two factors: how hard it is to implement and if we can do it just in DataFrames.jl or need some coordination with external packages)

bkamins on 2 Aug 2020

make Not more flexible (allowing unseen columns) (medium)

This is a very low priority. It should go into inverted indices directly and tests are already written for the behavior.

allow more flexible column selectors (#2328) (medium)

I would love to get more input from the community on this one. Currently my motivation is a very specific problem in DataFramesMeta, but overall adding this to the meta language might also be very useful!

pdeffebach on 2 Aug 2020

:+1: to

allow the function to return multiple columns (hard)

would you mind expanding why it is hard to implement? What would be the problem with a MultiCol function that indicates that the output should be split into columns?

select(df, :x => x -> MultiCol(ByRow((x2 = x^2, x3 = x^3))))

greimel on 6 Aug 2020

We do not need MultiCol - we already have a syntax reserved for it, just like in combine(fun, df) e.g.:

julia> df = DataFrame(reshape(1:12, 3, 4))
3×4 DataFrame
│ Row │ x1    │ x2    │ x3    │ x4    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 4     │ 7     │ 10    │
│ 2   │ 2     │ 5     │ 8     │ 11    │
│ 3   │ 3     │ 6     │ 9     │ 12    │

julia> combine(:x2 => x -> (a=x, b=2x), df)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 4     │ 8     │
│ 2   │ 5     │ 10    │
│ 3   │ 6     │ 12    │

so essentially - if you do not use select but combine you already have what you want now.

What is hard is:

select(df, some_cols1, some_cols2 => fun2 => col2, some_cols3 => multicolfun3, some_cols4 => multicolfun4)

and you

have to handle the case that if your function returns multiple columns you are not allowed to write cols2 => multicolfun2 => col2 (this should error, but this can be caught only after the function gets evaluated, and currently we do name parsing statically)
some_cols3 => multicolfun3 and some_cols4 => multicolfun4 might generate column names that are conflicting (and again - you are not able to catch this case statically)

And it is hard because the design of select is that it currently resolves column names in the target statically as it is easier that doing it dynamically after evaluating the functions. This is particularly challenging when you need to handle duplicate column names (and they can happen - and sometimes it is an error, but sometimes it is allowed as e.g. in select(df, :x, :) - we allowed duplicate column name :x, but we do not throw an error, because this is what we want - it is convenient this way to move column :x to the front).

So in summary:

this is doable, but this is hard to get it right in all cases
for the simple case combine(fun, df) already does what you want, and I assume that this should cover 90% of use cases in practice. The only thing combine in this form dose not cover is pseudo-broadcasting:

julia> combine(:x2 => x -> (a=x, b=1), df)
ERROR: ArgumentError: mixing single values and vectors in a named tuple is not allowed

bkamins on 6 Aug 2020

❤1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

using nrow as named tuple when grouping

jangorecki · 7Comments

Question: How to omit quotemark for writetable?

abieler · 7Comments

Materializing TableTraits sources via Tables.jl is slow

davidanthoff · 4Comments

Provide indexing feature to allow for fast sort, join, and group-by operations

xiaodaigh · 7Comments

select!(df, Not(tuple)) does not work

tlienart · 8Comments

Dataframes.jl: Make combine(gdf, args...) more flexible

All 11 comments

regarding map

regarding other functions

Related issues

regarding `map`