Dataframes.jl: API for groupwise column transformation

Created on 20 Feb 2019 · 26Comments · Source: JuliaData/DataFrames.jl

This Follows what @nalimilan identified in a discussion on Discourse.

A common operation when transforming a column is groupwise standardization of data.

Assume you have a data frame with grouping column :g and data column :x you want to mutate :x by subtracting groupwise mean of :x and divide by groupwise standard deviation of :x.

There are several options for this API. One of the ideas is the following change to by and combine functions:

add a groupwise::Bool keyword argument which defaults to false and then we have a current behavior;
if groupwise is true the following changes to the behavior happen:
- the output data frame contains all columns from the original data frame (here we should decide if the order of columns should be unchanged OR we put grouping columns first as is currently done)
- if sort=false then the existing rows are returned in the order present in the original data frame; if sort=true then they are sorted by groups
- all functions f are assumed to return a single column with the length equal to the group length; this can be simply checked because what is proposed here behaves as-if user manually added the following entries to transformation list col = :col => identity for all non-grouping columns :col in the parent data frame

Then the standardization operation described above would be written e.g. as:

by(df, :g, s_x = :x => x -> (x .- mean(x) ./ std(x)))
````

as opposed to what we have to do now:

julia> using Statistics

julia> df = DataFrame(g = rand(Bool, 10), a=1:10, b=1:10, x=1:10)
10×4 DataFrame
│ Row │ g │ a │ b │ x │
│ │ Bool │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ false │ 1 │ 1 │ 1 │
│ 2 │ true │ 2 │ 2 │ 2 │
│ 3 │ false │ 3 │ 3 │ 3 │
│ 4 │ false │ 4 │ 4 │ 4 │
│ 5 │ false │ 5 │ 5 │ 5 │
│ 6 │ true │ 6 │ 6 │ 6 │
│ 7 │ true │ 7 │ 7 │ 7 │
│ 8 │ false │ 8 │ 8 │ 8 │
│ 9 │ false │ 9 │ 9 │ 9 │
│ 10 │ true │ 10 │ 10 │ 10 │

julia> by(df, :g, a=:a=>identity, b=:b=>identity, x=:x=>identity, s_x = :x => x -> (x .- mean(x)) ./ std(x))
10×5 DataFrame
│ Row │ g │ a │ b │ x │ s_x │
│ │ Bool │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼───────┼───────────┤
│ 1 │ false │ 1 │ 1 │ 1 │ -1.31876 │
│ 2 │ false │ 3 │ 3 │ 3 │ -0.65938 │
│ 3 │ false │ 4 │ 4 │ 4 │ -0.32969 │
│ 4 │ false │ 5 │ 5 │ 5 │ 0.0 │
│ 5 │ false │ 8 │ 8 │ 8 │ 0.989071 │
│ 6 │ false │ 9 │ 9 │ 9 │ 1.31876 │
│ 7 │ true │ 2 │ 2 │ 2 │ -1.2863 │
│ 8 │ true │ 6 │ 6 │ 6 │ -0.075665 │
│ 9 │ true │ 7 │ 7 │ 7 │ 0.226995 │
│ 10 │ true │ 10 │ 10 │ 10 │ 1.13497 │

(note that the order of the result is currently destroyed in comparison to the parent data frame and in the proposed implementation it would be retained if we set `sort` to `false`)

This proposal is intended to be minimally disruptive. Please comment as you might have other ideas what would be a good way to allow for this kind of operation. Currently users need to use `@transform` from DataFramesMeta.jl:

julia> @transform(groupby(df, :g), s_x = (:x .- mean(:x)) ./ std(:x))
10×5 DataFrame
│ Row │ g │ a │ b │ x │ s_x │
│ │ Bool │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼───────┼───────────┤
│ 1 │ false │ 1 │ 1 │ 1 │ -1.31876 │
│ 2 │ false │ 3 │ 3 │ 3 │ -0.65938 │
│ 3 │ false │ 4 │ 4 │ 4 │ -0.32969 │
│ 4 │ false │ 5 │ 5 │ 5 │ 0.0 │
│ 5 │ false │ 8 │ 8 │ 8 │ 0.989071 │
│ 6 │ false │ 9 │ 9 │ 9 │ 1.31876 │
│ 7 │ true │ 2 │ 2 │ 2 │ -1.2863 │
│ 8 │ true │ 6 │ 6 │ 6 │ -0.075665 │
│ 9 │ true │ 7 │ 7 │ 7 │ 0.226995 │
│ 10 │ true │ 10 │ 10 │ 10 │ 1.13497 │
```

its drawback is that it always destroys the order of the parent (and I think it is nice to be able to guarantee that we can preserve the order of the parent - but also please comment if you would find it useful).

breaking

Source

bkamins

All 26 comments

It would be great to be able to do this kind of things. I have three comments:

It may be better to have two different functions, rather than using by with a different option, similar to other languages summarize/mutate (dplyr) or egen/collapse (Stata).
I agree it should not reorder the dataframe.
I hope it can also work with functions that return only one statistics, e.g. mean(:x), not just :x .- mean(:x).

matthieugomez on 6 Apr 2019

I think it would make sense to support mutate/transform similar to dplyr and DataFramesMeta. I agree there should be a way to preserve the order of rows in the input (maybe even by default, like dplyr does).

But the question is that there are two orthogonal dimensions for this kind of function:

whether to keep all columns (mutate/transform) or just keep new ones (transmute/map)
whether to keep all rows (mutate/transform) or only one per group (summarize/map)

So regarding your point 1, we don't really need to difference between transmute and summarize since our map automatically adjusts its behavior depending on whether a single value is returned per group. Regarding your point 3, what we wouldn't support in this scenario is something equivalent to transmute on groups, as our map would return a single row per group if the user-provided function returns a single value per group (like summarize). We could add transmute, but I'm not sure that use case is really needed since in general you want to keep other columns if you care about rows.

nalimilan on 7 Apr 2019

If we introduce select that allows transformations we will be able to do column adding "by group" (while keeping old columns):

julia> df = DataFrame(g = rand(Bool, 10), a=1:10, b=1:10, x=1:10);

julia> standardize(v) = (v .- mean(v)) ./ std(v);

julia> by(df, :g) do sdf
           select(sdf, :, :x => standardize => :x_s)
       end

it will not be super fast, but I think it will be reasonably convenient.

One just has to remember that the order of rows in the result will be changed.

bkamins on 15 Dec 2019

That doesn't sound ideal: aside from being a bit long to type, it kills type stability and requires making copies.

Maybe we could allow something like by(df, :g, :, :x => standardize => :x_s) to keep existing columns? We would just have to check that all functions return a vector of the same length as the input.

nalimilan on 16 Dec 2019

So this is what I feel today we will be converging to :).

I would even say that this might ideally be:

select(df, :, :x => Col(standardize) => :x_s, by=:g)

i.e. we would allow group-by operations in select.

Then, if we settled for select to work row-wise by default, we could leave by as is with its additional semantics for passing for SubDataFrame and whole column by default approach.

Then we would have a really SQL-like select. (if we added where keyword argument to it also we would have a full combo 😄)

bkamins on 16 Dec 2019

I like the idea of Col being around the arguments more, i.e.

select(df, :, Col(:x) => standardize => :x_s)

But love the inclusion of by = :g. This grouped transform has been implemented DataFramesMeta but has been missing in DataFrames.

pdeffebach on 18 Dec 2019

After discussing this again with @bkamins, there are four dimensions to take into account:

element-wise or whole-column
keep other columns or not
combine or keep grouping
as many rows in output as in input, one row per group, or any number of rows (this dimension is tied to 1.)

The principle I think we should respect is that each of these options should be enabled (either via separate functions or view keyword arguments) in the same way for DataFrame and GroupedDataFrame. Otherwise the design would be inconsistent. Of course point 3. only makes sense for GroupedDataFrame. Point 4 is tied to 1: for element-wise operations, we only allow returning as many rows as the input; whole column-operations, can return as many rows as they want (not API needed, this is determined by the return value).

Below are two possible solutions which offer all of these combinations in a consistent way.

A. Use select for transformations and combine to combine:

Element-wise : default for select -- whole column : default for combine, Col(:a) for select
Keep all columns : via : to select/combine, or via transform.
Combine : combine -- do not combine : select/transform

| | Combine | Do not combine
--|-------------|--------------------|
Element-wise | Not supported* | select(x, :a => f => :b) |
Whole column | combine(x, :a => f => :b) | select(x, Col(:a) => f => :b) |
Keep all columns | combine(x, :, ...) | select(x, :, ...) or transform(x, ...) |

The advantage of this approach is that select operates element-wise by default, and combine on whole columns, which are the most common needs. The drawback is that there's no equivalent of transform that combines, which is a common pattern (e.g. add column with group means). Of course it can be achieved by passing : to combine, but the regularity of the API is broken. Another drawback is that there are two ways to enable whole-column operation (Col(:a) in select, or using combine), so the design is a bit complex.

B. Use select for both DF and GDF, replacing combine with a keyword argument. This requires making select operate on whole columns by default, as this is the only way combining can have any effect.

Element-wise : enabled via Broadcasted(f) to select -- whole-column : default
Keep all columns : via : to select or via transform
Combine : via combine=true (the default) -- do not combine : via combine=false

| | Combine | Do not combine
--|-------------|--------------------|
Element-wise | select(x, :a => Broadcasted(f) => :b; combine=true)* | select(x, :a => Broadcasted(f) => :b) |
Whole column | select(x, :a => f => :b) | select(x, :a => f => :b; combine=false) |
Keep all columns | select(x, :, ...) or transform(x, ...) | select(x, :, ...; combine=false) or transform(x, ...; combine=false) |

The advantage of this approach is that it's very systematic: all operations work on whole columns by default, and there's a general way to switch to element-wise. Its drawback is that it's less convenient to apply element-wise operations with select.

* Combining with element-wise operation is equivalent to ungrouping. Supporting this case isn't very useful, except that it allows using the same code on a DataFrame or GroupedDataFrame and being sure you get a DataFrame.

nalimilan on 18 Jan 2020

Thank you for an excellent write-up. Some comments:

we have map for GDF. which operates "whole-column/not-combine/drop columns". Given these tables it is pretty clear what it should mean for DF - essentially the same, so agreeing on anything we decide here solves https://github.com/JuliaData/DataFrames.jl/issues/2053
users might want to mix whole-row and whole operations so having Col or Broadcasted (or some shorter alternative) is probably needed anyway to opt-out from the default
there is a typo, in bottom left cell, you probably meant transform(x, ...)
we need to remember about supporting in place operations which were intended to be done using select! and transpose!.
combine is mostly not used by users I think; the majority uses by, so having a clean syntax for DataFrame case is more important than a clean syntax for GroupedDataFrame.

Also comments for "design 1":

The drawback is that there's no equivalent of transform that combines, which is a common pattern (e.g. add column with group means).

I think we have to solve this issue within by anyway, so having it conveniently done for GroupedDataFrame does not solve all the issues, and having something like by(df, :col, :, ...) is a pattern that anyway needs to be added. Actually there is a decision then how do we handle the source "row ordering" issue in by.

Another drawback is that there are two ways to enable whole-column operation (Col(:a) in select, or using combine), so the design is a bit complex.

But if we went this way combine would be for GroupedDataFrame only. There is no need to define it for DataFrame. Also as noted above Col is needed anyway for mixing whole-column and element-wise operations in a single call.

In summary - personally I am in favor of option 1. It would mean the following work to do:

implement map for AbstractDataFrame to work on whole columns
implement select and transform for GroupedDataFrame that work per group and produce a GroupedDataFrame
add support for : and other column selectors in combine and by

But, as this is opinionated other feedback is welcome (CC @pdeffebach).

bkamins on 18 Jan 2020

Good catch for the typo in the bottom-left cell. I agree with most of your points... except on the conclusion. :-)

implement map for AbstractDataFrame to work on whole columns

Well that would go against the definition of data frames as collections of rows, so I don't think that would be a good idea. Why would you want to do that?

implement select and transform for GroupedDataFrame that work per group and produce a GroupedDataFrame

I would find it really weird to have select and transform take whole columns when applied to a GroupedDataFrame, but operate element-wise when applied to an AbstractDataFrame. That's one of the reasons I'm more and more inclined to think that select and transform should operate on whole columns by default... That would make everything simpler: select, transform, combine and by would all behave the same on all types by default.

add support for : and other column selectors in combine and by

This is uncontroversial I think.

nalimilan on 19 Jan 2020

Well that would go against the definition of data frames as collections of rows, so I don't think that would be a good idea. Why would you want to do that?

If we say that functions on DF and GDF should behave the same then this is a natural consequence. I agree it goes against the rule. So maybe simply we should not define map for DF for now, this is not urgent.

EDIT actually if we have select/transform for GDF, we can drop map for GDF as it would be an easy deprecation (then a defalut map definition for iterators would be used after deprecation period which kind of makes sense).

I would find it really weird to have select and transform take whole columns when applied to a GroupedDataFrame, but operate element-wise when applied to an AbstractDataFrame.

I meant them to operate consistently in both scenarios (i.e. element-wise). However, I would not strongly oppose for then working on whole columns, as, as you have noted, you can always broadcast in whole-column approach to get an element-wise code (with multiple columns it is a bit more tricky, but it is also essentially a broadcast).

So - in short - I am OK to go this way. We have to decide two things:

would @piever oppose strongly 😄? (we would diverge here with JuliaDB.jl and it would be a "hard" difference that would be likely to stay)
what name for a wrapper type for functions do we want to give so that it is short. Do you have some good name? Here are some possibilities: Row, Vec, Each.

bkamins on 19 Jan 2020

would @piever oppose strongly :smile:? (we would diverge here with JuliaDB.jl and it would be a "hard" difference that would be likely to stay)

I don't have strong opinions here, and definitely do not oppose strongly :)

Overall, it feels like "element-based" is more natural if the data is ungrouped, and "column-based" is more natural if the data is grouped. JuliaDB works like this, i.e. groupby (analogous to DataFrames.by) is column-based, everything else element-based, which seems to me very similar to design 1.

However, element-based transform is much more relevant for JuliaDB for two reasons:

if the columns are distributed, most "column-based" functions would probably fail anyways,
the object is "immutable", so it's hard to do this operations without transform.

To exemplify point 2, in DataFrames one could do (for example) df.logx = log.(df.x) and be done with it, bypassing "element-based" select. The same thing is not possible in JuliaDB, as one cannot add columns to a same table (and we don't do getproperty overloading to select a table easily).

In the end, the case where this API is most essential (that is to say "less easily replaced by just writing the obvious thing") for DataFrames IMO is the grouped case, so I definitely see the advantage of creating these functions with the grouped case in mind. It is a nice consequence that by becomes simply a "groupselect".

The only concern that I have is that transform would only work with vector-valued functions that preserve the length of the vector. Is the plan to auto-broadcast scalars? I.e., transform(df, :xm => mean => :x) would create a column :xm with all entries equal to mean(df.x)?

piever on 19 Jan 2020

If we say that functions on DF and GDF should behave the same then this is a natural consequence. I agree it goes against the rule. So maybe simply we should not define map for DF for now, this is not urgent.

EDIT actually if we have select/transform for GDF, we can drop map for GDF as it would be an easy deprecation (then a defalut map definition for iterators would be used after deprecation period which kind of makes sense).

@bkamins The main reason to define map on DF and GDF is to implement the collections API. But for DataFrame it's not very compelling since it's not really a collection of rows, it's more like a matrix, as getindex, iteration and broadcast show. So yes, maybe we'd better not define map for DataFrame (which will avoid the divergence with broadcast), and only use the fallback definition for GroupedDataFrame.

@piever Yes, I agree JuliaDB is less suited to whole-column operations than DataFrames. But I'm not sure I agree with the idea that row-based is more natural for ungrouped data: in some cases it's quite natural to apply x -> replace(x, 1 => 0, 2 => 0), lag or normalize. It really depends on operations.

The only concern that I have is that transform would only work with vector-valued functions that preserve the length of the vector. Is the plan to auto-broadcast scalars? I.e., transform(df, :xm => mean => :x) would create a column :xm with all entries equal to mean(df.x)?

I think broadcasting scalars would indeed make sense. That would be especially useful when applying transform to a GroupedDataFrame, in order to fill a column with group means.

nalimilan on 19 Jan 2020

👍1

I think broadcasting scalars would indeed make sense.

It would make sense, but we currently disallow it in combine, but we probably should start to allow it as we should allow things like combine(gdf, :, :c => mean) which in consequence means that we should also start allowing things like combine(gdf, :, :c => mean, :d => reverse), but bare combine(gdf, :c => mean) will produce one row per group.

The rule here should be the same we have in DataFrames.jl now:

AbstractVectors are left as-is except AbstractRange which gets materialized to a Vector
other AbstarctArrays are throwing an error
all else is treated as a scalar and expanded without copying (this is relevant for mutable values returned).

@nalimilan - so could I ask you to write down the summary what you think is best given this discussion (and hopefully we all will be in agreement)? I can then fill-in the gaps in the rules if there are any.

bkamins on 19 Jan 2020

@bkamins #1727
Then we would have a really SQL-like select

I'm not sure how completely you're aiming to be able to cover the SQL select spec, but beyond GROUPBY SELECT, SQL also specifies window functions which can operate on multiple partitions (groups) in the same selection statement.

_from postgresql docs_
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
"A query can contain multiple window functions that slice up the data in different ways by means of different OVER clauses"

Depending on the objective here this would really require a reconsideration for the select syntax to be able to apply to multiple, different partitions (groups) in one pass.

dgkf on 19 Jan 2020

@dgkf - I think that the reason why @nalimilan wants to keep the whole-column approach as the default is to make this kind of operation easy if I understand what you raise correctly (at least for a single partition).

Actually the key challenge we face is the following - if we allowed grouping in select from a DataFrame then we have decide how to order the rows produced (note that combine(group(df, :col), identity) will reorder the rows of df). However, adding more semantics from SQL select is probably a longer term plan. First we have to settle more basic things that @nalimilan raised in this issue.

bkamins on 19 Jan 2020

I see. The reason I raised this particular example is that SQL's window functions can apply to multiple groupings at once, so the OVER (PARTITION BY ...) equivalent might also have to exist in the column select spec. Something like this (I've found it tough to keep up with all the select syntactic proposals, so hopefully this is close):

select(df, :, 
    Col(:a, by = :subgrouping1) => standardize => :a_s,
    Col(:x, :y, by = :subgrouping2) => + => :z, 
    by = :grouping)

The point being that the groupings might have to be able to be defined for each "select => transform => as" input individually (similar to SQL's OVER (PARITION BY ...)), while also being able to apply group-wise (similar to SQL's GROUP BY).

A _very_ general point that I'm trying to touch on is that the SQL spec is quite involved and if the target is to have good coverage of that space, then I really think that a lot more of the spec needs to be considered when crafting a Julian API. That objective is ambitious and exciting, especially because it could very easily be used as a front end to database queries instead of the common practice of writing SQL in a string and passing it to a connection object, unifying syntax between operating on local DataFrames and with a database connection.

Is the goal to eventually have this sort of coverage or is it to strike a balance between syntactic simplicity and implementing some SQL basics?

dgkf on 19 Jan 2020

Thank you for this comment. My thinking was that it would be currently doable by:

select(df, :, 
    Col(:a, :subgrouping1) => fun1 => :a_s,
    Col(:x, :y, :subgrouping2) => fun2 => :z, 
    by = :grouping)

i.e. essentially you could pass the sub-grouping column to the function and rely on it to perform proper transformations (however, this would not be convenient and your proposal looks much nicer as you do not have to define custom functions).

Is the goal to eventually have this sort of coverage or is it to strike a balance between syntactic simplicity and implementing some SQL basics?

Here I can tell you what my goal is (as people might differ in opinions): try to keep syntactic simplicity but have an internal design that allows us to gradually add more features in the future (possibly having a full coverage of some SQL flavor is a nice long term target).

Even I think we are not yet trying to add by to select for 1.0 release. The reason is that current groupby ecosystem is not design to retain row ordering of the source, so even adding this feature and ensuring efficiency would be a significant PR.

However, thank you for raising this point as currently the design @nalimilan proposes relies on the assumption that by default select does not combine (so if you pass GroupedDataFrame to it you get a GroupedDataFrame back from it) then the question would be how by kwarg in it should be treated.

Having written all this:

for 1.0 release I think that we will go really simple with select allowing only transformations and renaming
post 1.0 we can think of some major changes (e.g. a major design question I always had was why do we need GroupedDataFrame when we just could keep information if a PK columns are defined for a data frame inside its metadata - something that JuliaDB.jl does currently)

bkamins on 20 Jan 2020

@nalimilan - after some time when I look at your original tables actually I think that we can drop combine and just have select and transform as you have proposed. The crucial missing piece is the name for the "broadcasting" option wrapper for a transformation. I will try to propose something.

bkamins on 31 Jan 2020

👍1

Given what I learned when implementing https://github.com/JuliaData/DataFrames.jl/pull/2080#issuecomment-581076570 maybe we can go for option 2 (i.e. select and transform are allowed do combine or not), but still retain combine and by as a legacy. The reason is that combine has a much more reach legacy API that I am not sure that would make sense for select (like passing a function as a first argument).

bkamins on 1 Feb 2020

👍1

See https://github.com/tidyverse/dplyr/issues/3279 about row ordering.

nalimilan on 11 Feb 2020

What we decide here is potentially breaking so adding 1.0 milestone to it.

bkamins on 12 Feb 2020

I am closing this issue as currently we settled on https://github.com/JuliaData/DataFrames.jl/issues/2172 (the rest has already landed).

The only thing not covered in current design is:

Assume you have a data frame with grouping column :g and data column :x you want to mutate :x by subtracting groupwise mean of :x and divide by groupwise standard deviation of :x.

for each row in the source data frame. This pattern does not fit in the current design well and will have to be done in two steps:

create a helper aggregated GroupedDataFrame
perform a select on an original data frame and query the helper (which is now fast as we have quick indexing of GroupedDataFrames)

However, if someone feels that we need a special functionality for this please open a new issue with a proposal of the API.

bkamins on 6 Apr 2020

I think this kind of pattern will be possible with groupby(df, :g) |> transform!(:x => (x -> x ./ mean(x)) => :x). Maybe let's reopen this until select and transform are implemented on GroupedDataFrame?

nalimilan on 8 Apr 2020

|> does not support currying.
I have opened https://github.com/JuliaData/DataFrames.jl/issues/2172 a few days ago to keep discussion about this functionality separated (here it got lost in general discussions) - can we move this topic there?

In general this will be problematic because the moment you do groupby(df, :g) you "logically" loose track of the original position of row in df (technically you can access it via parent).

The problem is that assume you run:

gdf = groupby(df, :g)

and now you can write

transform!(:x => (x -> x ./ mean(x)) => :x)

transform!(:x => (x -> shuffle!(x ./ mean(x))) => :x)

transform! is not able to distinguish these two function calls (there is no technical way to distinguish them). This means that transform! will not be able to recover a correct ordering of rows from the original data frame.

What you have to do is (there are more options for this, this is the simplest I think):

df.id = axes(df, 1)
df2 = sort!(by([:x, :id] => (x, id) -> (x_std = x ./ mean(x), id = id), df, :g), :id)
sort!(df2, :id)

which I think is not that bad.

Essentially the problem is that our current API allows "one row" or "multiple rows" returned by the function in apply step (and I actually like this flexibility). But this means that in the case of "multiple rows" you do not have control if apply function has not reordered the rows.

bkamins on 8 Apr 2020

Well it's up to the caller to ensure the function returns rows in the same order as the input. We do that all the time: even transform(df, :x => f) returns something meaningless if f(x) = shuffle(x).

nalimilan on 8 Apr 2020

I am moving the discussion to #2172.

bkamins on 8 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings