Suppose I want to compute the correlation between two variable within each group, how do I do that e.g. I want to compute the correlation between x and y within each group of grp?
See MWE
using DataFrames
df = DataFrame(grp = rand(["a","b"], 100), x= rand(100), y = rand(100))
by(df, :grp, (:x, :y)=>cor)
I though the above would be ok but only the below works. The below also doesn't work.
by(df, :grp, (:x, :y)=>((x,y))->cor(x,y))
by(df, :grp, (:x, :y)=>(x,y)->cor(x,y))
by(df, :grp, (:x, :y)=>(xy)->cor(xy[1],xy[2))
y(df,:grp,(:x,:y)=>xy->cor(xy...))
the above works but auto-splatting is a nicer interface IMO
by(df, :grp, (:x, :y)=>cor)
I think the auto-splat version is more intuitive because
f = (x,y)->x+y
f(1,2)
It's not call using f((1,2)), so (:x,:y) = > cor invokes the feeling that cor is a two argument function. So auto-splatting makes more sense.
@nalimilan was designing this API so probably he can comment best.
The problem with auto-splatting is that it loses some information (you are not able to pass column names then to the called function).
However, I agree that auto-splatting is something I typically expect and the pattern I use very often is x -> f(x...) as you have noted. Actually I almost never need to know column names.
Also if we went for this change this would be breaking.
Yes, the problem is that with the current system you can both apply splatting manually, or access columns by name (or do anything with the named tuple). As @bkamins noted, if we splat arguments automatically, there's no way to retrieve the column names. Also, in some cases you don't want splatting, e.g. to call sum on several columns.
Something we could do is use All(:x, :y) to enable automatic splatting. Or conversely deprecate the current behavior and require people to use All(:x, :y) to get a named tuple. Not sure which one is best, but note that our behavior was inspired by JuliaDB, and consistency is somewhat nice.
A good point. We even could add Splat (although probably reusing All is good enough).
This is more popular than Juliadb. And also which will lead to less typing? And generally more convenient to use? It will be painful, if we switch syntax now as it may break old code, but long term gain will be worth it.
Therefore for backward compatibility reasons and clarity I would support Splat(:x, :y) for splatting, but using All is also OK (it is not immediately clear if All would mean splatting or not, so it is better to keep backward compatibility).
One consideration to have in mind is that this should be composable with Not, Between and regexes.
Good point - so maybe the rule should be:
Tuple but any column selector (actually Tuple is not allowed in indexing, but we could start allowing it which would not be bad in general)Splat that splats anything that is passed into it - this would be groupby specific (and it would work as All so, All could be omitted if we wanted to).How does it sound?
Actually the need for something like Splat is more general than data frames (as I realized when reviewing your PRs). Maybe we should defineSplat(f) = x -> f(x...) instead? Then you'd write by(df, :key, selector => Splat(cor)), which would work with any selector. We may file an issue in Base about this.
A related issue that was raised previously (and which I somewhat confused with the present one) is that one may want to apply a function separately to each provided column. But maybe that can be fixed independently from this one, by using All(selector) .=> function.
selector .=> function "just works" right now, see e.g.:
julia> df = DataFrame(rand(3,4))
3ร4 DataFrame
โ Row โ x1 โ x2 โ x3 โ x4 โ
โ โ Float64 โ Float64 โ Float64 โ Float64 โ
โโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโค
โ 1 โ 0.68207 โ 0.464943 โ 0.381817 โ 0.971913 โ
โ 2 โ 0.954418 โ 0.0947486 โ 0.500607 โ 0.0992424 โ
โ 3 โ 0.306492 โ 0.824458 โ 0.778946 โ 0.563767 โ
julia> by(df, :x1, (:x2,:x3).=>sum)
3ร3 DataFrame
โ Row โ x1 โ x2_sum โ x3_sum โ
โ โ Float64 โ Float64 โ Float64 โ
โโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโค
โ 1 โ 0.68207 โ 0.464943 โ 0.381817 โ
โ 2 โ 0.954418 โ 0.0947486 โ 0.500607 โ
โ 3 โ 0.306492 โ 0.824458 โ 0.778946 โ
but I have not thought about it earlier - so this is a nice idea.
Splat(f) = x -> f(x...) is also excellent - but where should it live (probably some more general place than DataFrames.jl - but we could add it here and later move to a more general place). A minor thing is that it should probably be splat then, as it is just a function (not a type).
Actually given your comments the only thing that needs to be added is this splat function.
selector .=> function"just works" right now, see e.g.:
Only for simple cases, but not (as we have discussed somewhere else) for All, Not or regexes.
Splat/splat could indeed live in Base or in a fundamental package. @xiaodaigh Would you file an issue in Julia to discuss this?
Only for simple cases, but not (as we have discussed somewhere else) for
All,Notor regexes.
True, but I guess this is not a crucial functionality (still - we can add it if you feel it would be valuable).
Well it would be nice e.g. to be able to get the mean of all variables matching a pattern or all variables in a range. Not essential, but something that it would be nice to be able to do at some point.
Agreed (the only thing to think of is column name generation pattern)
@nalimilan - is there anything left to decide here (except for Splat which is a nice idea, but I understand you would prefer to have it in base?)
It would be nice to raise the Splat idea on Julia's GitHub or Discourse. If it's rejected we could add it here (or in another package).
How does this extend to situations where you want to pass, for example, the first two arguments, but also some fixed arguments?
using DataFrames
df = DataFrame(grp = rand(["a","b"], 100), x= rand(100), y = rand(100))
by(df, :grp, (:x, :y) => (x, y) -> cov(x, y; corrected = false))
My personal preference would be to use .=> for application across many columns simultaneously, or even multiple sets of columns.
# applying a unary function over multiple columns
by(df, :grp, (:x, :y) .=> maximum)
# applying a binary function over a single pair of columns
by(df, :grp, (:x, :y) => (x, y) -> cov(x, y; corrected = false))
# applying a binary function over multiple pairs of columns
by(df, :grp, ((:x, :y), (:a, :b)) .=> (x, y) -> cov(x, y; corrected = false))
So currently you would have to write (listing those of your examples that would need a change):
by(df, :grp, (:x, :y) => ((x, y),) -> cov(x, y; corrected = false))
by(df, :grp, ((:x, :y), (:a, :b)) .=> ((x, y),) -> cov(x, y; corrected = false))
and with splat you could write
by(df, :grp, (:x, :y) => splat((x, y) -> cov(x, y; corrected = false)))
by(df, :grp, ((:x, :y), (:a, :b)) .=> splat((x, y) -> cov(x, y; corrected = false)))
if you did not want to pass a kwarg it would be even simpler (and this is the key intended use):
by(df, :grp, (:x, :y) => splat(cov))
by(df, :grp, ((:x, :y), (:a, :b)) .=> splat(cov))
so this is only a syntactic sugar
@nalimilan as a separate comment, having the src => fun => target syntax would be super nice here, as then you could write something like:
by(df, :grp, ((:x, :y), (:a, :b)) .=> splat(cov) .=> (:res1, :res2))
to give names for the results. Currently, with kwarg only syntax I think it is not possible.
I see - I'm starting to put these pieces together and I think I'd prefer the "auto-splatting". Conceptually it seems easier to me. If you ever need the opposite of the default behavior, it's also much cleaner to pattern match on (args...) -> than ((x, y),) ->.
The splat function seems weird to me in general. For a language with such rich native splatting syntax, I have trouble seeing the need for an obscure and inflexible higher order function. If you're set on this style, you could easily add a bit more flexibility by defining splat as
splat(f, args...; kwargs...) = x -> f(x..., args...; kwargs...)
so that you can effectively partially evaluate f with fixed arguments before receiving x. Then you could do:
# "manual-splat" (having to splat yourself)... this term is killing me >.<
by(df, :grp, ((:x, :y), (:a, :b)) .=> splat(cov; corrected = false) .=> (:res1, :res2))
# "auto-splat"
# cons: more characters, pros: more obvious/readable
by(df, :grp, ((:x, :y), (:a, :b)) .=> (x, y) -> cov(x, y, corrected = false) .=> (:res1, :res2))
I think I'd prefer the "auto-splatting"
I agree that this is more intuitive, however, this approach has two problems:
(args...) what yo have proposed drops column names. A typical intended pattern of usage of the current syntax is:by(df, :grp, [:a, :b] => x -> x.a + x.b)
Still I personally agree that auto-splatting is more intuitive and retaining column names is not that crucial, especially that when we pass a single column we drop its name.
@nalimilan - what do you think?
Given this proposal I am marking this with 1.0 and "breaking" as we need to decide it soon.
the reverse operation is not loseless as
(args...)[...] drops column names
by(df, :grp, [:a, :b] => x -> x.a + x.b)
Thanks, I didn't realize this was also a target behavior. That makes sense.
My inclination is that this should be a special case - perhaps only treated this way if a Function (or Pair beginning with a Function) is passed, as the column selection seems syntactically redundant in this case (perhaps it's to avoid large tuple allocations?).
# `Function` or `Pair{Function,}`
by(df, :grp, (x -> x.a + x.b) => :z)
# `Pair{<:Any,Function}` or `Pair{<:Any,Pair{Function,}}`
by(df, :grp, (:a, :b) => ((xa, xb) -> xa + xb) => :z)
Then whenever a column selection is provided, always use the auto-splatting. Would something like that break other desired behaviors? Another concern is that this style would also make it difficult to ever support Functions as selectors without wrapping in some specific types.
A few comments.
Your both statements are invalid - you have to wrap a function in ( ... ) as otherwise the operator precedence makes it wrong. So you have to write e.g. by(df, :grp, (x -> x.a + x.b) => :z)
The form (x -> x.a + x.b) => :z is something we intend to add (currently it does not work); but then what is passed to a function will be a SubDataFrame not a NamedTuple (but this is a minimal difference). I think we would leave this behavior (for wide DataFrames we do not want to run into compilation overhead)
Then whenever a column selection is provided, always use the auto-splatting.
As I have said - I agree that auto-splatting is more natural. We just have to wait for @nalimilan to comment as he might have some strong arguments in favor of NamedTuples (he has spent much more time on designing this than all other people taken together ๐)
Would something like that break other desired behaviors?
It would be breaking, but I would say that (:a, :b) => x -> x.a + x.b is less convenient/intuitive than (:a, :b) => (a, b) -> a + b also when we consider broadcasting of .=> then auto-splatting is much more natural as in [(:a, :b), (:c, :d)] .=> do_something the column names would change between calls to do_something so passing the column names in a NamedTuple is not that much useful.
In summary:
SubDataFrame is passed, this is intended, and I would leave it as isYour both statements are invalid.
Thanks. Updated for clarity. I hope the intent was still clear as they were just for illustrative purposes anyways.
Sure - I just wanted to make this comment as this is a typical error (and the problem is that the unintended semantics is silently accepted, so you can learn about the problem much later in the code execution than you make it) I was not sure you are aware of.
And just to add. Maybe we will reject the proposal to auto-splat just because it is breaking. Then your proposal of enhanced splat is very nice. I would just discuss whether it should be:
splat(f, args...; kwargs...) = x -> f(x..., args...; kwargs...)
or
splat(f, args...; kwargs...) = x -> f(args..., x...; kwargs...)
as the latter (making args... first arguments of f) is more similar to standard currying.
See https://stackoverflow.com/questions/60237725/julia-dataframe-same-column-in-a-multiple-input-function-with-by/ for a related question from today.
Actually it is an argument for auto-splatting, as in the question asked there it clearly makes sense in general to run a code like (I am using current syntax):
by(df, :grp, result= (col1, col2) => ((x, y),) -> cov(x, y))
where col1 and col2 are variables holding column names. Such generic code should work even if col1==col2.
With auto-splatting it would become:
by(df, :grp, result= (col1, col2) => cov)
and it should not be a problem that col1 and col2 are the same names (internally we would not care about column names - only about the vectors stored in them).
That's a really interesting use case identified in that stackoverflow thread. For me it makes it clear that what I've been referring to as a selector - the first part of the Pair(s) - is serving two distinct purposes.
All, Not, or a Regex. Tuples (or SubDataFrames) of arguments for transformations.It's somewhat obvious in retrospect, but drawing a line between these has helped me conceptualize it. I suppose this is where the Col is needed to differentiate tuples that are meant to collapse multiple selectors (Col(Not([:a]), Not([:b]))) from tuples that are meant to be arguments to a multivariate transformation ((Not([:a]), Not([:b])))
More generally, this type of dual-purpose starts to make me a bit worried about the robustness of such a syntax. A lot of issues with relational columns like this, in my experience, are best solved by first reshaping to a more normalized data model and regrouping the data. At that point you don't need extremely flexible column selection syntax - although this might by my dplyr preconceptions trickling in.
So yes as @bkamins said the main reason for passing a named tuple rather than splatting is that it preserves the names: so you can splat a named tuple, but conversely you cannot construct a named tuple if we pass splatted arguments. @bkamins and @piever also liked named tuples better because they are a kind of small table which can easily be manipulated. Finally, that's inspired by JuliaDB.
But indeed using named tuples may not be that common and we could change the default, introducing another syntax to request a named tuple (e.g. something like by(df, :key, Named(:x, y) => f)).
I'd say that which approach is the most convenient depends on the use case. In particular, splatting only makes sense when you write column names explicitly by hand: if you use a more complex selector, you're unlikely to know the exact number and order of variables, so you wouldn't pass them as positional argument. As examples of conflicting goals, if we splatted by default, by(df, key, (:x, :y) => cov) would indeed be very nice, but we would lose things like select(df, (:x, :y) => ByRow(sum)) and even more select(df, r("x") => ByRow(sum)) (possible with the current design, with the new select improvements).
At some point I had played with the idea of retrieving the arguments names in the Function object, so that e.g. by(df, :key, (x, y) -> cov) would be equivalent to by(df, :key, (:x, :y) => r -> cov(r.x, r.y)). But people weren't too enthusiastic about that.
Overall, I think one conclusion is that DataFrames should provide a powerful and flexible syntax, but not necessarily super convenient, and that DataFramesMeta should build on top of this. And DataFramesMeta doesn't have this problem, as you can specify to which argument each column corresponds without thinking about splatting.
More generally, this type of dual-purpose starts to make me a bit worried about the robustness of such a syntax. A lot of issues with relational columns like this, in my experience, are best solved by first reshaping to a more normalized data model and regrouping the data. At that point you don't need extremely flexible column selection syntax - although this might by my
dplyrpreconceptions trickling in.
What do you mean exactly? Can you develop?
Good points.
So maybe we should change nothing in what we have, but add the following syntax:
by(df, :grp, Splat(:x, :y) => cov)
Then Splat would be applied to column names or column indices and as opposed to standard selectors it would allow passing duplicates. Then if Splat is used on LHS of => then on RHS we pass data to the function using splatting. Having Splat on LHS is more flexible as it allows for duplicates (which is OK when we splat). What do you think?
It would be interesting to collect examples of by, select and filter used with multiple input columns. That would help deciding which pattern is the more common to choose the default. Maybe we could ask on #data.
OK - I have asked. @nalimilan, @xiaodaigh, @dgkf - can you please comment there so that we have a single place to discuss?
can you please comment there
Where exactly do you want to move the conversation? To #2080?
To Slack as @nalimilan suggested. See the post https://julialang.slack.com/archives/C674VR0HH/p1581873612348200
Commenting here because I don't want to derail the slack conversation further than I already have
@nalimilan #1935
More generally, this type of dual-purpose starts to make me a bit worried about the robustness of such a syntax. A lot of issues with relational columns like this, in my experience, are best solved by first reshaping to a more normalized data model and regrouping the data. At that point you don't need extremely flexible column selection syntax - although this might by my dplyr preconceptions trickling in.
What do you mean exactly? Can you develop?
Expanding on both points
It seems like these two behaviors (column selection and argument building) are difficult to build simple syntactic rules around, which starts to make them very difficult to reason about. Initially I thought a simple rule might be "if a symbol is paired to a function, apply the function to the column data. If a tuple of columns is paired with a function auto-splat the tuple as arguments," but this logic breaks down with the additional selectors (All, Not, Regex, etc). Just trying to build up a few use cases:
# simplest case, perform a function on a specific column
:x => f
# apply a function over multiple columns
(:x, :y) .=> f
# apply a function over multiple columns by programmatic criteria
# (?) when a regular expression is passed is it assumed to behave
# as a collection of columns or a single column - should multiple
# columns be splatted?
r"_id$" .=> f
# applying a function over pairwise columns by programmatic criteria
# (?) do tuples of meta-selectors collapse into a Set (equivalent to a
# single `All()`), or do they form a tuple of `All()` results, or do
# they form pairwise permutations of tuples, or zipped tuples?
(All(), All()) .=> f
I don't think there's any way around this ambiguity without introducing some new Types that indicate how the column names translate into the various meanings of these columns as you won't have access to the DataFrame to collapse down into a set of columns until you're into the body of the by function. Because of all of the ways these can be interpreted, it might be necessary to have something like Splat(:x, :y), BroadcastSplatCols(:col_id, Not([:col_id])) PermuteSplatCols(All(), All())), ZipSplatCols(r"_valueA$", r"_valueB$")). Maybe there's another solution, but I can't think of one given the way that these selectors are implemented. In the absence of something better, this starts to feel like there must be a fundamental design issue when it's necessary to embed so many data operations as Types.
When I have problems that require operation across multiple columns at once, I usually do these steps. Granted, I think their may be performance concerns of constantly reshaping very large data, but usually I'm not working in that performance-limited domain. I'll use an example where you have average various medical measures for males and females in different countries.
|country|height_m|height_f|weight_m|weight_f|
|---|---|---|---|---|
|The Netherlands|182.5|167.5|82.7|67.8|
|Germany|179.9|162.8|82.4|67.5|
|...|...|...|...|...|
Typically I would first reshape the data to be
|country|measure|value_m|value_f|
|---|---|---|---|
|The Netherlands|height|182.5|167.5|
|The Netherlands|weight|82.7|67.8|
|...|...|...|...|
Then I can group by [:country, :measure] and perform my operation across value_m and value_f without having to worry about specifying pairwise columns. At least this is how I would make things work in the dplyr-style of doing things.
I think the Splat type would exactly disambiguate these two approaches and that is why I proposed to stay non-breaking and add Splat (as in Splat you can have duplicate columns etc. as you do not care about their names).
Thanks for developing. So yeah, a combination of what we have now and splatting would probably cover these cases. Things like Splat(:x, r"y") calling f(::AbstractVector, ys::NamedTuple{<:Any, <:AbstractVector}) could also be allowed at some point if we wanted, and it could even be combined with ByRow.
@nalimilan So the decision is:
NamedTuple as defaultSplat in the futurePlease confirm so that we can move forward with implementation.
That's fine with me.
OK - so I am closing this in favor of https://github.com/JuliaData/DataFrames.jl/issues/2121.
Please reopen if you feel this needs to be further discussed.