I'd like to argue for the addition of a few pipe-oriented methods before it's too late, from the perspective of teaching a data analysis class for new Julia users.
I mean methods such as the following:
DataFrames.groupby(cols; sort=false, skipmissing=false) = df -> groupby(df, cols; sort, skipmissing)
DataFrames.combine(args...; kwargs...) = x -> combine(x, args...; kwargs...)
This allows code such as
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,2,2])
df |>
groupby(:children) |>
combine(:age => mean)
This piping style results in clear and elegant code for some common data processing tasks. Unfortunately, it is a bit cumbersome to use directly with the API from DataFrames.jl:
df |>
x->groupby(x, :children) |>
x->combine(x, :age => mean)
Much of the clarity and elegance is lost (visual noise, meaningless name x).
Users can use macros from third-party packages (DataFramesMeta.jl, Query.jl...), but this adds a lot of complexity especially for new users: they have to use and learn yet another package, and more importantly, they have to learn macros.
Macros
Macros are not great for new users, and I think they should only be exposed in the API when necessary because:
@mymacro(f(x+2))
`````
Who knows iff()is a function call here,+a binary operator, and (more insidiously) ifx+2is evaluated beforef`. Or if it's a (possibly buggy) unhygienic macro that messes with user variables...And even paying this price of learning another package and using macros, we end up with code that is not as clear as the above. With Query.jl:
df |>
@groupby(_.children) |>
@map({children=key(_), Mean=mean(_.age)}) |>
DataFrame
Between this and the first version above, I know which one I'd like to teach to my students :-)
Conclusion
Having these new methods in DataFrames.jl would allow users to have an elegant piping solution that only requires knowledge of standard Julia syntax.
Of course this could be done in a third-party package without macros, but
Some notes:
combine the problem is that GroupedDataFrame can go first or second.@pipe df |>
groupby(_, :children) |>
combine(_, :age => mean)
I am convinced (but we can discuss this) that actually having the _ explicitly in the pipeline makes it more easy to understand once you learn that _ is the value returned by the previous expression.
Also note that this is actually not only more readable, but also more powerful, as you can write things like:
@pipe df |>
groupby(abs.(_), :children) |>
combine(_[Not(end)], :age => mean)
(I do not say what I have written is super useful, but I just want to highlight that this style gives you such an option)
But in general: thank you for the comment - let us discuss pros and cons of different approaches, as "easy entry" to the package is an important thing for us.
Also note that @pipe is fully generic - it is just an add on to Julia Base, and once learned can be used to do any piping (from my practice using |> even in Julia Base is very inconvenient without something like @pipe, so when you learn and start to use |> you probably learn @pipe, or some similar alternative - we recommend @pipe as it is super easy for newcomers to grasp).
Great points to discuss, thanks.
In fact I think it can be confusing that the semantics of f in combine(gdf, f) changes when the parameters are exchanged... It might be hard for readers of the code to remember which is which (or that there is a difference) if they are not regular users of DataFrame.jl. Would it make sense to use a different function name for combine(f, gdf)? I think it would make it easier to document and teach.
In any case, we could start by defining the fallback
DataFrames.combine(args...; kwargs...) = x -> combine(x, args...; kwargs...)
which covers only the combine(gdf, f) case. I think the plan is to extend this case at some point, to support returning multiple columns? This would then cover the functionality of combine(f, gdf). We could also add a keyword argument to combine(gdf, f), for examplepassgroups, to have explicit control on the semantics off`.
Higher-order functions are a difficulty, but it's an important and generally useful concept in programming (also beyond Julia) so I don't mind teaching it.
Ah you're right, better to compare with @pipe than Query.jl. I agree it's a good solution (apart from being a macro).
I think it would make it easier to document and teach.
We have thought about it, but what would be a good name? The issue is that this syntax is needed for do blocks (independently of the multiple columns issue).
@nalimilan - what do you think here?
In general in Julia this type of design was avoided, in favour of requiring x -> ..., except for Base.Fix1 and Base.Fix2 cases which are rare.
If we went this way we should probably consider doing a list of methods that should have such special cases (and this poses a problem in some cases where we extend methods from Base). On the other hand maybe @pipe is good enough?
Yes I don't think we should do this. There's nothing wrong with macros, actually I think adding a @ in front of a name is less difficult to explain than piping with currying. For convenient use, people should use DataFramesMeta anyway, which can provide a syntax which is even simpler than |> (note it's currently being redesigned so things will likely improve).
It's not that rare in Julia, I count at least isapprox, isequal, > (and many other comparison operators), in, startswith, endswith and contains (some like contains were recently added). There are discussions to add map and filter.
But yes, @pipe is arguably good enough.
It's not that rare in Julia, I count at least
isapprox,isequal,>(and many other comparison operators),in,startswith,endswithandcontains(some likecontainswere recently added).
Yes but these are all predicates returning a Boolean, and (generally) taking only two arguments.
@knuesel another possibility is to use Lazy.jl but Lazy is dangerous as it exports groupby. So what I have done for my needs is to export @_ and @__ and @as from Lazy and use it in DataConvenience.jl so it becomes
using DataFrames
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,2,2])
using DataConvenience
using Statistics: mean
@> df begin
groupby(:children)
combine(:age => mean)
end
which is the cleanest I find.
I'd like to put in a vote for the curried versions. Although this problem can be addressed using other packages, I don't think that is a good solution. The package should advocate a standard way of manipulating data (e.g. in the tutorials) that doesn't require other packages, and this way should be as user-friendly as possible without sacrificing flexibility or performance.
I really think that adding curried versions of all the main functions (and adding piping examples to the tutorials) moves us in that direction. Consider the following simple example
combine(groupby(data, :treatment), :score=>mean)
data |> groupby(:treatment) |> combine(:score=>mean)
In the first, both the order of operations and also the assignment of arguments to operators is obscured. Compare to the two most used data-manipulation frameworks:
dplyr
data %> group_by(treatment) %> summarize(mean_score=mean(score, na.rm=TRUE))
pandas
data.groupby('treatment').score.mean()
Maybe I'm being simplistic, but I really think the fact that these frameworks allow/force you to put operations in their logical order is part of the reason they have been so successful.
Let me get a vote on #data for this.
data %> group_by(treatment) %> summarize(mean_score=mean(score, na.rm=TRUE))
The issue is that there are many function you would want to curry, and they may not sit inside DataFrames.jl so unless those functions also offer a curried version you end up writing things like
df |>
groupby(...) |>
df->fn(df, ...) |>
df->fn2(df, ...) |>
But if you use a package like DataConvenience.jl (or Lazy.jl) then this becomes
@> df begin
groupby(...)
fn(...)
fn2(...)
end
In R the pipe is also a separate package. I think for good reasons, because piping is more generally applicable concept for everything. I would prefer to keep DataFrames.jl lean.
Also, there are multiple styles of piping and why should DataFrames.jl choose one over another? E.g. this is another valid style
@pipe df |>
groupby(_, ...) |>
fn(_, ...) |>
fn2(_, ...) |>
I would summarize the tension as:
vs
@pipe) and then with every function have a consistent syntax.On #data people seem to prefer option 1., but let us wait a bit more here. Fortunately this proposal is non-breaking, and for the time being option 2. is available and it is already convenient enough I would say.
On #data people seem to prefer option 1
I think it's 3 to 7. I agree that doing 1 then ppl can still do 2.
using HypothesisTests
x = vcat([-1, -1, -1], repeat([1], 7))
OneSampleTTest(x)
pvalue(OneSampleTTest(x)) # 0.22
So not quite conclusive 馃ぃ
The tide has changed - it is 8 to 7 in favor of not having it.