Dataframes.jl: Add some methods for elegant piping

Created on 6 Sep 2020 · 15Comments · Source: JuliaData/DataFrames.jl

I'd like to argue for the addition of a few pipe-oriented methods before it's too late, from the perspective of teaching a data analysis class for new Julia users.

I mean methods such as the following:

DataFrames.groupby(cols; sort=false, skipmissing=false) = df -> groupby(df, cols; sort, skipmissing)

DataFrames.combine(args...; kwargs...) = x -> combine(x, args...; kwargs...)

This allows code such as

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,2,2])

df |>
groupby(:children) |>
combine(:age => mean)

This piping style results in clear and elegant code for some common data processing tasks. Unfortunately, it is a bit cumbersome to use directly with the API from DataFrames.jl:

df |>
x->groupby(x, :children) |>
x->combine(x, :age => mean)

Much of the clarity and elegance is lost (visual noise, meaningless name x).

Users can use macros from third-party packages (DataFramesMeta.jl, Query.jl...), but this adds a lot of complexity especially for new users: they have to use and learn yet another package, and more importantly, they have to learn macros.

Macros

Macros are not great for new users, and I think they should only be exposed in the API when necessary because:

They are a relatively advanced part of the language
Users have to learn a new domain-specific language for each package using macros
Most importantly: what users have learnt of Julia syntax, what they can normally rely on when trying to understand a piece of code, goes out of the window. Consider:
``@mymacro(f(x+2)) ````` Who knows iff()is a function call here,+a binary operator, and (more insidiously) ifx+2is evaluated beforef`. Or if it's a (possibly buggy) unhygienic macro that messes with user variables...
This makes reasoning and debugging code using macros from other developers significantly more difficult.

And even paying this price of learning another package and using macros, we end up with code that is not as clear as the above. With Query.jl:

df |>
@groupby(_.children) |>
@map({children=key(_), Mean=mean(_.age)}) |>
DataFrame

Between this and the first version above, I know which one I'd like to teach to my students :-)

Conclusion

Having these new methods in DataFrames.jl would allow users to have an elegant piping solution that only requires knowledge of standard Julia syntax.

Of course this could be done in a third-party package without macros, but

That would mean committing type piracy.
I'm afraid that DataFrames.jl would then evolve in ways that make it more and more difficult (or less nice). Already now, some functions have many methods, which makes it not trivial (but hopefully possible) to implement this properly. So I fear if this is not included in DataFrames.jl, it will quickly become impossible to do without macros.

decision

Source

knuesel

👎2 👍2

All 15 comments

Some notes:

with combine the problem is that GroupedDataFrame can go first or second.
I am not sure here (and I might be wrong), but I fear that typical data science student will have trouble understanding higher-order functions
Currently we recommend to use Pipe.jl, which is admittedly another package and a macro is introduced (which you recommend against) but I believe it is really simple to teach:

@pipe df |>
groupby(_, :children) |>
combine(_, :age => mean)

I am convinced (but we can discuss this) that actually having the _ explicitly in the pipeline makes it more easy to understand once you learn that _ is the value returned by the previous expression.

Also note that this is actually not only more readable, but also more powerful, as you can write things like:

@pipe df |>
groupby(abs.(_), :children) |>
combine(_[Not(end)], :age => mean)

(I do not say what I have written is super useful, but I just want to highlight that this style gives you such an option)

bkamins on 6 Sep 2020

👍1

But in general: thank you for the comment - let us discuss pros and cons of different approaches, as "easy entry" to the package is an important thing for us.

Also note that @pipe is fully generic - it is just an add on to Julia Base, and once learned can be used to do any piping (from my practice using |> even in Julia Base is very inconvenient without something like @pipe, so when you learn and start to use |> you probably learn @pipe, or some similar alternative - we recommend @pipe as it is super easy for newcomers to grasp).

bkamins on 6 Sep 2020

Great points to discuss, thanks.

In fact I think it can be confusing that the semantics of f in combine(gdf, f) changes when the parameters are exchanged... It might be hard for readers of the code to remember which is which (or that there is a difference) if they are not regular users of DataFrame.jl. Would it make sense to use a different function name for combine(f, gdf)? I think it would make it easier to document and teach.

In any case, we could start by defining the fallback
```
DataFrames.combine(args...; kwargs...) = x -> combine(x, args...; kwargs...)
```
which covers only the combine(gdf, f) case. I think the plan is to extend this case at some point, to support returning multiple columns? This would then cover the functionality of combine(f, gdf). We could also add a keyword argument to combine(gdf, f), for examplepassgroups, to have explicit control on the semantics off`.
Higher-order functions are a difficulty, but it's an important and generally useful concept in programming (also beyond Julia) so I don't mind teaching it.
Ah you're right, better to compare with @pipe than Query.jl. I agree it's a good solution (apart from being a macro).

knuesel on 6 Sep 2020

I think it would make it easier to document and teach.

We have thought about it, but what would be a good name? The issue is that this syntax is needed for do blocks (independently of the multiple columns issue).

bkamins on 6 Sep 2020

@nalimilan - what do you think here?

In general in Julia this type of design was avoided, in favour of requiring x -> ..., except for Base.Fix1 and Base.Fix2 cases which are rare.

If we went this way we should probably consider doing a list of methods that should have such special cases (and this poses a problem in some cases where we extend methods from Base). On the other hand maybe @pipe is good enough?

bkamins on 6 Sep 2020

Yes I don't think we should do this. There's nothing wrong with macros, actually I think adding a @ in front of a name is less difficult to explain than piping with currying. For convenient use, people should use DataFramesMeta anyway, which can provide a syntax which is even simpler than |> (note it's currently being redesigned so things will likely improve).

nalimilan on 6 Sep 2020

👍1

It's not that rare in Julia, I count at least isapprox, isequal, > (and many other comparison operators), in, startswith, endswith and contains (some like contains were recently added). There are discussions to add map and filter.

But yes, @pipe is arguably good enough.

knuesel on 6 Sep 2020

It's not that rare in Julia, I count at least isapprox, isequal, > (and many other comparison operators), in, startswith, endswith and contains (some like contains were recently added).

Yes but these are all predicates returning a Boolean, and (generally) taking only two arguments.

nalimilan on 7 Sep 2020

@knuesel another possibility is to use Lazy.jl but Lazy is dangerous as it exports groupby. So what I have done for my needs is to export @_ and @__ and @as from Lazy and use it in DataConvenience.jl so it becomes

using DataFrames
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,2,2])

using DataConvenience
using Statistics: mean

@> df begin
  groupby(:children)
  combine(:age => mean)
end

which is the cleanest I find.

xiaodaigh on 14 Sep 2020

I'd like to put in a vote for the curried versions. Although this problem can be addressed using other packages, I don't think that is a good solution. The package should advocate a standard way of manipulating data (e.g. in the tutorials) that doesn't require other packages, and this way should be as user-friendly as possible without sacrificing flexibility or performance.

I really think that adding curried versions of all the main functions (and adding piping examples to the tutorials) moves us in that direction. Consider the following simple example

combine(groupby(data, :treatment), :score=>mean)
data |>  groupby(:treatment) |> combine(:score=>mean)

In the first, both the order of operations and also the assignment of arguments to operators is obscured. Compare to the two most used data-manipulation frameworks:

dplyr

data %> group_by(treatment) %> summarize(mean_score=mean(score, na.rm=TRUE))

pandas

data.groupby('treatment').score.mean()

Maybe I'm being simplistic, but I really think the fact that these frameworks allow/force you to put operations in their logical order is part of the reason they have been so successful.

fredcallaway on 19 Sep 2020

👍1

Let me get a vote on #data for this.

bkamins on 19 Sep 2020

data %> group_by(treatment) %> summarize(mean_score=mean(score, na.rm=TRUE))

The issue is that there are many function you would want to curry, and they may not sit inside DataFrames.jl so unless those functions also offer a curried version you end up writing things like

df |>
   groupby(...) |>
   df->fn(df, ...) |>
   df->fn2(df, ...) |>

But if you use a package like DataConvenience.jl (or Lazy.jl) then this becomes

@> df begin
  groupby(...) 
  fn(...) 
  fn2(...)
end

In R the pipe is also a separate package. I think for good reasons, because piping is more generally applicable concept for everything. I would prefer to keep DataFrames.jl lean.

Also, there are multiple styles of piping and why should DataFrames.jl choose one over another? E.g. this is another valid style

@pipe df |>
  groupby(_, ...)  |>
  fn(_, ...) |>
  fn2(_, ...) |>

xiaodaigh on 19 Sep 2020

I would summarize the tension as:

have curried versions of SOME functions and then for the unsupported have to use a different style

always require some piping signal (like @pipe) and then with every function have a consistent syntax.

On #data people seem to prefer option 1., but let us wait a bit more here. Fortunately this proposal is non-breaking, and for the time being option 2. is available and it is already convenient enough I would say.

bkamins on 19 Sep 2020

👍1

On #data people seem to prefer option 1

I think it's 3 to 7. I agree that doing 1 then ppl can still do 2.

using HypothesisTests

x = vcat([-1, -1, -1], repeat([1], 7))
OneSampleTTest(x)
pvalue(OneSampleTTest(x)) # 0.22

So not quite conclusive 🤣

xiaodaigh on 19 Sep 2020

😄1

The tide has changed - it is 8 to 7 in favor of not having it.

bkamins on 20 Sep 2020

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

using nrow as named tuple when grouping

jangorecki · 7Comments

Make a new release

rofinn · 3Comments

Problems in groupreduce_init

bkamins · 8Comments

Documentation enhancement

blackeneth · 5Comments

Allow data frame and DataFrameRow to take part in broadcasting

bkamins · 8Comments