I've just realized that describe takes name => function pairs, which is totally inconsistent with select, transform and combine. I think we should deprecate the current syntax and move to function => name.
Well - we were aware that we left this case.
describe is consistent with DataFrame and insertcols! syntax. Also see https://bkamins.github.io/julialang/2020/07/17/pair.html for a review of all uses of => we have right now.
Having said that I would not oppose to change it prior to 1.0 release as this is a non-core functionality.
OK, I must have forgotten that then. I'd be inclined to change this, as it's quite similar to select/transform/combine (function associated with output column name). insertcols! is a bit different as there's no function. Also, the fewer exceptions, the better.
As I have said - I would not oppose it.
@oxinabox, @quinnj, @ararslan - do you have any opinion here? (I am mostly asking about the approach to "breaking things" in cases like here - we have a case of a relatively unimportant feature that is a bit inconsistent with the rest of the design; so - it would be better to change it but it is breaking).
In general this is a last call to break it in 0.22 release, as in general I will oppose any breaking changes to ensure the stability of the API of DataFrames.jl as I believe this is what we need to start ensuring if we want to have a wider adoption.
If we want to avoid breakage, we could also allow both orders (since there's no ambiguity), but recommend function => name.
Actually I would prefer to be breaking, but want to confirm the policy (i.e. allow small breakages for 0.22 like this one to clean things up; do not allow breaking changes after 1.0 release at all 馃槃)
My preference is to change the syntax to make it consistent with select/transform/combine.
If you鈥檙e going to be breaking, would not it be nicer to have cols as the default arg, and stats as the kwarg? That would be nice to be able to do describe(df, :x) similarly to groupby(df, :x) or select(df, :x)
cols kwarg is a standard approach to restrict columns to which a function is applied in DataFrames.jl.
Is that really the case? We have lots of cases where it's a positional argument (allowmissing, completecases, unique...).
Ah - sorry - you are right. I was thinking about these functions and I knew we called the argument cols, but mixed up the fact that we use cols as a positional argument.
So @matthieugomez is what you want this:?
describe(df, cols; stats)
but it would greatly complicate the syntax of stats. Would you want to pass the arguments as a value (Symbol or Pair) or a vector of Union{Symbol, Pair}?
Yeah. I was hoping we could make both cols and stats positional using dispatch, but there's a conflict when passing a symbol...
@matthieugomez - what do you think given these considerations (or in general - what would be your proposal exactly)? Thank you!
I like what @matthieugomez suggested. It's much closer to stata where summarize is used very heavily in that context.
I think describe(df, cols::MultiColumnIndex) would be good. I agree that this would make the typing of describe.
My ideal behavior is probably,
describe(df, cols, fun => :myfuname, fun2 => myfunname2)
describe(df, cols, fun => :myfuname, fun2 => myfunname2; allstats = true)
Where the first one just returns the results of fun1 and fun2. The second returns mean, std, etc. in addition to fun1 and fun2
The problem with this is that describe(df, :min) is problematic. I know you proposed a MultiColumnIndex, but still users could be confused by it.
Also note that currently you can write:
describe(df, cols=:x, :min, :sum => sum)
do you both feel that having to write cols= is that bad?
That's fine. cols is not that bad. Also point 1 can just be taken care of by adding :all as a positional argument.
We should change to fun => name though.
I've said this before but we should also change insertcols! to insertcols!(df, x => :name) for consistency everywhere.
Yes - I will make a PR changing this order.
What we could do is require wrapping symbols in a vector or All, e.g. describe(df, [:x]). Not clearly better or worse than writing cols=:x (the plural is weird when you pass a single column too). Do we have any other place where we use a cols keyword argument in the end?
@bkamins As you suggest, I was thinking of a kwarg stats that would be either a Union{Symbol, Pair}, or a vector of Union{Symbol, Pair}, as you suggest.
Beyond consistency, my impression is that 99% of the time, people will use describe to output all the default function for one or two columns (let me know if you disagree). So it makes sense to make it I easier to specify columns rather than specifying stats.
people will use
describeto output all the default function for one or two columns (let me know if you disagree)
Interesting - I do exactly the opposite - rather collecting data for all columns but custom functions. My major use case of describe is to quickly check if I read-in a DataFrame correctly. If I wanted to investigate one or two columns I would use combine probably.
Do we have any other place where we use a
colskeyword argument in the end?
No
OK. That could be a reason to change that argument then. OTOH we made stats positional for consistency with select/transform/combine (funny fact: before https://github.com/JuliaData/DataFrames.jl/pull/1664 it was a keyword argument!). It seems we can't be consistent with everything here...
This is indeed a coincidence :smile:. Let us wait a bit as maybe other people will voice their opinion (or @matthieugomez changes his opinion given this issue :smile:)
I would like to revive this, I think describe(df, :x1, :x2; stats = ...) is a much more useful syntax and think we should get it in before 1.0.
The only issue is that you can write describe(df[!, [:x1, :x2]], ...stats here...) if you really do not like to use cols. Is that that problematic?
(I do not have a strong opinion here, but just want to understand the rationale)
I think mostly, those are a lot of brackets and a lot of characters to type. If you use stata you are using
summ x y z
all the time interacitvely, it's second nature. So making it super easy to type is important.