We recently allowed
filter(:x => fun, df)
as an optimized version of anonymous function syntax for data frames.
I had the same argument for combine for grouped dataframes, but I can't find the reference.
We should allow
filter(df, :x => fun1, :y => fun2)
Where fun and fun2 are combined via AND, as in DataFramesMeta. This will facilitate piping.
I think it is not a problem to add it, but I would postpone it after the 0.21 release since it is non-breaking.
Thinking about the API for the multiple functions case
filter(df, :a => isequal(10), :b => isequal(20))
I think the basic intuition is that this would be AND of those two conditions as @pdeffebach suggested, but maybe it would be good to disambiguate, for the OR case?
filter(df, :a => isequal(10), :b => isequal(20); all=true)
filter(df, :a => isequal(10), :b => isequal(20); any=true)
Or maybe separate functions all together rather than the kw arg?
Would this really be better than filter(df, (:a, b) => (a, b) -> isequal(a, 10) && isequal(b, 20))? If that's not convenient enough, better use DataFramesMeta, which will always be more concise.
As a side comment, the current syntax is:
filter([:a, :b] => (a, b) -> isequal(a, 10) && isequal(b, 20), df)
or e.g.
filter([:a, :b] => (x...) -> all(isequal.(x, (10, 20))), df)
Would this really be better than
filter(df, (:a, b) => (a, b) -> isequal(a, 10) && isequal(b, 20))? If that's not convenient enough, better use DataFramesMeta, which will always be more concise.
This is fine. I still think the multiple arguments is nice but would be okay having it live in DataFramesMeta. This is currently disallowed though. It needs to be the first argument.
I would close this. Let us try to go for a more convenient APIs in DataFramesMeta.jl. OK?
I second @pdeffebach .
I think given transform(df, :col => fn) is a valid method we should have filter(df, :col => fn) for consistency. Consistency in that methods have df followed by :col => fn.
I just got bit where filter(:col => fn, df) is possible but not filter(df, :col => fn)
I'd be OK with allowing this for consistency with select/transform/combine and to allow piping -- unless we add where or subset with column-wise semantics closer to these functions.
whereor
In general, I think where makes more sense than filter.
I also think it is a good idea to implement this change for consistency reasons. Currently, I found it odd in a pipeline that the data frame argument has to go to the end unlike others.
I don’t care whether it’s filter or where. Given that combine also allows both ways, I think it’s fine to do the same with filter.
I don’t care whether it’s
filterorwhere. Given thatcombinealso allows both ways, I think it’s fine to do the same with filter.
The argument for where is that it would act on columns, whereas filter acts on rows.
where and filter by themselves don't give me a connotation they are either for columns or for rows. I would prefer column-first semantic and if they want rows to let them do something like
filter(eachrow(df), fn)
or
rowwisefilter(df, fn)
I would leave filter as is - we do not want to be breaking and it is a legacy function consistent with Base in the fact that it works row-wise.
I am OK, to add where(df, Pair{cols, predicate}...) that would do || and by default would work on whole columns and with ByRow it would work on rows (so all would be consistent with select semantics). Is someone willing to add it (it should be relatively easy; just please make sure to make a proper use of the mechanics of parsing Pair{cols, predicate} to make sure the code is correct and type-stable); if not I can add it.
Note that then probably we should also add where! and I would propose that it would also support view kwarg, as proposed in https://github.com/JuliaData/DataFrames.jl/pull/2386.
filter(eachrow(df), fn)or
rowwisefilter(df, fn)
This would disallow mixing row-wise and whole-column predicates in general.
Most helpful comment
Thinking about the API for the multiple functions case
I think the basic intuition is that this would be AND of those two conditions as @pdeffebach suggested, but maybe it would be good to disambiguate, for the OR case?
Or maybe separate functions all together rather than the kw arg?