Dataframes.jl: `filter` on an `AbstractDataFrame` should take in columns

Created on 21 Jul 2020  ·  40Comments  ·  Source: JuliaData/DataFrames.jl

I think it's a bit frustrating that select, transform, etc. operate on columns as the default, but filter operates on rows. I would be find with broadcasting. Plus, it makes it easier to filter with operations like x -> x .> mean(x)

breaking decision

Most helpful comment

I think the reason why filter is row-wise is that we decided at https://github.com/JuliaData/DataFrames.jl/issues/1514 that data frames are collections of rows. So all Base functions operating on collections that we implement have to operate by row. select, transform and combine are custom DataFrames function so there's no inconsistency strictly speaking (though for clarity they could have been added the "cols" suffix...).

Another reason is that operating on whole columns is essential for grouped operations like transform(gd, :x => mean). Since filter doesn't allow selecting rows in a GroupedDataFrame, it's not absolutely required: you can always do m = mean(df.x); filter(:x => x -> x > m, df) as the mean is always the same for all rows.

That said, the idea of always having the pair syntax pass the whole column vector (even in filter) is interesting. Its downside would be that it's quite practical not to have to put . everywhere for broadcasting, especially for things like filter(:x => in(df.y), df), or filter([:x, :y] => (x, y) -> x > 0 && y > 0, df) as precedence requires explicit parentheses for (x .> 0) .& (y .> 0). Maybe that's OK if you can write filter(:x => ByRow(in(df.y)), df) and filter([:x, :y] => ByRow((x, y) -> x > 0 && y > 0), df) instead.

It could also be a problem for filter(row -> ..., df) to be the most concise way to perform row-wise operations, which are the most common and intuitive, since that syntax is also by far the slowest due to type instability.

All 40 comments

Let us discuss it, but I think it is not likely to change. It is actually select, transform and combine that break the default behavior that data frame is a collection of rows. Also what you write is achievable by getindex:

df[df.x .> mean(df.x), :]

Thanks for considering this.

I like piping, and when I (eventually) run a lab I will try to make piping part of my style guide because it "reads well". That is to say, there are benefits to allowing this kind of behavior in filter even though it is achievable with getindex.

We can add ByRow wrappers to filter that would allow individuals to opt in to ByRow behaviors. We can also provide easier syntax for ByRow behavior in DataFramesMeta.

I guess maybe I would argue consistency within piping is more important than consistency with the general concept of a data frame as a collection of rows.

I see the benefit in having the same grammar for select/transform and filter (whether it's about taking rows or columns by default).
I cannot find the prior discussions. Why was select/transform decided to be column-wise? Wouldn't the same arguments apply to filter? (or conversely, if there is a strong reason for filter to be row-wise, should not select and transform be row-wise by default too?)

I confess I'm the one who made the biggest stink about transform acting on columns as noted in #1952. In my mind this was the best "one-to-one" mapping for working with df.x etc. It also makes it easier to do x .- mean(x).

After working with the new syntax I do think that adding . everywhere gets kind of annoying and having ByRow be the default and a ByCols wrapper would also have been a good decision.

It makes sense that select and transform use columns by default — esp. now that they are used with GroupedDataFrame.

Could filter be extended to accept the same syntax as select/transform when a DataFrame is given in the first argument?

As for the annoyance about using . everywhere, maybe map could be used as a short-hand for row-wise transform (e.g.
map(:x => x -> x + 1, df) would give the same thing as transform(df, :x => x -> x .+ 1))

I think ByRow is a powerful and convenient construct. However I think the inconsistency is a bit annoying.

Also, with regards to "data frame as a collection of rows", I would argue that

filter(:var => x -> x .> mean(x), df)

Is still consistent with "data frame as a collection of rows" since the filter function still only returns a Vector of length nrow(df).

Could filter be extended to accept the same syntax as select/transform when a DataFrame is given in the first argument?

See #2187 for this discussion

I have thought a bit more about it and I am thinking that allowing a Pair as a first argument is just too confusing.

It would be simpler to only have these two syntaxes:

filter(row -> row.x >= 1, df)
filter(df, :x => x -> x .>= 1)

Would that make it harder to express certain things?

No - this would be acceptable. The only problem is that filter in Base always takes a predicate as a first argument.

I think the reason why filter is row-wise is that we decided at https://github.com/JuliaData/DataFrames.jl/issues/1514 that data frames are collections of rows. So all Base functions operating on collections that we implement have to operate by row. select, transform and combine are custom DataFrames function so there's no inconsistency strictly speaking (though for clarity they could have been added the "cols" suffix...).

Another reason is that operating on whole columns is essential for grouped operations like transform(gd, :x => mean). Since filter doesn't allow selecting rows in a GroupedDataFrame, it's not absolutely required: you can always do m = mean(df.x); filter(:x => x -> x > m, df) as the mean is always the same for all rows.

That said, the idea of always having the pair syntax pass the whole column vector (even in filter) is interesting. Its downside would be that it's quite practical not to have to put . everywhere for broadcasting, especially for things like filter(:x => in(df.y), df), or filter([:x, :y] => (x, y) -> x > 0 && y > 0, df) as precedence requires explicit parentheses for (x .> 0) .& (y .> 0). Maybe that's OK if you can write filter(:x => ByRow(in(df.y)), df) and filter([:x, :y] => ByRow((x, y) -> x > 0 && y > 0), df) instead.

It could also be a problem for filter(row -> ..., df) to be the most concise way to perform row-wise operations, which are the most common and intuitive, since that syntax is also by far the slowest due to type instability.

Also as I have commented above we have df[df.x .> mean(df.x), :] so it is easy to filter using vectors already.

Should we reconsider the default column behavior for transform, then? It feels like its too late for that.

But it would be nice for everything in piping functions to work the same way. And if the rest of the functions really are about rows then it could make sense to bite the bullet and make transform row-wise before 1.0. Then for grouped operations would would add ByCol around the function.

OK - let us see what people say. We have had many long discussions around this and the conclusion was:

  • select, transform and combine work for both data frame and grouped data frame and should have a consistent behavior between them, thus they operate on whole columns
  • other functions should be row-wise if possible

Should not filter work with GroupedDataFrames too? Selecting the first row per group, or selecting groups of a given length are common operations.

The problem is that GroupedDataFrame behaves as a collection of SubDataFrame groups (via indexing notably). So for consistency filter should select groups, not rows within groups. There is a tension here with the fact that transform, select and combine work similarly on both DataFrame and GroupedDataFrame.

Right. I mean it would be great to keep these cases in mind when thinking of the best solution.

So there are two issues. First, filter is inconsistent with transform and select. Second, it is a bit clunky to do row-wise transform currently (and it's not clear how DataFramesMeta can make it simpler).

One potential way forward may be to have two syntaxes for select/transform/filter. If the DataFrame is given as a first argument, the function is applied column-wise (as in select/transform currently). If the DataFrame is given as second argument, the function is applied row-wise (as in filter currently)

The advantage of this solution is that row-wise operations become as easy as col-wise operations (no extra typing etc). It also makes filter consistent both with base and with existing functions in DataFrames. The disadvantage is that it is not very explicit.

Yeah that would be a really subtle difference likely to confuse a lot of people.

I would close this. Changing this will be breaking and inconsistent with Base. If someone strongly objects please comment.

Just to explain things. The syntax :src => fun should be understood in the following way:

  1. originally some g(fun, df) exists, it has some rule how fun works in a higher order function g
  2. then we add g(:src => fun, df) syntax to make things type stable, and in some cases also easier to express

Now the crucial point is 1.:

  • in most of the functions g we have that fun works per-row; this is a general design pattern we have in DataFrames.jl - it was heavily debated and was a hard decision to make, but we now have it and in particular filter follows it (and this is how Julia Base works)
  • we accepted an exception for select, transform and combine to this rule where we pass whole columns. Maybe it was a mistake, and we should have created ByCol rather than ByRow, and make these functions work per-row as all other functions? but again - it was heavily debated and we settled for this (mostly the reason was that combine worked this way for years and it would be massively breaking).

This is a hard case of having a mature package, where some design decisions have been made and tones of code written based on this design. If the change we want to do is mildly breaking - then we did it in 0.21. But combine and filter have been in DataFrames.jl for many, many years - we do not want to break users' code by such changes. Especially that 0.21 promises not to break things unless it is essential to do it.

So in short, user's have, regretfully, to learn that:

  1. filter works row-wise (like in Julia Base, so there is nothing to learn)
  2. select, transform and combine work on whole rows (which is an exception to learn)

I would still argue to remove the pair syntax as a first argument of filter, and allow it as a second argument, using columns by default.

This would make filter fully consistent with both Base and the existing verbs accepting the pair syntax.

For me, the pair syntax is a mini-language that should mean the same thing wherever it is used.

I see your point, but we do not want to be breaking now unless there is a strong reason to be breaking.

I will keep this open as maybe in 2.0 release we can discuss changing this behaviour to what you propose, but for 1.0 release I am really hesitant to introduce breaking changes at this point.

Something that came up on slack today was

@linq df |>
    groupby(:a)
    @where(:x .> mean(:x))

This would make separate means for each group and filter each sub-data frame separately, then combine the results. This currently works in dplyr and I think would be useful in DataFrames.

This is clearly achievable, but we do not have a syntax for it in DataFrames.jl as filter on GroupedDataFrame filters groups.

This is the current way to do it:

combine(groupby(df, :a)) do sdf
    filter(:x => >(mean(sdf.x)), sdf)
end

Yup! Obviously this would be harder to do in DataFramesMeta so I think DataFramesMeta should provide this. But it should be on our radar for DataFrames as well.

It could make sense to add where to DataFrames to get this behavior. (Though "where" isn't a verb, so maybe subset would be a better name?)

you mean where to work on GroupedDataFrame and filter groups?

Note that we also discussed adding where for AbstractDataFrame (that would drop missings by default and allow Pair syntax after the data frame which would be a first positional argument).

you mean where to work on GroupedDataFrame and filter groups?

I would imagine that filter would filter groups: you choose to keep or drop whole sub-data frames.

where: filter within a subdataframe, exactly as you describe above.

The reasoning for the over-loading is the same as transform. Since we are working with whole columns behavior will differ when acting on sub-data frame or the whole data frame.

I meant "filter WITHIN groups" so we are in agreement. filter selects groups as you say.

yes, I think we are in agreement.

So summarize the design for where:

  • it would take AbstractDataFrame or GroupedDataFrame as a first argument
  • it would drop rows where missing is returned
  • it would do within-group filtering for GroupedDataFrame
  • if would allow many predicates that would be combined with &&

The only question is if we would pass whole columns to where and expect it to return AbstractVector{<:Union{Bool, Missing}} OR we would work row-wise (like now in filter). Probably the former (i.e. passing whole vectors) is more useful in GroupedDataFrame context, and we can support ByRow anyway to have consistency with other functions.

Yes for consistency it would be better to pass whole vectors.

Regarding missing values, we should be very careful about the consistency with other functions. Skipping missing values here (i.e. treating them as false) but nowhere else could be weird.

Skipping missing values here (i.e. treating them as false) but nowhere else could be weird.

But I understand this is the key thing that @matthieugomez wanted.

There is no rush to add it, so let us think and discuss it well before deciding what to do.

I think where should exist, irrespective of how it treats missing values.

I also think the pair syntax should be removed for filter, at least for 1.0. For the sake of consistency, it would be nice to restrict the pair syntax to columns everywhere. Moreover it makes sense to restrict the signature of filter to Functions, as in Base.

Now I also think it would be great for where to convert missing to false by default. I understand it clashes with Base filter and getindex, but I’m not sure Base took the right decision there. Is there a GitHub issue where this was discussed? Is it something that could eventually change (since it now returns errors anyway)?

I also think the pair syntax should be removed for filter

I would be OK to drop Pair from filter if we introduce it in where. This means that we should settle on where design before 1.0 release.

Is there a GitHub issue where this was discussed?

I am not aware of such issue, but surely it could be opened if you want to discuss it in Julia Base.

Why does suppressing Pair in filter imply that where should exist before 1.0? It is fine if not everything is super performant for 1.0, no?

It is not a strict restriction but filter(predicate, df) is in practice so slow that it is unusable. So if we want to drop Pair I would prefer to have a replacement that would be efficient. Also I think that the where API design is not that complex. There are only a few things to decide and it does not interact with the rest of the ecosystem so it can be done relatively quickly.

In particular it would be good to be able to write a nice deprecation message.

In https://github.com/JuliaData/DataFrames.jl/issues/2460 @xiaodaigh proposes to consider droprows or keeprows as function names (in particular what I think we can discuss if where should not be named keeprows).

droprows or keeprows

These are very clear names to me. No question what they do. droprows(f, df) is easier to read then where(.!(f, df))

keeprows, I am ok. But if we add droprows then I think should make keeprows an alias for where.

We have almost no functions mentioning "row" in our API currently, as we decided that by default operations were row-wise (the exceptions are eachrow which comes from Base, nrow and and rownumber). So I'm not sure it would be very consistent to add one.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ahalwright picture ahalwright  ·  3Comments

abieler picture abieler  ·  7Comments

bbrunaud picture bbrunaud  ·  3Comments

cormullion picture cormullion  ·  6Comments

bkamins picture bkamins  ·  8Comments