Dataframes.jl: [BREAKING] Handling of ByRow

Created on 4 Sep 2020 · 21Comments · Source: JuliaData/DataFrames.jl

In preparation for allowing producing multiple columns in select and transform we have the following decision to be made, as indicated to me by @nalimilan (I also give the options we can consider).

Preamble: currently ByRow produces a result "just as passed" (no pseudo-broadcasting happens). We disallowed "tabular types" like NamedTuple for future use (envisioning that some day we might want to allow multiple columns to be returned).

Now the problem is how to distinguish if we want to pass a NamedTuple as a single column vs. turn it into several columns. In pseudo-broadcasting (which does not apply here) this is achieved by e.g. Ref but this is not something that ByRow supports.

We need to be breaking to fill this gap and possible options are:

if Ref or 0-dimensional array is returned unwrap it (just like in pseudo-broadcasting)
by default allow everything and produce a single column out of it always, but add e.g. :cols => ByRow(fun) => AsTable form (or something similar - to be discussed) - that would signal that the returned value is to be unwrapped into several columns

What do you think would be better?

CC @nalimilan, @pdeffebach, @matthieugomez

breaking question

Source

bkamins

Most helpful comment

(related issue https://github.com/JuliaData/DataFrames.jl/issues/2220 - which will follow when we settle this)

Given no more comments this is what I propose:

All functions: select, select!, transform, transform!, and combine should support the same syntax; I will use SELECT further to mean any of them
When I write FUN this means a ::Callable that is a transformation function
When I write SRC it means any column selector following current rules (extending these rules has been requested but it is orthogonal)
When I write DST it means a Union{Type{AsTable}, Symbol, AbstractVector{Symbol}, AbstractString, AbstractVector{<:AbstractString}} specifying the output column names
When I write DF it means Union{AbstractDataFrame, GroupedDataFrame} (to shorten the notation)
A form SELECT(FUN, DF) is allowed and it is the only allowed form where DF does not come first
A form SELECT(DF, (FUN, FUN => DST, SRC => FUN, SRC => FUN => DST, AbstractVecOrMat {SRC => FUN, SRC => FUN => DST})...) is allowed (AbstractVecOrMat extends currently allowed AbstractVector to allow a more easy programmatic generation of multiple aggregations)
The meanings of the specific transformation specifications is:
- FUN: function takes an AbstractDataFrame and may return a single or multiple columns; automatic naming
- ~FUN => DST: function takes an AbstractDataFrame and must return as many columns as specified by DST~ (not supported; at least for now)
- SRC => FUN: function takes an input specified by SRC and may return a single ~or multiple columns~; automatic naming
- SRC => FUN => DST: function takes an input specified by SRC and must return columns as specified by DST
- SRC: just reuse columns
- SRC => DST: rename columns (this is an extension WRT the current rules for consistency)
source of columns for FUN will be always only a DF (so no chaining of transformations - sorry for this, but it is problematic especially in GroupedDataFrame, this would also be breaking; users will need to chain SELECT calls to get this behavior; in particular chaining will make it problematic to use multithreading in the future as with multithreading we will be able to process many FUN at the same time)
Passed transformations will be now processed eagerly; i.e. if you pass several transformations they will sequentially construct a data frame (and I hope to make it in a non-breaking way)
ByRow will have no special cases - it will be just a shorthand for a broadcast operation
The pseudo-broadcasting rules are the following:
- Ref or AbstractArray{<:Any, 0} are unwrapped and recycled to a vector
- AbstractArrays of dimension higher than two throw an error
- Legacy behavior for multiple columns retained in a non-breaking way, and in particular allow for do-block notation
  - Anything that passes AbstractDataFrame, NamedTuple of vectors and AbstractMatrix: are expanded to multiple columns and treated as is
  - DataFrameRow: is expanded to multiple columns and recycled
  - NamedTuple of non-vectors (mixing vectors and scalars is disallowed as currently): expanded to multiple columns and recycled (like DataFrameRow)
- other AbstractVectors are treated as is
- All else is treated as a scalar and recycled to a vector
- if DST is AsTable or a vector of column names then:
  - if FUN returns a vector: each row wrapped is expanded to multiple columns with keys functions (so it in particular means that we require for a wrapped object to support these two functions) to get column names and cell entries (this will be type unstable, so it is not most efficient and is provided only for convenience); if keys are integers, then for convenience x is prepended to column name (like for AbstractMatrix currently)
  - if FUN returns something else - then Tables.columntable is applied to this value and later processing is done as for NamedTuple of vectors (so in particular for AbstractDataFrame, NamedTuple, DataFrameRow, and AbstractMatrix this is a no-op as they will anyway be pre-processed to multiple columns anyway)
- if DST is specified and these are multiple column names then they OVERRIDE column names generated by FUN without an error if the names do not match (but column count must match)
- it is disallowed to mix: scalar, vector, multiple columns scalar, and multiple columns vector types if DF is a GroupedDataFrame

The key difficulty is in Rule 12 (and it probably requires some thought before commenting on - but it is the key point that is meant to address the raised issues).

A general rationale:

all functions will behave in exactly the same way (except for the rules how rows are handled)
we retain legacy behavior for multiple columns for convenience and to be non breaking, but in general it is assumed that you should signal with AsTable or "multiple column names" as DST that you want to make many columns out of a return value
processing requests sequentially means that we will not do a static validation if the transformations are "non overlapping" with resulting column names (as we currently do), but it will be done lazily (this is required as we do not know what AsTable as DST will produce)

It would be great to get a feedback on this relatively quickly (so that we can move forward with implementation which will not be easy).

bkamins on 15 Sep 2020

👍3

All 21 comments

by default allow everything and produce a single column out of it always, but add e.g. :cols => ByRow(fun) => AsTable form (or something similar - to be discussed) - that would signal that the returned value is to be unwrapped into several columns

I had this exact idea yesterday! So I think it's a good solution.

Something similar is coming up in DataFramesMeta with @based_on.

In combine you can do combine([:x, :y] => fun).

To do similar behavior in a @combine macro you would write @combined(fun(:x, :y), df).

This poses a major problem for inspecting the expression since it doesn't have a set structure the way y = :x + :z does. I was thinking of a @astable flag like

@combine(@astable myfunction(:a, :b))

So the user says "this function returns a table object". I think an @astable flag that would transform expressions to [:x, :y] => fun => asTable would be a great parallel between DataFrames and DataFramsMeta.

I guess in this world we would also deprecate the table-as-return value in combine. This isn't strictly necessary for DataFrames, but might be good for consistency.

pdeffebach on 5 Sep 2020

Whatever solution we choose, I think we should be consistent with and without ByRow:

either always require AsTable to return multiple columns, or never
either always unwrap Ref and 0-dimensional arrays, or never

Regarding how AsTable would work, my original idea was that the function would return an AsTable object, e.g. :cols => ByRow(x -> AsTable(...)). But yeah we could also use :cols => ByRow(fun) => AsTable, since if you return a table output names are irrelevant. The symmetry with AsTable(:cols) => ByRow(fun) => :outcol is appealing.

I think the choice of the best option depends on whether we expect it to be relatively common for code to be written in a generic way without knowing in advance whether columns will contain or not named tuples and other table objects in their data frame columns. If that's the case, then the current behavior could be problematic as code supposed to return a single column would suddenly start returning multiple columns just because the input column happened to contain e.g. named tuple entries. Though this scenario sounds quite unlikely given that very few operations probably apply either to a scalar or to a tuple.

So maybe creating multiple columns when a named tuple is returned as we do currently is OK, and we should simply provide a way to avoid it when you know you want to store a named tuple in a single column, e.g. using Ref. Then we wouldn't need to use AsTable for this.

@pdeffebach I don't understand how @combine differs from combine regarding this issue. Can't we just require the user-provided function to return a Ref or an AsTable object (depending on the choice we make and on the intention of the user)?

nalimilan on 7 Sep 2020

@pdeffebach I don't understand how @combine differs from combine regarding this issue. Can't we just require the user-provided function to return a Ref or an AsTable object (depending on the choice we make and on the intention of the user)?

The point I was trying to make above was that it's looking like DataFramesMeta will require the user to tell Julia that they want to return a table. So the parallel here is that DataFramesMeta will require an explicit Table output flag, which mirrors requiring DataFrames to have an explicit Table output flag.

pdeffebach on 7 Sep 2020

Yes so that's the same situation in DataFrames and DataFramesMeta -- nothing specific to the latter.

nalimilan on 7 Sep 2020

My current thinking is the following:

ByRow(fun) is just a shorthand of x -> fun.(x) (or similar - the point is that it is a shorthand for broadcasting) and we should leave it as it is; what matters how the vector produced by it is processed later
therefore "unwrap Ref and 0-dimensional arrays" does not apply to it at all - we unwrap these two only if they get returned from a function (and they obviously cannot be returned by ByRow as it returns vectors)
currently we define what is considered to hold multiple columns as const MULTI_COLS_TYPE = Union{AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix} and we have two options:
- either remove this rule - and never consider anything to hold multiple columns by default (and require e.g. AsTable) - this will simplify things, but will make combine(fun, gdf/df) useless (and I think this is a form that is used quite often)
- or extend this rule to allow AbstractVector{T} with a carefully selected list of types T to be also considered as multiple columns; the form fun => multiple_cols would be then useful on its own right as when e.g. fun returns an AbstractMatrix there is no way to pass column names to it, but AsTable is not needed; then when we go this way and have this list of types T we need to add a rule that AbstractVector{Ref} and AbstractVector{0-dimensional array} get unwrapped in order to allow functions to produce AbstractVector{T} as a result that would not be treated as multiple columns.

Both options are breaking:

the first one - is significantly breaking and will limit combine(fun, df/gdf) usability
the second one - is minimally breaking (will change the behavior in corner cases that probably never happened in practice), but will be more complex to implement/explain/maintain (the list of exceptions will get longer)

I am not sure which is better.

bkamins on 7 Sep 2020

This is a top priority decision to be made (and implemented) now for DataFrames.jl. Can you please comment on what you think here (either one of the options or maybe something else if you feel you have a better proposal). Apart from "simple" PRs I would concentrate our efforts on deciding this issue now (after it we will tackle the skip-missing thing, but this should be done first as most likely it will lead to a significant internal redesign). Thank you!

bkamins on 13 Sep 2020

(related issue https://github.com/JuliaData/DataFrames.jl/issues/2220 - which will follow when we settle this)

Given no more comments this is what I propose:

All functions: select, select!, transform, transform!, and combine should support the same syntax; I will use SELECT further to mean any of them
When I write FUN this means a ::Callable that is a transformation function
When I write SRC it means any column selector following current rules (extending these rules has been requested but it is orthogonal)
When I write DST it means a Union{Type{AsTable}, Symbol, AbstractVector{Symbol}, AbstractString, AbstractVector{<:AbstractString}} specifying the output column names
When I write DF it means Union{AbstractDataFrame, GroupedDataFrame} (to shorten the notation)
A form SELECT(FUN, DF) is allowed and it is the only allowed form where DF does not come first
A form SELECT(DF, (FUN, FUN => DST, SRC => FUN, SRC => FUN => DST, AbstractVecOrMat {SRC => FUN, SRC => FUN => DST})...) is allowed (AbstractVecOrMat extends currently allowed AbstractVector to allow a more easy programmatic generation of multiple aggregations)
The meanings of the specific transformation specifications is:
- FUN: function takes an AbstractDataFrame and may return a single or multiple columns; automatic naming
- ~FUN => DST: function takes an AbstractDataFrame and must return as many columns as specified by DST~ (not supported; at least for now)
- SRC => FUN: function takes an input specified by SRC and may return a single ~or multiple columns~; automatic naming
- SRC => FUN => DST: function takes an input specified by SRC and must return columns as specified by DST
- SRC: just reuse columns
- SRC => DST: rename columns (this is an extension WRT the current rules for consistency)
source of columns for FUN will be always only a DF (so no chaining of transformations - sorry for this, but it is problematic especially in GroupedDataFrame, this would also be breaking; users will need to chain SELECT calls to get this behavior; in particular chaining will make it problematic to use multithreading in the future as with multithreading we will be able to process many FUN at the same time)
Passed transformations will be now processed eagerly; i.e. if you pass several transformations they will sequentially construct a data frame (and I hope to make it in a non-breaking way)
ByRow will have no special cases - it will be just a shorthand for a broadcast operation
The pseudo-broadcasting rules are the following:
- Ref or AbstractArray{<:Any, 0} are unwrapped and recycled to a vector
- AbstractArrays of dimension higher than two throw an error
- Legacy behavior for multiple columns retained in a non-breaking way, and in particular allow for do-block notation
  - Anything that passes AbstractDataFrame, NamedTuple of vectors and AbstractMatrix: are expanded to multiple columns and treated as is
  - DataFrameRow: is expanded to multiple columns and recycled
  - NamedTuple of non-vectors (mixing vectors and scalars is disallowed as currently): expanded to multiple columns and recycled (like DataFrameRow)
- other AbstractVectors are treated as is
- All else is treated as a scalar and recycled to a vector
- if DST is AsTable or a vector of column names then:
  - if FUN returns a vector: each row wrapped is expanded to multiple columns with keys functions (so it in particular means that we require for a wrapped object to support these two functions) to get column names and cell entries (this will be type unstable, so it is not most efficient and is provided only for convenience); if keys are integers, then for convenience x is prepended to column name (like for AbstractMatrix currently)
  - if FUN returns something else - then Tables.columntable is applied to this value and later processing is done as for NamedTuple of vectors (so in particular for AbstractDataFrame, NamedTuple, DataFrameRow, and AbstractMatrix this is a no-op as they will anyway be pre-processed to multiple columns anyway)
- if DST is specified and these are multiple column names then they OVERRIDE column names generated by FUN without an error if the names do not match (but column count must match)
- it is disallowed to mix: scalar, vector, multiple columns scalar, and multiple columns vector types if DF is a GroupedDataFrame

The key difficulty is in Rule 12 (and it probably requires some thought before commenting on - but it is the key point that is meant to address the raised issues).

A general rationale:

all functions will behave in exactly the same way (except for the rules how rows are handled)
we retain legacy behavior for multiple columns for convenience and to be non breaking, but in general it is assumed that you should signal with AsTable or "multiple column names" as DST that you want to make many columns out of a return value
processing requests sequentially means that we will not do a static validation if the transformations are "non overlapping" with resulting column names (as we currently do), but it will be done lazily (this is required as we do not know what AsTable as DST will produce)

It would be great to get a feedback on this relatively quickly (so that we can move forward with implementation which will not be easy).

bkamins on 15 Sep 2020

👍3

I approve of all of these rules. One comment,

SELECT(DF, FUN => DEST)

where FUN acts on an AbstractDataFrame is currently disallowed and seems orthogonal to the issue of having select etc. produce multiple columns. Do we really need it before 1.0?

pdeffebach on 20 Sep 2020

We do not strictly need it, but:

FUN is currently allowed, so it seemed consistent to allow FUN => DEST

having said that - it will be simpler for me not to allow it if we do not feel it is needed now (indeed if it is just FUN it is probably then easy enough to ensure that what is returned conforms to one of the supported "automatic" conversion to multiple column types).

So do you vote to drop it?

bkamins on 20 Sep 2020

I vote to drop it.

It's less performant. We should be working to make performant things syntactically easy, rather than just give an easier syntax to do a transformation
It is not a very frequently requested feature (though it is available in dplyr)
It's a "hard to see" syntax and breaks the pattern of explicit source declarations, i.e. AsTable. I would rather see a AsDataFrame input for clarity.

pdeffebach on 20 Sep 2020

i like the simplicity of this :cols => ByRow(fun) => AsTable where you can decide to make more features for the output or just a single feature (from ML perspective of feature extraction/transformation).

ppalmes on 20 Sep 2020

👍2

Another thought:

10. Passed transformations will be now processed eagerly; i.e. if you pass several transformations they will sequentially construct a data frame (and I hope to make it in a non-breaking way)

If this messes with multi-threading then we don't need to implement this. I don't know how multi-threading works but if somehow we get a big performance improvement by not doing eager transformations, i.e. transform(df, x = f(y), z = f(x)), then I think it's fine to require those two transformations to be in separate transform calls.

pdeffebach on 21 Sep 2020

If this messes with multi-threading then we don't need to implement this.

We have to implement it anyway because we want to allow returning multiple columns from functions. So if you have:

select(df, fun1, fun2)

and fun1 creates columns :a and :b and fun2 creates columns :a and :c then the resulting data frame will have columns :a , :b, :c (in this sequence), where column :a will be taken from fun2, column :b from fun1 and column :c from fun2.

And in the process of construction :a will initially be from fun1 but later it will be overwritten by fun2.

This has a following corner case we should decide what we do in. The current behavior is:

julia> using DataFrames

julia> df = DataFrame(rand(3,4))
3×4 DataFrame
│ Row │ x1        │ x2       │ x3       │ x4       │
│     │ Float64   │ Float64  │ Float64  │ Float64  │
├─────┼───────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.992606  │ 0.279633 │ 0.85221  │ 0.991694 │
│ 2   │ 0.103514  │ 0.799042 │ 0.427002 │ 0.569371 │
│ 3   │ 0.0134351 │ 0.720908 │ 0.243076 │ 0.487313 │

julia> select(df, :x1 => (x -> 1) => :x1, :)
3×4 DataFrame
│ Row │ x1    │ x2       │ x3       │ x4       │
│     │ Int64 │ Float64  │ Float64  │ Float64  │
├─────┼───────┼──────────┼──────────┼──────────┤
│ 1   │ 1     │ 0.279633 │ 0.85221  │ 0.991694 │
│ 2   │ 1     │ 0.799042 │ 0.427002 │ 0.569371 │
│ 3   │ 1     │ 0.720908 │ 0.243076 │ 0.487313 │

julia> select(df, :x2 => :x1, :)
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.279633 │ 0.279633 │ 0.85221  │ 0.991694 │
│ 2   │ 0.799042 │ 0.799042 │ 0.427002 │ 0.569371 │
│ 3   │ 0.720908 │ 0.720908 │ 0.243076 │ 0.487313 │

julia> select(df, :x2 => :x1, :x1)
ERROR: ArgumentError: duplicate target column name x1 passed

julia> select(df, :x2 => :x1, [:x1, :x2])
3×2 DataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.279633 │ 0.279633 │
│ 2   │ 0.799042 │ 0.799042 │
│ 3   │ 0.720908 │ 0.720908 │

as we had a static rule of resolving transformations. Now - as we have to do a dynamic resolution of column names the question is what rules do we find natural to use. The possible rules are:

always throw an error on duplicate (I think it is too rigorous)
always accept the last value produced (this is not what we have currently so it would be breaking)
always throw an error on duplicate column name except when there was a multi-column selector passed without a transformation, which is silently overwritten if it went first or silently ignored if it went second (THIS IS WHAT I THINK WE SHOULD USE)
always accept the last value produced except when there was a multi-column selector passed without a transformation, which is silently overwritten if it went first or silently ignored if it went second (again - this is not what we do now)

bkamins on 21 Sep 2020

OK - given no more comments I am starting the implementation of the rules described here. In particular following the "always throw an error on duplicate column name except when there was a multi-column selector passed without a transformation, which is silently overwritten if it went first or silently ignored if it went second" rules of handling duplicates.

CC @nalimilan

bkamins on 24 Sep 2020

👍1

A small update in SRC => FUN I recommend that we will assume a single column.

Only FUN will allow multiple columns implicitly.

You will have to write SRC => FUN => AsTable or SRC => FUN => column names to get multiple columns when passing SRC. If you return AbstractDataFrame, NamedTuple, DataFrameRow, or AbstractMatrix from SRC => FUN then you will get an error. I think it is safer as returning multiple columns is not a typical operation so it is better to require user to be explicit. We can always change it in the future (it will error now, so adding the support for this will be non-breaking).

(in general - I am moving forward and have a working prototype for AbstractDataFrame source).

bkamins on 26 Sep 2020

One more restriction. If you pass a vector or matrix of transformations it must be vector or matrix of Pair. FUN is not allowed as then most likely the vector would have eltype equal to Any and it would lead to potential ambiguities when passing vectors as column selectors. Again - we can work on a more flexible approach later, but I think it is not too restrictive to disallow it.

bkamins on 27 Sep 2020

Another "progress report".

We are making the rules increasingly complex.

This means that compilation time of e.g. select for AbstractDataFrame went up from 0.2 sec to 0.3 sec. I special case simple select in a separate method to avoid this compilation, but it essentially means that this "constant cost" goes up and will have to be paid any time a new transformation function is passed to select (most common case is when an anonymous function is used).

bkamins on 30 Sep 2020

Is this something we can use SnoopCompile to try and fix? Do you know what's causing the large compilation time?

pdeffebach on 30 Sep 2020

Do you know what's causing the large compilation time?

Complexity of the logic we want to support causes it (it is over 250 LOC). And every time a new anonymous function is passed it has to be recompiled. I will try to split out subfunctions so that only the "core" that changes has to be recompiled.

bkamins on 30 Sep 2020

Normally the compiler won't specialize a method on the type of a function that is passed as an argument unless that function is called inside the body of the method (as opposed to passed to another method). So if you move the caller to a separate function, the main one shouldn't have to be recompiled, right?

nalimilan on 30 Sep 2020