In preparation for allowing producing multiple columns in select and transform we have the following decision to be made, as indicated to me by @nalimilan (I also give the options we can consider).
Preamble: currently ByRow produces a result "just as passed" (no pseudo-broadcasting happens). We disallowed "tabular types" like NamedTuple for future use (envisioning that some day we might want to allow multiple columns to be returned).
Now the problem is how to distinguish if we want to pass a NamedTuple as a single column vs. turn it into several columns. In pseudo-broadcasting (which does not apply here) this is achieved by e.g. Ref but this is not something that ByRow supports.
We need to be breaking to fill this gap and possible options are:
Ref or 0-dimensional array is returned unwrap it (just like in pseudo-broadcasting):cols => ByRow(fun) => AsTable form (or something similar - to be discussed) - that would signal that the returned value is to be unwrapped into several columnsWhat do you think would be better?
CC @nalimilan, @pdeffebach, @matthieugomez
- by default allow everything and produce a single column out of it always, but add e.g.
:cols => ByRow(fun) => AsTableform (or something similar - to be discussed) - that would signal that the returned value is to be unwrapped into several columns
I had this exact idea yesterday! So I think it's a good solution.
Something similar is coming up in DataFramesMeta with @based_on.
In combine you can do combine([:x, :y] => fun).
To do similar behavior in a @combine macro you would write @combined(fun(:x, :y), df).
This poses a major problem for inspecting the expression since it doesn't have a set structure the way y = :x + :z does. I was thinking of a @astable flag like
@combine(@astable myfunction(:a, :b))
So the user says "this function returns a table object". I think an @astable flag that would transform expressions to [:x, :y] => fun => asTable would be a great parallel between DataFrames and DataFramsMeta.
I guess in this world we would also deprecate the table-as-return value in combine. This isn't strictly necessary for DataFrames, but might be good for consistency.
Whatever solution we choose, I think we should be consistent with and without ByRow:
AsTable to return multiple columns, or neverRef and 0-dimensional arrays, or neverRegarding how AsTable would work, my original idea was that the function would return an AsTable object, e.g. :cols => ByRow(x -> AsTable(...)). But yeah we could also use :cols => ByRow(fun) => AsTable, since if you return a table output names are irrelevant. The symmetry with AsTable(:cols) => ByRow(fun) => :outcol is appealing.
I think the choice of the best option depends on whether we expect it to be relatively common for code to be written in a generic way without knowing in advance whether columns will contain or not named tuples and other table objects in their data frame columns. If that's the case, then the current behavior could be problematic as code supposed to return a single column would suddenly start returning multiple columns just because the input column happened to contain e.g. named tuple entries. Though this scenario sounds quite unlikely given that very few operations probably apply either to a scalar or to a tuple.
So maybe creating multiple columns when a named tuple is returned as we do currently is OK, and we should simply provide a way to avoid it when you know you want to store a named tuple in a single column, e.g. using Ref. Then we wouldn't need to use AsTable for this.
@pdeffebach I don't understand how @combine differs from combine regarding this issue. Can't we just require the user-provided function to return a Ref or an AsTable object (depending on the choice we make and on the intention of the user)?
@pdeffebach I don't understand how
@combinediffers fromcombineregarding this issue. Can't we just require the user-provided function to return aRefor anAsTableobject (depending on the choice we make and on the intention of the user)?
The point I was trying to make above was that it's looking like DataFramesMeta will require the user to tell Julia that they want to return a table. So the parallel here is that DataFramesMeta will require an explicit Table output flag, which mirrors requiring DataFrames to have an explicit Table output flag.
Yes so that's the same situation in DataFrames and DataFramesMeta -- nothing specific to the latter.
My current thinking is the following:
ByRow(fun) is just a shorthand of x -> fun.(x) (or similar - the point is that it is a shorthand for broadcasting) and we should leave it as it is; what matters how the vector produced by it is processed laterByRow as it returns vectors)const MULTI_COLS_TYPE = Union{AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix} and we have two options:AsTable) - this will simplify things, but will make combine(fun, gdf/df) useless (and I think this is a form that is used quite often)AbstractVector{T} with a carefully selected list of types T to be also considered as multiple columns; the form fun => multiple_cols would be then useful on its own right as when e.g. fun returns an AbstractMatrix there is no way to pass column names to it, but AsTable is not needed; then when we go this way and have this list of types T we need to add a rule that AbstractVector{Ref} and AbstractVector{0-dimensional array} get unwrapped in order to allow functions to produce AbstractVector{T} as a result that would not be treated as multiple columns.Both options are breaking:
combine(fun, df/gdf) usabilityI am not sure which is better.
This is a top priority decision to be made (and implemented) now for DataFrames.jl. Can you please comment on what you think here (either one of the options or maybe something else if you feel you have a better proposal). Apart from "simple" PRs I would concentrate our efforts on deciding this issue now (after it we will tackle the skip-missing thing, but this should be done first as most likely it will lead to a significant internal redesign). Thank you!
(related issue https://github.com/JuliaData/DataFrames.jl/issues/2220 - which will follow when we settle this)
Given no more comments this is what I propose:
select, select!, transform, transform!, and combine should support the same syntax; I will use SELECT further to mean any of themFUN this means a ::Callable that is a transformation functionSRC it means any column selector following current rules (extending these rules has been requested but it is orthogonal)DST it means a Union{Type{AsTable}, Symbol, AbstractVector{Symbol}, AbstractString, AbstractVector{<:AbstractString}} specifying the output column namesDF it means Union{AbstractDataFrame, GroupedDataFrame} (to shorten the notation)SELECT(FUN, DF) is allowed and it is the only allowed form where DF does not come firstSELECT(DF, (FUN, FUN => DST, SRC => FUN, SRC => FUN => DST, AbstractVecOrMat
{SRC => FUN, SRC => FUN => DST})...) is allowed (AbstractVecOrMat
extends currently allowed AbstractVector to allow a more easy programmatic generation of multiple aggregations)FUN: function takes an AbstractDataFrame and may return a single or multiple columns; automatic namingFUN => DST: function takes an AbstractDataFrame and must return as many columns as specified by DST~ (not supported; at least for now)SRC => FUN: function takes an input specified by SRC and may return a single ~or multiple columns~; automatic namingSRC => FUN => DST: function takes an input specified by SRC and must return columns as specified by DSTSRC: just reuse columnsSRC => DST: rename columns (this is an extension WRT the current rules for consistency)FUN will be always only a DF (so no chaining of transformations - sorry for this, but it is problematic especially in GroupedDataFrame, this would also be breaking; users will need to chain SELECT calls to get this behavior; in particular chaining will make it problematic to use multithreading in the future as with multithreading we will be able to process many FUN at the same time)ByRow will have no special cases - it will be just a shorthand for a broadcast operationRef or AbstractArray{<:Any, 0} are unwrapped and recycled to a vectorAbstractArrays of dimension higher than two throw an errordo-block notationAbstractDataFrame, NamedTuple of vectors and AbstractMatrix: are expanded to multiple columns and treated as isDataFrameRow: is expanded to multiple columns and recycledNamedTuple of non-vectors (mixing vectors and scalars is disallowed as currently): expanded to multiple columns and recycled (like DataFrameRow)AbstractVectors are treated as isDST is AsTable or a vector of column names then:FUN returns a vector: each row wrapped is expanded to multiple columns with keys functions (so it in particular means that we require for a wrapped object to support these two functions) to get column names and cell entries (this will be type unstable, so it is not most efficient and is provided only for convenience); if keys are integers, then for convenience x is prepended to column name (like for AbstractMatrix currently)FUN returns something else - then Tables.columntable is applied to this value and later processing is done as for NamedTuple of vectors (so in particular for AbstractDataFrame, NamedTuple, DataFrameRow, and AbstractMatrix this is a no-op as they will anyway be pre-processed to multiple columns anyway)DST is specified and these are multiple column names then they OVERRIDE column names generated by FUN without an error if the names do not match (but column count must match)DF is a GroupedDataFrameThe key difficulty is in Rule 12 (and it probably requires some thought before commenting on - but it is the key point that is meant to address the raised issues).
A general rationale:
AsTable or "multiple column names" as DST that you want to make many columns out of a return valueAsTable as DST will produce)It would be great to get a feedback on this relatively quickly (so that we can move forward with implementation which will not be easy).
I approve of all of these rules. One comment,
SELECT(DF, FUN => DEST)
where FUN acts on an AbstractDataFrame is currently disallowed and seems orthogonal to the issue of having select etc. produce multiple columns. Do we really need it before 1.0?
We do not strictly need it, but:
FUN is currently allowed, so it seemed consistent to allow FUN => DEST
having said that - it will be simpler for me not to allow it if we do not feel it is needed now (indeed if it is just FUN it is probably then easy enough to ensure that what is returned conforms to one of the supported "automatic" conversion to multiple column types).
So do you vote to drop it?
I vote to drop it.
dplyr)source declarations, i.e. AsTable. I would rather see a AsDataFrame input for clarity. i like the simplicity of this :cols => ByRow(fun) => AsTable where you can decide to make more features for the output or just a single feature (from ML perspective of feature extraction/transformation).
Another thought:
10. Passed transformations will be now processed eagerly; i.e. if you pass several transformations they will sequentially construct a data frame (and I hope to make it in a non-breaking way)
If this messes with multi-threading then we don't need to implement this. I don't know how multi-threading works but if somehow we get a big performance improvement by not doing eager transformations, i.e. transform(df, x = f(y), z = f(x)), then I think it's fine to require those two transformations to be in separate transform calls.
If this messes with multi-threading then we don't need to implement this.
We have to implement it anyway because we want to allow returning multiple columns from functions. So if you have:
select(df, fun1, fun2)
and fun1 creates columns :a and :b and fun2 creates columns :a and :c then the resulting data frame will have columns :a , :b, :c (in this sequence), where column :a will be taken from fun2, column :b from fun1 and column :c from fun2.
And in the process of construction :a will initially be from fun1 but later it will be overwritten by fun2.
This has a following corner case we should decide what we do in. The current behavior is:
julia> using DataFrames
julia> df = DataFrame(rand(3,4))
3ร4 DataFrame
โ Row โ x1 โ x2 โ x3 โ x4 โ
โ โ Float64 โ Float64 โ Float64 โ Float64 โ
โโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโค
โ 1 โ 0.992606 โ 0.279633 โ 0.85221 โ 0.991694 โ
โ 2 โ 0.103514 โ 0.799042 โ 0.427002 โ 0.569371 โ
โ 3 โ 0.0134351 โ 0.720908 โ 0.243076 โ 0.487313 โ
julia> select(df, :x1 => (x -> 1) => :x1, :)
3ร4 DataFrame
โ Row โ x1 โ x2 โ x3 โ x4 โ
โ โ Int64 โ Float64 โ Float64 โ Float64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโค
โ 1 โ 1 โ 0.279633 โ 0.85221 โ 0.991694 โ
โ 2 โ 1 โ 0.799042 โ 0.427002 โ 0.569371 โ
โ 3 โ 1 โ 0.720908 โ 0.243076 โ 0.487313 โ
julia> select(df, :x2 => :x1, :)
3ร4 DataFrame
โ Row โ x1 โ x2 โ x3 โ x4 โ
โ โ Float64 โ Float64 โ Float64 โ Float64 โ
โโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโค
โ 1 โ 0.279633 โ 0.279633 โ 0.85221 โ 0.991694 โ
โ 2 โ 0.799042 โ 0.799042 โ 0.427002 โ 0.569371 โ
โ 3 โ 0.720908 โ 0.720908 โ 0.243076 โ 0.487313 โ
julia> select(df, :x2 => :x1, :x1)
ERROR: ArgumentError: duplicate target column name x1 passed
julia> select(df, :x2 => :x1, [:x1, :x2])
3ร2 DataFrame
โ Row โ x1 โ x2 โ
โ โ Float64 โ Float64 โ
โโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโค
โ 1 โ 0.279633 โ 0.279633 โ
โ 2 โ 0.799042 โ 0.799042 โ
โ 3 โ 0.720908 โ 0.720908 โ
as we had a static rule of resolving transformations. Now - as we have to do a dynamic resolution of column names the question is what rules do we find natural to use. The possible rules are:
OK - given no more comments I am starting the implementation of the rules described here. In particular following the "always throw an error on duplicate column name except when there was a multi-column selector passed without a transformation, which is silently overwritten if it went first or silently ignored if it went second" rules of handling duplicates.
CC @nalimilan
A small update in SRC => FUN I recommend that we will assume a single column.
Only FUN will allow multiple columns implicitly.
You will have to write SRC => FUN => AsTable or SRC => FUN => column names to get multiple columns when passing SRC. If you return AbstractDataFrame, NamedTuple, DataFrameRow, or AbstractMatrix from SRC => FUN then you will get an error. I think it is safer as returning multiple columns is not a typical operation so it is better to require user to be explicit. We can always change it in the future (it will error now, so adding the support for this will be non-breaking).
(in general - I am moving forward and have a working prototype for AbstractDataFrame source).
One more restriction. If you pass a vector or matrix of transformations it must be vector or matrix of Pair. FUN is not allowed as then most likely the vector would have eltype equal to Any and it would lead to potential ambiguities when passing vectors as column selectors. Again - we can work on a more flexible approach later, but I think it is not too restrictive to disallow it.
Another "progress report".
We are making the rules increasingly complex.
This means that compilation time of e.g. select for AbstractDataFrame went up from 0.2 sec to 0.3 sec. I special case simple select in a separate method to avoid this compilation, but it essentially means that this "constant cost" goes up and will have to be paid any time a new transformation function is passed to select (most common case is when an anonymous function is used).
Is this something we can use SnoopCompile to try and fix? Do you know what's causing the large compilation time?
Do you know what's causing the large compilation time?
Complexity of the logic we want to support causes it (it is over 250 LOC). And every time a new anonymous function is passed it has to be recompiled. I will try to split out subfunctions so that only the "core" that changes has to be recompiled.
Normally the compiler won't specialize a method on the type of a function that is passed as an argument unless that function is called inside the body of the method (as opposed to passed to another method). So if you move the caller to a separate function, the main one shouldn't have to be recompiled, right?
So if you move the caller to a separate function, the main one shouldn't have to be recompiled, right?
It did not help. What helps is adding @nospecialize as we have in split-apply-combine code.
Most helpful comment
(related issue https://github.com/JuliaData/DataFrames.jl/issues/2220 - which will follow when we settle this)
Given no more comments this is what I propose:
select,select!,transform,transform!, andcombineshould support the same syntax; I will useSELECTfurther to mean any of themFUNthis means a::Callablethat is a transformation functionSRCit means any column selector following current rules (extending these rules has been requested but it is orthogonal)DSTit means aUnion{Type{AsTable}, Symbol, AbstractVector{Symbol}, AbstractString, AbstractVector{<:AbstractString}}specifying the output column namesDFit meansUnion{AbstractDataFrame, GroupedDataFrame}(to shorten the notation)SELECT(FUN, DF)is allowed and it is the only allowed form whereDFdoes not come firstSELECT(DF, (FUN, FUN => DST, SRC => FUN, SRC => FUN => DST, AbstractVecOrMat {SRC => FUN, SRC => FUN => DST})...)is allowed (AbstractVecOrMatextends currently allowedAbstractVectorto allow a more easy programmatic generation of multiple aggregations)FUN: function takes anAbstractDataFrameand may return a single or multiple columns; automatic namingFUN => DST: function takes anAbstractDataFrameand must return as many columns as specified byDST~ (not supported; at least for now)SRC => FUN: function takes an input specified bySRCand may return a single ~or multiple columns~; automatic namingSRC => FUN => DST: function takes an input specified bySRCand must return columns as specified byDSTSRC: just reuse columnsSRC => DST: rename columns (this is an extension WRT the current rules for consistency)FUNwill be always only aDF(so no chaining of transformations - sorry for this, but it is problematic especially inGroupedDataFrame, this would also be breaking; users will need to chainSELECTcalls to get this behavior; in particular chaining will make it problematic to use multithreading in the future as with multithreading we will be able to process manyFUNat the same time)ByRowwill have no special cases - it will be just a shorthand for a broadcast operationReforAbstractArray{<:Any, 0}are unwrapped and recycled to a vectorAbstractArraysof dimension higher than two throw an errordo-block notationAbstractDataFrame,NamedTupleof vectors andAbstractMatrix: are expanded to multiple columns and treated as isDataFrameRow: is expanded to multiple columns and recycledNamedTupleof non-vectors (mixing vectors and scalars is disallowed as currently): expanded to multiple columns and recycled (likeDataFrameRow)AbstractVectors are treated as isDSTisAsTableor a vector of column names then:FUNreturns a vector: each row wrapped is expanded to multiple columns withkeysfunctions (so it in particular means that we require for a wrapped object to support these two functions) to get column names and cell entries (this will be type unstable, so it is not most efficient and is provided only for convenience); ifkeysare integers, then for conveniencexis prepended to column name (like forAbstractMatrixcurrently)FUNreturns something else - thenTables.columntableis applied to this value and later processing is done as forNamedTupleof vectors (so in particular forAbstractDataFrame,NamedTuple,DataFrameRow, andAbstractMatrixthis is a no-op as they will anyway be pre-processed to multiple columns anyway)DSTis specified and these are multiple column names then they OVERRIDE column names generated byFUNwithout an error if the names do not match (but column count must match)DFis aGroupedDataFrameThe key difficulty is in Rule 12 (and it probably requires some thought before commenting on - but it is the key point that is meant to address the raised issues).
A general rationale:
AsTableor "multiple column names" asDSTthat you want to make many columns out of a return valueAsTableasDSTwill produce)It would be great to get a feedback on this relatively quickly (so that we can move forward with implementation which will not be easy).