I have been playing around with the new select merged into master. And inspired by this discourse thread I have been trying to emulate Stata's rowtotal command.
Stata's rowtotal from egen looks like
egen y = rowtotal(x1 x2 x3)
Where it skips missing values and if all values are missing, sets the result to missing.
In julia, we almost have this behavior with skipmissing except the sum would be 0 instead of missing, which is fine. It's more consistent anyways and can be achieved with some keyword arguments in egen.
In the new select function, I tried to write the following and got an error.
julia> select(df, :, [:x1, :x2] => ByRow(x -> sum(x)) => :y)
ERROR: MethodError: no method matching (::var"#46#47")(::Float64, ::Float64)
The correct syntax is
julia> df = DataFrame(rand(10, 5))
julia> select(df, :, [:x1, :x2] => ByRow((x...,) -> sum(skipmissing(x))) => :y)
The error comes from the fact that the anonymous function written above only accepts one input argument.
We can't write
julia> select(df, :, [:x1, :x2] => ByRow(+) => :y)
because there is no way to pass skipmissings to this.
The splatting is a bit awkward and there could be a performance penalty to making a 100-length Tuple. But I don't know how the compiler is handling these cases.
cc @juliohm
cc @jmboehm
EDIT: Any discussion has to also consider the merits of a more complicated transform over something like
julia> df.y = map(eachrow(df[!,[:x1, :x2, :x3]])) do row
sum(skipmissing(row))
end
EDIT: There are more concerns when we have an empty collection. Note that
sum(skipmissing(Union{Int, Missing}[missing, missing]))
works because Base.mapreduce_empty has a method for Int.
However
sum(skipmissing((missing, missing)))
will fail because the collection is empty and there is no "backup" eltype. The latter is what will be called with the working select method above. Something like
julia> select(df, :, [:x1, :x2] => ByRow((x...,) -> sum(skipmissing(collect(x)))) => :y)
will also fail because collect(x) will return Missing[missing, missing] which has no "backup" eltype.
I can't think of any obvious solutions that don't rely on a lot of machinery to make a Vector{Union{T, Missing}} of a row and yet somehow also know what T is.
That's actually one situation where passing a named tuple is more useful since it stores the information about the column eltype:
julia> sum(skipmissing(NamedTuple{(:a,:b), Tuple{Union{Int,Missing}, Union{Int,Missing}}}((missing, missing))))
0
The plan is to allow this via select(df, NT([:x1, :x2]) => ByRow(x -> sum(skipmissing(x))) => newcol). But we need to find a better name for NT.
It's not very intuitive that you'd like to request names in order to be able to compute the sum -- maybe the name could be more general, to indicate that you get an object representing a row. Also it has the drawback that the function needs to be recompiled for each set of names even if you don't care about them.
That's an interesting solution. It's a shame this doesn't work for normal Tuples.
julia> t = Tuple{Union{Int, Missing}, Union{Int, Missing}}([1,2])
(1, 2)
julia> typeof(t)
Tuple{Int64,Int64}
I do feel like this is going to come up a lot and making someone make a named tuple is a bit esoteric for new data-oriented users.
I will also note that the behavior you describe doesn't work for a DataFrameRow, which should parallel NamedTuple.
You can't even get around it by collecting a DataFrameRow. You have to do convert(Vector, dfr) to get the Union{T, Missing} that is needed for reducing over an empty collection.
splatting large number of arguments should not be a problem for a compiler (at least when they have homogeneous type, which I expect in this case)
I think that rowtotal like function could be considered to be added (in this issue we can discuss the idea) - in general reduce and/or mapreduce with and option of handling as the request so somehow "aggregate" over rows of a data frame pops up quite often (although I would say that the canonical way to do it should be to do stack + by - I do not say it will be faster, but it feels to me more in the spirit of tabular data processing).
The way I was assuming one would write it (disregarding NT for the moment as we plan to add it, and ... is not that bad I think, given what else you have to write):
select(df, :, [:x1, :x2] => ByRow((x...) -> reduce(+, skipmissing(x), init=0.0)) => :y)
or (but this will be slower)
df.y = map(eachrow(df[!,[:x1, :x2, :x3]])) do row
reduce(+, skipmissing(row), init=0.0)
end
in both cases I assume it is relatively clear what value init kwarg should take.
Thanks for this response. I was not aware of the init value.
Is the fact that Tuples can't have Union types something that can be changed in Base? I presume it is breaking.
I will file a separate issue for DataFrameRow, though.
Is the fact that Tuples can't have
Uniontypes something that can be changed in Base? I presume it is breaking.
That's due to covariance of tuple types, and that would be pretty hard to change. I think it's been considered, but not before 2.0 at the very least, and that would probably introduce lots of problems.
What would a hypothetical rowtotal function do? Apart from skipping missing values, it doesn't seem it could fix the problem of finding out the right zero type.
I guess rowtotal would virtually (probably not explicitly as it should be possible to do it more efficiently) do:
rowtotal(fun, df) = fun.(skipmissing.(eachrow(Matrix(df)))
where conversion to Matrix ensures that promote_type is properly applied.
or maybe even just:
rowtotal(fun, df) = fun.(skipmissing.(Tables.namedtupleiterator(df)))
would be good enough?
What would a hypothetical rowtotal function do? Apart from skipping missing values, it doesn't seem it could fix the problem of finding out the right zero type.
I'm not proposing to write a function called rowtotal, though I'm open to the idea. Rather, I'm hoping that we can make an easy way for skipmissing to work. The init keyword in reduce is a good idea, but I'm still wondering if there is more we can do behind the scenes to help the user.
Here is a good idea that might solve this: passmissing, but which coalesces missing values to a default value.
select(df, :, [:x1, :x2, :x3] => coalescefun(+, 0) => :y)
EDIT: This would only work for rowtotal but would not be able to emulate rowmean.
I think @nalimilan's solution of wrapping it in a NamedTuple is the easiest solution for now...
Was there any discussion of having it be the default? And then having the user splat manually?
To which @bkamins replied
No, I mean that named tuple (a=1, b=[1,2,3]) would get expanded to (a=[1,1,1], b=[1,2,3]) before further processing.
solution of wrapping it in a NamedTuple is the easiest solution for now...
This was heavily discussed and the conclusion was that splatting by default is better because:
Still, we want to provide that option. We just need a good name for the wrapper.
I wonder if the naming should be done as an alternative to ByRow. Something like RowNT. That would cut down on the number of combinations that could overwhelm the user.
or maybe even just:
rowtotal(fun, df) = fun.(skipmissing.(Tables.namedtupleiterator(df)))would be good enough?
I think that would a good solution, yes. I'm assuming that this would iterate over rows of df.
Something like
RowNT
NT would wrap the source not a function, so you would write e.g. NT(cols) => ByRow(fun), as the form of passing the values is independent form whether we process whole columns or single rows.
fun.(skipmissing.(Tables.namedtupleiterator(df))) is something you already can do now:
julia> df = DataFrame([1 2 missing
missing missing missing
missing 2 3])
3ร3 DataFrame
โ Row โ x1 โ x2 โ x3 โ
โ โ Int64? โ Int64? โ Int64? โ
โโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโค
โ 1 โ 1 โ 2 โ missing โ
โ 2 โ missing โ missing โ missing โ
โ 3 โ missing โ 2 โ 3 โ
julia> sum.(skipmissing.(Tables.namedtupleiterator(df)))
3-element Array{Int64,1}:
3
0
5
it is just a question if we want to have a special function providing this functionality (or just assume that users would use the long form)
NTwould wrap the source not a function, so you would write e.g.NT(cols) => ByRow(fun), as the form of passing the values is independent form whether we process whole columns or single rows.
Yes I see. I forgot that one still might want to pass a named tuple of columns. That syntax wouldn't just work for rows.
I think that would a good solution, yes. I'm assuming that this would iterate over rows of
df.
I agree this would be easy to do. However I would really like to get a solution that works elegantly in the new select, transform workflow.
I will try and think of a better name for NT, something that is more intuitive for the user I think.
I had no idea that this works. this is the functionality of rowtotal. sure if we can make it shorter, and work with transform and select, even better.
julia> df.y = sum.(skipmissing.(Tables.namedtupleiterator(select(df,[:x2, :x3])))
)
3-element Array{Int64,1}:
2
0
5
julia> df
3ร4 DataFrame
โ Row โ x1 โ x2 โ x3 โ y โ
โ โ Int64? โ Int64? โ Int64? โ Int64 โ
โโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโค
โ 1 โ 1 โ 2 โ missing โ 2 โ
โ 2 โ missing โ missing โ missing โ 0 โ
โ 3 โ missing โ 2 โ 3 โ 5 โ
I'm seeing the value in NT better. People aren't realistically going to use x -> (x...,) in their function for ByRow. At the very least, it's as easy to type as NT.
Additionally, the only function that you would both apply skipmissing to would be sum, mean, var, etc. There is no equivelent of +(args...) for mean.
I still think that maybe the following could work
select(df, :, [:x1, :x2] => fun)
select(df, :, [:x1, :x2] => NT(fun))
select(df, :, [:x1, :x2] => ByRow(fun))
select(df, :, [:x1, :x2] => ByRowNT(fun))
Just to be clear - the conclusion is that if we add an option to pass NamedTuple to function this issue is resolved?
PS. my current proposal for the name of the wrapper is AsTable which is not very long and I hope explicit in what it promises to pass conceptually.
clused with the new AsTable syntax #2183