``.julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _ | |
| | |_| | | | (_| | | Version 0.4.5 (2016-03-18 00:58 UTC)
_/ |__'_|_|_|__'_| |
|__/ | x86_64-apple-darwin15.4.0
julia> using DataFrames
julia> x = DataFrame(rand(1:10, (10, 10)))
10x10 DataFrames.DataFrame
โ Row โ x1 โ x2 โ x3 โ x4 โ x5 โ x6 โ x7 โ x8 โ x9 โ x10 โ
โโโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโโฅ
โ 1 โ 10 โ 1 โ 3 โ 1 โ 3 โ 8 โ 2 โ 9 โ 2 โ 1 โ
โ 2 โ 9 โ 5 โ 5 โ 3 โ 8 โ 6 โ 6 โ 3 โ 2 โ 1 โ
โ 3 โ 3 โ 3 โ 10 โ 7 โ 5 โ 1 โ 3 โ 6 โ 3 โ 2 โ
โ 4 โ 5 โ 1 โ 5 โ 7 โ 8 โ 8 โ 9 โ 9 โ 5 โ 6 โ
โ 5 โ 10 โ 2 โ 4 โ 2 โ 2 โ 7 โ 3 โ 5 โ 3 โ 6 โ
โ 6 โ 7 โ 5 โ 8 โ 2 โ 9 โ 5 โ 9 โ 6 โ 8 โ 6 โ
โ 7 โ 1 โ 7 โ 6 โ 7 โ 6 โ 1 โ 2 โ 2 โ 2 โ 5 โ
โ 8 โ 7 โ 1 โ 8 โ 7 โ 6 โ 7 โ 9 โ 2 โ 6 โ 1 โ
โ 9 โ 7 โ 5 โ 5 โ 7 โ 5 โ 1 โ 8 โ 4 โ 3 โ 5 โ
โ 10 โ 5 โ 3 โ 10 โ 4 โ 6 โ 10 โ 5 โ 2 โ 3 โ 7 โ
julia> sum(x, 1)
ERROR: MethodError: sum has no method matching sum(::DataFrames.DataFrame, ::Int64)
Closest candidates are:
sum(::Union{Base.Func{1},DataType,Function}, ::Any)
sum(::BitArray{N}, ::Any)
sum(::DataArrays.DataArray{T,N}, ::Any)
...
```
I see that #391 is related.
You can apply functions like sum to individual columns of a DataFrame. Applying to all columns may not make sense because the columns can be heterogeneous.
julia> x = DataFrame(rand(1:10, (10, 10)))
10x10 DataFrames.DataFrame
โ Row โ x1 โ x2 โ x3 โ x4 โ x5 โ x6 โ x7 โ x8 โ x9 โ x10 โ
โโโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโฟโโโโโโฅ
โ 1 โ 1 โ 7 โ 5 โ 4 โ 1 โ 3 โ 9 โ 5 โ 9 โ 6 โ
โ 2 โ 2 โ 3 โ 2 โ 7 โ 2 โ 9 โ 10 โ 8 โ 6 โ 1 โ
โ 3 โ 8 โ 8 โ 6 โ 2 โ 1 โ 9 โ 4 โ 9 โ 9 โ 10 โ
โ 4 โ 3 โ 4 โ 2 โ 8 โ 8 โ 9 โ 9 โ 5 โ 9 โ 10 โ
โ 5 โ 3 โ 4 โ 10 โ 3 โ 3 โ 2 โ 2 โ 10 โ 5 โ 3 โ
โ 6 โ 5 โ 8 โ 9 โ 3 โ 6 โ 5 โ 6 โ 7 โ 4 โ 1 โ
โ 7 โ 2 โ 3 โ 7 โ 6 โ 10 โ 3 โ 1 โ 10 โ 8 โ 8 โ
โ 8 โ 10 โ 3 โ 7 โ 10 โ 8 โ 10 โ 1 โ 1 โ 3 โ 8 โ
โ 9 โ 1 โ 5 โ 3 โ 1 โ 5 โ 9 โ 6 โ 5 โ 7 โ 8 โ
โ 10 โ 1 โ 7 โ 5 โ 3 โ 9 โ 1 โ 4 โ 10 โ 10 โ 9 โ
julia> sum(x[1])
36
julia> [sum(x[i]) for i in 1 : size(x,2)]
10-element Array{Any,1}:
36
52
56
47
53
60
52
70
70
64
@tlnagy Any further comments or can I close this?
Your solution works, but I think it would still be nice to have something like http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html
I'm pretty strongly opposed to those kinds of "give me convenience or give me death" methods. You should never use an API for tabular data that you couldn't easily translate into SQL or you're giving in to vendor lock-in.
@tlnagy, FWIW, I agree that it would be nice to have such functionality. There has been some discussion about this in the past, however, and as suggested by @johnmyleswhite's response, the decision was made to keep this package "pure" and very SQL-like.
That said, it would be possible to have these types of functions available in a separate package, similar to how DataFramesMeta.jl maintains some convenience macros for interacting with DataFrames. All it would need is for someone to volunteer to create and maintain it... this usually ends up being someone who would really like to see this functionality exist (like you or me, except that I don't have the time right now, sorry...).
There is also Pandas.jl, which is a way to interact with Pandas directly. I haven't used it (although I still use Pandas itself at work quite frequently).
Cheers!
Note: I believe one should be able to do something like row sums via
sum(convert(Array, df), dims = 1)
where df is your DataFrame. Of course, this will only make sense when the entries of df are all numeric, but presumably that's what you want anyways.
You can do it either by:
sum.(eachrow(df))
or
sum(Matrix(df), dims=2)
(note that dims should be 2 for row sums)
sum.(skipmissing.(eachrow(df))) to process missing but will fail if empty row
you can define:
sm(x) = isempty(x)?missing:sum(x)
sm.(skipmissing.(eachrow(df)))
You would need to do skipmissing., and still the proper way to be 100% safe is sum.(skipmissing.(Tables.namedtupleiterator(df))).
since DataFrames has coltypes Union{Missing,Float64}, maybe we can have colsum/colmean rowsum/rowmean operations that deals with missing automatically?
the pipe makes it easier to understand:
Tables.namedtupleiterator(df) .|> skipmissing .|> sum
maybe we can have colsum/colmean rowsum/rowmean operations that deals with missing automatically?
This would be against the general design of handling missings in Base.
If your data frame is not huge and you are not in a performance critical part of code it is just enough use a conversion to Matrix. So you can write:
eachrow(Matrix(df)) .|> skipmissing .|> sum
which maybe is easier to understand as this is almost the same you would have to do in Base for a Matrix.
problem with: Tables.namedtupleiterator(df) .|> skipmissing .|> mean is that it returns NaN instead of missing if the rows are all missing. The same issue with eachrow(Matrix(df)) .|> skipmissing .|> mean. i think you still need to define a function to check if empty and return missing. the sum of all missing should also be missing i think instead of 0.
df=DataFrame(a=[1,2,missing,3],b=[1,2,missing,3],c=[1,2,missing,3])
eachrow(Matrix(df)) .|> skipmissing .|> mean
should not be NaN but missing. the idea is that rows with some missing is ok to compute their means but if all data in a row is missing, mean should return missing so that you can interpolate or impute it.
should not be NaN but missing.
Well NaN is produced by R in this case, just like in Julia AFAICT.
In what libraries having a proper notion of missing have you encountered the behaviour you describe?
In general this issue probably should be discussed on Slack in #statistics channel first and then an issue opened in Statistics.jl.
to be consistent sum and mean should return NaN. mean(empty)=NaN but sum(empty)=0 which is not consistent i think. i hope we can have column and row summary stats for dataframes as an API like colSums/colMeans in R.
statistics of data quality is important to assess model quality too. so maybe if we have stat summary for dataframes including missing data statistics, it can be a very useful API.
mean(empty)=NaN but sum(empty)=0 which is not consistent i think.
It is consistent: mean(x) = sum(x) / length(x), so if x is empty, that's 0 / 0, which is NaN.
i hope we can have column and row summary stats for dataframes as an API like colSums/colMeans in R.
rowMeans in R with na.rm=T produces NaN just like Julia if row contains only NA- so here there is no differencemapcols with any aggregation function you likemean(empty)=NaN but sum(empty)=0 which is not consistent i think.
It is consistent:
mean(x) = sum(x) / length(x), so ifxis empty, that's 0 / 0, which isNaN.
hmm, sum(skipmissing.([missing,missing,missing])) = 0 is problematic in real applications although R also returns similar thing. the entire row is missing so aggregation of all missing data should be not zero. the idea of skipmissing is that you still have some data left which you can operate from the practical aspect in real applications. if there is nothing to skip, any aggregation operation should not return a numeric value. imagine a row of huge numbers and then one row of all missings, the sum returns a zero which is not reflective of the row of numbers. for a vector of data, it's not a huge problem but in the context of dataframes where each row represents an observation, the operations on rows of all missing data returning a number can drastically change the distribution.
i hope we can have column and row summary stats for dataframes as an API like colSums/colMeans in R.
rowMeansin R withna.rm=TproducesNaNjust like Julia if row contains onlyNA- so here there is no difference- if you want to aggregate columns use
mapcolswith any aggregation function you like- if you want to aggregate rows then use one of the row-iteration approaches we have discussed
- In DataFrames.jl we do not define specific functions like you mention. It provides generic functions allowing you to work with data frame like objects. I do not expect that we will have statistics-related functionality (and related packages) as dependencies. However, it is easy enough to define the functions you propose either in a separate package or adding the definitions to your ~/.julia/config/startup.jl (all these functions should be probably one-liners)
not straightforward if you include blocks of missing stats where blocks means contiguous missing data scattered over rows and columns.
Most helpful comment
You can do it either by:
or
(note that
dimsshould be2for row sums)