Dataframes.jl: Basic operations like sum() not supported for DataFrames?

Created on 11 May 2016 · 21Comments · Source: JuliaData/DataFrames.jl

``.julia _ _ _ _(_)_ | A fresh approach to technical computing (_) | (_) (_) | Documentation: http://docs.julialang.org _ _ _| |_ __ _ | Type "?help" for help. | | | | | | |/ _ | |
| | |_| | | | (_| | | Version 0.4.5 (2016-03-18 00:58 UTC)
_/ |__'_|_|_|__'_| |
|__/ | x86_64-apple-darwin15.4.0

julia> using DataFrames

julia> x = DataFrame(rand(1:10, (10, 10)))
10x10 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │ x7 │ x8 │ x9 │ x10 │
┝━━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━━┥
│ 1 │ 10 │ 1 │ 3 │ 1 │ 3 │ 8 │ 2 │ 9 │ 2 │ 1 │
│ 2 │ 9 │ 5 │ 5 │ 3 │ 8 │ 6 │ 6 │ 3 │ 2 │ 1 │
│ 3 │ 3 │ 3 │ 10 │ 7 │ 5 │ 1 │ 3 │ 6 │ 3 │ 2 │
│ 4 │ 5 │ 1 │ 5 │ 7 │ 8 │ 8 │ 9 │ 9 │ 5 │ 6 │
│ 5 │ 10 │ 2 │ 4 │ 2 │ 2 │ 7 │ 3 │ 5 │ 3 │ 6 │
│ 6 │ 7 │ 5 │ 8 │ 2 │ 9 │ 5 │ 9 │ 6 │ 8 │ 6 │
│ 7 │ 1 │ 7 │ 6 │ 7 │ 6 │ 1 │ 2 │ 2 │ 2 │ 5 │
│ 8 │ 7 │ 1 │ 8 │ 7 │ 6 │ 7 │ 9 │ 2 │ 6 │ 1 │
│ 9 │ 7 │ 5 │ 5 │ 7 │ 5 │ 1 │ 8 │ 4 │ 3 │ 5 │
│ 10 │ 5 │ 3 │ 10 │ 4 │ 6 │ 10 │ 5 │ 2 │ 3 │ 7 │

julia> sum(x, 1)
ERROR: MethodError: sum has no method matching sum(::DataFrames.DataFrame, ::Int64)
Closest candidates are:
sum(::Union{Base.Func{1},DataType,Function}, ::Any)
sum(::BitArray{N}, ::Any)
sum(::DataArrays.DataArray{T,N}, ::Any)
...
```

I see that #391 is related.

Source

tlnagy

Most helpful comment

You can do it either by:

sum.(eachrow(df))

sum(Matrix(df), dims=2)

(note that dims should be 2 for row sums)

bkamins on 4 May 2019

👍2

All 21 comments

You can apply functions like sum to individual columns of a DataFrame. Applying to all columns may not make sense because the columns can be heterogeneous.

julia> x = DataFrame(rand(1:10, (10, 10)))
10x10 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │ x7 │ x8 │ x9 │ x10 │
┝━━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━┿━━━━━┥
│ 1   │ 1  │ 7  │ 5  │ 4  │ 1  │ 3  │ 9  │ 5  │ 9  │ 6   │
│ 2   │ 2  │ 3  │ 2  │ 7  │ 2  │ 9  │ 10 │ 8  │ 6  │ 1   │
│ 3   │ 8  │ 8  │ 6  │ 2  │ 1  │ 9  │ 4  │ 9  │ 9  │ 10  │
│ 4   │ 3  │ 4  │ 2  │ 8  │ 8  │ 9  │ 9  │ 5  │ 9  │ 10  │
│ 5   │ 3  │ 4  │ 10 │ 3  │ 3  │ 2  │ 2  │ 10 │ 5  │ 3   │
│ 6   │ 5  │ 8  │ 9  │ 3  │ 6  │ 5  │ 6  │ 7  │ 4  │ 1   │
│ 7   │ 2  │ 3  │ 7  │ 6  │ 10 │ 3  │ 1  │ 10 │ 8  │ 8   │
│ 8   │ 10 │ 3  │ 7  │ 10 │ 8  │ 10 │ 1  │ 1  │ 3  │ 8   │
│ 9   │ 1  │ 5  │ 3  │ 1  │ 5  │ 9  │ 6  │ 5  │ 7  │ 8   │
│ 10  │ 1  │ 7  │ 5  │ 3  │ 9  │ 1  │ 4  │ 10 │ 10 │ 9   │

julia> sum(x[1])
36

julia> [sum(x[i]) for i in 1 : size(x,2)]
10-element Array{Any,1}:
 36
 52
 56
 47
 53
 60
 52
 70
 70
 64

dmbates on 11 May 2016

👍1

@tlnagy Any further comments or can I close this?

dmbates on 12 May 2016

Your solution works, but I think it would still be nice to have something like http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html

tlnagy on 12 May 2016

I'm pretty strongly opposed to those kinds of "give me convenience or give me death" methods. You should never use an API for tabular data that you couldn't easily translate into SQL or you're giving in to vendor lock-in.

johnmyleswhite on 12 May 2016

@tlnagy, FWIW, I agree that it would be nice to have such functionality. There has been some discussion about this in the past, however, and as suggested by @johnmyleswhite's response, the decision was made to keep this package "pure" and very SQL-like.

That said, it would be possible to have these types of functions available in a separate package, similar to how DataFramesMeta.jl maintains some convenience macros for interacting with DataFrames. All it would need is for someone to volunteer to create and maintain it... this usually ends up being someone who would really like to see this functionality exist (like you or me, except that I don't have the time right now, sorry...).

There is also Pandas.jl, which is a way to interact with Pandas directly. I haven't used it (although I still use Pandas itself at work quite frequently).

Cheers!

kmsquire on 13 May 2016

Note: I believe one should be able to do something like row sums via

sum(convert(Array, df), dims = 1)

where df is your DataFrame. Of course, this will only make sense when the entries of df are all numeric, but presumably that's what you want anyways.

pistacliffcho on 4 May 2019

You can do it either by:

sum.(eachrow(df))

sum(Matrix(df), dims=2)

(note that dims should be 2 for row sums)

bkamins on 4 May 2019

👍2

sum.(skipmissing.(eachrow(df))) to process missing but will fail if empty row

you can define:
sm(x) = isempty(x)?missing:sum(x)
sm.(skipmissing.(eachrow(df)))

ppalmes on 19 May 2020

You would need to do skipmissing., and still the proper way to be 100% safe is sum.(skipmissing.(Tables.namedtupleiterator(df))).

bkamins on 19 May 2020

since DataFrames has coltypes Union{Missing,Float64}, maybe we can have colsum/colmean rowsum/rowmean operations that deals with missing automatically?

ppalmes on 19 May 2020

the pipe makes it easier to understand:
Tables.namedtupleiterator(df) .|> skipmissing .|> sum

ppalmes on 19 May 2020

maybe we can have colsum/colmean rowsum/rowmean operations that deals with missing automatically?

This would be against the general design of handling missings in Base.

If your data frame is not huge and you are not in a performance critical part of code it is just enough use a conversion to Matrix. So you can write:

eachrow(Matrix(df)) .|> skipmissing .|> sum

which maybe is easier to understand as this is almost the same you would have to do in Base for a Matrix.

bkamins on 19 May 2020

problem with: Tables.namedtupleiterator(df) .|> skipmissing .|> mean is that it returns NaN instead of missing if the rows are all missing. The same issue with eachrow(Matrix(df)) .|> skipmissing .|> mean. i think you still need to define a function to check if empty and return missing. the sum of all missing should also be missing i think instead of 0.

ppalmes on 20 May 2020

df=DataFrame(a=[1,2,missing,3],b=[1,2,missing,3],c=[1,2,missing,3])

eachrow(Matrix(df)) .|> skipmissing .|> mean

should not be NaN but missing. the idea is that rows with some missing is ok to compute their means but if all data in a row is missing, mean should return missing so that you can interpolate or impute it.

ppalmes on 20 May 2020

should not be NaN but missing.

Well NaN is produced by R in this case, just like in Julia AFAICT.
In what libraries having a proper notion of missing have you encountered the behaviour you describe?

In general this issue probably should be discussed on Slack in #statistics channel first and then an issue opened in Statistics.jl.

bkamins on 20 May 2020

to be consistent sum and mean should return NaN. mean(empty)=NaN but sum(empty)=0 which is not consistent i think. i hope we can have column and row summary stats for dataframes as an API like colSums/colMeans in R.

ppalmes on 20 May 2020

statistics of data quality is important to assess model quality too. so maybe if we have stat summary for dataframes including missing data statistics, it can be a very useful API.

ppalmes on 20 May 2020

mean(empty)=NaN but sum(empty)=0 which is not consistent i think.

It is consistent: mean(x) = sum(x) / length(x), so if x is empty, that's 0 / 0, which is NaN.

ararslan on 20 May 2020

i hope we can have column and row summary stats for dataframes as an API like colSums/colMeans in R.

rowMeans in R with na.rm=T produces NaN just like Julia if row contains only NA- so here there is no difference
if you want to aggregate columns use mapcols with any aggregation function you like
if you want to aggregate rows then use one of the row-iteration approaches we have discussed
In DataFrames.jl we do not define specific functions like you mention. It provides generic functions allowing you to work with data frame like objects. I do not expect that we will have statistics-related functionality (and related packages) as dependencies. However, it is easy enough to define the functions you propose either in a separate package or adding the definitions to your ~/.julia/config/startup.jl (all these functions should be probably one-liners)

bkamins on 20 May 2020

👍1

mean(empty)=NaN but sum(empty)=0 which is not consistent i think.

It is consistent: mean(x) = sum(x) / length(x), so if x is empty, that's 0 / 0, which is NaN.

hmm, sum(skipmissing.([missing,missing,missing])) = 0 is problematic in real applications although R also returns similar thing. the entire row is missing so aggregation of all missing data should be not zero. the idea of skipmissing is that you still have some data left which you can operate from the practical aspect in real applications. if there is nothing to skip, any aggregation operation should not return a numeric value. imagine a row of huge numbers and then one row of all missings, the sum returns a zero which is not reflective of the row of numbers. for a vector of data, it's not a huge problem but in the context of dataframes where each row represents an observation, the operations on rows of all missing data returning a number can drastically change the distribution.

ppalmes on 20 May 2020

i hope we can have column and row summary stats for dataframes as an API like colSums/colMeans in R.

rowMeans in R with na.rm=T produces NaN just like Julia if row contains only NA- so here there is no difference

if you want to aggregate columns use mapcols with any aggregation function you like

if you want to aggregate rows then use one of the row-iteration approaches we have discussed

In DataFrames.jl we do not define specific functions like you mention. It provides generic functions allowing you to work with data frame like objects. I do not expect that we will have statistics-related functionality (and related packages) as dependencies. However, it is easy enough to define the functions you propose either in a separate package or adding the definitions to your ~/.julia/config/startup.jl (all these functions should be probably one-liners)