Dataframes.jl: using nrow as named tuple when grouping

Created on 9 Dec 2018 · 7Comments · Source: JuliaData/DataFrames.jl

# this works
ANS = by(x, [:id1, :id2], nrow);
# this does not work
ANS = by(x, [:id1, :id2], count = nrow);
# what I am actually looking for is 
ANS = by(x, [:id1, :id2], v3 = :v3=>sum, count = nrow);
# I can currently emulate this by
ANS = by(x, [:id1, :id2], v3 = :v3=>sum, count = :v3=>length);

DataFrames v0.15.1

Source

jangorecki

All 7 comments

It's not supposed to work: when passing a keyword argument you need to provide the name of the columns to operate on, and in that case the function is passed either the column vector (single column) or a named tuple of column vectors (multiple columns). The simplest way of getting the number of rows reliably is indeed the last command (:v3 can be replaced with 1 if you don't want to bother about finding a column name).

I agree it's a bit annoying that you need to choose an arbitrary column, but I don't see a good solution. We could special-case nrow, but that would be a bit weird.

nalimilan on 9 Dec 2018

👍1

@nalimilan Actually we could handle something like count = nrow without a problem I think. We could check if RHS of a keyword argument is a pair (this is what we assume now) and otherwise assume it is a callable and pass it a SubDataFrame and expect it returns a scalar or a vector.

This would be convinient, but of course it would be slow, so in the benchmarks of @jangorecki we should not use it anyway. However, maybe we should allow for this as for small data frames it should have an acceptable speed while being intuitive?

bkamins on 9 Dec 2018

if we can refer to current group of a dataframe that we are passing to by then api could be count = :df=>nrow?

jangorecki on 9 Dec 2018

count = :df=>nrow would mean to:

take a current group
locate a df column in it
pass this column as a vector to nrow function (which in this case would fail, as nrow is defined only for data frames)

However, count = nrow could mean:

take a current group
pass this group as a SubDataFrame to nrow function and expect a scalar or a vector; take the result of the function and make it a column in the result that has count name;

Actually this is what you observe when you call by(x, [:id1, :id2], nrow) with the only difference that then column name is auto-generated, and the change would be to allow adding a name to the resulting column. Currently you can achieve this like this by(df, :x1, x->(count=nrow(x),)), but it is a bit cumbersome.

bkamins on 9 Dec 2018

👍1

Yes, that makes sense, but I'm not sure we want to make this operation easy since it will necessarily be slow. The only situation where it would be useful is when you don't need a specific column, which AFAICT is only the case for nrow. But then it could easily be a trap for users if the fast variant is a bit more convoluted.

Or maybe it would be fast, given that the compiler knows we always pass a SubDataFrame to nrow and that it returns an Int? Some benchmarking would be interesting.

nalimilan on 9 Dec 2018

Or maybe it would be fast, given that the compiler knows we always pass a SubDataFrame to nrow and that it returns an Int?

This would be fast - but essentially only in this case. That is why I have said that we could consider it only for convinience (I think we have enough explanations in the docs that not passing source column names slows things down).

bkamins on 9 Dec 2018

@nalimilan @jangorecki : can it be closed now or something else should be added here?

bkamins on 15 Jan 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings