# this works
ANS = by(x, [:id1, :id2], nrow);
# this does not work
ANS = by(x, [:id1, :id2], count = nrow);
# what I am actually looking for is
ANS = by(x, [:id1, :id2], v3 = :v3=>sum, count = nrow);
# I can currently emulate this by
ANS = by(x, [:id1, :id2], v3 = :v3=>sum, count = :v3=>length);
DataFrames v0.15.1
It's not supposed to work: when passing a keyword argument you need to provide the name of the columns to operate on, and in that case the function is passed either the column vector (single column) or a named tuple of column vectors (multiple columns). The simplest way of getting the number of rows reliably is indeed the last command (:v3 can be replaced with 1 if you don't want to bother about finding a column name).
I agree it's a bit annoying that you need to choose an arbitrary column, but I don't see a good solution. We could special-case nrow, but that would be a bit weird.
@nalimilan Actually we could handle something like count = nrow without a problem I think. We could check if RHS of a keyword argument is a pair (this is what we assume now) and otherwise assume it is a callable and pass it a SubDataFrame and expect it returns a scalar or a vector.
This would be convinient, but of course it would be slow, so in the benchmarks of @jangorecki we should not use it anyway. However, maybe we should allow for this as for small data frames it should have an acceptable speed while being intuitive?
if we can refer to current group of a dataframe that we are passing to by then api could be count = :df=>nrow?
count = :df=>nrow would mean to:
df column in itnrow function (which in this case would fail, as nrow is defined only for data frames)However, count = nrow could mean:
SubDataFrame to nrow function and expect a scalar or a vector; take the result of the function and make it a column in the result that has count name;Actually this is what you observe when you call by(x, [:id1, :id2], nrow) with the only difference that then column name is auto-generated, and the change would be to allow adding a name to the resulting column. Currently you can achieve this like this by(df, :x1, x->(count=nrow(x),)), but it is a bit cumbersome.
Yes, that makes sense, but I'm not sure we want to make this operation easy since it will necessarily be slow. The only situation where it would be useful is when you don't need a specific column, which AFAICT is only the case for nrow. But then it could easily be a trap for users if the fast variant is a bit more convoluted.
Or maybe it would be fast, given that the compiler knows we always pass a SubDataFrame to nrow and that it returns an Int? Some benchmarking would be interesting.
Or maybe it would be fast, given that the compiler knows we always pass a
SubDataFrametonrowand that it returns anInt?
This would be fast - but essentially only in this case. That is why I have said that we could consider it only for convinience (I think we have enough explanations in the docs that not passing source column names slows things down).
@nalimilan @jangorecki : can it be closed now or something else should be added here?