Seems like group by are not optimised where the by-variable is of factor type. Essentially one can perform group-by faster for factors given that it's represented by integers from 1 to n where n is the number of groups.
See Python implementation discussion http://wesmckinney.com/blog/mastering-high-performance-data-algorithms-i-group-by/
and Julia implementation discussion https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell
Update - Benchmarks
Here are some benchmarks to show that the factors as group-by doesn't seem to receive extra optimisation.
So I ran two benchmarks one for group-by variable with 2.5million groups, and one for 100millions groups. I ran one where the group-by variable is an integer and one where the group-by variable is a factor. In both cases the run time did not get faster in the group by variable is factor case. One can probably confirm by looking into the source code that sum group-by a factor is not optimised for.
Could you add a benchmark suggesting this is the case? it's no secret
factors are integers
On Nov 2, 2017 8:31 AM, "susabi" notifications@github.com wrote:
Seems like group by are not optimisation where the by-variable is of
factor type. Essentially one can perform group faster for factors given
that it's represented by integers from 1 to n where n is the number of
groups.See Python implementation discussion http://wesmckinney.com/blog/
mastering-high-performance-data-algorithms-i-group-by/
and Julia implementation discussion https://www.codementor.io/
zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2458, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdZXv6-vXXJ9CrpAm_6z2kWgMUNOFks5syQ1KgaJpZM4QPAqZ
.
Seems like group by are not optimisation where the by-variable is of factor type.
Please explain what you mean by this, and why with a minimal example. As Michael mentions, factors are internally represented as integers, and data.table makes use of it while finding groups.
Apologies for bad grammar.
I can't comment on the general group-by case where you can apply arbitrary code to .SD but I am thinking about the case where I am just doing a simple function DT[,sum(v1), by = some_factor] in this case I can write
result <- vector("numeric", length = nrow(DT))
for(r in 1:nrow(DT)) {
result[DT$some_factor] = result[DT$some_factor] + DT[r, v1]
}
result # is the group by results
Imagine if the above was in some efficient C/C++ implementation.
Because the number of groups, n, is already known up front one can simply create an array of size n and using the levels of the factor to index into the vector, to do a sum (or any reduction).
If this was implemented in C/C++ as a fast path in data.table then it will be faster than the traditional group by.
This seciton of the post talks about speed ups achieve in Julia which can be replicated in data.table.
I reiterate -- could you provide some benchmarks indicating the code is,
for example, much slower for factors than for integer columns?
On Nov 3, 2017 9:14 AM, "susabi" notifications@github.com wrote:
Apologies for bad grammar.
I can't comment on the general group-by case where you can apply arbitrary
code to .SD but I am thinking about the case where I am just doing a
simple function DT[,sum(v1), by = some_factor] in this case I can writeresult <- vector("numeric", length = nrow(DT))for(r in 1:nrow(DT)) {
result[DT$some_factor] = result[DT$some_factor] + DT[r, v1]
}result # is the group by resultsBecause the number of groups, n, is already known up front one can simply
create an array of size n and using the levels of the factor to index
into the vector, to do a sum (or any reduction).If this was implemented in C/C++ as a fast path in data.table then it will
be faster than the traditional group by.this seciton
https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell#benchmark-pooledarrays-and-therefore-categoricalarrays
of the post talks about speed up achieve in Julia which can be replicated
in data.table.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2458#issuecomment-341602237,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdTbK8cd8GKCG89ure34lH0iA2A8_ks5symjggaJpZM4QPAqZ
.
@MichaelChirico My point is not that factors are slower than integers, the point is that factors can be even faster than it is now in special cases such as sum and mean etc.
Benchmarks for this in other languages to show that optimising for factors will be faster is already known, e.g. in the post about Julia above (also I believe panda optimises this case as well but not sure).
Hope this clears things up.
@xiaodaigh any reason why you assume it is not already optimised in data.table? I can imagine it is optimised even better than in Julia or pandas, as many other algorithms. Question about benchmark, in R, in data.table, is very important here. So please provide one. Without the benchmark this issue should not be even created in first place, and definitely not with current title "group by operation not optimised for factor types ".
Here are some benchmarks to show that the factors as group-by doesn't seem to receive extra optimisation.
Also updated original issue post with link to benchmarks.
I decided to write a simple benchmark in R using Rcpp showing a faster sumby is possible for group-by vectors that are factor vectors using Rcpp.
The data.table case takes about 2.2 seconds vs 0.16 seconds for optimised version on my computer.
library(dplyr)
library(data.table)
library(dtplyr)
N = 100000000
K = 100
a <- data.table(grps = sample(as.factor(1:K),N, replace = T), val = runif(N))
library(Rcpp)
cppFunction('NumericVector sumbycpp(IntegerVector grps, NumericVector vals, int ngrps) {
NumericVector out(ngrps);
int n = grps.size();
for(int i = 0; i < n; ++i) {
out[grps[i]-1] = out[grps[i]-1] + vals[i];
}
return out;
}')
fastersumby <- function(grps, vals) {
ngrps <- nlevels(grps)
data.table(levels(grps), sumbycpp(grps, vals, ngrps))
}
system.time(print(a[,sum(val), grps])) # 2 seconds
system.time(print(a[,fastersumby(grps, val)])) # 0.16 seconds
system.time({
a %>%
group_by(grps) %>%
summarise(sum(val))
}) # 2.4
Thanks @xiaodaigh, this is exactly what we needed in the report.
Most helpful comment
I decided to write a simple benchmark in R using Rcpp showing a faster sumby is possible for group-by vectors that are factor vectors using Rcpp.
The data.table case takes about 2.2 seconds vs 0.16 seconds for optimised version on my computer.