Data.table: sum & mean group by operation not optimised for factor types

Created on 2 Nov 2017 · 9Comments · Source: Rdatatable/data.table

Seems like group by are not optimised where the by-variable is of factor type. Essentially one can perform group-by faster for factors given that it's represented by integers from 1 to n where n is the number of groups.

See Python implementation discussion http://wesmckinney.com/blog/mastering-high-performance-data-algorithms-i-group-by/
and Julia implementation discussion https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

Update - Benchmarks

Here are some benchmarks to show that the factors as group-by doesn't seem to receive extra optimisation.

So I ran two benchmarks one for group-by variable with 2.5million groups, and one for 100millions groups. I ran one where the group-by variable is an integer and one where the group-by variable is a factor. In both cases the run time did not get faster in the group by variable is factor case. One can probably confirm by looking into the source code that sum group-by a factor is not optimised for.

performance

Source

xiaodaigh

Most helpful comment

I decided to write a simple benchmark in R using Rcpp showing a faster sumby is possible for group-by vectors that are factor vectors using Rcpp.

The data.table case takes about 2.2 seconds vs 0.16 seconds for optimised version on my computer.

library(dplyr)
library(data.table)
library(dtplyr)

N = 100000000
K = 100
a <- data.table(grps = sample(as.factor(1:K),N, replace = T), val = runif(N))

library(Rcpp)
cppFunction('NumericVector sumbycpp(IntegerVector grps, NumericVector vals, int ngrps) {
            NumericVector out(ngrps);

            int n = grps.size();

            for(int i = 0; i < n; ++i) {
            out[grps[i]-1] = out[grps[i]-1] + vals[i];
            }
            return out;
            }')

fastersumby <- function(grps, vals) {
  ngrps <- nlevels(grps)
  data.table(levels(grps), sumbycpp(grps, vals, ngrps))
}

system.time(print(a[,sum(val), grps])) # 2 seconds
system.time(print(a[,fastersumby(grps, val)])) # 0.16 seconds

system.time({
  a %>% 
    group_by(grps) %>% 
    summarise(sum(val))
}) # 2.4

xiaodaigh on 9 Nov 2017

👍4

All 9 comments

Could you add a benchmark suggesting this is the case? it's no secret
factors are integers

On Nov 2, 2017 8:31 AM, "susabi" notifications@github.com wrote:

Seems like group by are not optimisation where the by-variable is of
factor type. Essentially one can perform group faster for factors given
that it's represented by integers from 1 to n where n is the number of
groups.

See Python implementation discussion http://wesmckinney.com/blog/
mastering-high-performance-data-algorithms-i-group-by/
and Julia implementation discussion https://www.codementor.io/
zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2458, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdZXv6-vXXJ9CrpAm_6z2kWgMUNOFks5syQ1KgaJpZM4QPAqZ
.

MichaelChirico on 2 Nov 2017

Seems like group by are not optimisation where the by-variable is of factor type.

Please explain what you mean by this, and why with a minimal example. As Michael mentions, factors are internally represented as integers, and data.table makes use of it while finding groups.

arunsrinivasan on 3 Nov 2017

Apologies for bad grammar.

I can't comment on the general group-by case where you can apply arbitrary code to .SD but I am thinking about the case where I am just doing a simple function DT[,sum(v1), by = some_factor] in this case I can write

result <- vector("numeric", length = nrow(DT))
for(r in 1:nrow(DT)) {
  result[DT$some_factor] =   result[DT$some_factor] + DT[r, v1]
}
result # is the group by results

Imagine if the above was in some efficient C/C++ implementation.

Because the number of groups, n, is already known up front one can simply create an array of size n and using the levels of the factor to index into the vector, to do a sum (or any reduction).

If this was implemented in C/C++ as a fast path in data.table then it will be faster than the traditional group by.

This seciton of the post talks about speed ups achieve in Julia which can be replicated in data.table.

xiaodaigh on 3 Nov 2017

I reiterate -- could you provide some benchmarks indicating the code is,
for example, much slower for factors than for integer columns?

On Nov 3, 2017 9:14 AM, "susabi" notifications@github.com wrote:

Apologies for bad grammar.

I can't comment on the general group-by case where you can apply arbitrary
code to .SD but I am thinking about the case where I am just doing a
simple function DT[,sum(v1), by = some_factor] in this case I can write

result <- vector("numeric", length = nrow(DT))for(r in 1:nrow(DT)) {
result[DT$some_factor] = result[DT$some_factor] + DT[r, v1]
}result # is the group by results

Because the number of groups, n, is already known up front one can simply
create an array of size n and using the levels of the factor to index
into the vector, to do a sum (or any reduction).

If this was implemented in C/C++ as a fast path in data.table then it will
be faster than the traditional group by.

this seciton
https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell#benchmark-pooledarrays-and-therefore-categoricalarrays
of the post talks about speed up achieve in Julia which can be replicated
in data.table.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2458#issuecomment-341602237,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdTbK8cd8GKCG89ure34lH0iA2A8_ks5symjggaJpZM4QPAqZ
.

MichaelChirico on 3 Nov 2017

@MichaelChirico My point is not that factors are slower than integers, the point is that factors can be even faster than it is now in special cases such as sum and mean etc.

Benchmarks for this in other languages to show that optimising for factors will be faster is already known, e.g. in the post about Julia above (also I believe panda optimises this case as well but not sure).

Hope this clears things up.

xiaodaigh on 3 Nov 2017

@xiaodaigh any reason why you assume it is not already optimised in data.table? I can imagine it is optimised even better than in Julia or pandas, as many other algorithms. Question about benchmark, in R, in data.table, is very important here. So please provide one. Without the benchmark this issue should not be even created in first place, and definitely not with current title "group by operation not optimised for factor types ".

jangorecki on 3 Nov 2017

Here are some benchmarks to show that the factors as group-by doesn't seem to receive extra optimisation.

Also updated original issue post with link to benchmarks.

xiaodaigh on 3 Nov 2017

I decided to write a simple benchmark in R using Rcpp showing a faster sumby is possible for group-by vectors that are factor vectors using Rcpp.

The data.table case takes about 2.2 seconds vs 0.16 seconds for optimised version on my computer.

library(dplyr)
library(data.table)
library(dtplyr)

N = 100000000
K = 100
a <- data.table(grps = sample(as.factor(1:K),N, replace = T), val = runif(N))

library(Rcpp)
cppFunction('NumericVector sumbycpp(IntegerVector grps, NumericVector vals, int ngrps) {
            NumericVector out(ngrps);

            int n = grps.size();

            for(int i = 0; i < n; ++i) {
            out[grps[i]-1] = out[grps[i]-1] + vals[i];
            }
            return out;
            }')

fastersumby <- function(grps, vals) {
  ngrps <- nlevels(grps)
  data.table(levels(grps), sumbycpp(grps, vals, ngrps))
}

system.time(print(a[,sum(val), grps])) # 2 seconds
system.time(print(a[,fastersumby(grps, val)])) # 0.16 seconds

system.time({
  a %>% 
    group_by(grps) %>% 
    summarise(sum(val))
}) # 2.4

xiaodaigh on 9 Nov 2017

👍4

Thanks @xiaodaigh, this is exactly what we needed in the report.

jangorecki on 9 Nov 2017

Was this page helpful?

0 / 5 - 0 ratings