Dplyr: benchmark results filtering within and outside grouping

Created on 30 Aug 2018  Â·  11Comments  Â·  Source: tidyverse/dplyr

After discussion on twitter on how to keep the first n records for each group, I concluded that filtering outside grouping is way faster. This to the surprise of @hadley, who liked to see this issue filed. Results on my Macbook Pro are included.

x <- data_frame(val = 1:10^7, group = rep(1:10^6, 10))

original <- function(x, n, grouping_col) {
   x %>%
     group_by(!!enquo(grouping_col)) %>%
     filter(row_number() %in% n) %>%
     ungroup()
}


# suggested by Inaki Ucar @Enchufa2
filter_outside <- function(x, n, grouping_col) {
  x %>%
    group_by(!!enquo(grouping_col)) %>%
    mutate(gr_nr = row_number()) %>%
    ungroup() %>%
    filter(gr_nr <= n) %>%
    select(-gr_nr)
}

# suggested by Hadly
smaller_than_n <- function(x, n, grouping_col) {
  x %>%
    group_by(!!enquo(grouping_col)) %>%
    filter(row_number() <= n) %>%
    ungroup()
}

system.time(output_original <- original(x, 1, group))

   user  system elapsed 
 21.068   0.915  22.646 

system.time(output_outside  <- filter_outside(x, 1, group))

   user  system elapsed 
  2.498   0.123   2.788 

system.time(output_smaller  <- smaller_than_n(x, 1, group))

   user  system elapsed 
 17.307   0.259  17.984 

all.equal(output_original, output_outside)

[1] TRUE

all.equal(output_original, output_smaller)

[1] TRUE

All 11 comments

Related: #1953, #3294.

Just to confirm, I get 17.5 // 1.4 // 15.4 seconds with latest CRAN version (0.7.6).

With latest GitHub version, first one has been running for more than 5 minutes already.

With latest GitHub version, first one has been running for more than 5 minutes already.

!!!

Minimal reprex with smaller data set and using bench::mark():

library(dplyr, warn.conflicts = FALSE)

f1 <- function(x, n, grouping_col) {
  x %>%
    group_by(!!enquo(grouping_col)) %>%
    filter(row_number() <= n) %>%
    ungroup()
}

# suggested by Inaki Ucar
f2 <- function(x, n, grouping_col) {
  x %>%
    group_by(!!enquo(grouping_col)) %>%
    mutate(gr_nr = row_number()) %>%
    ungroup() %>%
    filter(gr_nr <= n) %>%
    select(-gr_nr)
}

x <- tibble(val = 1:10^4, group = rep(1:10^3, 10))
bench::mark(
  f1(x, 1, group), 
  f2(x, 1, group)
)
#> # A tibble: 2 x 10
#>   expression     min    mean  median     max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl>
#> 1 f1(x, 1, … 15.89ms 17.02ms 16.69ms 21.97ms      58.8    4.91MB    11
#> 2 f2(x, 1, …  3.36ms  4.28ms  4.12ms  9.08ms     234.   746.41KB     7
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>

There's a large difference in memory allocations. That's interesting.

Don't think this is the problem.
With the newest version, the allocated memory is lower for the first solution.

For the github version of dplyr as of 8-31-2018 knitting the following gives me

processing file: Untitled.Rmd
Error: segfault from C stack overflow
# from https://github.com/tidyverse/dplyr/issues/3788
# devtools::install_github("tidyverse/dplyr")
library("dplyr")
packageVersion("dplyr")

x <- data_frame(val = 1:10^7, group = rep(1:10^6, 10))

original <- function(x, n, grouping_col) {
   x %>%
     group_by(!!enquo(grouping_col)) %>%
     filter(row_number() %in% n) %>%
     ungroup()
}


# suggested by Inaki Ucar @Enchufa2
filter_outside <- function(x, n, grouping_col) {
  x %>%
    group_by(!!enquo(grouping_col)) %>%
    mutate(gr_nr = row_number()) %>%
    ungroup() %>%
    filter(gr_nr <= n) %>%
    select(-gr_nr)
}

# suggested by Hadly
smaller_than_n <- function(x, n, grouping_col) {
  x %>%
    group_by(!!enquo(grouping_col)) %>%
    filter(row_number() <= n) %>%
    ungroup()
}

system.time(output_original <- original(x, 1, group))


system.time(output_outside  <- filter_outside(x, 1, group))



system.time(output_smaller  <- smaller_than_n(x, 1, group))



all.equal(output_original, output_outside)



all.equal(output_original, output_smaller)

The implementation of filter has changed, I'll have another look with this in mind.

One thing overlooked is that f2 benefits from a hybrid row_number and f1 does not, the expression row_number() <= 1 has to go through standard R evaluation for each group.

Consider f3 that is like f2 but also goes through standard evaluation, because identity(row_number()) is not recognized by hybrid. This looks even more expensive than f1

library(dplyr, warn.conflicts = FALSE)

f1 <- function(x, n, grouping_col) {
  x %>%
    group_by(!!enquo(grouping_col)) %>%
    filter(row_number() <= n) %>%
    ungroup()
}

# suggested by Inaki Ucar
f2 <- function(x, n, grouping_col) {
  x %>%
    group_by(!!enquo(grouping_col)) %>%
    mutate(gr_nr = row_number()) %>%
    ungroup() %>%
    filter(gr_nr <= n) %>%
    select(-gr_nr)
}

f3 <- function(x, n, grouping_col) {
  x %>%
    group_by(!!enquo(grouping_col)) %>%
    mutate(gr_nr = identity(row_number())) %>%
    ungroup() %>%
    filter(gr_nr <= n) %>%
    select(-gr_nr)
}

x <- tibble(val = 1:10^4, group = rep(1:10^3, 10))
bench::mark(
  f1(x, 1, group),
  f2(x, 1, group),
  f3(x, 1, group)
)
#> # A tibble: 3 x 10
#>   expression     min   mean  median     max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:t> <bch:> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>
#> 1 f1(x, 1, … 15.28ms 17.8ms 16.42ms  35.6ms      56.1  842.02KB    11    15
#> 2 f2(x, 1, …  4.66ms  5.1ms  4.93ms  9.62ms     196.     2.07MB     6    88
#> 3 f3(x, 1, … 18.25ms 18.8ms  18.8ms 19.49ms      53.2  141.52KB    12    12
#> # ... with 1 more variable: total_time <bch:tm>

Created on 2018-10-15 by the reprex package (v0.2.1.9000)

I'm closing this now as this is part of #3919

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Was this page helpful?
0 / 5 - 0 ratings