After discussion on twitter on how to keep the first n records for each group, I concluded that filtering outside grouping is way faster. This to the surprise of @hadley, who liked to see this issue filed. Results on my Macbook Pro are included.
x <- data_frame(val = 1:10^7, group = rep(1:10^6, 10))
original <- function(x, n, grouping_col) {
x %>%
group_by(!!enquo(grouping_col)) %>%
filter(row_number() %in% n) %>%
ungroup()
}
# suggested by Inaki Ucar @Enchufa2
filter_outside <- function(x, n, grouping_col) {
x %>%
group_by(!!enquo(grouping_col)) %>%
mutate(gr_nr = row_number()) %>%
ungroup() %>%
filter(gr_nr <= n) %>%
select(-gr_nr)
}
# suggested by Hadly
smaller_than_n <- function(x, n, grouping_col) {
x %>%
group_by(!!enquo(grouping_col)) %>%
filter(row_number() <= n) %>%
ungroup()
}
system.time(output_original <- original(x, 1, group))
user system elapsed
21.068 0.915 22.646
system.time(output_outside <- filter_outside(x, 1, group))
user system elapsed
2.498 0.123 2.788
system.time(output_smaller <- smaller_than_n(x, 1, group))
user system elapsed
17.307 0.259 17.984
all.equal(output_original, output_outside)
[1] TRUE
all.equal(output_original, output_smaller)
[1] TRUE
Related: #1953, #3294.
Just to confirm, I get 17.5 // 1.4 // 15.4 seconds with latest CRAN version (0.7.6).
With latest GitHub version, first one has been running for more than 5 minutes already.
With latest GitHub version, first one has been running for more than 5 minutes already.
!!!
Minimal reprex with smaller data set and using bench::mark():
library(dplyr, warn.conflicts = FALSE)
f1 <- function(x, n, grouping_col) {
x %>%
group_by(!!enquo(grouping_col)) %>%
filter(row_number() <= n) %>%
ungroup()
}
# suggested by Inaki Ucar
f2 <- function(x, n, grouping_col) {
x %>%
group_by(!!enquo(grouping_col)) %>%
mutate(gr_nr = row_number()) %>%
ungroup() %>%
filter(gr_nr <= n) %>%
select(-gr_nr)
}
x <- tibble(val = 1:10^4, group = rep(1:10^3, 10))
bench::mark(
f1(x, 1, group),
f2(x, 1, group)
)
#> # A tibble: 2 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc
#> <chr> <bch:t> <bch:t> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 f1(x, 1, … 15.89ms 17.02ms 16.69ms 21.97ms 58.8 4.91MB 11
#> 2 f2(x, 1, … 3.36ms 4.28ms 4.12ms 9.08ms 234. 746.41KB 7
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>
There's a large difference in memory allocations. That's interesting.
Don't think this is the problem.
With the newest version, the allocated memory is lower for the first solution.
For the github version of dplyr as of 8-31-2018 knitting the following gives me
processing file: Untitled.Rmd
Error: segfault from C stack overflow
# from https://github.com/tidyverse/dplyr/issues/3788
# devtools::install_github("tidyverse/dplyr")
library("dplyr")
packageVersion("dplyr")
x <- data_frame(val = 1:10^7, group = rep(1:10^6, 10))
original <- function(x, n, grouping_col) {
x %>%
group_by(!!enquo(grouping_col)) %>%
filter(row_number() %in% n) %>%
ungroup()
}
# suggested by Inaki Ucar @Enchufa2
filter_outside <- function(x, n, grouping_col) {
x %>%
group_by(!!enquo(grouping_col)) %>%
mutate(gr_nr = row_number()) %>%
ungroup() %>%
filter(gr_nr <= n) %>%
select(-gr_nr)
}
# suggested by Hadly
smaller_than_n <- function(x, n, grouping_col) {
x %>%
group_by(!!enquo(grouping_col)) %>%
filter(row_number() <= n) %>%
ungroup()
}
system.time(output_original <- original(x, 1, group))
system.time(output_outside <- filter_outside(x, 1, group))
system.time(output_smaller <- smaller_than_n(x, 1, group))
all.equal(output_original, output_outside)
all.equal(output_original, output_smaller)
The implementation of filter has changed, I'll have another look with this in mind.
One thing overlooked is that f2 benefits from a hybrid row_number and f1 does not, the expression row_number() <= 1 has to go through standard R evaluation for each group.
Consider f3 that is like f2 but also goes through standard evaluation, because identity(row_number()) is not recognized by hybrid. This looks even more expensive than f1
library(dplyr, warn.conflicts = FALSE)
f1 <- function(x, n, grouping_col) {
x %>%
group_by(!!enquo(grouping_col)) %>%
filter(row_number() <= n) %>%
ungroup()
}
# suggested by Inaki Ucar
f2 <- function(x, n, grouping_col) {
x %>%
group_by(!!enquo(grouping_col)) %>%
mutate(gr_nr = row_number()) %>%
ungroup() %>%
filter(gr_nr <= n) %>%
select(-gr_nr)
}
f3 <- function(x, n, grouping_col) {
x %>%
group_by(!!enquo(grouping_col)) %>%
mutate(gr_nr = identity(row_number())) %>%
ungroup() %>%
filter(gr_nr <= n) %>%
select(-gr_nr)
}
x <- tibble(val = 1:10^4, group = rep(1:10^3, 10))
bench::mark(
f1(x, 1, group),
f2(x, 1, group),
f3(x, 1, group)
)
#> # A tibble: 3 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <bch:t> <bch:> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int>
#> 1 f1(x, 1, … 15.28ms 17.8ms 16.42ms 35.6ms 56.1 842.02KB 11 15
#> 2 f2(x, 1, … 4.66ms 5.1ms 4.93ms 9.62ms 196. 2.07MB 6 88
#> 3 f3(x, 1, … 18.25ms 18.8ms 18.8ms 19.49ms 53.2 141.52KB 12 12
#> # ... with 1 more variable: total_time <bch:tm>
Created on 2018-10-15 by the reprex package (v0.2.1.9000)
I'm closing this now as this is part of #3919
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/