It appears the new implementation of nest() and unnest() has resulted in a dramatic slowdown, compared to the previous version of tidyR
Perhaps the problem is related to size preallocation?
In that case, num_rows needs to be large to observe the slowdown. See code snippet below.
num_rows <- 100000
x <- dplyr::tibble(first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1))
before <- Sys.time()
y <- dplyr::tibble(first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1)) %>%
tidyr::nest(second_and_third = c(second, third)) %>%
tidyr::unnest(second_and_third)
after <- Sys.time()
if(length(which(x != y)) != 0){
stop("nest() and unnest() procedure results in corrupted data!")
}
cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))
On my system:
Execution Time: 61.2449209690094 seconds
I tested this with tidyR 0.8.3:
num_rows <- 100000
x <- dplyr::tibble(first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1))
before <- Sys.time()
y <- dplyr::tibble(first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1)) %>%
tidyr::nest(second_and_third = c(second, third)) %>%
tidyr::unnest()
after <- Sys.time()
if(length(which(x != y)) != 0){
stop("nest() and unnest() procedure results in corrupted data!")
}
cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))
and there I saw
Execution Time: 5.78480935096741 seconds
Slow down of 10.587x
It's worth mentioning that nest_legacy() and unnest_legacy() are still fast:
num_rows <- 100000
x <- dplyr::tibble(first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1))
before <- Sys.time()
y <- dplyr::tibble(first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1)) %>%
tidyr::nest_legacy(second_and_third = c(second, third)) %>%
tidyr::unnest_legacy()
after <- Sys.time()
if(length(which(x != y)) != 0){
stop("nest() and unnest() procedure results in corrupted data!")
}
cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))
On my my machine
Execution Time: 3.19051384925842 seconds
So, for performant code, one can always use the legacy() functions. I would suggest fixing this for the next major release, and directing people to use the legacy() functions to avoid unexpected slowdowns.
Me too comments are not very useful. Please just click the thumbs up button instead.
Most of the problems here have been resolved in the development version of vctrs. The issue was mainly with unnest() (really, unchop()). See r-lib/vctrs#530 for more information.
With your exact example:
# devtools::install_github("r-lib/vctrs")
library(tidyr)
num_rows <- 100000
x <- dplyr::tibble(
first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1)
)
before <- Sys.time()
y <- dplyr::tibble(
first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1)
) %>%
tidyr::nest(second_and_third = c(second, third)) %>%
tidyr::unnest(second_and_third)
after <- Sys.time()
if(length(which(x != y)) != 0){
stop("nest() and unnest() procedure results in corrupted data!")
}
cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))
#> Execution Time: 9.04665207862854 seconds
Created on 2019-09-24 by the reprex package (v0.2.1)
I think we could get 1-2 seconds faster with r-lib/vctrs#592.
And probably a little bit more from a native C implementation of vec_recycle_common().
Reprex comparing new and legacy functions directly:
library(tidyr)
num_rows <- 10000
df <- tibble(
first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1)
)
bench::mark(
new = df %>%
nest(second_and_third = c(second, third)) %>%
unnest(second_and_third),
old = df %>%
nest_legacy(second, third) %>%
unnest_legacy(data)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 new 906ms 906ms 1.10 3.6MB 27.6
#> 2 old 384ms 404ms 2.48 2.98MB 24.8
Created on 2019-11-13 by the reprex package (v0.3.0)
So the worst of the performance gap is now resolved, although obviously it would be nicer to do better than the previous version (although the new version is much more general so it's not too surprising that it's a bit slower currently). Profiling shows ~15% of run time from drop_null() , and 70% from vec_rbind() so as @DavisVaughan's suggests, vctrs is the obvious place to pursue performance improvements.
In theory here are the benefits from r-lib/vctrs#592
library(tidyr)
num_rows <- 10000
tbl <- tibble(
first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1)
)
tbl_nest <- tbl %>%
nest(second_and_third = c(second, third))
df_nest <- as.data.frame(tbl_nest)
df_nest$second_and_third <- lapply(df_nest$second_and_third, as.data.frame)
bench::mark(
tibble_new = unnest(tbl_nest, second_and_third),
tibble_old = unnest_legacy(tbl_nest, second_and_third),
dataframe_new = unnest(df_nest, second_and_third),
dataframe_old = unnest_legacy(df_nest, second_and_third),
iterations = 30
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tibble_new 622.2ms 698ms 1.37 1.43MB 22.7
#> 2 tibble_old 92.4ms 98.9ms 10.1 808.02KB 23.2
#> 3 dataframe_new 345.6ms 431.9ms 2.25 680.05KB 19.4
#> 4 dataframe_old 62.3ms 67.2ms 14.7 575.67KB 28.3
Created on 2019-11-13 by the reprex package (v0.3.0.9000)
Slightly simpler reprex:
library(tidyr)
n <- 10000
df <- tibble(
g = 1:n,
y = rep(list(tibble(x = 1:5)), n)
)
bench::mark(
unnest(df, y),
unnest_legacy(df, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 unnest(df, y) 488.4ms 512.8ms 1.95 2.82MB 22.4
#> 2 unnest_legacy(df, y) 88.6ms 91.9ms 10.7 1.44MB 24.9
Created on 2019-11-28 by the reprex package (v0.3.0)
Incremental update. With vctrs master branch after inclusion of big benefits from r-lib/vctrs#825 and small benefits from r-lib/vctrs#824
library(tidyr)
n <- 10000
df <- tibble(
g = 1:n,
y = rep(list(tibble(x = 1:5)), n)
)
bench::mark(
unnest(df, y),
unnest_legacy(df, y),
iterations = 50
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 unnest(df, y) 204ms 297.5ms 3.39 3.38MB 14.0
#> 2 unnest_legacy(df, y) 48ms 73.6ms 13.4 1.25MB 11.6
Created on 2020-02-17 by the reprex package (v0.3.0)
I've got another idea to further cut this down by moving some expensive tidyr::unchop() implementation details to C
With dev dplyr, I'm now seeing:
library(tidyr)
n <- 10000
df <- tibble(
g = 1:n,
y = rep(list(tibble(x = 1:5)), n)
)
bench::mark(
unnest(df, y),
unnest_legacy(df, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 unnest(df, y) 87.2ms 90.8ms 11.0 3.62MB 20.1
#> 2 unnest_legacy(df, y) 123.7ms 129.6ms 7.73 4.11MB 23.2
Created on 2020-04-22 by the reprex package (v0.3.0)
So unnest_legacy() has slowed down a bit, but unnest() is now faster, and only slightly slower than before.
I think this is a good place to leave it. We can certainly come back to improve performance in the future, but I think the original pressing motivation is now resolved.
Most helpful comment
It's worth mentioning that
nest_legacy()andunnest_legacy()are still fast:On my my machine
So, for performant code, one can always use the
legacy()functions. I would suggest fixing this for the next major release, and directing people to use thelegacy()functions to avoid unexpected slowdowns.