Tidyr: nest() / unnest() in 1.0.0 significantly slower

Created on 18 Sep 2019  路  9Comments  路  Source: tidyverse/tidyr

It appears the new implementation of nest() and unnest() has resulted in a dramatic slowdown, compared to the previous version of tidyR

Perhaps the problem is related to size preallocation?

In that case, num_rows needs to be large to observe the slowdown. See code snippet below.

  num_rows <- 100000

  x <- dplyr::tibble(first = 1:num_rows,
                     second = 5:(num_rows+5-1),
                     third = 7:(num_rows+7-1))

  before <- Sys.time()

  y <- dplyr::tibble(first = 1:num_rows,
                     second = 5:(num_rows+5-1),
                     third = 7:(num_rows+7-1)) %>%
      tidyr::nest(second_and_third = c(second, third)) %>%
      tidyr::unnest(second_and_third)

  after <- Sys.time()

  if(length(which(x != y)) != 0){
    stop("nest() and unnest() procedure results in corrupted data!")
  }

  cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))

On my system:

Execution Time: 61.2449209690094 seconds
feature performance rectangling vctrs 鈫楋笍

Most helpful comment

It's worth mentioning that nest_legacy() and unnest_legacy() are still fast:

  num_rows <- 100000

  x <- dplyr::tibble(first = 1:num_rows,
                     second = 5:(num_rows+5-1),
                     third = 7:(num_rows+7-1))

  before <- Sys.time()

  y <- dplyr::tibble(first = 1:num_rows,
                     second = 5:(num_rows+5-1),
                     third = 7:(num_rows+7-1)) %>%
    tidyr::nest_legacy(second_and_third = c(second, third)) %>%
    tidyr::unnest_legacy()

  after <- Sys.time()

  if(length(which(x != y)) != 0){
    stop("nest() and unnest() procedure results in corrupted data!")
  }

  cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))

On my my machine

Execution Time: 3.19051384925842 seconds

So, for performant code, one can always use the legacy() functions. I would suggest fixing this for the next major release, and directing people to use the legacy() functions to avoid unexpected slowdowns.

All 9 comments

I tested this with tidyR 0.8.3:

num_rows <- 100000

x <- dplyr::tibble(first = 1:num_rows,
                   second = 5:(num_rows+5-1),
                   third = 7:(num_rows+7-1))

before <- Sys.time()

y <- dplyr::tibble(first = 1:num_rows,
                   second = 5:(num_rows+5-1),
                   third = 7:(num_rows+7-1)) %>%
  tidyr::nest(second_and_third = c(second, third)) %>%
  tidyr::unnest()

after <- Sys.time()

if(length(which(x != y)) != 0){
  stop("nest() and unnest() procedure results in corrupted data!")
}

cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))

and there I saw

Execution Time: 5.78480935096741 seconds

Slow down of 10.587x

It's worth mentioning that nest_legacy() and unnest_legacy() are still fast:

  num_rows <- 100000

  x <- dplyr::tibble(first = 1:num_rows,
                     second = 5:(num_rows+5-1),
                     third = 7:(num_rows+7-1))

  before <- Sys.time()

  y <- dplyr::tibble(first = 1:num_rows,
                     second = 5:(num_rows+5-1),
                     third = 7:(num_rows+7-1)) %>%
    tidyr::nest_legacy(second_and_third = c(second, third)) %>%
    tidyr::unnest_legacy()

  after <- Sys.time()

  if(length(which(x != y)) != 0){
    stop("nest() and unnest() procedure results in corrupted data!")
  }

  cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))

On my my machine

Execution Time: 3.19051384925842 seconds

So, for performant code, one can always use the legacy() functions. I would suggest fixing this for the next major release, and directing people to use the legacy() functions to avoid unexpected slowdowns.

Me too comments are not very useful. Please just click the thumbs up button instead.

Most of the problems here have been resolved in the development version of vctrs. The issue was mainly with unnest() (really, unchop()). See r-lib/vctrs#530 for more information.

With your exact example:

# devtools::install_github("r-lib/vctrs")

library(tidyr)

num_rows <- 100000

x <- dplyr::tibble(
  first = 1:num_rows,
  second = 5:(num_rows+5-1),
  third = 7:(num_rows+7-1)
)

before <- Sys.time()

y <- dplyr::tibble(
  first = 1:num_rows,
  second = 5:(num_rows+5-1),
  third = 7:(num_rows+7-1)
) %>%
  tidyr::nest(second_and_third = c(second, third)) %>%
  tidyr::unnest(second_and_third)

after <- Sys.time()

if(length(which(x != y)) != 0){
  stop("nest() and unnest() procedure results in corrupted data!")
}

cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))
#> Execution Time: 9.04665207862854 seconds

Created on 2019-09-24 by the reprex package (v0.2.1)

I think we could get 1-2 seconds faster with r-lib/vctrs#592.

And probably a little bit more from a native C implementation of vec_recycle_common().

Reprex comparing new and legacy functions directly:

library(tidyr)

num_rows <- 10000
df <- tibble(
  first = 1:num_rows,
  second = 5:(num_rows+5-1),
  third = 7:(num_rows+7-1)
)

bench::mark(
  new = df %>%
    nest(second_and_third = c(second, third)) %>%
    unnest(second_and_third),
  old = df %>% 
    nest_legacy(second, third) %>% 
    unnest_legacy(data)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new           906ms    906ms      1.10     3.6MB     27.6
#> 2 old           384ms    404ms      2.48    2.98MB     24.8

Created on 2019-11-13 by the reprex package (v0.3.0)

So the worst of the performance gap is now resolved, although obviously it would be nicer to do better than the previous version (although the new version is much more general so it's not too surprising that it's a bit slower currently). Profiling shows ~15% of run time from drop_null() , and 70% from vec_rbind() so as @DavisVaughan's suggests, vctrs is the obvious place to pursue performance improvements.

In theory here are the benefits from r-lib/vctrs#592

library(tidyr)

num_rows <- 10000

tbl <- tibble(
  first = 1:num_rows,
  second = 5:(num_rows+5-1),
  third = 7:(num_rows+7-1)
)

tbl_nest <- tbl %>%
  nest(second_and_third = c(second, third))

df_nest <- as.data.frame(tbl_nest)
df_nest$second_and_third <- lapply(df_nest$second_and_third, as.data.frame)

bench::mark(
  tibble_new = unnest(tbl_nest, second_and_third),
  tibble_old = unnest_legacy(tbl_nest, second_and_third),
  dataframe_new = unnest(df_nest, second_and_third),
  dataframe_old = unnest_legacy(df_nest, second_and_third),
  iterations = 30
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tibble_new     622.2ms    698ms      1.37    1.43MB     22.7
#> 2 tibble_old      92.4ms   98.9ms     10.1   808.02KB     23.2
#> 3 dataframe_new  345.6ms  431.9ms      2.25  680.05KB     19.4
#> 4 dataframe_old   62.3ms   67.2ms     14.7   575.67KB     28.3

Created on 2019-11-13 by the reprex package (v0.3.0.9000)

Slightly simpler reprex:

library(tidyr)
n <- 10000

df <- tibble(
  g = 1:n,
  y = rep(list(tibble(x = 1:5)), n)
)

bench::mark(
  unnest(df, y),
  unnest_legacy(df, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest(df, y)         488.4ms  512.8ms      1.95    2.82MB     22.4
#> 2 unnest_legacy(df, y)   88.6ms   91.9ms     10.7     1.44MB     24.9

Created on 2019-11-28 by the reprex package (v0.3.0)

Incremental update. With vctrs master branch after inclusion of big benefits from r-lib/vctrs#825 and small benefits from r-lib/vctrs#824

library(tidyr)
n <- 10000

df <- tibble(
  g = 1:n,
  y = rep(list(tibble(x = 1:5)), n)
)

bench::mark(
  unnest(df, y),
  unnest_legacy(df, y),
  iterations = 50
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest(df, y)           204ms  297.5ms      3.39    3.38MB     14.0
#> 2 unnest_legacy(df, y)     48ms   73.6ms     13.4     1.25MB     11.6

Created on 2020-02-17 by the reprex package (v0.3.0)

I've got another idea to further cut this down by moving some expensive tidyr::unchop() implementation details to C

With dev dplyr, I'm now seeing:

library(tidyr)
n <- 10000

df <- tibble(
  g = 1:n,
  y = rep(list(tibble(x = 1:5)), n)
)

bench::mark(
  unnest(df, y),
  unnest_legacy(df, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest(df, y)          87.2ms   90.8ms     11.0     3.62MB     20.1
#> 2 unnest_legacy(df, y)  123.7ms  129.6ms      7.73    4.11MB     23.2

Created on 2020-04-22 by the reprex package (v0.3.0)

So unnest_legacy() has slowed down a bit, but unnest() is now faster, and only slightly slower than before.

I think this is a good place to leave it. We can certainly come back to improve performance in the future, but I think the original pressing motivation is now resolved.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

klmr picture klmr  路  12Comments

albertotb picture albertotb  路  7Comments

dmi3kno picture dmi3kno  路  10Comments

voxnonecho picture voxnonecho  路  30Comments

earowang picture earowang  路  9Comments