Tidyr: Slow unnest() performances

Created on 29 Jul 2019  路  9Comments  路  Source: tidyverse/tidyr

The development version of tidyr (commit a0ecd23) exhibits very slow unnest performances (at least on linux). The behaviour can even be reproduced when trying to unnest a tibble which isn't nested like in the following reprex:

While the CRAN version performs well (0.8.3):

library(tidyverse)

system.time({ tibble(x = seq(1E5)) %>%  unnest(x) })
#>    user  system elapsed 
#>   0.173   0.000   0.174

Created on 2019-07-29 by the reprex package (v0.3.0)

The latest development version (a0ecd23) is very slow:

library(tidyverse)

system.time({ tibble(x = seq(1E5)) %>%  unnest(x) })
#>    user  system elapsed 
#> 105.364   0.328 105.712

Created on 2019-07-29 by the reprex package (v0.3.0)

Most helpful comment

It will be fixed in the next vctrs release. We provided unnest_legacy() for exactly this sort of situation.

All 9 comments

Could you please include the output from using profvis on a smaller example (uploading to rpubs is probably easiest way to share)? And what version of vctrs do you have?

The profvis output can be found here. I also included my sessioninfo(): I'm using vctrs 0.2.0 from CRAN. I tried the latest dev version (fadbe6b) and the drop in performance is still there.

I don't know if it's useful but I tried to improve the profvis document on Rpubs to directly compare both tidyr versions (v0.8.3 vs commit a0ecd23). I wasn't able to demonstrate the issue using a very small tibble but the time seems lost while checking the names during the list to tibble conversion (in the tibble package)...

Hmmm, the vast majority of time seems to be spend in as_tibble.list(); so I suspect this is really a problem in tibble. Looks like I can just use vctrs::new_data_frame() instead.

Ok, the primary cause is what looks like a bug in vctrs which is causing the time to increase non-linearly.

I have updated tidyr to version 1.0.0 and I am also facing slow performance from unnest just as described by @koncina. I think I am going to use unnest <- unnest_legacy as described in the release NEWS for now, in order to maintain lower execution times. In one example, I need to unnest a dataframe with 230 000 rows (1 million rows unnested + 4 added columns), and it takes 11 seconds using unnest_legacy and around 24 minutes with unnest. Is there by any chance any prediction if and when the underlying vctrs issue could potentially be solved? Thanks a lot!

It will be fixed in the next vctrs release. We provided unnest_legacy() for exactly this sort of situation.

Without wanting to pile-on here, I'm seeing the same issue. unnest_legacy does not provide a keep_empty argument (which resolves a big issue I was facing) so I'm stuck between two sub-optimal options.

Could you please link the relevant issue in vctrs here so we can see when it's resolved?

In my code unnest_legacy takes about 0.18 seconds and unnesst takes 119 seconds. Just updated to vctrs 0.2.4.

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] simsAzure_0.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        compiler_3.6.1    pillar_1.4.3      prettyunits_1.0.2 bitops_1.0-6      remotes_2.1.0    
 [7] tools_3.6.1       testthat_2.2.1    digest_0.6.23     pkgbuild_1.0.6    pkgload_1.0.2     jsonlite_1.6     
[13] lifecycle_0.1.0   lubridate_1.7.4   memoise_1.1.0     tibble_2.1.3      pkgconfig_2.0.3   rlang_0.4.5      
[19] cli_2.0.1         rstudioapi_0.10   curl_4.3          stringr_1.4.0     withr_2.1.2       httr_1.4.1       
[25] dplyr_0.8.3       tictoc_1.0        vctrs_0.2.4       desc_1.2.0        fs_1.3.1          devtools_2.2.1   
[31] rprojroot_1.3-2   tidyselect_1.0.0  glue_1.3.1        R6_2.4.1          processx_3.4.1    fansi_0.4.1      
[37] sessioninfo_1.1.1 tidyr_1.0.2       callr_3.3.2       purrr_0.3.3       magrittr_1.5      backports_1.1.5  
[43] ps_1.3.0          ellipsis_0.3.0    usethis_1.5.1     assertthat_0.2.1  keyring_1.1.0     utf8_1.1.4       
[49] stringi_1.4.5     RCurl_1.98-1.1    crayon_1.3.4    
Was this page helpful?
0 / 5 - 0 ratings

Related issues

dmi3kno picture dmi3kno  路  10Comments

PMSeitzer picture PMSeitzer  路  9Comments

PMassicotte picture PMassicotte  路  10Comments

earowang picture earowang  路  9Comments

GillesSanMartin picture GillesSanMartin  路  12Comments