I have a script that takes only 4 seconds to process in V 1.12.8, and more than 24+ hours when using V 1.13.0. All other variables are the same (computer, r and RStudio versions, dataset, etc.)
All relevant and detailed information about this issue is contained in Stack Overflow.
https://stackoverflow.com/questions/63105711/why-data-table-unnesting-time-grows-with-the-square-of-number-of-rows-for-a-spec
I am not a programmer, so I am not able to go through the code. Below you will find the sessionInfo() reg\rding thje environment AFTER changing back to V 1.12.8.
Please let me know whether I can be of any further help.
BTW, congratulations for this amazing gamecahnger package.
Regards,
Fabio.
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-7 data.table_1.12.8 lubridate_1.7.9 stringi_1.4.6 runner_0.3.7 e1071_1.7-3
[7] ggplot2_3.3.2 stringr_1.4.0 magrittr_1.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 class_7.3-17 tools_4.0.2 digest_0.6.25 packrat_0.5.0 evaluate_0.14
[9] lifecycle_0.2.0 tibble_3.0.3 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.7 rstudioapi_0.11 yaml_2.2.1 xfun_0.16
[17] withr_2.2.0 dplyr_1.0.0 knitr_1.29 generics_0.0.2 vctrs_0.3.2 grid_4.0.2 tidyselect_1.1.0 glue_1.4.1
[25] R6_2.4.1 rmarkdown_2.3 purrr_0.3.4 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0 colorspace_1.4-1 tinytex_0.25
[33] munsell_0.5.0 crayon_1.3.4
Many thanks for the great report and I've downloaded your RDS from S.O.
There was a change to [[ by group in this release so we'll have to look into that, bug fix 5 in news :
GForce is deactivated for [[ on non-atomic input, part of #4159. Thanks @hongyuanjia and @ColeMiller1 for helping debug an issue in dev with the original fix before release, #4612.
The verbose output in the S.O. (thanks) show the following :
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
Which is more a clue for us to fix in this case, than for the user.
I had the same initial thought for the cause, bit SO also reported the same slowdown for an unlist command.
Anyway there's repro data now which makes all the difference. Thanks @fabiocs8 !
Glad to help you.
I would like to deactivate the link for the dataset, Have you already
make a copy of it?
Regards, fabio.
On 27/07/2020 20:53, Michael Chirico wrote:
>
I had the same initial thought for the cause, bit SO also reported the
same slowdown for an unlist command.Anyway there's repro data now which makes all the difference. Thanks
@fabiocs8 https://github.com/fabiocs8 !—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/4646#issuecomment-664696727,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AI5ISCJMUQHWCF657E56FRTR5YHRDANCNFSM4PIXKGRA.
--
Tecknowledge Consultoria e Serviços LTDA
Fábio Corrêa da Silva
Glad to help.
Because you have downloaded the dataset, I will deactivate the link.
Please let me know whether I can be of further help.
Regards,
Fabio.
On 27/07/2020 17:16, Matt Dowle wrote:
>
Thank you for the report and I've downloaded your RDS from S.O.
There was a change to |[[| by group in this release so we'll have to
look into that.GForce is deactivated for [[ on non-atomic input, part of #4159 <https://github.com/Rdatatable/data.table/issues/4159>. Thanks @hongyuanjia <https://github.com/hongyuanjia> and @ColeMiller1 <https://github.com/ColeMiller1> for helping debug an issue in dev with the original fix before release, #4612 <https://github.com/Rdatatable/data.table/issues/4612>.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/4646#issuecomment-664615219,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AI5ISCPE2P37HOY42I2S75TR5XOBJANCNFSM4PIXKGRA.
--
Tecknowledge Consultoria e Serviços LTDA
Fábio Corrêa da Silva
Thanks for doing all the work to re-installing 1.12.8 to provide additional information. This type of info is great!
The most interesting thing in 1.13.0 is that there is a major performance difference between dt[, v[[1L]], by = grps] versus dt[seq_len(.N), v[[1L]], by = grps] for your RDS. I could only reproduce the performance degradation using the RDS and not the simulated data. Here is some of the verbose output in 1.13.0:
##dt[seq_len(.N), ...]
collecting discontiguous groups took 0.000s for 5000 groups
##dt[, ...]
memcpy contiguous groups took 50.502s for 5000 groups
It does appear to be a regression from 1.12.8. I need to deactivate the gforce optimization using parentheses (e.g. (val[[1L]])), and I get very different results:
##dt[seq_len(.N), (...), ...]
collecting discontiguous groups took 0.004s for 5000 groups
##dt[, (...), ...]
memcpy contiguous groups took 0.000s for 5000 groups
So it appears to be related to memcpy. For right now, it appears a workaround in 1.13.0 may be item_res[seq_len(.N), lance[[1]], by = item_id].
In the meantime - here's a way for simulated data to reproduce the issue:
library(data.table) #1.13.0
set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(id = seq_len(n),
value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))
bench::mark(
dt[seq_len(.N), value[[1L]], by = id]
, dt[, value[[1L]], by = id]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 dt[seq_len(.N), value[[1L]], by = id] 2.56ms 3.05ms 287. 2.24MB
#> 2 dt[, value[[1L]], by = id] 219.41ms 250.98ms 3.98 51.91MB
#> # ... with 1 more variable: `gc/sec` <dbl>
Using profmem, our seq_len(.N) approach has 29 allocations. Our _NULL_ i has around 250,000 allocations, many of which are 208 byes of alloc.
library(profmem)
profmem( dt[seq_len(.N), value[[1L]], by = id])
profmem( dt[, value[[1L]], by = id]) ## SLOW!
I noticed other related weird behaviors. All the commands below were run under V 1.12.8 on the full database:
1) Problems when using _item[, lance[[1]], by = unnest_names,verbose = T]_, see below:
> microbenchmark(
item_int <- item[, lance[[1]], by = unnest_names,verbose = T]
, times = 1L)
Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.170s elapsed (0.200s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'lance[[1]]'
GForce optimized j to 'g[[(lance, 1)'
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.002
gforce assign high and low took 0.002
gforce eval took 0.035
0.130s elapsed (0.140s cpu)
Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, lance[[1]], by = unnest_names, verbose = T] 1.175517 1.175517 1.175517 1.175517 1.175517 1.175517 1
> item_int
Error in [.data.table(x, i, , ) :
Column 12 ['V1'] is a data.frame or data.table; _malformed data.table_.
Note 1) that data.table has not issued any error or warning, and 2) the output number of rows is the same of the input data (872,851), it does not make sense (the output number of rows should be 16,070,070).
2) _item_int <- item[, unlist(lance, recursive = F)_, by = item_id] runs smooth and correctly
> microbenchmark(
item_int <- item[, unlist(lance, recursive = F), by = unnest_names]
, times = 5L)
Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.140s elapsed (0.120s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'unlist(lance, recursive = F)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... dogroups: growing from 872581 to 18513352 rows
Wrote less rows (16070070) than allocated (18513352).
memcpy contiguous groups took 0.074s for 872581 groups
eval(j) took 1.400s for 872581 calls
8.200s elapsed (6.940s cpu)
Unit: seconds
expr min lq mean median uq
item_int <- item[, unlist(lance, recursive = F), by = unnest_names, verbose = TRUE] 5.777739 8.352493 8.204699 8.3721 8.698475
max neval
9.822689 5
Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, unlist(lance, recursive = F), by = unnest_names] 8.022987 8.245593 8.93737 8.914378 8.982508 10.52138 5
3) _item_int <- item[, rbindlist(lance), by = unnest_names]_ runs correctly, but takes a long time... does it make sense?
> microbenchmark(
item_int <- item[, rbindlist(lance), by = unnest_names, verbose = TRUE]
, times = 1L)
Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.300s elapsed (0.280s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'rbindlist(lance)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
dogroups: growing from 872581 to 18513352 rows
Wrote less rows (16070070) than allocated (18513352).
memcpy contiguous groups took 0.113s for 872581 groups
eval(j) took 98.270s for 872581 calls
00:01:43 elapsed (00:01:39 cpu)
Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, rbindlist(lance), by = unnest_names, verbose = TRUE] 103.8269 103.8269 103.8269 103.8269 103.8269 103.8269 1
Regards,
Fabio.
Additional research: #4164 is the commit where this started.
## merged commit before #4164
##remotes::install_github("https://github.com/Rdatatable/data.table/tree/793f8545c363d222de18ac892bc7abb80154e724") # good
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
##1 dt[seq_len(.N), value[[1L]], by = id] 3.68ms 4.06ms 232. 2.26MB 5.11 91 2 392ms <data.table[,2] [20,000 ~ <Rprofmem[,3] [260 ~ <bch:tm [93~ <tibble [93 x ~
##2 dt[, value[[1L]], by = id] 2.06ms 2.29ms 430. 401.24KB 4.16 207 2 481ms <data.table[,2] [20,000 ~ <Rprofmem[,3] [17 x~ <bch:tm [20~ <tibble [209 x~
## final commit with merged #4164
# remotes::install_github("https://github.com/Rdatatable/data.table/tree/4aadde8f5a51cd7c8f3889964e7280432ec65bbc") #bad - merged commit
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
##1 dt[seq_len(.N), value[[1L]], by = id] 1.25ms 1.39ms 572. 2.26MB 12.0 286 6 500ms <data.table[,2] [20,000~ <Rprofmem[,3] [260 x 3~ <bch:tm [2~ <tibble [286 ~
##2 dt[, value[[1L]], by = id] 173.01ms 188.94ms 5.29 52.03MB 5.29 3 3 567ms <data.table[,2] [20,000~ <Rprofmem[,3] [251,669~ <bch:tm [3~ <tibble [3 x ~
The last thing I looked into is that if byval is ordered (e.g., length(o__) == 0L), then memrecycle is triggered. So while ideally memrecycle would be more performant on lists, another way would be to only do continuous gathering when length(o__) && !any_typeof_cols_is_list) (e.g., when there is a list column, we'll use subsetVectorRaw instead). Alternatively we could look into an unnest function more (#4156).
Code between discontinuous vs. continuous:
Performance difference:
library(data.table) #1.13.0
set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(ordered_id = seq_len(n),
unordered_id = sample(n),
value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))
bench::mark(
dt[, value[[1L]], by = ordered_id]
, dt[, value[[1L]], by = unordered_id]
, check = FALSE
, time_unit = "ms"
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <dbl> <dbl> <dbl> <bch:byt>
#> 1 dt[, value[[1L]], by = ordered_id] 93.5 124. 8.25 53.7MB
#> 2 dt[, value[[1L]], by = unordered_id] 0.634 1.69 629. 409.2KB
#> # ... with 1 more variable: `gc/sec` <dbl>
I will work on a PR for !any_type_of_col_is_list to do subsetVectorRaw this weekend unless I hear there's a better route to fix the root cause.
Most helpful comment
Thanks for doing all the work to re-installing 1.12.8 to provide additional information. This type of info is great!
The most interesting thing in 1.13.0 is that there is a major performance difference between
dt[, v[[1L]], by = grps]versusdt[seq_len(.N), v[[1L]], by = grps]for your RDS. I could only reproduce the performance degradation using the RDS and not the simulated data. Here is some of the verbose output in 1.13.0:It does appear to be a regression from 1.12.8. I need to deactivate the gforce optimization using parentheses (e.g.
(val[[1L]])), and I get very different results:So it appears to be related to
memcpy. For right now, it appears a workaround in 1.13.0 may beitem_res[seq_len(.N), lance[[1]], by = item_id].In the meantime - here's a way for simulated data to reproduce the issue:
Using
profmem, ourseq_len(.N)approach has 29 allocations. Our _NULL_ihas around 250,000 allocations, many of which are 208 byes ofalloc.