Data.table: [[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8

Created on 27 Jul 2020  Â·  7Comments  Â·  Source: Rdatatable/data.table

I have a script that takes only 4 seconds to process in V 1.12.8, and more than 24+ hours when using V 1.13.0. All other variables are the same (computer, r and RStudio versions, dataset, etc.)

All relevant and detailed information about this issue is contained in Stack Overflow.

https://stackoverflow.com/questions/63105711/why-data-table-unnesting-time-grows-with-the-square-of-number-of-rows-for-a-spec
I am not a programmer, so I am not able to go through the code. Below you will find the sessionInfo() reg\rding thje environment AFTER changing back to V 1.12.8.

Please let me know whether I can be of any further help.

BTW, congratulations for this amazing gamecahnger package.

Regards,

Fabio.

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] microbenchmark_1.4-7 data.table_1.12.8 lubridate_1.7.9 stringi_1.4.6 runner_0.3.7 e1071_1.7-3
[7] ggplot2_3.3.2 stringr_1.4.0 magrittr_1.5

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 class_7.3-17 tools_4.0.2 digest_0.6.25 packrat_0.5.0 evaluate_0.14
[9] lifecycle_0.2.0 tibble_3.0.3 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.7 rstudioapi_0.11 yaml_2.2.1 xfun_0.16
[17] withr_2.2.0 dplyr_1.0.0 knitr_1.29 generics_0.0.2 vctrs_0.3.2 grid_4.0.2 tidyselect_1.1.0 glue_1.4.1
[25] R6_2.4.1 rmarkdown_2.3 purrr_0.3.4 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0 colorspace_1.4-1 tinytex_0.25
[33] munsell_0.5.0 crayon_1.3.4

performance

Most helpful comment

Thanks for doing all the work to re-installing 1.12.8 to provide additional information. This type of info is great!

The most interesting thing in 1.13.0 is that there is a major performance difference between dt[, v[[1L]], by = grps] versus dt[seq_len(.N), v[[1L]], by = grps] for your RDS. I could only reproduce the performance degradation using the RDS and not the simulated data. Here is some of the verbose output in 1.13.0:

##dt[seq_len(.N), ...]
collecting discontiguous groups took 0.000s for 5000 groups 

##dt[, ...]
memcpy contiguous groups took 50.502s for 5000 groups 

It does appear to be a regression from 1.12.8. I need to deactivate the gforce optimization using parentheses (e.g. (val[[1L]])), and I get very different results:

##dt[seq_len(.N), (...), ...]
collecting discontiguous groups took 0.004s for 5000 groups 

##dt[, (...), ...]
memcpy contiguous groups took 0.000s for 5000 groups 

So it appears to be related to memcpy. For right now, it appears a workaround in 1.13.0 may be item_res[seq_len(.N), lance[[1]], by = item_id].

In the meantime - here's a way for simulated data to reproduce the issue:

library(data.table) #1.13.0

set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(id = seq_len(n),
                value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))

bench::mark(
    dt[seq_len(.N), value[[1L]], by = id]
    , dt[, value[[1L]], by = id]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                                 min   median `itr/sec` mem_alloc
#>   <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 dt[seq_len(.N), value[[1L]], by = id]   2.56ms   3.05ms    287.      2.24MB
#> 2 dt[, value[[1L]], by = id]            219.41ms 250.98ms      3.98   51.91MB
#> # ... with 1 more variable: `gc/sec` <dbl>

Using profmem, our seq_len(.N) approach has 29 allocations. Our _NULL_ i has around 250,000 allocations, many of which are 208 byes of alloc.

library(profmem)
profmem( dt[seq_len(.N), value[[1L]], by = id])
profmem( dt[, value[[1L]], by = id]) ## SLOW!

All 7 comments

Many thanks for the great report and I've downloaded your RDS from S.O.

There was a change to [[ by group in this release so we'll have to look into that, bug fix 5 in news :

GForce is deactivated for [[ on non-atomic input, part of #4159. Thanks @hongyuanjia and @ColeMiller1 for helping debug an issue in dev with the original fix before release, #4612.

The verbose output in the S.O. (thanks) show the following :

Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

Which is more a clue for us to fix in this case, than for the user.

I had the same initial thought for the cause, bit SO also reported the same slowdown for an unlist command.

Anyway there's repro data now which makes all the difference. Thanks @fabiocs8 !

Glad to help you.

I would like to deactivate the link for the dataset, Have you already
make a copy of it?

Regards, fabio.

On 27/07/2020 20:53, Michael Chirico wrote:
>

I had the same initial thought for the cause, bit SO also reported the
same slowdown for an unlist command.

Anyway there's repro data now which makes all the difference. Thanks
@fabiocs8 https://github.com/fabiocs8 !

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/4646#issuecomment-664696727,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AI5ISCJMUQHWCF657E56FRTR5YHRDANCNFSM4PIXKGRA.

--

Tecknowledge Consultoria e Serviços LTDA
Fábio Corrêa da Silva

Glad to help.

Because you have downloaded the dataset, I will deactivate the link.

Please let me know whether I can be of further help.

Regards,

Fabio.

On 27/07/2020 17:16, Matt Dowle wrote:
>

Thank you for the report and I've downloaded your RDS from S.O.

There was a change to |[[| by group in this release so we'll have to
look into that.

GForce is deactivated for [[ on non-atomic input, part of #4159
<https://github.com/Rdatatable/data.table/issues/4159>. Thanks
@hongyuanjia <https://github.com/hongyuanjia> and @ColeMiller1
<https://github.com/ColeMiller1> for helping debug an issue in dev
with the original fix before release, #4612
<https://github.com/Rdatatable/data.table/issues/4612>.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/4646#issuecomment-664615219,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AI5ISCPE2P37HOY42I2S75TR5XOBJANCNFSM4PIXKGRA.

--

Tecknowledge Consultoria e Serviços LTDA
Fábio Corrêa da Silva

Thanks for doing all the work to re-installing 1.12.8 to provide additional information. This type of info is great!

The most interesting thing in 1.13.0 is that there is a major performance difference between dt[, v[[1L]], by = grps] versus dt[seq_len(.N), v[[1L]], by = grps] for your RDS. I could only reproduce the performance degradation using the RDS and not the simulated data. Here is some of the verbose output in 1.13.0:

##dt[seq_len(.N), ...]
collecting discontiguous groups took 0.000s for 5000 groups 

##dt[, ...]
memcpy contiguous groups took 50.502s for 5000 groups 

It does appear to be a regression from 1.12.8. I need to deactivate the gforce optimization using parentheses (e.g. (val[[1L]])), and I get very different results:

##dt[seq_len(.N), (...), ...]
collecting discontiguous groups took 0.004s for 5000 groups 

##dt[, (...), ...]
memcpy contiguous groups took 0.000s for 5000 groups 

So it appears to be related to memcpy. For right now, it appears a workaround in 1.13.0 may be item_res[seq_len(.N), lance[[1]], by = item_id].

In the meantime - here's a way for simulated data to reproduce the issue:

library(data.table) #1.13.0

set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(id = seq_len(n),
                value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))

bench::mark(
    dt[seq_len(.N), value[[1L]], by = id]
    , dt[, value[[1L]], by = id]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                                 min   median `itr/sec` mem_alloc
#>   <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 dt[seq_len(.N), value[[1L]], by = id]   2.56ms   3.05ms    287.      2.24MB
#> 2 dt[, value[[1L]], by = id]            219.41ms 250.98ms      3.98   51.91MB
#> # ... with 1 more variable: `gc/sec` <dbl>

Using profmem, our seq_len(.N) approach has 29 allocations. Our _NULL_ i has around 250,000 allocations, many of which are 208 byes of alloc.

library(profmem)
profmem( dt[seq_len(.N), value[[1L]], by = id])
profmem( dt[, value[[1L]], by = id]) ## SLOW!

I noticed other related weird behaviors. All the commands below were run under V 1.12.8 on the full database:

1) Problems when using _item[, lance[[1]], by = unnest_names,verbose = T]_, see below:

> microbenchmark(
    item_int <- item[, lance[[1]], by = unnest_names,verbose = T]
    , times = 1L)

Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.170s elapsed (0.200s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'lance[[1]]'
GForce optimized j to 'g[[(lance, 1)'
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.002
gforce assign high and low took 0.002
gforce eval took 0.035
0.130s elapsed (0.140s cpu)
Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, lance[[1]], by = unnest_names, verbose = T] 1.175517 1.175517 1.175517 1.175517 1.175517 1.175517 1

> item_int
Error in [.data.table(x, i, , ) :
Column 12 ['V1'] is a data.frame or data.table; _malformed data.table_.

Note 1) that data.table has not issued any error or warning, and 2) the output number of rows is the same of the input data (872,851), it does not make sense (the output number of rows should be 16,070,070).

2) _item_int <- item[, unlist(lance, recursive = F)_, by = item_id] runs smooth and correctly

> microbenchmark(
    item_int <- item[, unlist(lance, recursive = F), by = unnest_names]
    , times = 5L)

Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.140s elapsed (0.120s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'unlist(lance, recursive = F)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... dogroups: growing from 872581 to 18513352 rows
Wrote less rows (16070070) than allocated (18513352).

memcpy contiguous groups took 0.074s for 872581 groups
eval(j) took 1.400s for 872581 calls
8.200s elapsed (6.940s cpu)
Unit: seconds
expr min lq mean median uq
item_int <- item[, unlist(lance, recursive = F), by = unnest_names, verbose = TRUE] 5.777739 8.352493 8.204699 8.3721 8.698475
max neval
9.822689 5

Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, unlist(lance, recursive = F), by = unnest_names] 8.022987 8.245593 8.93737 8.914378 8.982508 10.52138 5

3) _item_int <- item[, rbindlist(lance), by = unnest_names]_ runs correctly, but takes a long time... does it make sense?

> microbenchmark(
    item_int <- item[, rbindlist(lance), by = unnest_names, verbose = TRUE]
    , times = 1L)

Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.300s elapsed (0.280s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'rbindlist(lance)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
dogroups: growing from 872581 to 18513352 rows
Wrote less rows (16070070) than allocated (18513352).

memcpy contiguous groups took 0.113s for 872581 groups
eval(j) took 98.270s for 872581 calls
00:01:43 elapsed (00:01:39 cpu)
Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, rbindlist(lance), by = unnest_names, verbose = TRUE] 103.8269 103.8269 103.8269 103.8269 103.8269 103.8269 1

Regards,

Fabio.

Additional research: #4164 is the commit where this started.

## merged commit before #4164
##remotes::install_github("https://github.com/Rdatatable/data.table/tree/793f8545c363d222de18ac892bc7abb80154e724") # good

##  expression                                 min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                    memory               time         gc             
##  <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                    <list>               <list>       <list>         
##1 dt[seq_len(.N), value[[1L]], by = id]   3.68ms   4.06ms      232.    2.26MB     5.11    91     2      392ms <data.table[,2] [20,000 ~ <Rprofmem[,3] [260 ~ <bch:tm [93~ <tibble [93 x ~
##2 dt[, value[[1L]], by = id]              2.06ms   2.29ms      430.  401.24KB     4.16   207     2      481ms <data.table[,2] [20,000 ~ <Rprofmem[,3] [17 x~ <bch:tm [20~ <tibble [209 x~

## final commit with merged #4164 
# remotes::install_github("https://github.com/Rdatatable/data.table/tree/4aadde8f5a51cd7c8f3889964e7280432ec65bbc") #bad - merged commit
##  expression                                 min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                   memory                  time        gc            
##  <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                   <list>                  <list>      <list>        
##1 dt[seq_len(.N), value[[1L]], by = id]   1.25ms   1.39ms    572.      2.26MB    12.0    286     6      500ms <data.table[,2] [20,000~ <Rprofmem[,3] [260 x 3~ <bch:tm [2~ <tibble [286 ~
##2 dt[, value[[1L]], by = id]            173.01ms 188.94ms      5.29   52.03MB     5.29     3     3      567ms <data.table[,2] [20,000~ <Rprofmem[,3] [251,669~ <bch:tm [3~ <tibble [3 x ~

The last thing I looked into is that if byval is ordered (e.g., length(o__) == 0L), then memrecycle is triggered. So while ideally memrecycle would be more performant on lists, another way would be to only do continuous gathering when length(o__) && !any_typeof_cols_is_list) (e.g., when there is a list column, we'll use subsetVectorRaw instead). Alternatively we could look into an unnest function more (#4156).

Code between discontinuous vs. continuous:

https://github.com/Rdatatable/data.table/blob/db618444a4d16b91cccf35d80fa3f306739dd571/src/dogroups.c#L161-L179

Performance difference:

library(data.table) #1.13.0

set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(ordered_id = seq_len(n),
                unordered_id = sample(n),
                value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))

bench::mark(
    dt[, value[[1L]], by = ordered_id]
    , dt[, value[[1L]], by = unordered_id]
    , check = FALSE
    , time_unit = "ms" 
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                              min median `itr/sec` mem_alloc
#>   <bch:expr>                            <dbl>  <dbl>     <dbl> <bch:byt>
#> 1 dt[, value[[1L]], by = ordered_id]   93.5   124.        8.25    53.7MB
#> 2 dt[, value[[1L]], by = unordered_id]  0.634   1.69    629.     409.2KB
#> # ... with 1 more variable: `gc/sec` <dbl>

I will work on a PR for !any_type_of_col_is_list to do subsetVectorRaw this weekend unless I hear there's a better route to fix the root cause.

Was this page helpful?
0 / 5 - 0 ratings