Dplyr: R crashes when many variables in group_by in dplyr 0.8.0

Created on 8 Jan 2019  ·  6Comments  ·  Source: tidyverse/dplyr

The following code is crashing my R for some reason in dplyr 0.8.0. It works great with 0.7.8

library(dplyr)

n <- 400000

a <- data_frame(
  x = 1:n,
  a = sample(1:10, size = n, replace = TRUE),
  b = sample(1:10, size = n, replace = TRUE),
  c = sample(1:10, size = n, replace = TRUE),
  d = sample(1:10, size = n, replace = TRUE),
  e = sample(1:10, size = n, replace = TRUE),
  f = sample(1:10, size = n, replace = TRUE),
  g = sample(1:10, size = n, replace = TRUE),
  h = sample(1:10, size = n, replace = TRUE),
  i = sample(1:10, size = n, replace = TRUE),
  y = runif(n)
)

g_1 <- a %>% group_by(x)
g_2 <- a %>% group_by_at(vars(-y))

g_1 %>% summarise(y = sum(y))
g_2 %>% summarise(y = sum(y))

The last line blocks the R session and eventually crashes. My dplyr version is dplyr * 0.8.0.9000 2019-01-03 [1] Github (tidyverse/dplyr@f0993bb)

bug

All 6 comments

Thanks. I can reproduce this. This looks related to how the grouping structure is done after the summarise.

While 0a1b62d resolved the problem in code provided by @dfalbel I think there is more to fix here.
Using the below code I am trying to aggregate ~400 KB dataset (1e4 rows) and the process consumes up to 1.2 GB memory, which is around 3000 times more than input size. If I scale up input to 4 MB (1e5 rows) R process is killed by OS. When using master I was getting C stack overflow error.

N=1e4L
K=100L
set.seed(108)
DF = data.frame(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
  id4 = sample(K, N, TRUE),                          # large groups (int)
  id5 = sample(K, N, TRUE),                          # large groups (int)
  id6 = sample(N/K, N, TRUE),                        # small groups (int)
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
print(object.size(DF), units="MB")

MB_used = function() as.numeric(system(sprintf("ps -o rss %s | tail -1", Sys.getpid()), intern=TRUE))/1024
suppressPackageStartupMessages(library(dplyr))
MB_used()
ans <- DF %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3), count=n())
MB_used()

With N = 1e4 this generates more than a million groups, much less with .drop = TRUE.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lobstr)

N=1e4L
K=100L
set.seed(108)
DF = data.frame(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
  id4 = sample(K, N, TRUE),                          # large groups (int)
  id5 = sample(K, N, TRUE),                          # large groups (int)
  id6 = sample(N/K, N, TRUE),                        # small groups (int)
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
mem_used()
#> 42,526,976 B

grouped <- DF %>%
  group_by(id1, id2, id3, id4, id5, id6)
n_groups(grouped)
#> [1] 1000053
mem_used()
#> 131,497,112 B

grouped <- DF %>%
  group_by(id1, id2, id3, id4, id5, id6, .drop = TRUE)
n_groups(grouped)
#> [1] 10000
mem_used()
#> 44,374,392 B

A million integer vectors will indeed create a lot of memory. Not sure what I can do here.

@romainfrancois I am running your code on latest master and .drop=TRUE does not seems to have any effect, any idea what is wrong? Maybe it requires dev version of some dependency?

grouped <- DF %>% group_by(id1, id2, id3, id4, id5, id6)
n_groups(grouped)
#[1] 1000053
grouped <- DF %>% group_by(id1, id2, id3, id4, id5, id6, .drop = TRUE)
n_groups(grouped)
#[1] 1000053
─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.0 (2018-04-23)
 os       Ubuntu precise (12.04.5 LTS)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2019-01-11                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source                          
 assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)                  
 backports     1.1.2   2017-12-13 [1] CRAN (R 3.5.0)                  
 base64enc     0.1-3   2015-07-28 [1] CRAN (R 3.5.0)                  
 callr         3.0.0   2018-08-24 [1] CRAN (R 3.5.0)                  
 cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.0)                  
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)                  
 desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.0)                  
 devtools      2.0.1   2018-10-26 [1] CRAN (R 3.5.0)                  
 digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.0)                  
 dplyr       * 0.8.0   2019-01-11 [1] Github (tidyverse/dplyr@a581466)
 fs            1.2.6   2018-08-23 [1] CRAN (R 3.5.0)                  
 glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.0)                  
 lobstr      * 1.0.1   2018-12-21 [1] CRAN (R 3.5.0)                  
 magrittr      1.5     2014-11-22 [2] CRAN (R 3.1.3)                  
 memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.0)                  
 pillar        1.3.1   2018-12-15 [1] CRAN (R 3.5.0)                  
 pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.0)                  
 pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.0)                  
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.0)                  
 prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)                  
 processx      3.2.0   2018-08-16 [1] CRAN (R 3.5.0)                  
 ps            1.2.1   2018-11-06 [1] CRAN (R 3.5.0)                  
 purrr         0.2.5   2018-05-29 [1] CRAN (R 3.5.0)                  
 R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.0)                  
 Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)                  
 remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.0)                  
 rlang         0.3.1   2019-01-08 [1] CRAN (R 3.5.0)                  
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.0)                  
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.0)                  
 testthat      2.0.1   2018-10-13 [1] CRAN (R 3.5.0)                  
 tibble        2.0.0   2019-01-04 [1] CRAN (R 3.5.0)                  
 tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.0)                  
 usethis       1.4.0   2018-08-14 [1] CRAN (R 3.5.0)                  
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)                  

Sorry about that, it needs this PR: https://github.com/tidyverse/dplyr/pull/4091 which I'll merge today probably.

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Was this page helpful?
0 / 5 - 0 ratings