The following code is crashing my R for some reason in dplyr 0.8.0. It works great with 0.7.8
library(dplyr)
n <- 400000
a <- data_frame(
x = 1:n,
a = sample(1:10, size = n, replace = TRUE),
b = sample(1:10, size = n, replace = TRUE),
c = sample(1:10, size = n, replace = TRUE),
d = sample(1:10, size = n, replace = TRUE),
e = sample(1:10, size = n, replace = TRUE),
f = sample(1:10, size = n, replace = TRUE),
g = sample(1:10, size = n, replace = TRUE),
h = sample(1:10, size = n, replace = TRUE),
i = sample(1:10, size = n, replace = TRUE),
y = runif(n)
)
g_1 <- a %>% group_by(x)
g_2 <- a %>% group_by_at(vars(-y))
g_1 %>% summarise(y = sum(y))
g_2 %>% summarise(y = sum(y))
The last line blocks the R session and eventually crashes. My dplyr version is dplyr * 0.8.0.9000 2019-01-03 [1] Github (tidyverse/dplyr@f0993bb)
Thanks. I can reproduce this. This looks related to how the grouping structure is done after the summarise.
While 0a1b62d resolved the problem in code provided by @dfalbel I think there is more to fix here.
Using the below code I am trying to aggregate ~400 KB dataset (1e4 rows) and the process consumes up to 1.2 GB memory, which is around 3000 times more than input size. If I scale up input to 4 MB (1e5 rows) R process is killed by OS. When using master I was getting C stack overflow error.
N=1e4L
K=100L
set.seed(108)
DF = data.frame(
id1 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id2 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
id4 = sample(K, N, TRUE), # large groups (int)
id5 = sample(K, N, TRUE), # large groups (int)
id6 = sample(N/K, N, TRUE), # small groups (int)
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
print(object.size(DF), units="MB")
MB_used = function() as.numeric(system(sprintf("ps -o rss %s | tail -1", Sys.getpid()), intern=TRUE))/1024
suppressPackageStartupMessages(library(dplyr))
MB_used()
ans <- DF %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3), count=n())
MB_used()
With N = 1e4 this generates more than a million groups, much less with .drop = TRUE.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(lobstr)
N=1e4L
K=100L
set.seed(108)
DF = data.frame(
id1 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id2 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
id4 = sample(K, N, TRUE), # large groups (int)
id5 = sample(K, N, TRUE), # large groups (int)
id6 = sample(N/K, N, TRUE), # small groups (int)
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
mem_used()
#> 42,526,976 B
grouped <- DF %>%
group_by(id1, id2, id3, id4, id5, id6)
n_groups(grouped)
#> [1] 1000053
mem_used()
#> 131,497,112 B
grouped <- DF %>%
group_by(id1, id2, id3, id4, id5, id6, .drop = TRUE)
n_groups(grouped)
#> [1] 10000
mem_used()
#> 44,374,392 B
A million integer vectors will indeed create a lot of memory. Not sure what I can do here.
@romainfrancois I am running your code on latest master and .drop=TRUE does not seems to have any effect, any idea what is wrong? Maybe it requires dev version of some dependency?
grouped <- DF %>% group_by(id1, id2, id3, id4, id5, id6)
n_groups(grouped)
#[1] 1000053
grouped <- DF %>% group_by(id1, id2, id3, id4, id5, id6, .drop = TRUE)
n_groups(grouped)
#[1] 1000053
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 3.5.0 (2018-04-23)
os Ubuntu precise (12.04.5 LTS)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2019-01-11
─ Packages ───────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0)
backports 1.1.2 2017-12-13 [1] CRAN (R 3.5.0)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.5.0)
callr 3.0.0 2018-08-24 [1] CRAN (R 3.5.0)
cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.0)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0)
devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.0)
digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0)
dplyr * 0.8.0 2019-01-11 [1] Github (tidyverse/dplyr@a581466)
fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.0)
glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0)
lobstr * 1.0.1 2018-12-21 [1] CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 [2] CRAN (R 3.1.3)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0)
pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0)
pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0)
processx 3.2.0 2018-08-16 [1] CRAN (R 3.5.0)
ps 1.2.1 2018-11-06 [1] CRAN (R 3.5.0)
purrr 0.2.5 2018-05-29 [1] CRAN (R 3.5.0)
R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.0)
Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0)
remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0)
rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.0)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0)
testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.0)
tibble 2.0.0 2019-01-04 [1] CRAN (R 3.5.0)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0)
usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0)
Sorry about that, it needs this PR: https://github.com/tidyverse/dplyr/pull/4091 which I'll merge today probably.
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/