Data.table: Large memory consumption compared to data.frame (memory leak?)

Created on 7 Dec 2018  Â·  14Comments  Â·  Source: Rdatatable/data.table

Please have a look at the following MRE.

library(data.table)  # 1.11.9
data.table::setDTthreads(1L)

dt <- data.table(EventId = 1L, PerformerId = 1L)
list1 <- vector(mode = "list", length = 300)
list2 <- vector(mode = "list", length = length(list1) * 500)
list2Idx <- 1L

print(pryr::mem_used())  #  89.2 MB
while(length(list1) > 0){
  for(e in 1:500){

    # list2[[list2Idx]] <- as.data.frame(copy(dt))  # good
    list2[[list2Idx]] <- copy(dt)  # bad

    list2Idx <- list2Idx + 1L
  }

  # Remove the last element of list1
  list1[[length(list1)]] <- NULL
}

print(object.size(list2), units = "MB")  # 125.9 Mb
print(pryr::mem_used())  # 174 MB for good case, 2.68 GB for bad

When I wrap the copy(dt) with as.data.frame(), memory consumption is normal, but without it, memory consumption grows way more than I think it should. Is this a bug or is this me misunderstanding how memory is allocated behind the scenes?

A few notes,

  1. This bug occurs in version 1.11.8 and the current devel version
  2. I used data.table::setDTthreads(1L) to eliminate the possibility of this being a multithreading issue
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.1

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.9

loaded via a namespace (and not attached):
[1] compiler_3.5.1   pryr_0.1.4       magrittr_1.5     tools_3.5.1      yaml_2.2.0       Rcpp_1.0.0      
[7] stringi_1.2.4    codetools_0.2-15 stringr_1.3.1   

Most helpful comment

@ben519 yes, confirming, using "bad" I got 1.14 GB, after gc 1.07 GB. using data.frame instead of data.table result in minimal memory usage at the end. Thanks for reporting. Needs to be investigated.

library(data.table)  # 1.11.9
setDTthreads(1L)

dt <- data.table(EventId = 1L, PerformerId = 1L)
mylist <- vector(mode = "list", length = 500 * 400)
print(pryr::mem_used())
#29.9 MB
for(i in 1:500){
  for(j in 1:400){
    mylist[[i*j]] <- copy(dt)
  }
}
print(object.size(mylist), units = "MB")
#79.8 Mb
print(pryr::mem_used())
#1.14 GB

All 14 comments

Playing around with this a bit more, it seems that the nested loops exacerbate the problem, but don't explain it entirely. Here are some simpler examples showcasing this

library(data.table)  # 1.11.9
data.table::setDTthreads(1L)

### Experiment 1 (nested for loops)
dt <- data.table(EventId = 1L, PerformerId = 1L)
mylist <- vector(mode = "list", length = 20000)
print(pryr::mem_used())  #  40.3 MB
for(i in 1:500){
  for(j in 1:400){
    # mylist[[i*j]] <- as.data.frame(copy(dt))  # good
    mylist[[i*j]] <- copy(dt)  # bad
  }
}
print(object.size(mylist), units = "MB")  # 51.4 Mb
print(pryr::mem_used())  # 78.3 MB for good case, 1.33 GB for bad

### Experiment 2 (no nested for loops)
dt <- data.table(EventId = 1L, PerformerId = 1L)
mylist <- vector(mode = "list", length = 20000)
print(pryr::mem_used())  #  40.3 MB
for(i in 1:20000){
    # mylist[[i]] <- as.data.frame(copy(dt))  # good
    mylist[[i]] <- copy(dt)  # bad
}
print(object.size(mylist), units = "MB")  # 26.2 Mb
print(pryr::mem_used())  # 57.6 MB for good case, 387 MB for bad

AFAIR as.data.frame will already do copy so no need to as.data.frame(copy(.))

it should, otherwise setDF and as.data.frame are redundant

On Sat, Dec 8, 2018, 12:22 PM Jan Gorecki <[email protected] wrote:

afair as.data.frame will already do copy so no need to
as.data.frame(copy(.))

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3192#issuecomment-445429430,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdcClm3QFCe0mQ9WsZNrxQ9lz83yCks5u2z6IgaJpZM4ZJBHB
.

Thanks guys. I should note that in my real code, I'm not even using the copy() function. It's just something I've used here to make a MRE.

I'm curious if you are able to reproduce the memory usage behavior I'm seeing?

didn't you lose one zero? 500*400=2e5, while the non-nested loop is 2e4

@jangorecki Oops, you're right! Although, if I change mylist <- vector(mode = "list", length = 20000) to mylist <- vector(mode = "list", length = 200000), the memory consumption difference is even more obvious.

Not sure that this is a bug. A 1-row data.table simply uses more memory than a 1-row data frame.

> pryr::object_size(dt)
1.18 kB
> pryr::object_size(setDF(dt))
872 B

I suspect everything follows from this.

@HughParsonage my resulting list of data.tables is 79.8 Mb, roughly 55% bigger than my resulting list of data.frames (51.4 Mb). This is fine.

My issue is not about the size of the objects, rather the memory used. When I use data.tables, pryr::mem_used() reports 1.21 GB of memory used compared to just 86.2 MB when building a list of data.frames. (In my real setting, the memory usage exceeds my laptop's limit and crashes my program.)

Here is my updated experimental code and notes below. If you think this is a non-issue, feel free to close. My workaround of using data.frames seems fine for now.

library(data.table)  # 1.11.9
data.table::setDTthreads(1L)

### Experiment 1 (nested for loops)
dt <- data.table(EventId = 1L, PerformerId = 1L)
mylist <- vector(mode = "list", length = 500 * 400)
print(pryr::mem_used())  #  42.6 MB
for(i in 1:500){
  for(j in 1:400){
    mylist[[i*j]] <- as.data.frame(dt)  # good
    # mylist[[i*j]] <- copy(dt)               # bad
  }
}
print(object.size(mylist), units = "MB")  # 79.8 Mb for list of data.tables, 51.4 Mb for data.frames
print(pryr::mem_used())                          # 1.21 GB for list of data.tables, 78.7 MB for data.frames

To close it we have to investigate it first.
as.data.frame(copy(dt)) is not a good way, why you won't just use as.data.frame(dt)? did that bite you? If it did then it would be a separate issue to report.

If it matters, I get the same results with data.table 1.11.8:

Restarting R session...

> memory.limit()
[1] 32696
> memory.limit(TRUE)
[1] 46
> library(data.table)
data.table 1.11.8  Latest news: r-datatable.com
> data.table::setDTthreads(1L)
> dt <- data.table(EventId = 1L, PerformerId = 1L)
> mylist <- vector(mode = "list", length = 500 * 400)
> print(pryr::mem_used())
44.4 MB
> for(i in 1:500){
+     for(j in 1:400){
+         mylist[[i*j]] <- as.data.frame(copy(dt))  # good
+         # mylist[[i*j]] <- copy(dt)               # bad
+     }
+ }
> print(object.size(mylist), units = "MB")
51.4 Mb
> print(pryr::mem_used())
92.3 MB
> for(i in 1:500){
+     for(j in 1:400){
+         # mylist[[i*j]] <- as.data.frame(copy(dt))  # good
+         mylist[[i*j]] <- copy(dt)               # bad
+     }
+ }
> print(object.size(mylist), units = "MB")
79.8 Mb
> print(pryr::mem_used())
1.28 GB

@jangorecki suggestion falls in between:

> for(i in 1:500){
+     for(j in 1:400){
+         mylist[[i*j]] <- as.data.frame(dt)  # jangorecki
+         # mylist[[i*j]] <- copy(dt)               
+     }
+ }
> print(object.size(mylist), units = "MB")
51.4 Mb
> print(pryr::mem_used())
297 MB

@jangorecki While I agree with you that using as.data.frame(dt) is better than as.data.frame(copy(dt)) I think it's somewhat irrelevant to my issue. Nonetheless I've updated my recent comment above to use as.data.frame(dt). Cheers and thanks for the suggestion.

This is what I am getting on, recently-ish devel

library(data.table)  # 1.11.9
setDTthreads(1L)

dt <- data.table(EventId = 1L, PerformerId = 1L)
mylist <- vector(mode = "list", length = 500 * 400)
print(pryr::mem_used())
#29.9 MB
for(i in 1:500){
  for(j in 1:400){
    mylist[[i*j]] <- as.data.frame(dt)
  }
}
print(object.size(mylist), units = "MB")
#51.4 Mb
print(pryr::mem_used())
#81 MB

Using 4 threads does not make any change

@ben519 Could you try adding gc() before checking mem_used? maybe on your system gc is happening a little bit later than on mine.

@jangorecki it looks like you've run the good case. Can you try that again using mylist[[i*j]] <- copy(dt)? In other words, build a list of data.tables instead of a list of data.frames.

@ben519 yes, confirming, using "bad" I got 1.14 GB, after gc 1.07 GB. using data.frame instead of data.table result in minimal memory usage at the end. Thanks for reporting. Needs to be investigated.

library(data.table)  # 1.11.9
setDTthreads(1L)

dt <- data.table(EventId = 1L, PerformerId = 1L)
mylist <- vector(mode = "list", length = 500 * 400)
print(pryr::mem_used())
#29.9 MB
for(i in 1:500){
  for(j in 1:400){
    mylist[[i*j]] <- copy(dt)
  }
}
print(object.size(mylist), units = "MB")
#79.8 Mb
print(pryr::mem_used())
#1.14 GB
Was this page helpful?
0 / 5 - 0 ratings

Related issues

symbalex picture symbalex  Â·  3Comments

DavidArenburg picture DavidArenburg  Â·  3Comments

st-pasha picture st-pasha  Â·  3Comments

nachti picture nachti  Â·  3Comments

andschar picture andschar  Â·  3Comments