Please have a look at the following MRE.
library(data.table) # 1.11.9
data.table::setDTthreads(1L)
dt <- data.table(EventId = 1L, PerformerId = 1L)
list1 <- vector(mode = "list", length = 300)
list2 <- vector(mode = "list", length = length(list1) * 500)
list2Idx <- 1L
print(pryr::mem_used()) # 89.2 MB
while(length(list1) > 0){
for(e in 1:500){
# list2[[list2Idx]] <- as.data.frame(copy(dt)) # good
list2[[list2Idx]] <- copy(dt) # bad
list2Idx <- list2Idx + 1L
}
# Remove the last element of list1
list1[[length(list1)]] <- NULL
}
print(object.size(list2), units = "MB") # 125.9 Mb
print(pryr::mem_used()) # 174 MB for good case, 2.68 GB for bad
When I wrap the copy(dt) with as.data.frame(), memory consumption is normal, but without it, memory consumption grows way more than I think it should. Is this a bug or is this me misunderstanding how memory is allocated behind the scenes?
A few notes,
data.table::setDTthreads(1L) to eliminate the possibility of this being a multithreading issue> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.11.9
loaded via a namespace (and not attached):
[1] compiler_3.5.1 pryr_0.1.4 magrittr_1.5 tools_3.5.1 yaml_2.2.0 Rcpp_1.0.0
[7] stringi_1.2.4 codetools_0.2-15 stringr_1.3.1
Playing around with this a bit more, it seems that the nested loops exacerbate the problem, but don't explain it entirely. Here are some simpler examples showcasing this
library(data.table) # 1.11.9
data.table::setDTthreads(1L)
### Experiment 1 (nested for loops)
dt <- data.table(EventId = 1L, PerformerId = 1L)
mylist <- vector(mode = "list", length = 20000)
print(pryr::mem_used()) # 40.3 MB
for(i in 1:500){
for(j in 1:400){
# mylist[[i*j]] <- as.data.frame(copy(dt)) # good
mylist[[i*j]] <- copy(dt) # bad
}
}
print(object.size(mylist), units = "MB") # 51.4 Mb
print(pryr::mem_used()) # 78.3 MB for good case, 1.33 GB for bad
### Experiment 2 (no nested for loops)
dt <- data.table(EventId = 1L, PerformerId = 1L)
mylist <- vector(mode = "list", length = 20000)
print(pryr::mem_used()) # 40.3 MB
for(i in 1:20000){
# mylist[[i]] <- as.data.frame(copy(dt)) # good
mylist[[i]] <- copy(dt) # bad
}
print(object.size(mylist), units = "MB") # 26.2 Mb
print(pryr::mem_used()) # 57.6 MB for good case, 387 MB for bad
AFAIR as.data.frame will already do copy so no need to as.data.frame(copy(.))
it should, otherwise setDF and as.data.frame are redundant
On Sat, Dec 8, 2018, 12:22 PM Jan Gorecki <[email protected] wrote:
afair as.data.frame will already do copy so no need to
as.data.frame(copy(.))—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3192#issuecomment-445429430,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdcClm3QFCe0mQ9WsZNrxQ9lz83yCks5u2z6IgaJpZM4ZJBHB
.
Thanks guys. I should note that in my real code, I'm not even using the copy() function. It's just something I've used here to make a MRE.
I'm curious if you are able to reproduce the memory usage behavior I'm seeing?
didn't you lose one zero? 500*400=2e5, while the non-nested loop is 2e4
@jangorecki Oops, you're right! Although, if I change mylist <- vector(mode = "list", length = 20000) to mylist <- vector(mode = "list", length = 200000), the memory consumption difference is even more obvious.
Not sure that this is a bug. A 1-row data.table simply uses more memory than a 1-row data frame.
> pryr::object_size(dt)
1.18 kB
> pryr::object_size(setDF(dt))
872 B
I suspect everything follows from this.
@HughParsonage my resulting list of data.tables is 79.8 Mb, roughly 55% bigger than my resulting list of data.frames (51.4 Mb). This is fine.
My issue is not about the size of the objects, rather the memory used. When I use data.tables, pryr::mem_used() reports 1.21 GB of memory used compared to just 86.2 MB when building a list of data.frames. (In my real setting, the memory usage exceeds my laptop's limit and crashes my program.)
Here is my updated experimental code and notes below. If you think this is a non-issue, feel free to close. My workaround of using data.frames seems fine for now.
library(data.table) # 1.11.9
data.table::setDTthreads(1L)
### Experiment 1 (nested for loops)
dt <- data.table(EventId = 1L, PerformerId = 1L)
mylist <- vector(mode = "list", length = 500 * 400)
print(pryr::mem_used()) # 42.6 MB
for(i in 1:500){
for(j in 1:400){
mylist[[i*j]] <- as.data.frame(dt) # good
# mylist[[i*j]] <- copy(dt) # bad
}
}
print(object.size(mylist), units = "MB") # 79.8 Mb for list of data.tables, 51.4 Mb for data.frames
print(pryr::mem_used()) # 1.21 GB for list of data.tables, 78.7 MB for data.frames
To close it we have to investigate it first.
as.data.frame(copy(dt)) is not a good way, why you won't just use as.data.frame(dt)? did that bite you? If it did then it would be a separate issue to report.
If it matters, I get the same results with data.table 1.11.8:
Restarting R session...
> memory.limit()
[1] 32696
> memory.limit(TRUE)
[1] 46
> library(data.table)
data.table 1.11.8 Latest news: r-datatable.com
> data.table::setDTthreads(1L)
> dt <- data.table(EventId = 1L, PerformerId = 1L)
> mylist <- vector(mode = "list", length = 500 * 400)
> print(pryr::mem_used())
44.4 MB
> for(i in 1:500){
+ for(j in 1:400){
+ mylist[[i*j]] <- as.data.frame(copy(dt)) # good
+ # mylist[[i*j]] <- copy(dt) # bad
+ }
+ }
> print(object.size(mylist), units = "MB")
51.4 Mb
> print(pryr::mem_used())
92.3 MB
> for(i in 1:500){
+ for(j in 1:400){
+ # mylist[[i*j]] <- as.data.frame(copy(dt)) # good
+ mylist[[i*j]] <- copy(dt) # bad
+ }
+ }
> print(object.size(mylist), units = "MB")
79.8 Mb
> print(pryr::mem_used())
1.28 GB
@jangorecki suggestion falls in between:
> for(i in 1:500){
+ for(j in 1:400){
+ mylist[[i*j]] <- as.data.frame(dt) # jangorecki
+ # mylist[[i*j]] <- copy(dt)
+ }
+ }
> print(object.size(mylist), units = "MB")
51.4 Mb
> print(pryr::mem_used())
297 MB
@jangorecki While I agree with you that using as.data.frame(dt) is better than as.data.frame(copy(dt)) I think it's somewhat irrelevant to my issue. Nonetheless I've updated my recent comment above to use as.data.frame(dt). Cheers and thanks for the suggestion.
This is what I am getting on, recently-ish devel
library(data.table) # 1.11.9
setDTthreads(1L)
dt <- data.table(EventId = 1L, PerformerId = 1L)
mylist <- vector(mode = "list", length = 500 * 400)
print(pryr::mem_used())
#29.9 MB
for(i in 1:500){
for(j in 1:400){
mylist[[i*j]] <- as.data.frame(dt)
}
}
print(object.size(mylist), units = "MB")
#51.4 Mb
print(pryr::mem_used())
#81 MB
Using 4 threads does not make any change
@ben519 Could you try adding gc() before checking mem_used? maybe on your system gc is happening a little bit later than on mine.
@jangorecki it looks like you've run the good case. Can you try that again using mylist[[i*j]] <- copy(dt)? In other words, build a list of data.tables instead of a list of data.frames.
@ben519 yes, confirming, using "bad" I got 1.14 GB, after gc 1.07 GB. using data.frame instead of data.table result in minimal memory usage at the end. Thanks for reporting. Needs to be investigated.
library(data.table) # 1.11.9
setDTthreads(1L)
dt <- data.table(EventId = 1L, PerformerId = 1L)
mylist <- vector(mode = "list", length = 500 * 400)
print(pryr::mem_used())
#29.9 MB
for(i in 1:500){
for(j in 1:400){
mylist[[i*j]] <- copy(dt)
}
}
print(object.size(mylist), units = "MB")
#79.8 Mb
print(pryr::mem_used())
#1.14 GB
Most helpful comment
@ben519 yes, confirming, using "bad" I got 1.14 GB, after gc 1.07 GB. using
data.frameinstead ofdata.tableresult in minimal memory usage at the end. Thanks for reporting. Needs to be investigated.