I have encountered an issue with respect to memory usage after using fwrite. Attempting to boil it down, what I observe is that after calling fwrite on a base data.frame containing a factor column, memory usage consistently grows thereafter. The memory usage growth does not occur after calling fwrite on a base data.frame not containing a factor column, nor after calling fwrite on a data.table that contains a factor column.
An example with a base data.frame containing a factor column:
n <- 100000
df <- data.frame(
x1 = runif(n),
x2 = runif(n),
x3 = factor(sample(state.abb, n, replace = TRUE)),
stringsAsFactors = FALSE
)
# Before fwrite
invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 36.6 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
tmp <- tempfile(fileext = ".csv")
data.table::fwrite(df, tmp)
invisible(file.remove(tmp))
# After fwrite
invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 124 MB
#> 204 MB
#> 284 MB
#> 364 MB
#> 444 MB
#> 524 MB
#> 604 MB
#> 684 MB
#> 764 MB
#> 844 MB
If the data.frame does not contain a factor, this issue doesn't occur.
n <- 100000
df <- data.frame(
x1 = runif(n),
x2 = runif(n),
x3 = sample(state.abb, n, replace = TRUE),
stringsAsFactors = FALSE
)
# Before fwrite
invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 37 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
tmp <- tempfile(fileext = ".csv")
data.table::fwrite(df, tmp)
invisible(file.remove(tmp))
# After fwrite
invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
Some additional observations:
data.frame containing a factor is coerced to a data.table before invoking fwrite (i.e., data.table::fwrite(data.table::as.data.table(df), tmp), this memory usage growth did not occurdata.frame containing a factor contained fewer than 3 columns, I did not observe the memory usage growtherror = TRUE set, I did not observe the memory usage growth in the example of a base data.frame containing a factor. When the chunk option error = FALSE was set, though, I did see the memory usage growth.I got the same behavior on both Ubuntu 18.04 and Windows 10 using v1.12.8 of data.table.
Thanks for the report. I'm not seeing the issue on current master on macOS:
> n <- 100000
> df <- data.frame(
+ x1 = runif(n),
+ x2 = runif(n),
+ x3 = factor(sample(state.abb, n, replace = TRUE)),
+ stringsAsFactors = FALSE
+ )
>
> # Before fwrite
>
> invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
Registered S3 method overwritten by 'pryr':
method from
print.bytes Rcpp
29.1 MB
tmp <- tempfile(fileext = ".csv")
data.table::fwrite(df, tmp)
invisible(file.remove(tmp))
# After fwrite
invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
33.4 MB
33.4 MB
33.4 MB
33.4 MB
33.4 MB
33.4 MB
33.4 MB
33.4 MB
33.4 MB
>
> tmp <- tempfile(fileext = ".csv")
> data.table::fwrite(df, tmp)
> invisible(file.remove(tmp))
>
> # After fwrite
>
> invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB
Also not reproduced on a Linux docker image with the CRAN version:
sysname release
"Linux" "4.14.177-139.253.amzn2.x86_64"
version nodename
"#1 SMP Wed Apr 29 09:56:20 UTC 2020" "jupyter-0"
machine login
"x86_64" "unknown"
user effective_user
"root" "root"
Nor on rocker/r-base
Sys.info()
sysname release
"Linux" "4.19.76-linuxkit"
version nodename
"#1 SMP Thu Oct 17 19:31:58 UTC 2019" "633d68db3e33"
machine login
"x86_64" "unknown"
user effective_user
"root" "root"
What would be useful is your R version, compiler and flags if used.
I did some further testing and observed the following.
On a Windows machine with 4 cores, the issue occurred with R version 3.6.3. It did not occur with 3.5.3 or 4.0.1.
The same occurred on Ubuntu 18.04 machines with 4 or more cores, with the issue occurring in R 3.6.3, but not in 3.5.3 or 4.0.1.
On an Ubuntu 18.04 machine with 2 cores, the issue did not occur in 3.6.3 (nor in 3.5.3 nor 4.0.1).
On the Ubuntu 18.04 and Windows machines with 4+ cores, the issue did not occur in R version 3.6.3 if fwrite was called with the argument nThread = 1.
In testing on Windows I installed R from CRAN binaries and installed the CRAN Windows binaries of data.table v1.12.8.
In testing on Ubuntu 18.04, starting from scratch I installed R versions with apt-get install r-base-core from the CRAN repository, then installed data.table with:
gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
> install.packages("data.table")
Installing package into ‘/home/ubuntu/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/data.table_1.12.8.tar.gz'
Content type 'application/x-gzip' length 4948391 bytes (4.7 MB)
==================================================
downloaded 4.7 MB
* installing *source* package ‘data.table’ ...
** package ‘data.table’ successfully unpacked and MD5 sums checked
** using staged installation
zlib 1.2.11 is available ok
** libs
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG -fopenmp -fpic -g -O2 -fdebug-prefix-map=/build/r-base-V28x5H/r-base-3.6.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c assign.c -o assign.o
Most helpful comment
I did some further testing and observed the following.
On a Windows machine with 4 cores, the issue occurred with R version 3.6.3. It did not occur with 3.5.3 or 4.0.1.
The same occurred on Ubuntu 18.04 machines with 4 or more cores, with the issue occurring in R 3.6.3, but not in 3.5.3 or 4.0.1.
On an Ubuntu 18.04 machine with 2 cores, the issue did not occur in 3.6.3 (nor in 3.5.3 nor 4.0.1).
On the Ubuntu 18.04 and Windows machines with 4+ cores, the issue did not occur in R version 3.6.3 if
fwritewas called with the argumentnThread = 1.In testing on Windows I installed R from CRAN binaries and installed the CRAN Windows binaries of
data.tablev1.12.8.In testing on Ubuntu 18.04, starting from scratch I installed R versions with
apt-get install r-base-corefrom the CRAN repository, then installeddata.tablewith: