Data.table: Memory usage after using fwrite on a base data frame with a factor column

Created on 23 Jun 2020  Â·  3Comments  Â·  Source: Rdatatable/data.table

I have encountered an issue with respect to memory usage after using fwrite. Attempting to boil it down, what I observe is that after calling fwrite on a base data.frame containing a factor column, memory usage consistently grows thereafter. The memory usage growth does not occur after calling fwrite on a base data.frame not containing a factor column, nor after calling fwrite on a data.table that contains a factor column.

An example with a base data.frame containing a factor column:

n <- 100000
df <- data.frame(
  x1 = runif(n),
  x2 = runif(n),
  x3 = factor(sample(state.abb, n, replace = TRUE)),
  stringsAsFactors = FALSE
)

# Before fwrite

invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 36.6 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB

tmp <- tempfile(fileext = ".csv")
data.table::fwrite(df, tmp)
invisible(file.remove(tmp))

# After fwrite

invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 124 MB
#> 204 MB
#> 284 MB
#> 364 MB
#> 444 MB
#> 524 MB
#> 604 MB
#> 684 MB
#> 764 MB
#> 844 MB

If the data.frame does not contain a factor, this issue doesn't occur.

n <- 100000
df <- data.frame(
  x1 = runif(n),
  x2 = runif(n),
  x3 = sample(state.abb, n, replace = TRUE),
  stringsAsFactors = FALSE
)

# Before fwrite

invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 37 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB

tmp <- tempfile(fileext = ".csv")
data.table::fwrite(df, tmp)
invisible(file.remove(tmp))

# After fwrite

invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB

Some additional observations:

  • If the data.frame containing a factor is coerced to a data.table before invoking fwrite (i.e., data.table::fwrite(data.table::as.data.table(df), tmp), this memory usage growth did not occur
  • When the data.frame containing a factor contained fewer than 3 columns, I did not observe the memory usage growth
  • Curiously, discovered when trying to create a minimal example, when executed in an R Markdown document with the chunk option error = TRUE set, I did not observe the memory usage growth in the example of a base data.frame containing a factor. When the chunk option error = FALSE was set, though, I did see the memory usage growth.

I got the same behavior on both Ubuntu 18.04 and Windows 10 using v1.12.8 of data.table.

fwrite

Most helpful comment

I did some further testing and observed the following.

On a Windows machine with 4 cores, the issue occurred with R version 3.6.3. It did not occur with 3.5.3 or 4.0.1.

The same occurred on Ubuntu 18.04 machines with 4 or more cores, with the issue occurring in R 3.6.3, but not in 3.5.3 or 4.0.1.

On an Ubuntu 18.04 machine with 2 cores, the issue did not occur in 3.6.3 (nor in 3.5.3 nor 4.0.1).

On the Ubuntu 18.04 and Windows machines with 4+ cores, the issue did not occur in R version 3.6.3 if fwrite was called with the argument nThread = 1.

In testing on Windows I installed R from CRAN binaries and installed the CRAN Windows binaries of data.table v1.12.8.

In testing on Ubuntu 18.04, starting from scratch I installed R versions with apt-get install r-base-core from the CRAN repository, then installed data.table with:

gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
> install.packages("data.table")
Installing package into ‘/home/ubuntu/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/data.table_1.12.8.tar.gz'
Content type 'application/x-gzip' length 4948391 bytes (4.7 MB)
==================================================
downloaded 4.7 MB

* installing *source* package ‘data.table’ ...
** package ‘data.table’ successfully unpacked and MD5 sums checked
** using staged installation
zlib 1.2.11 is available ok
** libs
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG    -fopenmp -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-V28x5H/r-base-3.6.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c assign.c -o assign.o

All 3 comments

Thanks for the report. I'm not seeing the issue on current master on macOS:

> n <- 100000
> df <- data.frame(
+   x1 = runif(n),
+   x2 = runif(n),
+   x3 = factor(sample(state.abb, n, replace = TRUE)),
+   stringsAsFactors = FALSE
+ )
> 
> # Before fwrite
> 
> invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
Registered S3 method overwritten by 'pryr':
  method      from
  print.bytes Rcpp
29.1 MB

tmp <- tempfile(fileext = ".csv")
data.table::fwrite(df, tmp)
invisible(file.remove(tmp))

# After fwrite

invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))

33.4 MB
33.4 MB
33.4 MB
33.4 MB
33.4 MB
33.4 MB
33.4 MB
33.4 MB
33.4 MB
> 
> tmp <- tempfile(fileext = ".csv")
> data.table::fwrite(df, tmp)
> invisible(file.remove(tmp))
> 
> # After fwrite
> 
> invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB
35.1 MB

Also not reproduced on a Linux docker image with the CRAN version:

                              sysname                               release 
                              "Linux"       "4.14.177-139.253.amzn2.x86_64" 
                              version                              nodename 
"#1 SMP Wed Apr 29 09:56:20 UTC 2020"                           "jupyter-0" 
                              machine                                 login 
                             "x86_64"                             "unknown" 
                                 user                        effective_user 
                               "root"                                "root" 

Nor on rocker/r-base

Sys.info()
                              sysname                               release 
                              "Linux"                    "4.19.76-linuxkit" 
                              version                              nodename 
"#1 SMP Thu Oct 17 19:31:58 UTC 2019"                        "633d68db3e33" 
                              machine                                 login 
                             "x86_64"                             "unknown" 
                                 user                        effective_user 
                               "root"                                "root" 

What would be useful is your R version, compiler and flags if used.

I did some further testing and observed the following.

On a Windows machine with 4 cores, the issue occurred with R version 3.6.3. It did not occur with 3.5.3 or 4.0.1.

The same occurred on Ubuntu 18.04 machines with 4 or more cores, with the issue occurring in R 3.6.3, but not in 3.5.3 or 4.0.1.

On an Ubuntu 18.04 machine with 2 cores, the issue did not occur in 3.6.3 (nor in 3.5.3 nor 4.0.1).

On the Ubuntu 18.04 and Windows machines with 4+ cores, the issue did not occur in R version 3.6.3 if fwrite was called with the argument nThread = 1.

In testing on Windows I installed R from CRAN binaries and installed the CRAN Windows binaries of data.table v1.12.8.

In testing on Ubuntu 18.04, starting from scratch I installed R versions with apt-get install r-base-core from the CRAN repository, then installed data.table with:

gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
> install.packages("data.table")
Installing package into ‘/home/ubuntu/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/data.table_1.12.8.tar.gz'
Content type 'application/x-gzip' length 4948391 bytes (4.7 MB)
==================================================
downloaded 4.7 MB

* installing *source* package ‘data.table’ ...
** package ‘data.table’ successfully unpacked and MD5 sums checked
** using staged installation
zlib 1.2.11 is available ok
** libs
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG    -fopenmp -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-V28x5H/r-base-3.6.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c assign.c -o assign.o
Was this page helpful?
0 / 5 - 0 ratings