Data.table: Support .gz file format for fread

Created on 4 Jul 2014 · 52Comments · Source: Rdatatable/data.table

I have several thousands of .gz files containing data in csv format - about 60GB in total in terms of .gz files. Decompressing them and load some pieces via fread turns out a huge pain in the first step. I'm wonder whether it is possible to improve the functionality of fread so that it can read compressed file formats just as read.table does?

Perhaps file connection issues are highly relevant, as mentioned in #341, #543, and #561.
Some other reference:

http://stackoverflow.com/questions/5764499/decompress-gz-file-using-r

http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html

High feature request fread

Source

renqian

👍26

Most helpful comment

Just put here my tip for OS X users.
zcat syntax on OS X is little bit different to other linux systems. For reading *.gz files use following call:

dt <- fread(input = 'zcat < data.gz')

dselivanov on 14 Sep 2015

👍18

All 52 comments

You can just do fread('zcat file.gz'), or some loop variation, if you have many files.

eantonya on 7 Jul 2014

👍18

:Bump: quite useful (and coming up frequently).

Here's one SO post.

arunsrinivasan on 28 Sep 2014

Yes this would be a very useful feature to have. Using a command line, is a temporary solution at best, since it relies on the underlying system to have the tools for decompression. For instance 'zcat' is not available on windows unless one installs cygwin etc.

Since FRead is by far the best tool in R to read file, it would be a huge performance gain to read gzipped/bziped/... files directly.

gbonamy on 28 Sep 2014

I just saw @arun's bump in my email, and literally a few hours ago I was ingesting 200+ such files. +1 for usefullness

rsaporta on 30 Sep 2014

Would have been useful here as well :
http://www.magesblog.com/2014/10/visualising-seasonality-of-atlantic.html
Will take a look.

mattdowle on 7 Oct 2014

I agree with @gbonamy readding directly from zip files would be a fantastic addition!!

xiaodaigh on 20 Nov 2014

Reading from a connection with unz() would also be quite useful. I have a function that downloads a zip file, and only reads one file then throws it away. So if I could use fread(unz(zipfile, file = file)) it would be a great addition.

rmscriven on 16 Mar 2015

👍1

I ++ about directly from gz files. I would personally use it every day.

statquant on 17 Apr 2015

+1 from me as well.

mspivakov on 26 Apr 2015

fleimgruber on 3 Jun 2015

zx8754 on 1 Jul 2015

rickdonnelly on 15 Jul 2015

qgeissmann on 17 Jul 2015

jayjacobs on 28 Jul 2015

I'm curious - are people requesting this mostly working on Windows? I have trouble seeing the desire for this kind of specialization on Linux.

I personally mostly use .xz compression, but wouldn't care if fread directly supported it - I very frequently pipe the uncompressed result and do some post-processing before loading it in R (e.g. fread('xzcat file.xz | grep smth | awk blah')) and I like not depending on fread's file-format reading abilities - my shell processes are almost always going to be more advanced than whatever is implemented in fread.

eantonya on 28 Jul 2015

👍1

:+1:

zachmayer on 21 Aug 2015

Just put here my tip for OS X users.
zcat syntax on OS X is little bit different to other linux systems. For reading *.gz files use following call:

dt <- fread(input = 'zcat < data.gz')

dselivanov on 14 Sep 2015

👍18

This is probably not the most efficient way but it works for me, you will probably have to change unz for gzfile :

zread <- function(zf,f,...){
  require(data.table)
  res <- fread(paste(readLines(tmp <- unz(zf,f)), collapse = "\n"),...)
  close(tmp)
  res
}

gdkrmr on 8 Dec 2015

readLines incredibly slow...

dselivanov on 8 Dec 2015

@dselivanov it gets the job done on small files, never tried it on large ones though... my method probably does a lot of useless memory allocation passing the whole file around as a character vector.

gdkrmr on 8 Dec 2015

Just zcat the file, see previous posts

On Tuesday, 8 December 2015, gdkrmr [email protected] wrote:

@dselivanov https://github.com/dselivanov it gets the job done on small
files... my method probably does a lot of useless memory allocation passing
the whole file around as a character vector.

—
Reply to this email directly or view it on GitHub
https://github.com/Rdatatable/data.table/issues/717#issuecomment-162838134
.

statquant on 8 Dec 2015

+1 for us hapless Windows users and for portability. There may be reasons why fread cannot accept a connection (as in help("connections", package="base")) but if not that would be a great and portable solution. Would also help with some common encoding issues (eg BOMs in UTF-8 files).

cybaea on 21 Dec 2015

+1 Our production server is Linux, but we code and test on Windows machines. Having this functionality built into fread would allow us to create one version of code that works in both environments.

harmonica2 on 5 Mar 2016

👍1

Bouncner on 20 Mar 2016

map2085 on 26 Apr 2016

zcat | isn't an option for vey large datafiles because of:
gzip: stdout: No space left on device

webbp on 15 Jun 2016

👍6

@webbp unless you have an unusual /dev/shm mount - if you can't zcat it to /dev/shm (what fread does internally), then I don't see how native support would help you. Perhaps you just don't have enough RAM.

eantonya on 15 Jun 2016

mbacou on 15 Jun 2016

@eantonya yes, i don't have enough ram; i have only 32g + 64g swap. zcat file.tsv.gz > file.tsv followed by fread('file.tsv') is an easy enough workaround, but i and others (1 2 3) would be grateful for a proper solution.

webbp on 25 Jun 2016

👍2

@webbp You should resize /dev/shm (aka your virtual memory) to be larger than your physical RAM and you won't have that issue. Your (uncompressed) file is either larger than 1/2 of your RAM size (but smaller than your RAM, otherwise fread of uncompressed file would fail) and you have the default value of shm set to half of your RAM, or you increased your RAM after installation and didn't update /etc/fstab.

eantonya on 29 Jun 2016

👍2

lucasmation on 20 Sep 2016

aruberuku on 2 Nov 2016

+1 - also for other connections (gzfile, bzfile, xzfile, unz)

setempler on 18 Nov 2016

+1 My first wish from fread / fwrite

TuSKan on 23 Nov 2016

@webbp, I have the same pbm. I cannot use zcat, although it is pretty, because too little size in /dev/shm on my AWS EC2 instance. I should try to redirect /dev/shm to a EBS disk, but did not figure out how yet. Meanwhile, "zcat file.tsv.gz > file.tsv followed by fread('file.tsv')" is a penible workaround, but at least it works.

An alternative idea would be to use a specific tmp directory. Any idea?

borisclemencon on 25 Nov 2016

sznadas on 6 Feb 2017

mGalarnyk on 10 Feb 2017

rargelaguet on 15 Feb 2017

Is it possible to make a R package have command line tools for windows, mac, liunx wrapped in same interface. Then we can use the zcat usage with fread when that package is installed.

An example of this kind of package

I realized this kind of package will not be allowed in CRAN if you need to pack a gzip windows version in package. Either hosting it in other place, or ask user to download gzip windows by themselves.

xhdong-umd on 23 Feb 2017

To uncompress file into temp file on disk will always work, but that could be slow because of disk access. If we read the file into a raw vector in RAM, then uncompress it with memDecompress before feeding a uncompressed raw vector to fread, will that work?

xhdong-umd on 23 Feb 2017

I wrote a function that decompress zip, gz, bzip2, xz into temp file, run function then remove temp file. So we can use temp_unzip(file, fread, ...).

The code is pure R so it should work in all platforms. I feel the zcat method is good enough for linux/mac(I do need to quote the file name sometimes), but too complex for windows.

The code is inspired by R.utils but I really don't like its default behavior of removing input file by default. Also I think R.utils author just modified the compressFile code to use for decompressFile. There is need to call gzfile and bzfile separately for compression, but you don't have to call gzfile, bzfile and xzfile separately because gzfile can handle all compression formats (except zip, which I used unzip).

Here are some benchmarks:

library(microbenchmark)
microbenchmark(
  fread(eg_csv),
  fread(input = paste0("zcat < '", eg_gz, "'")), 
  temp_unzip(eg_bz, fread),
  temp_unzip(eg_zip, fread),
  temp_unzip(eg_gz, fread),
  times = 1)

Unit: seconds
                                          expr      min       lq     mean   median       uq
                                 fread(eg_csv) 2.117812 2.117812 2.117812 2.117812 2.117812
 fread(input = paste0("zcat < '", eg_gz, "'")) 1.984009 1.984009 1.984009 1.984009 1.984009
                      temp_unzip(eg_bz, fread) 6.304849 6.304849 6.304849 6.304849 6.304849
                     temp_unzip(eg_zip, fread) 2.481650 2.481650 2.481650 2.481650 2.481650
                      temp_unzip(eg_gz, fread) 2.487811 2.487811 2.487811 2.487811 2.487811
      max neval
 2.117812     1
 1.984009     1
 6.304849     1
 2.481650     1
 2.487811     1

xhdong-umd on 2 Mar 2017

👍3

One thing to note is that the zcat solution appears to only work if the file exists in the same directory that R is launched:

Error in fread("zcat < data/directory/test.csv.gz") :
  File is empty: /var/folders/41/asdf_kj80000gn/T//RtmpwtAttt/fileebeb5e124cef

frenchja on 13 May 2017

👍1

i forgot about this issue and tried to fread a gz file, only to get a mysterious error causing me to waste time, again, searching for the solution.

3 years later, still waiting for this elementary fix.

map2085 on 17 May 2017

👍4

After further exploration, my error above only occurs when there are spaces in the directory name:

fread("zcat < data/directory\ one/test.csv.gz"

But not with underscores:

fread("zcat < data/directory_two/test.csv.gz"

And can be alleviated by escaping the backslash again:

fread("zcat < data/directory\\ one/test.csv.gz"

Hope this helps. Otherwise, the zcat solution works fine.

frenchja on 17 May 2017

👍3

Another example on StackOverflow why this feature is needed:
data.table fread error - gzip file - set temporary directory

jaapwalhout on 27 Feb 2018

👍3

How about:

library(readr)
DT = as.data.table(read_csv("myfile.gz"))

webbp on 1 Mar 2018

This is considerably slower.

-------- Original message --------
From: Webb Phillips
Date:2018/03/01 6:26 PM (GMT+00:00)
To: "Rdatatable/data.table"
Cc: Mikhail Spivakov , Comment
Subject: Re: [Rdatatable/data.table] Support .gz file format for fread (#717)

How about:

dt = as.data.table(read_csv("myfile.gz"))```

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/Rdatatable/data.table/issues/717#issuecomment-369684424, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABlQ65ZJPuwpxCfmJTuRZ1aBPh4Jejniks5taD06gaJpZM4CKNWu.
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.uk http://www.babraham.ac.uk/terms

mspivakov on 1 Mar 2018

setDT will be faster than as.data.table. What tool does read_csv use for
uncracking the .gz?

On Mar 2, 2018 2:32 AM, "mspivakov" notifications@github.com wrote:

This is considerably slower.

-------- Original message --------
From: Webb Phillips
Date:2018/03/01 6:26 PM (GMT+00:00)
To: "Rdatatable/data.table"
Cc: Mikhail Spivakov , Comment
Subject: Re: [Rdatatable/data.table] Support .gz file format for fread
(#717)

How about:

dt = as.data.table(read_csv("myfile.gz"))```

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub Rdatatable/data.table/issues/717#issuecomment-369684424>, or mute the
thread ABlQ65ZJPuwpxCfmJTuRZ1aBPh4Jejniks5taD06gaJpZM4CKNWu>.
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT
Registered Charity No. 1053902.
The information transmitted in this email is directed only to the
addressee. If you received this in error, please contact the sender and
delete this email from your system. The contents of this e-mail are the
views of the sender and do not necessarily represent the views of the
Babraham Institute. Full conditions at: www.babraham.ac.uk babraham.ac.uk/terms>

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/717#issuecomment-369686247,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdZa08PZo9LKMGbnzMnmUeaspKxgVks5taD6wgaJpZM4CKNWu
.

MichaelChirico on 2 Mar 2018

@frenchja - agree - though you might prefer to escape those spaces with R's shQuote

malcook on 18 Mar 2018

readr is reading data from connection (?gzfile) in memory: https://github.com/tidyverse/readr/blob/6f0bb65296afa55709fd60cdc5d59a4c89623e36/src/connection.cpp

And it is parsed with read_tokens_ : https://github.com/tidyverse/readr/blob/6f0bb65296afa55709fd60cdc5d59a4c89623e36/src/read.cpp

byapparov on 18 Apr 2018

@frenchja How would this work with past0? I have now the code below, but that throws an error:

SOME_DIR = "/Users/swvanderlaan/some_dir"
data <- fread('zcat < paste0(SOME_DIR,"/somedata.txt.gz")', 
                                                          header = TRUE, na.strings = "NA", 
                                                          verbose = TRUE, showProgress = TRUE)

Ah got it, it should be this:

data <- 
  fread(paste0("zcat < '", SOME_DIR,"/somedata.txt.gz","'"), 
                                                          header = TRUE, na.strings = "NA", 
                                                          verbose = TRUE, showProgress = TRUE)

swvanderlaan on 28 Jun 2018

@swvanderlaan I tend to use sprintf for cases like this; you should also use file.path and shQuote to be platform-robust:

fread(sprintf('zcat %s', shQuote(file.path(SOME_DIR, 'somedata.txt.gz'))))

MichaelChirico on 2 Jul 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Regression in unique.data.table() as of data.table 1.12.0

jameslamb · 3Comments

SHM size exceeded Error

tcederquist · 3Comments

R CMD check NOTE: No visible binding for global variable

mattdowle · 3Comments

GForce should be able to work with `:=` as well.

arunsrinivasan · 3Comments

eplusr revdep: between internal error

mattdowle · 3Comments