Data.table: Support .gz file format for fread

Created on 4 Jul 2014  Â·  52Comments  Â·  Source: Rdatatable/data.table

I have several thousands of .gz files containing data in csv format - about 60GB in total in terms of .gz files. Decompressing them and load some pieces via fread turns out a huge pain in the first step. I'm wonder whether it is possible to improve the functionality of fread so that it can read compressed file formats just as read.table does?

Perhaps file connection issues are highly relevant, as mentioned in #341, #543, and #561.
Some other reference:

http://stackoverflow.com/questions/5764499/decompress-gz-file-using-r

http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html

High feature request fread

Most helpful comment

Just put here my tip for OS X users.
zcat syntax on OS X is little bit different to other linux systems. For reading *.gz files use following call:

dt <- fread(input = 'zcat < data.gz')

All 52 comments

You can just do fread('zcat file.gz'), or some loop variation, if you have many files.

:Bump: quite useful (and coming up frequently).

Here's one SO post.

Yes this would be a very useful feature to have. Using a command line, is a temporary solution at best, since it relies on the underlying system to have the tools for decompression. For instance 'zcat' is not available on windows unless one installs cygwin etc.

Since FRead is by far the best tool in R to read file, it would be a huge performance gain to read gzipped/bziped/... files directly.

I just saw @arun's bump in my email, and literally a few hours ago I was ingesting 200+ such files. +1 for usefullness

Would have been useful here as well :
http://www.magesblog.com/2014/10/visualising-seasonality-of-atlantic.html
Will take a look.

I agree with @gbonamy readding directly from zip files would be a fantastic addition!!

Reading from a connection with unz() would also be quite useful. I have a function that downloads a zip file, and only reads one file then throws it away. So if I could use fread(unz(zipfile, file = file)) it would be a great addition.

I ++ about directly from gz files. I would personally use it every day.

+1 from me as well.

+1

+1

+1

+1

+1

I'm curious - are people requesting this mostly working on Windows? I have trouble seeing the desire for this kind of specialization on Linux.

I personally mostly use .xz compression, but wouldn't care if fread directly supported it - I very frequently pipe the uncompressed result and do some post-processing before loading it in R (e.g. fread('xzcat file.xz | grep smth | awk blah')) and I like not depending on fread's file-format reading abilities - my shell processes are almost always going to be more advanced than whatever is implemented in fread.

:+1:

Just put here my tip for OS X users.
zcat syntax on OS X is little bit different to other linux systems. For reading *.gz files use following call:

dt <- fread(input = 'zcat < data.gz')

This is probably not the most efficient way but it works for me, you will probably have to change unz for gzfile :

zread <- function(zf,f,...){
  require(data.table)
  res <- fread(paste(readLines(tmp <- unz(zf,f)), collapse = "\n"),...)
  close(tmp)
  res
}

readLines incredibly slow...

@dselivanov it gets the job done on small files, never tried it on large ones though... my method probably does a lot of useless memory allocation passing the whole file around as a character vector.

Just zcat the file, see previous posts

On Tuesday, 8 December 2015, gdkrmr [email protected] wrote:

@dselivanov https://github.com/dselivanov it gets the job done on small
files... my method probably does a lot of useless memory allocation passing
the whole file around as a character vector.

—
Reply to this email directly or view it on GitHub
https://github.com/Rdatatable/data.table/issues/717#issuecomment-162838134
.

+1 for us hapless Windows users and for portability. There may be reasons why fread cannot accept a connection (as in help("connections", package="base")) but if not that would be a great and portable solution. Would also help with some common encoding issues (eg BOMs in UTF-8 files).

+1 Our production server is Linux, but we code and test on Windows machines. Having this functionality built into fread would allow us to create one version of code that works in both environments.

+1

+1

zcat | isn't an option for vey large datafiles because of:
gzip: stdout: No space left on device

@webbp unless you have an unusual /dev/shm mount - if you can't zcat it to /dev/shm (what fread does internally), then I don't see how native support would help you. Perhaps you just don't have enough RAM.

+1

@eantonya yes, i don't have enough ram; i have only 32g + 64g swap. zcat file.tsv.gz > file.tsv followed by fread('file.tsv') is an easy enough workaround, but i and others (1 2 3) would be grateful for a proper solution.

@webbp You should resize /dev/shm (aka your virtual memory) to be larger than your physical RAM and you won't have that issue. Your (uncompressed) file is either larger than 1/2 of your RAM size (but smaller than your RAM, otherwise fread of uncompressed file would fail) and you have the default value of shm set to half of your RAM, or you increased your RAM after installation and didn't update /etc/fstab.

+1

+1

+1 - also for other connections (gzfile, bzfile, xzfile, unz)

+1 My first wish from fread / fwrite

@webbp, I have the same pbm. I cannot use zcat, although it is pretty, because too little size in /dev/shm on my AWS EC2 instance. I should try to redirect /dev/shm to a EBS disk, but did not figure out how yet. Meanwhile, "zcat file.tsv.gz > file.tsv followed by fread('file.tsv')" is a penible workaround, but at least it works.

An alternative idea would be to use a specific tmp directory. Any idea?

+1

+1

+1

Is it possible to make a R package have command line tools for windows, mac, liunx wrapped in same interface. Then we can use the zcat usage with fread when that package is installed.

An example of this kind of package

I realized this kind of package will not be allowed in CRAN if you need to pack a gzip windows version in package. Either hosting it in other place, or ask user to download gzip windows by themselves.

To uncompress file into temp file on disk will always work, but that could be slow because of disk access. If we read the file into a raw vector in RAM, then uncompress it with memDecompress before feeding a uncompressed raw vector to fread, will that work?

I wrote a function that decompress zip, gz, bzip2, xz into temp file, run function then remove temp file. So we can use temp_unzip(file, fread, ...).

The code is pure R so it should work in all platforms. I feel the zcat method is good enough for linux/mac(I do need to quote the file name sometimes), but too complex for windows.

The code is inspired by R.utils but I really don't like its default behavior of removing input file by default. Also I think R.utils author just modified the compressFile code to use for decompressFile. There is need to call gzfile and bzfile separately for compression, but you don't have to call gzfile, bzfile and xzfile separately because gzfile can handle all compression formats (except zip, which I used unzip).

Here are some benchmarks:

library(microbenchmark)
microbenchmark(
  fread(eg_csv),
  fread(input = paste0("zcat < '", eg_gz, "'")), 
  temp_unzip(eg_bz, fread),
  temp_unzip(eg_zip, fread),
  temp_unzip(eg_gz, fread),
  times = 1)
Unit: seconds
                                          expr      min       lq     mean   median       uq
                                 fread(eg_csv) 2.117812 2.117812 2.117812 2.117812 2.117812
 fread(input = paste0("zcat < '", eg_gz, "'")) 1.984009 1.984009 1.984009 1.984009 1.984009
                      temp_unzip(eg_bz, fread) 6.304849 6.304849 6.304849 6.304849 6.304849
                     temp_unzip(eg_zip, fread) 2.481650 2.481650 2.481650 2.481650 2.481650
                      temp_unzip(eg_gz, fread) 2.487811 2.487811 2.487811 2.487811 2.487811
      max neval
 2.117812     1
 1.984009     1
 6.304849     1
 2.481650     1
 2.487811     1

One thing to note is that the zcat solution appears to only work if the file exists in the same directory that R is launched:

Error in fread("zcat < data/directory/test.csv.gz") :
  File is empty: /var/folders/41/asdf_kj80000gn/T//RtmpwtAttt/fileebeb5e124cef

i forgot about this issue and tried to fread a gz file, only to get a mysterious error causing me to waste time, again, searching for the solution.

3 years later, still waiting for this elementary fix.

After further exploration, my error above only occurs when there are spaces in the directory name:

fread("zcat < data/directory\ one/test.csv.gz"

But not with underscores:

fread("zcat < data/directory_two/test.csv.gz"

And can be alleviated by escaping the backslash again:

fread("zcat < data/directory\\ one/test.csv.gz"

Hope this helps. Otherwise, the zcat solution works fine.

Another example on StackOverflow why this feature is needed:
data.table fread error - gzip file - set temporary directory

How about:

library(readr)
DT = as.data.table(read_csv("myfile.gz"))

This is considerably slower.

-------- Original message --------
From: Webb Phillips
Date:2018/03/01 6:26 PM (GMT+00:00)
To: "Rdatatable/data.table"
Cc: Mikhail Spivakov , Comment
Subject: Re: [Rdatatable/data.table] Support .gz file format for fread (#717)

How about:

dt = as.data.table(read_csv("myfile.gz"))```

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/Rdatatable/data.table/issues/717#issuecomment-369684424, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABlQ65ZJPuwpxCfmJTuRZ1aBPh4Jejniks5taD06gaJpZM4CKNWu.
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.ukhttp://www.babraham.ac.uk/terms

setDT will be faster than as.data.table. What tool does read_csv use for
uncracking the .gz?

On Mar 2, 2018 2:32 AM, "mspivakov" notifications@github.com wrote:

This is considerably slower.

-------- Original message --------
From: Webb Phillips
Date:2018/03/01 6:26 PM (GMT+00:00)
To: "Rdatatable/data.table"
Cc: Mikhail Spivakov , Comment
Subject: Re: [Rdatatable/data.table] Support .gz file format for fread
(#717)

How about:

dt = as.data.table(read_csv("myfile.gz"))```

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub Rdatatable/data.table/issues/717#issuecomment-369684424>, or mute the
thread ABlQ65ZJPuwpxCfmJTuRZ1aBPh4Jejniks5taD06gaJpZM4CKNWu>.
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT
Registered Charity No. 1053902.
The information transmitted in this email is directed only to the
addressee. If you received this in error, please contact the sender and
delete this email from your system. The contents of this e-mail are the
views of the sender and do not necessarily represent the views of the
Babraham Institute. Full conditions at: www.babraham.ac.uk babraham.ac.uk/terms>

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/717#issuecomment-369686247,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdZa08PZo9LKMGbnzMnmUeaspKxgVks5taD6wgaJpZM4CKNWu
.

@frenchja - agree - though you might prefer to escape those spaces with R's shQuote

@frenchja How would this work with past0? I have now the code below, but that throws an error:

SOME_DIR = "/Users/swvanderlaan/some_dir"
data <- fread('zcat < paste0(SOME_DIR,"/somedata.txt.gz")', 
                                                          header = TRUE, na.strings = "NA", 
                                                          verbose = TRUE, showProgress = TRUE)

Ah got it, it should be this:

data <- 
  fread(paste0("zcat < '", SOME_DIR,"/somedata.txt.gz","'"), 
                                                          header = TRUE, na.strings = "NA", 
                                                          verbose = TRUE, showProgress = TRUE)

@swvanderlaan I tend to use sprintf for cases like this; you should also use file.path and shQuote to be platform-robust:

fread(sprintf('zcat %s', shQuote(file.path(SOME_DIR, 'somedata.txt.gz'))))
Was this page helpful?
0 / 5 - 0 ratings

Related issues

HughParsonage picture HughParsonage  Â·  61Comments

iembry picture iembry  Â·  32Comments

arunsrinivasan picture arunsrinivasan  Â·  54Comments

renkun-ken picture renkun-ken  Â·  27Comments

MichaelChirico picture MichaelChirico  Â·  63Comments