Data.table: [R-Forge #4931] Support file connections for fread

Created on 8 Jun 2014 · 12Comments · Source: Rdatatable/data.table

Submitted by: Chris Neff; Assigned to: Nobody; R-Forge link

I use a corporate internal networked file system for much of my data, and so often times i need to call read.csv with a file connection. fread doesn't support this yet.

Namely I would like the following to work:

f = file("~/path/to/file.csv")

dt = fread(f)

feature request fread top request

Source

arunsrinivasan

👍9

Most helpful comment

+1. I wish I could fread(bzfile("file.csv")).

mauriciocramos on 23 Mar 2017

👍6

All 12 comments

This would be a pretty awesome feature that I am after as well!!

xiaodaigh on 20 Nov 2014

👍1

I agree, this would be great.

susannasupalla on 8 Apr 2015

:+1: I need this!

zachmayer on 21 Aug 2015

👍1

An interesting use case for this feature would be reading chunks from de CSV file and pass it to a worker to process such information to create additive and/or semi-additive metrics. I have a working example using read.table (read.csv2) to deal with CSV file that doesn’t fit the available memory (supposing the need of all fields in the process). However, it is still not possible to use all workers in the best way given the slow nature of read.table. I have great expectation for fread with file connection as input.

library(doSNOW)
library(data.table)
library(iterators)
library(parallel)

(SOCK - Windows - mem copy-on-call - slower)

(FORK - Linux - mem copy-on-write - faster)

cl <- makeCluster(detectCores(logical=FALSE), type="SOCK")
registerDoSNOW(cl)

chunkSize = 250000
conn = file("FILE_BIGGER_THAN_AVAILABLE_MEMORY.csv","r")
header = scan(conn, what=character(), sep=';', nlines=1)

it <- iter(function() {
tryCatch({
#EXCELLENT OPPORTUNITY TO TEST FREAD'S FILE CONNECTION FEATURE
chunk = read.csv2(file=conn, header=FALSE, nrows=chunkSize)
colnames(chunk) = header
setDT(chunk)
return(chunk)
}, error=function(e) {
#READ.TABLE THROWS ERROS WHEN A READ IS MADE AFTER EOF
stop("StopIteration", call. = FALSE)
})
})

somefun <- function(dt) {
aggreg = dt[,
list(Obs=.N),
by=list(CATEGORICAL_VARIABLE_A)
]
return(aggreg)
}

allaggreg <- foreach(slice=it, .packages='data.table', .combine='rbind', .inorder=FALSE) %dopar% {
somefun(slice)
}

setkey(allaggreg, CATEGORICAL_VARIABLE_A)
finalaggreg = allaggreg[,
list(Obs=sum(Obs, na.rm=TRUE)),
by=list(CATEGORICAL_VARIABLE_A)
]

close(conn)
stopCluster(cl)

fredguinog on 24 Jan 2016

👍4

+1. I wish I could fread(bzfile("file.csv")).

mauriciocramos on 23 Mar 2017

👍6

+1. I'd like to process stdin using data.table with Apache Hive / Hadoop. Here's what I currently do:

```{R}
stream_in = file("stdin")
open(stream_in)

queue = read.table(stream_in, nrows = rows_per_chunk, colClasses = input_classes
, col.names = input_cols, na.strings = "\N")

Then incrementally refresh and process queue

```

clarkfitzg on 22 Feb 2018

👍1

@mauriciocramos - does this not work for your case:

fread("bunzip2 file.csv")

fread(sprintf("bunzip2 %s", "file.csv"))

malcook on 18 Mar 2018

I would like to !

I have a huge CSV file (about 7Gb) and a small RAM (about 8Gb). I do chunk with for loop and skip and nrows parameters for extract some features. If it's very efficient at the beginning it's very slow at the end. I would like to memories were I was in the file and don't look at each time for the beginning of my chunk from the beginning of the file.

I think that use connection could help.

I hope that I was clear,

In advance thank you.

louvetg on 20 Apr 2018

👍1

@st-pasha made several relevant points to this issue in #1721, for example:

Generally, fread algorithm likes to see the whole file in order to properly detect types, number of columns, etc. Also, sometimes it needs several passes to get the result correctly.

Based on those points I actually don't think data.table needs to support general file connections or chunking. Sure, it would be convenient, but probably not worth the future trouble. There's already several existing solutions:

Dump to a temporary file
Basicread.table is actually reasonably efficient if we can specify column types and number of rows ahead of time.
iotools package already offers fast stream processing reads.

clarkfitzg on 20 Apr 2018

a note to explore after implementing: use textConnection to handle input like

fread('a,b,c
1,2,3
4,5,6')

instead of outputting it to disk as is done now if I'm not mistaken.

MichaelChirico on 16 Jul 2020

Curious here if fread simply handling the logic of spilling the connection to disk & then reading would be enough for this FR?

At a glance I think this implementation doesn't satisfy the "chunked read" use case, am I missing anything else?

MichaelChirico on 27 Oct 2020

@MichaelChirico This would not satisfy my usecase. The idea of using large bzip'd files is precisely to avoid spilling to (slow) disk.

If fread indeed needs multiple passes and seeks, then either it should use seek() ea or this FR should be closed as WONTFIX.

rfaelens on 12 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Join with index delivers unexpected results if indexed column name is a prefix of the join column name

pannnda · 3Comments

fread may fail to parse valid file when dec=','

st-pasha · 3Comments

GForce should be able to work with `:=` as well.

arunsrinivasan · 3Comments

eplusr revdep: between internal error

mattdowle · 3Comments

grouping sets

jangorecki · 3Comments