Submitted by: Chris Neff; Assigned to: Nobody; R-Forge link
I use a corporate internal networked file system for much of my data, and so often times i need to call read.csv with a file connection. fread doesn't support this yet.
Namely I would like the following to work:
f = file("~/path/to/file.csv")
dt = fread(f)
This would be a pretty awesome feature that I am after as well!!
I agree, this would be great.
:+1: I need this!
An interesting use case for this feature would be reading chunks from de CSV file and pass it to a worker to process such information to create additive and/or semi-additive metrics. I have a working example using read.table (read.csv2) to deal with CSV file that doesn鈥檛 fit the available memory (supposing the need of all fields in the process). However, it is still not possible to use all workers in the best way given the slow nature of read.table. I have great expectation for fread with file connection as input.
library(doSNOW)
library(data.table)
library(iterators)
library(parallel)
cl <- makeCluster(detectCores(logical=FALSE), type="SOCK")
registerDoSNOW(cl)
chunkSize = 250000
conn = file("FILE_BIGGER_THAN_AVAILABLE_MEMORY.csv","r")
header = scan(conn, what=character(), sep=';', nlines=1)
it <- iter(function() {
tryCatch({
#EXCELLENT OPPORTUNITY TO TEST FREAD'S FILE CONNECTION FEATURE
chunk = read.csv2(file=conn, header=FALSE, nrows=chunkSize)
colnames(chunk) = header
setDT(chunk)
return(chunk)
}, error=function(e) {
#READ.TABLE THROWS ERROS WHEN A READ IS MADE AFTER EOF
stop("StopIteration", call. = FALSE)
})
})
somefun <- function(dt) {
aggreg = dt[,
list(Obs=.N),
by=list(CATEGORICAL_VARIABLE_A)
]
return(aggreg)
}
allaggreg <- foreach(slice=it, .packages='data.table', .combine='rbind', .inorder=FALSE) %dopar% {
somefun(slice)
}
setkey(allaggreg, CATEGORICAL_VARIABLE_A)
finalaggreg = allaggreg[,
list(Obs=sum(Obs, na.rm=TRUE)),
by=list(CATEGORICAL_VARIABLE_A)
]
close(conn)
stopCluster(cl)
+1. I wish I could fread(bzfile("file.csv")).
+1. I'd like to process stdin using data.table with Apache Hive / Hadoop. Here's what I currently do:
```{R}
stream_in = file("stdin")
open(stream_in)
queue = read.table(stream_in, nrows = rows_per_chunk, colClasses = input_classes
, col.names = input_cols, na.strings = "\N")
```
@mauriciocramos - does this not work for your case:
fread("bunzip2 file.csv")
or
fread(sprintf("bunzip2 %s", "file.csv"))
I would like to !
I have a huge CSV file (about 7Gb) and a small RAM (about 8Gb). I do chunk with for loop and skip and nrows parameters for extract some features. If it's very efficient at the beginning it's very slow at the end. I would like to memories were I was in the file and don't look at each time for the beginning of my chunk from the beginning of the file.
I think that use connection could help.
I hope that I was clear,
In advance thank you.
@st-pasha made several relevant points to this issue in #1721, for example:
Generally, fread algorithm likes to see the whole file in order to properly detect types, number of columns, etc. Also, sometimes it needs several passes to get the result correctly.
Based on those points I actually don't think data.table needs to support general file connections or chunking. Sure, it would be convenient, but probably not worth the future trouble. There's already several existing solutions:
read.table is actually reasonably efficient if we can specify column types and number of rows ahead of time.a note to explore after implementing: use textConnection to handle input like
fread('a,b,c
1,2,3
4,5,6')
instead of outputting it to disk as is done now if I'm not mistaken.
Curious here if fread simply handling the logic of spilling the connection to disk & then reading would be enough for this FR?
At a glance I think this implementation doesn't satisfy the "chunked read" use case, am I missing anything else?
@MichaelChirico This would not satisfy my usecase. The idea of using large bzip'd files is precisely to avoid spilling to (slow) disk.
If fread indeed needs multiple passes and seeks, then either it should use seek() ea or this FR should be closed as WONTFIX.
Most helpful comment
+1. I wish I could fread(bzfile("file.csv")).