Data.table: [R-Forge #2605] add filtering option to fread so it can load part of a file

Created on 8 Jun 2014 · 15Comments · Source: Rdatatable/data.table

Submitted by: stat quant; Assigned to: Nobody; R-Forge link

fread(input, chunk.nrows=10000, chunk.filter = <anything acceptable to i of DT[i]>), that could be grep() or any expression of column names.

feature request fread top request

Source

arunsrinivasan

👍20 ❤6

Most helpful comment

Any updates on this feature? Is anyone working on this?

deepakagrawal on 6 Mar 2018

👍11

All 15 comments

I'm pretty sure this is the same as what I had in mind recently, but let me elaborate with an example:

read_dt = data.table(
  id = sample(10, 1e7, TRUE),
  var=rnorm(1e7)
)
fwrite(read_dt, file="dt_to_read.csv")
main_dt = data.table(
  id = sample(8, 1e5, TRUE),
  var2 = rnorm(1e5)
)

I'm working with main_dt but want to pull in some matching (based on id) info from read_dt; currently, I need to do something like this:

relevant_read_dt = fread("dt_to_read.csv")[id %in% main_dt[ , unique(id)]]

This is inefficient because I need to read all of read_dt (especially painful as the number of columns of read_dt increases), then immediately chop off ~20% of them.

An approach like this:

relevant_read_dt  = fread("dt_to_read.csv", row.select = id %in% main_dt[ , unique(id)])

Would only require 1) read id from "dt_to_read.csv" 2) run the logical argument id %in% main_dt[ , unique(id)] and return row numbers to read 3) fread only the selected row numbers.

MichaelChirico on 17 Jun 2015

👍1

Hello.
Is the filtering already implemented?
For example I want to read a very big csv file with 4 columns: Value, XXX, YYY, ZZZ,
and I want to read only the lines where the Value >= 1.3

I could do it in two steps: first read all the file, second filter, but this is slower and I could have problems if the file doesn't fit on memory.

I don't know if we are speaking about the same thing or if I missunderstood it.
fread("file", Value>=1.3)
Regards.

skanskan on 22 Jul 2016

@skanskan Yes, this ticket is about exactly that. No, it wasn't yet implemented.

jangorecki on 22 Jul 2016

OK, thank you @jangorecki

skanskan on 22 Jul 2016

I was wondering if any progress has been made here? It'd be a super useful feature!

zachmayer on 20 Dec 2016

Any update for those stuck with Windows :D ?

Update : My bad, Cygwin works perfectly on Windows, as said above.
Good installation tutorial here :
Restart R, and you're good to go !

In order to avoid to include the header as a line and get the colnames you can write something like that :

library(data.table)
fichier = "iris.txt"
# keep the colnames
cols <- names(fread(fichier,nrows = 0L,sep = ","))

# load a random sample of the dataframe, excluding the header
df<- fread(paste("tail -n+2",fichier,"| shuf -n 15")
              ,sep = ","
              ,header = FALSE
              ,col.names = cols
              ,colClasses = list(character = which(cols == "class"))) # define the classes of your columns

Thanks to @thoera for the help !

Regards.

VinceLYO on 6 Sep 2017

UPDATE 2 :

After some tests, I figured that the solution I proposed wasn't working on R.
Actually, the code line tail -n+2 fichier.txt | shuf -n 15 works in a cmd consol, but not in R with Fread.
It returns the header as a line (randomly, of course).

This issue can be reproduced with the iris dataset and the following code :

setwd("path")
test <- fread("tail -n+2 IRIS.csv | shuf -n 149"
              ,sep = ","
              ,header = FALSE)

You can also try with sed 1d IRIS.csv | shuf -n149 => Same result.

Does fread deal with pipe and command lines more complicated than one instruction ?

Thanks

Vincent.

VinceLYO on 11 Sep 2017

@VinceLYO and all this seems nice but have a problem, it depends on external commands not available on all computers. It would be great to have a self-contained solution using just fread or maybe R commands.

skanskan on 13 Sep 2017

👍1

@skanskan : I do agree ! I think it's a nice feature. Let's hope for this :)

VinceLYO on 13 Sep 2017

To be updated

https://stackoverflow.com/questions/47172355

MichaelChirico on 8 Nov 2017

Any updates on this feature? Is anyone working on this?

deepakagrawal on 6 Mar 2018

👍11

Any updates on this feature?

woodywonge on 21 Jul 2019

Boop

tgwhite on 15 Apr 2020

For those who are bumping this issue, be sure to upvote first post here as well. AFAIK nobody is currently working on implementing this. If anyone would, we would be happy to _assign_ him/her to this issue.
I will clean up a little bit this thread.

Regarding the FR itself. I don't think it make sense to introduce new mechanism for filtering on a csv files directly. It is basically a lot of effort and maintenance, where now grep works pretty well. What could eventually be a low hanging fruit, is to examine filter expression, guess which columns are required to filter. Then read fully those columns only, perform filter using currently implemented algorithms which=TRUE, and then re-read csv applying filter on lines based on which results. That would be fully implemented in R (not sure about skipping lines), might not be so efficient, but should reduce peak memory required.

jangorecki on 15 Apr 2020

See here: https://stackoverflow.com/a/62240442/3576984

grep / awk don't have the benefit of autoparallelism so can be quite slow vs fread

MichaelChirico on 7 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings