Data.table: [R-Forge #2605] add filtering option to fread so it can load part of a file

Created on 8 Jun 2014  路  15Comments  路  Source: Rdatatable/data.table

Submitted by: stat quant; Assigned to: Nobody; R-Forge link

Discussed in data.table list.

fread(input, chunk.nrows=10000, chunk.filter = <anything acceptable to i of DT[i]>), that could be grep() or any expression of column names.

feature request fread top request

Most helpful comment

Any updates on this feature? Is anyone working on this?

All 15 comments

I'm pretty sure this is the same as what I had in mind recently, but let me elaborate with an example:

read_dt = data.table(
  id = sample(10, 1e7, TRUE),
  var=rnorm(1e7)
)
fwrite(read_dt, file="dt_to_read.csv")
main_dt = data.table(
  id = sample(8, 1e5, TRUE),
  var2 = rnorm(1e5)
)

I'm working with main_dt but want to pull in some matching (based on id) info from read_dt; currently, I need to do something like this:

relevant_read_dt = fread("dt_to_read.csv")[id %in% main_dt[ , unique(id)]]

This is inefficient because I need to read all of read_dt (especially painful as the number of columns of read_dt increases), then immediately chop off ~20% of them.

An approach like this:

relevant_read_dt  = fread("dt_to_read.csv", row.select = id %in% main_dt[ , unique(id)])

Would only require 1) read id from "dt_to_read.csv" 2) run the logical argument id %in% main_dt[ , unique(id)] and return row numbers to read 3) fread only the selected row numbers.

Hello.
Is the filtering already implemented?
For example I want to read a very big csv file with 4 columns: Value, XXX, YYY, ZZZ,
and I want to read only the lines where the Value >= 1.3

I could do it in two steps: first read all the file, second filter, but this is slower and I could have problems if the file doesn't fit on memory.

I don't know if we are speaking about the same thing or if I missunderstood it.
fread("file", Value>=1.3)
Regards.

@skanskan Yes, this ticket is about exactly that. No, it wasn't yet implemented.

OK, thank you @jangorecki

I was wondering if any progress has been made here? It'd be a super useful feature!

Any update for those stuck with Windows :D ?

Update : My bad, Cygwin works perfectly on Windows, as said above.
Good installation tutorial here :
Restart R, and you're good to go !

In order to avoid to include the header as a line and get the colnames you can write something like that :

library(data.table)
fichier = "iris.txt"
# keep the colnames
cols <- names(fread(fichier,nrows = 0L,sep = ","))

# load a random sample of the dataframe, excluding the header
df<- fread(paste("tail -n+2",fichier,"| shuf -n 15")
              ,sep = ","
              ,header = FALSE
              ,col.names = cols
              ,colClasses = list(character = which(cols == "class"))) # define the classes of your columns

Thanks to @thoera for the help !

Regards.

UPDATE 2 :

After some tests, I figured that the solution I proposed wasn't working on R.
Actually, the code line tail -n+2 fichier.txt | shuf -n 15 works in a cmd consol, but not in R with Fread.
It returns the header as a line (randomly, of course).

This issue can be reproduced with the iris dataset and the following code :

setwd("path")
test <- fread("tail -n+2 IRIS.csv | shuf -n 149"
              ,sep = ","
              ,header = FALSE)

You can also try with sed 1d IRIS.csv | shuf -n149 => Same result.

Does fread deal with pipe and command lines more complicated than one instruction ?

Thanks

Vincent.

@VinceLYO and all this seems nice but have a problem, it depends on external commands not available on all computers. It would be great to have a self-contained solution using just fread or maybe R commands.

@skanskan : I do agree ! I think it's a nice feature. Let's hope for this :)

Any updates on this feature? Is anyone working on this?

Any updates on this feature?

Boop

For those who are bumping this issue, be sure to upvote first post here as well. AFAIK nobody is currently working on implementing this. If anyone would, we would be happy to _assign_ him/her to this issue.
I will clean up a little bit this thread.


Regarding the FR itself. I don't think it make sense to introduce new mechanism for filtering on a csv files directly. It is basically a lot of effort and maintenance, where now grep works pretty well. What could eventually be a low hanging fruit, is to examine filter expression, guess which columns are required to filter. Then read fully those columns only, perform filter using currently implemented algorithms which=TRUE, and then re-read csv applying filter on lines based on which results. That would be fully implemented in R (not sure about skipping lines), might not be so efficient, but should reduce peak memory required.

See here: https://stackoverflow.com/a/62240442/3576984

grep / awk don't have the benefit of autoparallelism so can be quite slow vs fread

Was this page helpful?
0 / 5 - 0 ratings