The tidyr functions don't seem to work with data.table functionality. For example, I would like to fill missing values of a large data.table in the following way:
#install.packages("data.table")
#install.packages("tidyr")
library(data.table)
library(tidyr)
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
dt[, year := fill(year)]
dt[, c("year", "trt") := .(fill(year), fill(trt))]
but the code above throws the following error:
# > dt[, year := fill(year)]
# Error in UseMethod("fill_") :
# no applicable method for 'fill_' applied to an object of class "c('double', 'numeric')"
# > dt[, c("year", "trt") := .(fill(year), fill(trt))]
# Error in UseMethod("fill_") :
# no applicable method for 'fill_' applied to an object of class "c('double', 'numeric')"
Since sequentially applying fill() to each column does not modify in place, the method below is costly for large data.tables.
rm(dt)
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
tracemem(dt)
dt <- fill(dt, year)
dt <- fill(dt, trt)
# > rm(dt)
# > dt <- data.table(
# + year = c(2015, NA, NA, NA),
# + trt = c("A", NA, "B", NA)
# + )
# > tracemem(dt)
# [1] "<00000000088F6118>"
# > dt <- fill(dt, year)
# tracemem[0x00000000088f6118 -> 0x00000000092d8c50]: fill_.data.frame fill_ fill
# tracemem[0x00000000092d8c50 -> 0x00000000092d8b00]: [[<-.data.frame [[<- fill_.data.frame fill_ fill
# tracemem[0x00000000092d8b00 -> 0x00000000092d8ac8]: [[<-.data.frame [[<- fill_.data.frame fill_ fill
# > dt <- fill(dt, trt)
# tracemem[0x00000000092d8ac8 -> 0x000000000951f1c0]: fill_.data.frame fill_ fill
# tracemem[0x000000000951f1c0 -> 0x000000000951f0a8]: [[<-.data.frame [[<- fill_.data.frame fill_ fill
# tracemem[0x000000000951f0a8 -> 0x000000000951f070]: [[<-.data.frame [[<- fill_.data.frame fill_ fill
# >
Thanks for taking a look!
I'd be willing to review pull requests to do this, but I don't have the time (or knowledge of data.table) to do it myself.
@mindymallory, What you describe is not an issue with tidyr function or new feature to implement. It is rather a matter of how you use tidyr functions with data.table.
There are few error in the tests you've done with dt[, year := fill(year)]
fill takes 2 arguments with first a data (data.frame, data.table or tbl) then a bare column name. In your test, you specify the column name as first argument so fill won't work independently of data.table.fill does not return the column you specify but the all data you applied fill on. It is why you can't replace year with the result of the fill function in dt[, year := fill(year)]. With LHS:=RHS, RHS must be a vector of replacement values, which is not the case in year := fill(year).So the code you have tried have these problems independently of tidyr compatibility with data.table.
Here is some ways to use fill inside data.table. It could help other users too.
library(data.table)
library(tidyr)
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000019B796C8>"
If you want to use fill, you must specify first a data then a column. Inside a data.table, use .SD for that, and see that fill return all the columns of dt
dt[, fill(.SD, year)]
#> year trt
#> 1: 2015 A
#> 2: 2015 NA
#> 3: 2015 B
#> 4: 2015 NA
You can then select a column with $ for example, and it returns a vector.
dt[, fill(.SD, year)$year]
#> [1] 2015 2015 2015 2015
If you give several columns to fill, it returns these columns filled.
dt[, fill(.SD, year, trt)]
#> year trt
#> 1: 2015 A
#> 2: 2015 A
#> 3: 2015 B
#> 4: 2015 B
So, how apply fill to replace columns inside a data.table.
You could extract the vector you want and use :=. And do it again for the other colum.
dt[, year := fill(.SD, year)$year]
dt
#> year trt
#> 1: 2015 A
#> 2: 2015 NA
#> 3: 2015 B
#> 4: 2015 NA
dt[, trt := fill(.SD, trt)$trt]
dt
#> year trt
#> 1: 2015 A
#> 2: 2015 A
#> 3: 2015 B
#> 4: 2015 B
you could also apply the change at once using data.table syntax.
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000006104428>"
dt[, c("year", "trt") := .(fill(.SD, year)$year, fill(.SD, trt)$trt)]
dt
#> year trt
#> 1: 2015 A
#> 2: 2015 A
#> 3: 2015 B
#> 4: 2015 B
However, as fill return all the columns of the data argument, you can do without extracting column with $
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000005B57B98>"
dt[, c("year", "trt") := fill(.SD, year, trt)]
dt
#> year trt
#> 1: 2015 A
#> 2: 2015 A
#> 3: 2015 B
#> 4: 2015 B
Finally, you could use an approach more suited for programming.
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<00000000187CF230>"
cols <- c("year", "trt")
dt
#> year trt
#> 1: 2015 A
#> 2: NA NA
#> 3: NA B
#> 4: NA NA
dt[, (cols) := fill_(.SD, cols), .SDcols = cols]
dt
#> year trt
#> 1: 2015 A
#> 2: 2015 A
#> 3: 2015 B
#> 4: 2015 B
Note that using this way, you could apply fill easily on a subset of columns
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA),
trt2 = c(NA, "C", NA, "D")
)
tracemem(dt)
#> [1] "<00000000049EFD30>"
cols <- c("year", "trt")
dt
#> year trt trt2
#> 1: 2015 A NA
#> 2: NA NA C
#> 3: NA B NA
#> 4: NA NA D
dt[, (cols) := fill_(.SD, cols), .SDcols = cols]
dt
#> year trt trt2
#> 1: 2015 A NA
#> 2: 2015 A C
#> 3: 2015 B NA
#> 4: 2015 B D
Hope this answer your question and the feature request you want. Note that I use tracemem to show that there are no copies and that tidyr is compatible with data.table concept, as data.table syntax and concept are used correctly.
@hadley, I think you won't have any PR to review on this subject.
@cderv Thanks for this detailed explanation! You've saved me a lot of memory usage!
Most helpful comment
@mindymallory, What you describe is not an issue with
tidyrfunction or new feature to implement. It is rather a matter of how you usetidyrfunctions withdata.table.There are few error in the tests you've done with
dt[, year := fill(year)]filltakes 2 arguments with first a data (data.frame, data.table or tbl) then a bare column name. In your test, you specify the column name as first argument sofillwon't work independently ofdata.table.filldoes not return the column you specify but the all data you appliedfillon. It is why you can't replaceyearwith the result of thefillfunction indt[, year := fill(year)]. WithLHS:=RHS,RHSmust be a vector of replacement values, which is not the case inyear := fill(year).So the code you have tried have these problems independently of
tidyrcompatibility withdata.table.Here is some ways to use
fillinsidedata.table. It could help other users too.If you want to use
fill, you must specify first a data then a column. Inside adata.table, use.SDfor that, and see thatfillreturn all the columns ofdtYou can then select a column with
$for example, and it returns a vector.If you give several columns to fill, it returns these columns filled.
So, how apply
fillto replace columns inside adata.table.You could extract the vector you want and use
:=. And do it again for the other colum.you could also apply the change at once using
data.tablesyntax.However, as
fillreturn all the columns of thedataargument, you can do without extracting column with$Finally, you could use an approach more suited for programming.
Note that using this way, you could apply
filleasily on a subset of columnsHope this answer your question and the feature request you want. Note that I use
tracememto show that there are no copies and thattidyris compatible withdata.tableconcept, asdata.tablesyntax and concept are used correctly.@hadley, I think you won't have any PR to review on this subject.