The tidyr functions don't seem to work with data.table functionality. For example, I would like to fill missing values of a large data.table in the following way:
#install.packages("data.table")
#install.packages("tidyr")
library(data.table)
library(tidyr)
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
dt[, year := fill(year)]
dt[, c("year", "trt") := .(fill(year), fill(trt))]
but the code above throws the following error:
# > dt[, year := fill(year)]
# Error in UseMethod("fill_") :
# no applicable method for 'fill_' applied to an object of class "c('double', 'numeric')"
# > dt[, c("year", "trt") := .(fill(year), fill(trt))]
# Error in UseMethod("fill_") :
# no applicable method for 'fill_' applied to an object of class "c('double', 'numeric')"
Since sequentially applying fill()
to each column does not modify in place, the method below is costly for large data.tables.
rm(dt)
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
tracemem(dt)
dt <- fill(dt, year)
dt <- fill(dt, trt)
# > rm(dt)
# > dt <- data.table(
# + year = c(2015, NA, NA, NA),
# + trt = c("A", NA, "B", NA)
# + )
# > tracemem(dt)
# [1] "<00000000088F6118>"
# > dt <- fill(dt, year)
# tracemem[0x00000000088f6118 -> 0x00000000092d8c50]: fill_.data.frame fill_ fill
# tracemem[0x00000000092d8c50 -> 0x00000000092d8b00]: [[<-.data.frame [[<- fill_.data.frame fill_ fill
# tracemem[0x00000000092d8b00 -> 0x00000000092d8ac8]: [[<-.data.frame [[<- fill_.data.frame fill_ fill
# > dt <- fill(dt, trt)
# tracemem[0x00000000092d8ac8 -> 0x000000000951f1c0]: fill_.data.frame fill_ fill
# tracemem[0x000000000951f1c0 -> 0x000000000951f0a8]: [[<-.data.frame [[<- fill_.data.frame fill_ fill
# tracemem[0x000000000951f0a8 -> 0x000000000951f070]: [[<-.data.frame [[<- fill_.data.frame fill_ fill
# >
Thanks for taking a look!
I'd be willing to review pull requests to do this, but I don't have the time (or knowledge of data.table) to do it myself.
@mindymallory, What you describe is not an issue with tidyr
function or new feature to implement. It is rather a matter of how you use tidyr
functions with data.table
.
There are few error in the tests you've done with dt[, year := fill(year)]
fill
takes 2 arguments with first a data (data.frame, data.table or tbl) then a bare column name. In your test, you specify the column name as first argument so fill
won't work independently of data.table
.fill
does not return the column you specify but the all data you applied fill
on. It is why you can't replace year
with the result of the fill
function in dt[, year := fill(year)]
. With LHS:=RHS
, RHS
must be a vector of replacement values, which is not the case in year := fill(year)
.So the code you have tried have these problems independently of tidyr
compatibility with data.table
.
Here is some ways to use fill
inside data.table
. It could help other users too.
library(data.table)
library(tidyr)
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000019B796C8>"
If you want to use fill
, you must specify first a data then a column. Inside a data.table
, use .SD
for that, and see that fill
return all the columns of dt
dt[, fill(.SD, year)]
#> year trt
#> 1: 2015 A
#> 2: 2015 NA
#> 3: 2015 B
#> 4: 2015 NA
You can then select a column with $
for example, and it returns a vector.
dt[, fill(.SD, year)$year]
#> [1] 2015 2015 2015 2015
If you give several columns to fill, it returns these columns filled.
dt[, fill(.SD, year, trt)]
#> year trt
#> 1: 2015 A
#> 2: 2015 A
#> 3: 2015 B
#> 4: 2015 B
So, how apply fill
to replace columns inside a data.table
.
You could extract the vector you want and use :=
. And do it again for the other colum.
dt[, year := fill(.SD, year)$year]
dt
#> year trt
#> 1: 2015 A
#> 2: 2015 NA
#> 3: 2015 B
#> 4: 2015 NA
dt[, trt := fill(.SD, trt)$trt]
dt
#> year trt
#> 1: 2015 A
#> 2: 2015 A
#> 3: 2015 B
#> 4: 2015 B
you could also apply the change at once using data.table
syntax.
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000006104428>"
dt[, c("year", "trt") := .(fill(.SD, year)$year, fill(.SD, trt)$trt)]
dt
#> year trt
#> 1: 2015 A
#> 2: 2015 A
#> 3: 2015 B
#> 4: 2015 B
However, as fill
return all the columns of the data
argument, you can do without extracting column with $
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000005B57B98>"
dt[, c("year", "trt") := fill(.SD, year, trt)]
dt
#> year trt
#> 1: 2015 A
#> 2: 2015 A
#> 3: 2015 B
#> 4: 2015 B
Finally, you could use an approach more suited for programming.
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<00000000187CF230>"
cols <- c("year", "trt")
dt
#> year trt
#> 1: 2015 A
#> 2: NA NA
#> 3: NA B
#> 4: NA NA
dt[, (cols) := fill_(.SD, cols), .SDcols = cols]
dt
#> year trt
#> 1: 2015 A
#> 2: 2015 A
#> 3: 2015 B
#> 4: 2015 B
Note that using this way, you could apply fill
easily on a subset of columns
dt <- data.table(
year = c(2015, NA, NA, NA),
trt = c("A", NA, "B", NA),
trt2 = c(NA, "C", NA, "D")
)
tracemem(dt)
#> [1] "<00000000049EFD30>"
cols <- c("year", "trt")
dt
#> year trt trt2
#> 1: 2015 A NA
#> 2: NA NA C
#> 3: NA B NA
#> 4: NA NA D
dt[, (cols) := fill_(.SD, cols), .SDcols = cols]
dt
#> year trt trt2
#> 1: 2015 A NA
#> 2: 2015 A C
#> 3: 2015 B NA
#> 4: 2015 B D
Hope this answer your question and the feature request you want. Note that I use tracemem
to show that there are no copies and that tidyr
is compatible with data.table
concept, as data.table
syntax and concept are used correctly.
@hadley, I think you won't have any PR to review on this subject.
@cderv Thanks for this detailed explanation! You've saved me a lot of memory usage!
Most helpful comment
@mindymallory, What you describe is not an issue with
tidyr
function or new feature to implement. It is rather a matter of how you usetidyr
functions withdata.table
.There are few error in the tests you've done with
dt[, year := fill(year)]
fill
takes 2 arguments with first a data (data.frame, data.table or tbl) then a bare column name. In your test, you specify the column name as first argument sofill
won't work independently ofdata.table
.fill
does not return the column you specify but the all data you appliedfill
on. It is why you can't replaceyear
with the result of thefill
function indt[, year := fill(year)]
. WithLHS:=RHS
,RHS
must be a vector of replacement values, which is not the case inyear := fill(year)
.So the code you have tried have these problems independently of
tidyr
compatibility withdata.table
.Here is some ways to use
fill
insidedata.table
. It could help other users too.If you want to use
fill
, you must specify first a data then a column. Inside adata.table
, use.SD
for that, and see thatfill
return all the columns ofdt
You can then select a column with
$
for example, and it returns a vector.If you give several columns to fill, it returns these columns filled.
So, how apply
fill
to replace columns inside adata.table
.You could extract the vector you want and use
:=
. And do it again for the other colum.you could also apply the change at once using
data.table
syntax.However, as
fill
return all the columns of thedata
argument, you can do without extracting column with$
Finally, you could use an approach more suited for programming.
Note that using this way, you could apply
fill
easily on a subset of columnsHope this answer your question and the feature request you want. Note that I use
tracemem
to show that there are no copies and thattidyr
is compatible withdata.table
concept, asdata.table
syntax and concept are used correctly.@hadley, I think you won't have any PR to review on this subject.