Tidyr: Feature Request: Make tidyr functions play nicely with data.table

Created on 4 Mar 2016  路  3Comments  路  Source: tidyverse/tidyr

The tidyr functions don't seem to work with data.table functionality. For example, I would like to fill missing values of a large data.table in the following way:

#install.packages("data.table")
#install.packages("tidyr")
library(data.table)
library(tidyr)
dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)

dt[, year := fill(year)]
dt[, c("year", "trt") := .(fill(year), fill(trt))]

but the code above throws the following error:

# > dt[, year := fill(year)]
# Error in UseMethod("fill_") : 
#  no applicable method for 'fill_' applied to an object of class "c('double', 'numeric')"
# > dt[, c("year", "trt") := .(fill(year), fill(trt))]
# Error in UseMethod("fill_") : 
#  no applicable method for 'fill_' applied to an object of class "c('double', 'numeric')"

Since sequentially applying fill() to each column does not modify in place, the method below is costly for large data.tables.

rm(dt)
dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
dt <- fill(dt, year)
dt <- fill(dt, trt)

# > rm(dt)
# > dt <- data.table(
# +     year = c(2015, NA, NA, NA), 
# +     trt = c("A", NA, "B", NA)
# + )
# > tracemem(dt)
# [1] "<00000000088F6118>"
# > dt <- fill(dt, year)
# tracemem[0x00000000088f6118 -> 0x00000000092d8c50]: fill_.data.frame fill_ fill 
# tracemem[0x00000000092d8c50 -> 0x00000000092d8b00]: [[<-.data.frame [[<- fill_.data.frame  fill_ fill 
# tracemem[0x00000000092d8b00 -> 0x00000000092d8ac8]: [[<-.data.frame [[<- fill_.data.frame  fill_ fill 
# > dt <- fill(dt, trt)
# tracemem[0x00000000092d8ac8 -> 0x000000000951f1c0]: fill_.data.frame fill_ fill 
# tracemem[0x000000000951f1c0 -> 0x000000000951f0a8]: [[<-.data.frame [[<- fill_.data.frame fill_  fill 
# tracemem[0x000000000951f0a8 -> 0x000000000951f070]: [[<-.data.frame [[<- fill_.data.frame fill_   fill 
# > 

Thanks for taking a look!

Most helpful comment

@mindymallory, What you describe is not an issue with tidyr function or new feature to implement. It is rather a matter of how you use tidyr functions with data.table.

There are few error in the tests you've done with dt[, year := fill(year)]

  • First, fill takes 2 arguments with first a data (data.frame, data.table or tbl) then a bare column name. In your test, you specify the column name as first argument so fill won't work independently of data.table.
  • Then, fill does not return the column you specify but the all data you applied fill on. It is why you can't replace year with the result of the fill function in dt[, year := fill(year)]. With LHS:=RHS, RHS must be a vector of replacement values, which is not the case in year := fill(year).

So the code you have tried have these problems independently of tidyr compatibility with data.table.

Here is some ways to use fill inside data.table. It could help other users too.

library(data.table)
library(tidyr)
dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000019B796C8>"

If you want to use fill, you must specify first a data then a column. Inside a data.table, use .SD for that, and see that fill return all the columns of dt

dt[, fill(.SD, year)]
#>    year trt
#> 1: 2015   A
#> 2: 2015  NA
#> 3: 2015   B
#> 4: 2015  NA

You can then select a column with $ for example, and it returns a vector.

dt[, fill(.SD, year)$year]
#> [1] 2015 2015 2015 2015

If you give several columns to fill, it returns these columns filled.

dt[, fill(.SD, year, trt)]
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

So, how apply fill to replace columns inside a data.table.

You could extract the vector you want and use :=. And do it again for the other colum.

dt[, year := fill(.SD, year)$year]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015  NA
#> 3: 2015   B
#> 4: 2015  NA
dt[, trt := fill(.SD, trt)$trt]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

you could also apply the change at once using data.table syntax.

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000006104428>"
dt[, c("year", "trt") := .(fill(.SD, year)$year, fill(.SD, trt)$trt)]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

However, as fill return all the columns of the data argument, you can do without extracting column with $

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000005B57B98>"
dt[, c("year", "trt") := fill(.SD, year, trt)]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

Finally, you could use an approach more suited for programming.

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<00000000187CF230>"
cols <- c("year", "trt")
dt
#>    year trt
#> 1: 2015   A
#> 2:   NA  NA
#> 3:   NA   B
#> 4:   NA  NA
dt[, (cols) := fill_(.SD, cols), .SDcols = cols]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

Note that using this way, you could apply fill easily on a subset of columns

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA),
  trt2 = c(NA, "C", NA, "D")
)
tracemem(dt)
#> [1] "<00000000049EFD30>"
cols <- c("year", "trt")
dt
#>    year trt trt2
#> 1: 2015   A   NA
#> 2:   NA  NA    C
#> 3:   NA   B   NA
#> 4:   NA  NA    D
dt[, (cols) := fill_(.SD, cols), .SDcols = cols]
dt
#>    year trt trt2
#> 1: 2015   A   NA
#> 2: 2015   A    C
#> 3: 2015   B   NA
#> 4: 2015   B    D

Hope this answer your question and the feature request you want. Note that I use tracemem to show that there are no copies and that tidyr is compatible with data.table concept, as data.table syntax and concept are used correctly.

@hadley, I think you won't have any PR to review on this subject.

All 3 comments

I'd be willing to review pull requests to do this, but I don't have the time (or knowledge of data.table) to do it myself.

@mindymallory, What you describe is not an issue with tidyr function or new feature to implement. It is rather a matter of how you use tidyr functions with data.table.

There are few error in the tests you've done with dt[, year := fill(year)]

  • First, fill takes 2 arguments with first a data (data.frame, data.table or tbl) then a bare column name. In your test, you specify the column name as first argument so fill won't work independently of data.table.
  • Then, fill does not return the column you specify but the all data you applied fill on. It is why you can't replace year with the result of the fill function in dt[, year := fill(year)]. With LHS:=RHS, RHS must be a vector of replacement values, which is not the case in year := fill(year).

So the code you have tried have these problems independently of tidyr compatibility with data.table.

Here is some ways to use fill inside data.table. It could help other users too.

library(data.table)
library(tidyr)
dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000019B796C8>"

If you want to use fill, you must specify first a data then a column. Inside a data.table, use .SD for that, and see that fill return all the columns of dt

dt[, fill(.SD, year)]
#>    year trt
#> 1: 2015   A
#> 2: 2015  NA
#> 3: 2015   B
#> 4: 2015  NA

You can then select a column with $ for example, and it returns a vector.

dt[, fill(.SD, year)$year]
#> [1] 2015 2015 2015 2015

If you give several columns to fill, it returns these columns filled.

dt[, fill(.SD, year, trt)]
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

So, how apply fill to replace columns inside a data.table.

You could extract the vector you want and use :=. And do it again for the other colum.

dt[, year := fill(.SD, year)$year]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015  NA
#> 3: 2015   B
#> 4: 2015  NA
dt[, trt := fill(.SD, trt)$trt]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

you could also apply the change at once using data.table syntax.

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000006104428>"
dt[, c("year", "trt") := .(fill(.SD, year)$year, fill(.SD, trt)$trt)]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

However, as fill return all the columns of the data argument, you can do without extracting column with $

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000005B57B98>"
dt[, c("year", "trt") := fill(.SD, year, trt)]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

Finally, you could use an approach more suited for programming.

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<00000000187CF230>"
cols <- c("year", "trt")
dt
#>    year trt
#> 1: 2015   A
#> 2:   NA  NA
#> 3:   NA   B
#> 4:   NA  NA
dt[, (cols) := fill_(.SD, cols), .SDcols = cols]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

Note that using this way, you could apply fill easily on a subset of columns

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA),
  trt2 = c(NA, "C", NA, "D")
)
tracemem(dt)
#> [1] "<00000000049EFD30>"
cols <- c("year", "trt")
dt
#>    year trt trt2
#> 1: 2015   A   NA
#> 2:   NA  NA    C
#> 3:   NA   B   NA
#> 4:   NA  NA    D
dt[, (cols) := fill_(.SD, cols), .SDcols = cols]
dt
#>    year trt trt2
#> 1: 2015   A   NA
#> 2: 2015   A    C
#> 3: 2015   B   NA
#> 4: 2015   B    D

Hope this answer your question and the feature request you want. Note that I use tracemem to show that there are no copies and that tidyr is compatible with data.table concept, as data.table syntax and concept are used correctly.

@hadley, I think you won't have any PR to review on this subject.

@cderv Thanks for this detailed explanation! You've saved me a lot of memory usage!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andrewpbray picture andrewpbray  路  8Comments

strengejacke picture strengejacke  路  8Comments

davidhunterwalsh picture davidhunterwalsh  路  4Comments

albertotb picture albertotb  路  7Comments

yusuzech picture yusuzech  路  3Comments