Recently, I tried to get rid of tidyverse in every aspect and see if data.table could do the same more efficiently. I know that now data.table supports functions like nest and unnest in tidyr. However, could I find an example for all the data.table way to run examples in tidyr::nest and tidyr::chop? Any hints? Thanks.
library(tidyr)
library(data.table)
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
df %>% nest(data = c(y, z))
#> # A tibble: 3 x 2
#> x data
#> <dbl> <list>
#> 1 1 <tibble [3 脳 2]>
#> 2 2 <tibble [2 脳 2]>
#> 3 3 <tibble [1 脳 2]>
df %>% chop(c(y, z))
#> # A tibble: 3 x 3
#> x y z
#> <dbl> <list> <list>
#> 1 1 <int [3]> <int [3]>
#> 2 2 <int [2]> <int [2]>
#> 3 3 <int [1]> <int [1]>
dt <- as.data.table(df)
dt[, .(data = list(.SD)), keyby = x]
#> x data
#> <num> <list>
#> 1: 1 <data.table[3x2]>
#> 2: 2 <data.table[2x2]>
#> 3: 3 <data.table[1x2]>
dt[, lapply(.SD, list), keyby = x, .SDcols = c('y', 'z')]
#> x y z
#> <num> <list> <list>
#> 1: 1 1,2,3 6,5,4
#> 2: 2 4,5 3,2
#> 3: 3 6 1
Created on 2020-03-09 by the reprex package (v0.3.0)
You might also check out @TysonStanley 's tidyfast package and his recent talk about list columns in data.table:
https://resources.rstudio.com/rstudio-conf-2020/list-columns-in-data-table-tyson-s-barrett
Thank you for the prompt feedback. I tried some of these, but meet some trouble.
By example, I mean every cases in the "Examples" in tidyr::nest and tidyr::chop. I find that data.table treat integer and double quite differently, which making the unnest in data.table fails. And I don't know how to write consistent codes to unnest or nest multiple columns, for list of data.tables as well as any vectors or even lists. It is just complicated. Let me give an example:
library(tidyr)
df <- tibble(
x = 1:3,
y = list(
NULL,
tibble(a = 1, b = 2),
tibble(a = 1:3, b = 3:1)
)
)
df %>% unnest(y)
df %>% unnest(y, keep_empty = TRUE)
####################################
library(data.table)
dt <- data.table(
x = 1:3,
y = list(
NULL,
data.table(a = 1, b = 2),
data.table(a = 1:3, b = 3:1)
)
)
dt[,unlist(y),by = x]
# Error in `[.data.table`(dt, , unlist(y), by = x) :
# Column 1 of result for group 3 is type 'integer' but expecting type 'double'. Column types must be consistent for each group.
For that, you'll have to wait for this:
https://github.com/Rdatatable/data.table/pull/4156
In turn I've been waiting for the work on rbindlist to get merged first to see if I can't leverage some of that code to make cases like this easier rather than writing some more custom code.
For now, you'll have to know your input types -- dt[ , as.numeric(unlist(y)), by = x]
You might also check out @TysonStanley 's
tidyfastpackage and his recent talk about list columns indata.table:https://resources.rstudio.com/rstudio-conf-2020/list-columns-in-data-table-tyson-s-barrett
Thanks, I know his work, there is an article too. Check https://osf.io/f6pxw/download.
Actually I'm trying to write some myself, but it seems there are many problems ahead, which make me very hard to do the same as in tidyr. If data.table could not beat in speed for similar tasks, I might have to use tidyr, which seems to work on data.table too. But I prefer the whole workflow in data.table if possible.
A workaround for this:
library(data.table)
dt <- data.table(
x = 1:3,
y = list(
NULL,
data.table(a = 1, b = 2),
data.table(a = 1:3, b = 3:1)
)
)
rbindlist(dt$y, idcol = "x")
#> x a b
#> <int> <num> <num>
#> 1: 2 1 2
#> 2: 3 1 3
#> 3: 3 2 2
#> 4: 3 3 1
rbindlist(dt$y, idcol = "x")[J(dt$x), on = "x"]
#> x a b
#> <int> <num> <num>
#> 1: 1 NA NA
#> 2: 2 1 2
#> 3: 3 1 3
#> 4: 3 2 2
#> 5: 3 3 1
Created on 2020-03-09 by the reprex package (v0.3.0)
Well, this is not that correct because it only works x happens to be 1:N ...
A verbose but work solution should be :
library(data.table)
dt <- data.table(
x = 11:13,
y = list(
NULL,
data.table(a = 1, b = 2),
data.table(a = 1:3, b = 3:1)
)
)
dt[, ID := seq_len(.N)]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
#> a b x
#> <num> <num> <int>
#> 1: 1 2 12
#> 2: 1 3 13
#> 3: 2 2 13
#> 4: 3 1 13
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]
#> a b x
#> <num> <num> <int>
#> 1: NA NA 11
#> 2: 1 2 12
#> 3: 1 3 13
#> 4: 2 2 13
#> 5: 3 1 13
Created on 2020-03-09 by the reprex package (v0.3.0)
Thanks for noting, I changed the x and found out. Do you think it we should solve it on the top or from the bottom? I've tried many times, but only get a very bad but working example:
library(data.table)
library(magrittr)
dt <- data.table(
x = 1:3,
y = list(
NULL,
data.table(a = 1, b = 2),
data.table(a = 1:3, b = 3:1)
)
)
dt %>%
split(dt$x) %>%
lapply(function(dt) merge(dt$x,dt$y)) %>%
rbindlist(fill = T) %>%
.[, which(unlist(lapply(., function(x) !all(is.na(x))))),
with = F]
#> x a b
#> 1: 2 1 2
#> 2: 3 1 3
#> 3: 3 2 2
#> 4: 3 3 1
I don't know what you are trying to do. Do my updated example works (please check) for you or not?
Still, there's no uniform way to handle it. Some examples:
dt <- data.table(
x = c(2,3,1),
y = list(
1:3,
4:5,
7:9
)
)
# not working
dt[, ID := seq_len(.N)]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]
# still works
tidyr::unnest(dt,y)
I am considering what could be in each cell of the table. A data.table, a vector, a list. How could we handle all of them in a consistent way (which tidyr seems to tackle already).
Just add a as.data.table() in advance.
library(data.table)
dt <- data.table(
x = c(2,3,1),
y = list(
1:3,
4:5,
7:9
)
)
# tidyr
tidyr::unnest(dt,y)
#> # A tibble: 8 x 2
#> x y
#> <dbl> <int>
#> 1 2 1
#> 2 2 2
#> 3 2 3
#> 4 3 4
#> 5 3 5
#> 6 1 7
#> 7 1 8
#> 8 1 9
# data.table
dt[, ID := seq_len(.N)]
rbindlist(lapply(dt$y, as.data.table), idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
#> V1 x
#> 1: 1 2
#> 2: 2 2
#> 3: 3 2
#> 4: 4 3
#> 5: 5 3
#> 6: 7 1
#> 7: 8 1
#> 8: 9 1
rbindlist(lapply(dt$y, as.data.table), idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]
#> V1 x
#> 1: 1 2
#> 2: 2 2
#> 3: 3 2
#> 4: 4 3
#> 5: 5 3
#> 6: 7 1
#> 7: 8 1
#> 8: 9 1
Created on 2020-03-09 by the reprex package (v0.3.0)
Nice, turn everything into data.table and then use rbindlist. I'll try with more examples, this is great!
Any hints for the tidyr::unchop in the data.table?
I need an example. I don't know the differences between tidyr::unchop() and tidyr::unnest().
Example:
library(tidyr)
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
df %>% chop(c(y, z))
#> # A tibble: 3 x 3
#> x y z
#> <dbl> <list> <list>
#> 1 1 <int [3]> <int [3]>
#> 2 2 <int [2]> <int [2]>
#> 3 3 <int [1]> <int [1]>
df %>% unchop(c(y,z))
#> # A tibble: 6 x 3
#> x y z
#> <dbl> <int> <int>
#> 1 1 1 6
#> 2 1 2 5
#> 3 1 3 4
#> 4 2 4 3
#> 5 2 5 2
#> 6 3 6 1
df %>% unnest(c(y,z))
#> # A tibble: 6 x 3
#> x y z
#> <dbl> <int> <int>
#> 1 1 1 6
#> 2 1 2 5
#> 3 1 3 4
#> 4 2 4 3
#> 5 2 5 2
#> 6 3 6 1
While this is no difference in this example, I don't know how to make it back in data.table.