Data.table: List column support in data.table

Created on 9 Mar 2020 · 16Comments · Source: Rdatatable/data.table

Recently, I tried to get rid of tidyverse in every aspect and see if data.table could do the same more efficiently. I know that now data.table supports functions like nest and unnest in tidyr. However, could I find an example for all the data.table way to run examples in tidyr::nest and tidyr::chop? Any hints? Thanks.

question

Source

hope-data-science

All 16 comments

library(tidyr)
library(data.table)

df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
df %>% nest(data = c(y, z))
#> # A tibble: 3 x 2
#>       x data            
#>   <dbl> <list>          
#> 1     1 <tibble [3 × 2]>
#> 2     2 <tibble [2 × 2]>
#> 3     3 <tibble [1 × 2]>
df %>% chop(c(y, z))
#> # A tibble: 3 x 3
#>       x y         z        
#>   <dbl> <list>    <list>   
#> 1     1 <int [3]> <int [3]>
#> 2     2 <int [2]> <int [2]>
#> 3     3 <int [1]> <int [1]>


dt <- as.data.table(df)
dt[, .(data = list(.SD)), keyby = x]
#>        x              data
#>    <num>            <list>
#> 1:     1 <data.table[3x2]>
#> 2:     2 <data.table[2x2]>
#> 3:     3 <data.table[1x2]>
dt[, lapply(.SD, list), keyby = x, .SDcols = c('y', 'z')]
#>        x      y      z
#>    <num> <list> <list>
#> 1:     1  1,2,3  6,5,4
#> 2:     2    4,5    3,2
#> 3:     3      6      1

^{Created on 2020-03-09 by the reprex package (v0.3.0)}

shrektan on 9 Mar 2020

You might also check out @TysonStanley 's tidyfast package and his recent talk about list columns in data.table:

https://resources.rstudio.com/rstudio-conf-2020/list-columns-in-data-table-tyson-s-barrett

MichaelChirico on 9 Mar 2020

Thank you for the prompt feedback. I tried some of these, but meet some trouble.
By example, I mean every cases in the "Examples" in tidyr::nest and tidyr::chop. I find that data.table treat integer and double quite differently, which making the unnest in data.table fails. And I don't know how to write consistent codes to unnest or nest multiple columns, for list of data.tables as well as any vectors or even lists. It is just complicated. Let me give an example:

library(tidyr)

df <- tibble(
  x = 1:3,
  y = list(
    NULL,
    tibble(a = 1, b = 2),
    tibble(a = 1:3, b = 3:1)
  )
)
df %>% unnest(y)
df %>% unnest(y, keep_empty = TRUE)

####################################

library(data.table)
dt <- data.table(
  x = 1:3,
  y = list(
    NULL,
    data.table(a = 1, b = 2),
    data.table(a = 1:3, b = 3:1)
  )
)

dt[,unlist(y),by = x]

# Error in `[.data.table`(dt, , unlist(y), by = x) : 
#   Column 1 of result for group 3 is type 'integer' but expecting type 'double'. Column types must be consistent for each group.

hope-data-science on 9 Mar 2020

For that, you'll have to wait for this:

https://github.com/Rdatatable/data.table/pull/4156

In turn I've been waiting for the work on rbindlist to get merged first to see if I can't leverage some of that code to make cases like this easier rather than writing some more custom code.

For now, you'll have to know your input types -- dt[ , as.numeric(unlist(y)), by = x]

MichaelChirico on 9 Mar 2020

👍1

You might also check out @TysonStanley 's tidyfast package and his recent talk about list columns in data.table:

https://resources.rstudio.com/rstudio-conf-2020/list-columns-in-data-table-tyson-s-barrett

Thanks, I know his work, there is an article too. Check https://osf.io/f6pxw/download.
Actually I'm trying to write some myself, but it seems there are many problems ahead, which make me very hard to do the same as in tidyr. If data.table could not beat in speed for similar tasks, I might have to use tidyr, which seems to work on data.table too. But I prefer the whole workflow in data.table if possible.

hope-data-science on 9 Mar 2020

A workaround for this:

library(data.table)
dt <- data.table(
  x = 1:3,
  y = list(
    NULL,
    data.table(a = 1, b = 2),
    data.table(a = 1:3, b = 3:1)
  )
)

rbindlist(dt$y, idcol = "x")
#>        x     a     b
#>    <int> <num> <num>
#> 1:     2     1     2
#> 2:     3     1     3
#> 3:     3     2     2
#> 4:     3     3     1
rbindlist(dt$y, idcol = "x")[J(dt$x), on = "x"]
#>        x     a     b
#>    <int> <num> <num>
#> 1:     1    NA    NA
#> 2:     2     1     2
#> 3:     3     1     3
#> 4:     3     2     2
#> 5:     3     3     1

^{Created on 2020-03-09 by the reprex package (v0.3.0)}

shrektan on 9 Mar 2020

👍1

Well, this is not that correct because it only works x happens to be 1:N ...

shrektan on 9 Mar 2020

A verbose but work solution should be :

library(data.table)
dt <- data.table(
  x = 11:13,
  y = list(
    NULL,
    data.table(a = 1, b = 2),
    data.table(a = 1:3, b = 3:1)
  )
)
dt[, ID := seq_len(.N)]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
#>        a     b     x
#>    <num> <num> <int>
#> 1:     1     2    12
#> 2:     1     3    13
#> 3:     2     2    13
#> 4:     3     1    13
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]
#>        a     b     x
#>    <num> <num> <int>
#> 1:    NA    NA    11
#> 2:     1     2    12
#> 3:     1     3    13
#> 4:     2     2    13
#> 5:     3     1    13

^{Created on 2020-03-09 by the reprex package (v0.3.0)}

shrektan on 9 Mar 2020

Thanks for noting, I changed the x and found out. Do you think it we should solve it on the top or from the bottom? I've tried many times, but only get a very bad but working example:

library(data.table)
library(magrittr)
dt <- data.table(
  x = 1:3,
  y = list(
    NULL,
    data.table(a = 1, b = 2),
    data.table(a = 1:3, b = 3:1)
  )
)

dt %>% 
  split(dt$x) %>% 
  lapply(function(dt) merge(dt$x,dt$y)) %>% 
  rbindlist(fill = T) %>% 
  .[, which(unlist(lapply(., function(x) !all(is.na(x))))), 
    with = F]
#>    x a b
#> 1: 2 1 2
#> 2: 3 1 3
#> 3: 3 2 2
#> 4: 3 3 1

hope-data-science on 9 Mar 2020

I don't know what you are trying to do. Do my updated example works (please check) for you or not?

shrektan on 9 Mar 2020

Still, there's no uniform way to handle it. Some examples:

dt <- data.table(
  x = c(2,3,1),
  y = list(
    1:3,
    4:5,
    7:9
  )
)

# not working
dt[, ID := seq_len(.N)]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]

# still works
tidyr::unnest(dt,y)

I am considering what could be in each cell of the table. A data.table, a vector, a list. How could we handle all of them in a consistent way (which tidyr seems to tackle already).

hope-data-science on 9 Mar 2020

Just add a as.data.table() in advance.

library(data.table)
dt <- data.table(
  x = c(2,3,1),
  y = list(
    1:3,
    4:5,
    7:9
  )
)

# tidyr
tidyr::unnest(dt,y)
#> # A tibble: 8 x 2
#>       x     y
#>   <dbl> <int>
#> 1     2     1
#> 2     2     2
#> 3     2     3
#> 4     3     4
#> 5     3     5
#> 6     1     7
#> 7     1     8
#> 8     1     9

# data.table
dt[, ID := seq_len(.N)]
rbindlist(lapply(dt$y, as.data.table), idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
#>    V1 x
#> 1:  1 2
#> 2:  2 2
#> 3:  3 2
#> 4:  4 3
#> 5:  5 3
#> 6:  7 1
#> 7:  8 1
#> 8:  9 1
rbindlist(lapply(dt$y, as.data.table), idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]
#>    V1 x
#> 1:  1 2
#> 2:  2 2
#> 3:  3 2
#> 4:  4 3
#> 5:  5 3
#> 6:  7 1
#> 7:  8 1
#> 8:  9 1

^{Created on 2020-03-09 by the reprex package (v0.3.0)}

shrektan on 9 Mar 2020

👍1

Nice, turn everything into data.table and then use rbindlist. I'll try with more examples, this is great!

hope-data-science on 9 Mar 2020

Any hints for the tidyr::unchop in the data.table?

hope-data-science on 9 Mar 2020

I need an example. I don't know the differences between tidyr::unchop() and tidyr::unnest().

shrektan on 9 Mar 2020

Example:

library(tidyr)
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
df %>% chop(c(y, z))
#> # A tibble: 3 x 3
#>       x y         z        
#>   <dbl> <list>    <list>   
#> 1     1 <int [3]> <int [3]>
#> 2     2 <int [2]> <int [2]>
#> 3     3 <int [1]> <int [1]>
df %>% unchop(c(y,z))
#> # A tibble: 6 x 3
#>       x     y     z
#>   <dbl> <int> <int>
#> 1     1     1     6
#> 2     1     2     5
#> 3     1     3     4
#> 4     2     4     3
#> 5     2     5     2
#> 6     3     6     1
df %>% unnest(c(y,z))
#> # A tibble: 6 x 3
#>       x     y     z
#>   <dbl> <int> <int>
#> 1     1     1     6
#> 2     1     2     5
#> 3     1     3     4
#> 4     2     4     3
#> 5     2     5     2
#> 6     3     6     1

While this is no difference in this example, I don't know how to make it back in data.table.

hope-data-science on 9 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings