Data.table: proper nest/unnest functions

Created on 30 Jun 2019 · 16Comments · Source: Rdatatable/data.table

Hello,

I am trying to get the equivalent of nest() / unnest() in data.table and I wonder if a new function would make more sense. Consider this little example

tibble(group = c(1,1,1,2,2,2),
      val = c('hello', 'world','hello', 'world','hello', 'world'),
      col = c(1,2,3,4,5,6)) %>% 
  group_by(group) %>% nest() %>% 
  mutate(newval = group + 1)

# A tibble: 2 x 3
  group data             newval
  <dbl> <list>            <dbl>
1     1 <tibble [3 x 2]>      2
2     2 <tibble [3 x 2]>      3

I am not sure the same can be done with data.table. Look at the mysterious output I get after this. Am I missing something?

mydf[, list(listcol=list(data.table(val, col))), by=group][
  , newval := group + 1]
   group    val newval
1:     1 <list>      3
2:     1 <list>      3
3:     1 <list>      3
4:     2 <list>      3
5:     2 <list>      3
6:     2 <list>      3

feature request

Source

randomgambit

Most helpful comment

This is what I thought, we're so used at considering data.table and tidyverse as competing paradigms that we don't see how much they can work hand in hand, and would probably even more if more effort was invested in that direction :

.nest <- function(...) list(data.table(...))

mydt <- as.data.table(mydf)

mydt <- mydt[, .(data = .nest(.SD)),  by = group]

# the following works with lapply but I wanted to show that map works just fine,
# including the formula notation and the potential quasiquotation stuff.

mydt[, myreg := map(data, myreg)][
  , myoutput := map(myreg, mycoef)]

mydt
# group         data myreg myoutput
# 1:     1 <data.table>  <lm> <tbl_df>
# 2:     2 <data.table>  <lm> <tbl_df>

I agree that it prints less nice. And I find the need to switch between assignment by copy and reference annoying, but to be fair it's quite nice and readable.

I tried to use tibble:::print.tbl() on mydt and it failed, but I think it's something the tidyverse team could sort out, it might even be a tidyverse bug in the sense that data.tables are data.frames and it's supposed to support data.frames.

Now to unnest a specific column :

# this removes other list columns but it does the same on your mydf tibble
mydt[,unnest(.SD, myoutput)]
# group        term     estimate std.error     statistic   p.value
# 1:     1 (Intercept)  2.00000e+00  1.000000  2.000000e+00 0.2951672
# 2:     1 I(val)world -5.43896e-16  1.732051 -3.140185e-16 1.0000000
# 3:     2 (Intercept)  5.00000e+00  1.414214  3.535534e+00 0.1754797
# 4:     2 I(val)world  8.15844e-16  1.732051  4.710277e-16 1.0000000

# to keep all we can use a join
mydt[,unnest(.SD), .SDcols = c("group","myoutput")][mydt, on="group"]
#    group        term     estimate std.error     statistic   p.value         data myreg myoutput
# 1:     1 (Intercept)  2.00000e+00  1.000000  2.000000e+00 0.2951672 <data.table>  <lm> <tbl_df>
# 2:     1 I(val)world -5.43896e-16  1.732051 -3.140185e-16 1.0000000 <data.table>  <lm> <tbl_df>
# 3:     2 (Intercept)  5.00000e+00  1.414214  3.535534e+00 0.1754797 <data.table>  <lm> <tbl_df>
# 4:     2 I(val)world  8.15844e-16  1.732051  4.710277e-16 1.0000000 <data.table>  <lm> <tbl_df>

moodymudskipper on 5 Jul 2019

👍2

All 16 comments

Re the mysterious output, yeah that does look odd (having newval := group + 1 and yet that output, and also listcol not being a column name...). Could you provide a reproducible example, preferably not using packages that are not required? Some more guidance is here: https://github.com/Rdatatable/data.table/wiki/Support

As far as I know, nest and unnest are feasible in data.table already, though there is a proposal to add a fast unnest: https://github.com/Rdatatable/data.table/issues/2146

(Wrapping in list as you have done is nesting. For unnest, there are several options depending on what specifically you want to do. Rep and unlist work for many cases: https://stackoverflow.com/a/38796822)

franknarf1 on 30 Jun 2019

👍1

sure you can simply use

mydf <- data.table(group = c(1,1,1,2,2,2),
           val = c('hello', 'world','hello', 'world','hello', 'world'),
           col = c(1,2,3,4,5,6))

and this seems to work

> mydf[, list(listcol=list(data.table(val, col))), by=group]
   group      listcol
1:     1 <data.table>
2:     2 <data.table>

but the rest seems to break down the list-column format. Ideally, I would like to use purrr::map() on the nest data.frame. Yes, using the best for both worlds :)

randomgambit on 30 Jun 2019

wow but the nest() ticket has been open for more than 2 years!

randomgambit on 30 Jun 2019

@randomgambit your problem is that you used assignment by reference on top of an aggregating call, it seems like data.table doesn't like it too much, this works :

library(data.table)
mydf <- data.table(group = c(1,1,1,2,2,2),
                   val = c('hello', 'world','hello', 'world','hello', 'world'),
                   col = c(1,2,3,4,5,6)) 

mydf <- mydf[, .(listcol=list(data.table(val, col))), by=group]
mydf [, newval := group + 1]
mydf
#>    group      listcol newval
#> 1:     1 <data.table>      2
#> 2:     2 <data.table>      3

This also works, with the strange effect that mydf needs to called twice to be printed (bug ?) :

mydf <- data.table(group = c(1,1,1,2,2,2),
                   val = c('hello', 'world','hello', 'world','hello', 'world'),
                   col = c(1,2,3,4,5,6)) 

mydf <- mydf[, .(listcol=list(data.table(val, col))), by=group][, newval := group + 1]
mydf
mydf
#>    group      listcol newval
#> 1:     1 <data.table>      2
#> 2:     2 <data.table>      3

We could create a function .nest as a shortcut :

.nest <- function(...) list(data.table(...))

mydf <- mydf[, .(listcol= .nest(val, col)), by=group]
mydf [, newval := group + 1]
mydf
#>    group      listcol newval
#> 1:     1 <data.table>      2
#> 2:     2 <data.table>      3

As this function doesn't make sense outside of [.data.table it could be defined in its scope only.

Could you be more precise about what you would like to do with purrr::map but feel you can't ?

As for unnest it seems that we can get help from tidyr without any conversion to tibble. It might be possible to get the best of both worlds after all!

library(tidyr)
unnest(mydf)
#>    group newval   val col
#> 1:     1      2 hello   1
#> 2:     1      2 world   2
#> 3:     1      2 hello   3
#> 4:     2      3 world   4
#> 5:     2      3 hello   5
#> 6:     2      3 world   6

# or
mydf[,unnest(.SD)]
#>    group newval   val col
#> 1:     1      2 hello   1
#> 2:     1      2 world   2
#> 3:     1      2 hello   3
#> 4:     2      3 world   4
#> 5:     2      3 hello   5
#> 6:     2      3 world   6

moodymudskipper on 5 Jul 2019

This also works, with the strange effect that mydf needs to called twice to be printed (bug ?)

This is expected behavior since at least version 1.9.6; see here for reference.

jaapwalhout on 5 Jul 2019

Ah that is pretty great. As for

Could you be more precise about what you would like to do with purrr::map but feel you can't ?

Here is in my opinion what makes tibble superior to DT (for now :)) for data analysis. Look at this wonderful output:

mydf <- tibble(group = c(1,1,1,2,2,2),
                   val = c('hello', 'world','hello', 'world','hello', 'world'),
                   col = c(1,2,3,4,5,6)) 

myreg <- function(df){
  lm(col ~ I(val), data = df)
}

mycoef <- function(obj){
  obj %>% broom::tidy()
}

mydf %>% 
  group_by(group) %>% 
  nest() %>% 
  mutate(myreg = map(data, ~myreg(.x)),
         myoutput = map(myreg, ~mycoef(.x)))

# A tibble: 2 x 4
  group data             myreg    myoutput        
  <dbl> <list>           <list>   <list>          
1     1 <tibble [3 x 2]> <S3: lm> <tibble [2 x 5]>
2     2 <tibble [3 x 2]> <S3: lm> <tibble [2 x 5]>

Here, by just looking at the R console I can see that myreg is a lm object, and that the regression by group went fine because myoutput contains non-empty tibbles.

I would love being able to encapsulate DTs or tibbles in a DT and get the same functionality. But this requires proper unnest()/nest() functions (and proper printing functions ... see my other github issue https://github.com/Rdatatable/data.table/issues/3671).

Here, it is important to be able to unnest myoutput while preserving the other variables (and possibly the other list-column myreg). I am not sure this works right now in DT.

Thanks!

randomgambit on 5 Jul 2019

The only real difference I see in the data.table approach (below what I would do...) is the print doesn't show the size of the objects in the list column.
To unnest you can use tidyr as pointed out earlier.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, coalesce, first, last

mydf <- tibble(group = c(1,1,1,2,2,2),
               val = c('hello', 'world','hello', 'world','hello', 'world'),
               col = c(1,2,3,4,5,6)) 

myreg <- function(df){
  lm(col ~ I(val), data = df)
}

mycoef <- function(obj){
  obj %>% broom::tidy()
}

mydt <- setDT(mydf)

mydt[, .(data = list(.SD)), group]  %>% 
  .[, {
    myreg = lapply(data, FUN = function(x) myreg(x))
    myoutput = lapply(myreg, FUN = function(x) mycoef(x))
    list(data = data, myreg = myreg, myoutput = myoutput)
  }, group]
#>    group         data myreg myoutput
#> 1:     1 <data.table>  <lm> <tbl_df>
#> 2:     2 <data.table>  <lm> <tbl_df>

# OR this other way with more intermediary steps and unnesting at the end

mydt[, .(data = list(.SD)), group] %>% 
  .[, .(data, myreg = lapply(data, FUN = function(x) myreg(x))), group] %>% 
  .[, .(data, myreg, myoutput = lapply(myreg, FUN = function(x) mycoef(x))), group] %>% 
  unnest(myoutput)
#>    group        term     estimate std.error     statistic   p.value
#> 1:     1 (Intercept)  2.00000e+00  1.000000  2.000000e+00 0.2951672
#> 2:     1 I(val)world -5.43896e-16  1.732051 -3.140185e-16 1.0000000
#> 3:     2 (Intercept)  5.00000e+00  1.414214  3.535534e+00 0.1754797
#> 4:     2 I(val)world  8.15844e-16  1.732051  4.710277e-16 1.0000000

^{Created on 2019-07-05 by the reprex package (v0.3.0)}

danielarantes on 5 Jul 2019

👍1

.nest <- function(...) list(data.table(...))

mydt <- as.data.table(mydf)

mydt <- mydt[, .(data = .nest(.SD)),  by = group]

# the following works with lapply but I wanted to show that map works just fine,
# including the formula notation and the potential quasiquotation stuff.

mydt[, myreg := map(data, myreg)][
  , myoutput := map(myreg, mycoef)]

mydt
# group         data myreg myoutput
# 1:     1 <data.table>  <lm> <tbl_df>
# 2:     2 <data.table>  <lm> <tbl_df>

I agree that it prints less nice. And I find the need to switch between assignment by copy and reference annoying, but to be fair it's quite nice and readable.

Now to unnest a specific column :

# this removes other list columns but it does the same on your mydf tibble
mydt[,unnest(.SD, myoutput)]
# group        term     estimate std.error     statistic   p.value
# 1:     1 (Intercept)  2.00000e+00  1.000000  2.000000e+00 0.2951672
# 2:     1 I(val)world -5.43896e-16  1.732051 -3.140185e-16 1.0000000
# 3:     2 (Intercept)  5.00000e+00  1.414214  3.535534e+00 0.1754797
# 4:     2 I(val)world  8.15844e-16  1.732051  4.710277e-16 1.0000000

# to keep all we can use a join
mydt[,unnest(.SD), .SDcols = c("group","myoutput")][mydt, on="group"]
#    group        term     estimate std.error     statistic   p.value         data myreg myoutput
# 1:     1 (Intercept)  2.00000e+00  1.000000  2.000000e+00 0.2951672 <data.table>  <lm> <tbl_df>
# 2:     1 I(val)world -5.43896e-16  1.732051 -3.140185e-16 1.0000000 <data.table>  <lm> <tbl_df>
# 3:     2 (Intercept)  5.00000e+00  1.414214  3.535534e+00 0.1754797 <data.table>  <lm> <tbl_df>
# 4:     2 I(val)world  8.15844e-16  1.732051  4.710277e-16 1.0000000 <data.table>  <lm> <tbl_df>

moodymudskipper on 5 Jul 2019

👍2

ah Daniel I hadn't seen your answer, well that gives more options!

moodymudskipper on 5 Jul 2019

Very interesting. I am all if for a better integration between DT and tidyverse.
@moodymudskipper when you say:

And I find the need to switch between assignment by copy and reference annoying

Also, perhaps the best solution is to make tibble:::print.tbl() accept DTs. The more I think about it, the more this looks like a genius option. I can file an issue on the tibble repo if you like.

randomgambit on 5 Jul 2019

Actually I was wrong to describe it as a bug, it's an unexported method so it's reasonable that it works on tibbles only, and actually the following works so no need to file anything (except it could support data.table as a special class, as it doesn't now) :

as_tibble(mydt)
# A tibble: 2 x 4
  group data             myreg  myoutput        
  <dbl> <list>           <list> <list>          
1     1 <df[,2] [3 x 2]> <lm>   <tibble [2 x 5]>
2     2 <df[,2] [3 x 2]> <lm>   <tibble [2 x 5]>

moodymudskipper on 5 Jul 2019

The trick to use mydt[TRUE] to make sure we're copying is really nice, and for quick exploration mydt[T] is just 3 more keys.

moodymudskipper on 5 Jul 2019

👍1

got it.

Also this syntax is quite interesting
mydt[,unnest(.SD), .SDcols = c("group","myoutput")][mydt, on="group"]

My understanding is that this only works because mydt is copied on the first place. So if I were to use := inside the first call then I should necessarily use the mydf[TRUE] insane hack :)

randomgambit on 5 Jul 2019

Calling it hack is not appropriate, each data table query that doesn't update by reference will copy. The hack about TRUE is that it does shallow copy, not really relevant here. Making copies on [ is a regular behaviour.

jangorecki on 5 Jul 2019

Thanks Jan, I was thinking 1:.N and NULL were the same behavior... so actually DT[TRUE] and DT[1:.N]/DT[NULL] have different effects (the latter being full copies)

MichaelChirico on 6 Jul 2019

I think we can close this issue as duplicate of #2146. We don't need fast nest function because AFAIU it is a matter of wrapping in a list. Any comments?

jangorecki on 30 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings