Hello,
I am trying to get the equivalent of nest() / unnest() in data.table and I wonder if a new function would make more sense. Consider this little example
tibble(group = c(1,1,1,2,2,2),
val = c('hello', 'world','hello', 'world','hello', 'world'),
col = c(1,2,3,4,5,6)) %>%
group_by(group) %>% nest() %>%
mutate(newval = group + 1)
# A tibble: 2 x 3
group data newval
<dbl> <list> <dbl>
1 1 <tibble [3 x 2]> 2
2 2 <tibble [3 x 2]> 3
I am not sure the same can be done with data.table. Look at the mysterious output I get after this. Am I missing something?
mydf[, list(listcol=list(data.table(val, col))), by=group][
, newval := group + 1]
group val newval
1: 1 <list> 3
2: 1 <list> 3
3: 1 <list> 3
4: 2 <list> 3
5: 2 <list> 3
6: 2 <list> 3
Re the mysterious output, yeah that does look odd (having newval := group + 1 and yet that output, and also listcol not being a column name...). Could you provide a reproducible example, preferably not using packages that are not required? Some more guidance is here: https://github.com/Rdatatable/data.table/wiki/Support
As far as I know, nest and unnest are feasible in data.table already, though there is a proposal to add a fast unnest: https://github.com/Rdatatable/data.table/issues/2146
(Wrapping in list as you have done is nesting. For unnest, there are several options depending on what specifically you want to do. Rep and unlist work for many cases: https://stackoverflow.com/a/38796822)
sure you can simply use
mydf <- data.table(group = c(1,1,1,2,2,2),
val = c('hello', 'world','hello', 'world','hello', 'world'),
col = c(1,2,3,4,5,6))
and this seems to work
> mydf[, list(listcol=list(data.table(val, col))), by=group]
group listcol
1: 1 <data.table>
2: 2 <data.table>
but the rest seems to break down the list-column format. Ideally, I would like to use purrr::map() on the nest data.frame. Yes, using the best for both worlds :)
wow but the nest() ticket has been open for more than 2 years!
@randomgambit your problem is that you used assignment by reference on top of an aggregating call, it seems like data.table doesn't like it too much, this works :
library(data.table)
mydf <- data.table(group = c(1,1,1,2,2,2),
val = c('hello', 'world','hello', 'world','hello', 'world'),
col = c(1,2,3,4,5,6))
mydf <- mydf[, .(listcol=list(data.table(val, col))), by=group]
mydf [, newval := group + 1]
mydf
#> group listcol newval
#> 1: 1 <data.table> 2
#> 2: 2 <data.table> 3
This also works, with the strange effect that mydf needs to called twice to be printed (bug ?) :
mydf <- data.table(group = c(1,1,1,2,2,2),
val = c('hello', 'world','hello', 'world','hello', 'world'),
col = c(1,2,3,4,5,6))
mydf <- mydf[, .(listcol=list(data.table(val, col))), by=group][, newval := group + 1]
mydf
mydf
#> group listcol newval
#> 1: 1 <data.table> 2
#> 2: 2 <data.table> 3
We could create a function .nest as a shortcut :
.nest <- function(...) list(data.table(...))
mydf <- mydf[, .(listcol= .nest(val, col)), by=group]
mydf [, newval := group + 1]
mydf
#> group listcol newval
#> 1: 1 <data.table> 2
#> 2: 2 <data.table> 3
As this function doesn't make sense outside of [.data.table it could be defined in its scope only.
Could you be more precise about what you would like to do with purrr::map but feel you can't ?
As for unnest it seems that we can get help from tidyr without any conversion to tibble. It might be possible to get the best of both worlds after all!
library(tidyr)
unnest(mydf)
#> group newval val col
#> 1: 1 2 hello 1
#> 2: 1 2 world 2
#> 3: 1 2 hello 3
#> 4: 2 3 world 4
#> 5: 2 3 hello 5
#> 6: 2 3 world 6
# or
mydf[,unnest(.SD)]
#> group newval val col
#> 1: 1 2 hello 1
#> 2: 1 2 world 2
#> 3: 1 2 hello 3
#> 4: 2 3 world 4
#> 5: 2 3 hello 5
#> 6: 2 3 world 6
This also works, with the strange effect that mydf needs to called twice to be printed (bug ?)
This is expected behavior since at least version 1.9.6; see here for reference.
Ah that is pretty great. As for
Could you be more precise about what you would like to do with purrr::map but feel you can't ?
Here is in my opinion what makes tibble superior to DT (for now :)) for data analysis. Look at this wonderful output:
mydf <- tibble(group = c(1,1,1,2,2,2),
val = c('hello', 'world','hello', 'world','hello', 'world'),
col = c(1,2,3,4,5,6))
myreg <- function(df){
lm(col ~ I(val), data = df)
}
mycoef <- function(obj){
obj %>% broom::tidy()
}
mydf %>%
group_by(group) %>%
nest() %>%
mutate(myreg = map(data, ~myreg(.x)),
myoutput = map(myreg, ~mycoef(.x)))
# A tibble: 2 x 4
group data myreg myoutput
<dbl> <list> <list> <list>
1 1 <tibble [3 x 2]> <S3: lm> <tibble [2 x 5]>
2 2 <tibble [3 x 2]> <S3: lm> <tibble [2 x 5]>
Here, by just looking at the R console I can see that myreg is a lm object, and that the regression by group went fine because myoutput contains non-empty tibbles.
I would love being able to encapsulate DTs or tibbles in a DT and get the same functionality. But this requires proper unnest()/nest() functions (and proper printing functions ... see my other github issue https://github.com/Rdatatable/data.table/issues/3671).
Here, it is important to be able to unnest myoutput while preserving the other variables (and possibly the other list-column myreg). I am not sure this works right now in DT.
Thanks!
The only real difference I see in the data.table approach (below what I would do...) is the print doesn't show the size of the objects in the list column.
To unnest you can use tidyr as pointed out earlier.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(purrr)
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following object is masked from 'package:purrr':
#>
#> transpose
#> The following objects are masked from 'package:dplyr':
#>
#> between, coalesce, first, last
mydf <- tibble(group = c(1,1,1,2,2,2),
val = c('hello', 'world','hello', 'world','hello', 'world'),
col = c(1,2,3,4,5,6))
myreg <- function(df){
lm(col ~ I(val), data = df)
}
mycoef <- function(obj){
obj %>% broom::tidy()
}
mydt <- setDT(mydf)
mydt[, .(data = list(.SD)), group] %>%
.[, {
myreg = lapply(data, FUN = function(x) myreg(x))
myoutput = lapply(myreg, FUN = function(x) mycoef(x))
list(data = data, myreg = myreg, myoutput = myoutput)
}, group]
#> group data myreg myoutput
#> 1: 1 <data.table> <lm> <tbl_df>
#> 2: 2 <data.table> <lm> <tbl_df>
# OR this other way with more intermediary steps and unnesting at the end
mydt[, .(data = list(.SD)), group] %>%
.[, .(data, myreg = lapply(data, FUN = function(x) myreg(x))), group] %>%
.[, .(data, myreg, myoutput = lapply(myreg, FUN = function(x) mycoef(x))), group] %>%
unnest(myoutput)
#> group term estimate std.error statistic p.value
#> 1: 1 (Intercept) 2.00000e+00 1.000000 2.000000e+00 0.2951672
#> 2: 1 I(val)world -5.43896e-16 1.732051 -3.140185e-16 1.0000000
#> 3: 2 (Intercept) 5.00000e+00 1.414214 3.535534e+00 0.1754797
#> 4: 2 I(val)world 8.15844e-16 1.732051 4.710277e-16 1.0000000
Created on 2019-07-05 by the reprex package (v0.3.0)
This is what I thought, we're so used at considering data.table and tidyverse as competing paradigms that we don't see how much they can work hand in hand, and would probably even more if more effort was invested in that direction :
.nest <- function(...) list(data.table(...))
mydt <- as.data.table(mydf)
mydt <- mydt[, .(data = .nest(.SD)), by = group]
# the following works with lapply but I wanted to show that map works just fine,
# including the formula notation and the potential quasiquotation stuff.
mydt[, myreg := map(data, myreg)][
, myoutput := map(myreg, mycoef)]
mydt
# group data myreg myoutput
# 1: 1 <data.table> <lm> <tbl_df>
# 2: 2 <data.table> <lm> <tbl_df>
I agree that it prints less nice. And I find the need to switch between assignment by copy and reference annoying, but to be fair it's quite nice and readable.
I tried to use tibble:::print.tbl() on mydt and it failed, but I think it's something the tidyverse team could sort out, it might even be a tidyverse bug in the sense that data.tables are data.frames and it's supposed to support data.frames.
Now to unnest a specific column :
# this removes other list columns but it does the same on your mydf tibble
mydt[,unnest(.SD, myoutput)]
# group term estimate std.error statistic p.value
# 1: 1 (Intercept) 2.00000e+00 1.000000 2.000000e+00 0.2951672
# 2: 1 I(val)world -5.43896e-16 1.732051 -3.140185e-16 1.0000000
# 3: 2 (Intercept) 5.00000e+00 1.414214 3.535534e+00 0.1754797
# 4: 2 I(val)world 8.15844e-16 1.732051 4.710277e-16 1.0000000
# to keep all we can use a join
mydt[,unnest(.SD), .SDcols = c("group","myoutput")][mydt, on="group"]
# group term estimate std.error statistic p.value data myreg myoutput
# 1: 1 (Intercept) 2.00000e+00 1.000000 2.000000e+00 0.2951672 <data.table> <lm> <tbl_df>
# 2: 1 I(val)world -5.43896e-16 1.732051 -3.140185e-16 1.0000000 <data.table> <lm> <tbl_df>
# 3: 2 (Intercept) 5.00000e+00 1.414214 3.535534e+00 0.1754797 <data.table> <lm> <tbl_df>
# 4: 2 I(val)world 8.15844e-16 1.732051 4.710277e-16 1.0000000 <data.table> <lm> <tbl_df>
ah Daniel I hadn't seen your answer, well that gives more options!
Very interesting. I am all if for a better integration between DT and tidyverse.
@moodymudskipper when you say:
And I find the need to switch between assignment by copy and reference annoying
Is this related to https://github.com/Rdatatable/data.table/issues/3682?
Also, perhaps the best solution is to make tibble:::print.tbl() accept DTs. The more I think about it, the more this looks like a genius option. I can file an issue on the tibble repo if you like.
Actually I was wrong to describe it as a bug, it's an unexported method so it's reasonable that it works on tibbles only, and actually the following works so no need to file anything (except it could support data.table as a special class, as it doesn't now) :
as_tibble(mydt)
# A tibble: 2 x 4
group data myreg myoutput
<dbl> <list> <list> <list>
1 1 <df[,2] [3 x 2]> <lm> <tibble [2 x 5]>
2 2 <df[,2] [3 x 2]> <lm> <tibble [2 x 5]>
The trick to use mydt[TRUE] to make sure we're copying is really nice, and for quick exploration mydt[T] is just 3 more keys.
got it.
Also this syntax is quite interesting
mydt[,unnest(.SD), .SDcols = c("group","myoutput")][mydt, on="group"]
My understanding is that this only works because mydt is copied on the first place. So if I were to use := inside the first call then I should necessarily use the mydf[TRUE] insane hack :)
Calling it hack is not appropriate, each data table query that doesn't update by reference will copy. The hack about TRUE is that it does shallow copy, not really relevant here. Making copies on [ is a regular behaviour.
Thanks Jan, I was thinking 1:.N and NULL were the same behavior... so actually DT[TRUE] and DT[1:.N]/DT[NULL] have different effects (the latter being full copies)
I think we can close this issue as duplicate of #2146. We don't need fast nest function because AFAIU it is a matter of wrapping in a list. Any comments?
Most helpful comment
This is what I thought, we're so used at considering data.table and tidyverse as competing paradigms that we don't see how much they can work hand in hand, and would probably even more if more effort was invested in that direction :
I agree that it prints less nice. And I find the need to switch between assignment by copy and reference annoying, but to be fair it's quite nice and readable.
I tried to use
tibble:::print.tbl()onmydtand it failed, but I think it's something the tidyverse team could sort out, it might even be a tidyverse bug in the sense that data.tables are data.frames and it's supposed to support data.frames.Now to
unnesta specific column :