Dplyr: rbind on grouped data produces a "nested data frame"

Created on 22 Sep 2016  Â·  15Comments  Â·  Source: tidyverse/dplyr

Digging around possible causes for this strange behaviour led me to bind_rows, but it would still be great if the following bug could be resolved or a warning to use bind_rows could be printed. Code that worked fine suddenly failed when I changed the grouping variable, leading to an output of rbind like:

> rbind(d2, d3)
   y         foo       z        
d2 Integer,4 Numeric,4 Numeric,4
d3 Integer,8 Integer,8 Numeric,8

Example:

library(dplyr)

d1 <- data_frame(x = 1:8, y = rep(1:4, each = 2), z = rep(1:4, 2))

d2 <- d1 %>% group_by(y) %>% summarise(foo = mean(x), z = 0)
d3 <- d1 %>% group_by(y, z) %>% summarise(foo = mean(x))

# doesn't work
rbind(d2, d3)

# works
rbind(d2, d3 %>% ungroup())

# works
rbind(d3, d3)

# works
rbind(d2, d2)

# works
bind_rows(d2, d3)


d2 <- d1 %>% group_by(y) %>% summarise(foo = mean(x))
d3 <- d1 %>% group_by(y, z) %>% summarise(foo = mean(x)) %>% select(-z)

# doesn't work
rbind(d2, d3)

# works 
rbind(d2, d3 %>% ungroup())
bug

Most helpful comment

it doesn't look ideal to have both dplyr and data.table hack the base functions. We'll probably get unpredictable results as a function of which objects are supplied and which package was first loaded. Though we could load data.table if it is installed before applying our hack to work around that.

All 15 comments

Similarly, rbind between a data.frame and a grouped tbl_df data.frame does not return expected result.

df1 <- data.frame ("a" = sample(10), "b" = sample(10))
df2 <- group_by(data.frame ("a" = sample(10, replace= TRUE), "b" = sample(10)), a)

rbind(df1, df2)
a b
df1 Integer,10 Integer,10
df2 Integer,10 Integer,10

Could someone let us know if that wanted or not ? Not like this in 0.4 and very weird result according to me

It does work as expected with bind_rows(), though. @hadley: Do you think there's a good way to make rbind() do the right thing for grouped df-s?

I think we can probably make it do better, either by defining a rbind() or rbind2() method.

That would have to be S4, right? Could you please tag this as appropriate (bug/feature)?

I'm not sure on the S3/S4 issue - I'd need to look into the dispatch issues in more detail.

Somewhat related to https://github.com/edzer/sfr/issues/49

Could do the same what data.table is doing: see https://github.com/hadley/dplyr/issues/606#issuecomment-56529411, also https://github.com/tidyverse/tibble/issues/34#issuecomment-275356594 for the related tibble issue.

CC @billdenney.

We had to revert this due to #2667 — it caused more problems than it fixed.

@hadley looks like we should also remove the grouped_df method? it's probably better to lose group information than getting a matrix when combining a grouped_df with a tbl_df or data.frame.

Could an option be added to cbind and rbind so that they perform as described in this issue? Otherwise, there will be a lot of code that looks like:

rbind(as.data.frame(df1), as.data.frame(df2))

@billdenney we're not in control of the cbind and rbind generics.

@billdenney Just use bind_rows().

@hadley bind_rows (and bind_cols) is fair... Is it possible to add a warning reminder cbind and rbind so that they will say something like "cbind may give unexpected results with tbl_df, please use bind_cols" and "rbind may give unexpected results with tbl_df, please use bind_rows"?

(I haven't fully followed the parts about why it's infeasible, but I like warnings as reminders. And, the warning will hopefully help prevent future bug reports like this one. If it's already there-- no worries; I can't easily test the development version in my current environment.)

it is not possible to reliably issue a warning because of the way cbind's and rbind's dispatch mechanism works.

Well, we could hack into cbind() and rbind() the way data.table does: https://github.com/hadley/dplyr/issues/606#issuecomment-56529411, https://github.com/tidyverse/tibble/issues/34#issuecomment-275356594. I have suggested this before, so there may be a reason we're not following this path.

it doesn't look ideal to have both dplyr and data.table hack the base functions. We'll probably get unpredictable results as a function of which objects are supplied and which package was first loaded. Though we could load data.table if it is installed before applying our hack to work around that.

Was this page helpful?
0 / 5 - 0 ratings