Dplyr: rbind on grouped data produces a "nested data frame"

Created on 22 Sep 2016 · 15Comments · Source: tidyverse/dplyr

Digging around possible causes for this strange behaviour led me to bind_rows, but it would still be great if the following bug could be resolved or a warning to use bind_rows could be printed. Code that worked fine suddenly failed when I changed the grouping variable, leading to an output of rbind like:

> rbind(d2, d3)
   y         foo       z        
d2 Integer,4 Numeric,4 Numeric,4
d3 Integer,8 Integer,8 Numeric,8

Example:

library(dplyr)

d1 <- data_frame(x = 1:8, y = rep(1:4, each = 2), z = rep(1:4, 2))

d2 <- d1 %>% group_by(y) %>% summarise(foo = mean(x), z = 0)
d3 <- d1 %>% group_by(y, z) %>% summarise(foo = mean(x))

# doesn't work
rbind(d2, d3)

# works
rbind(d2, d3 %>% ungroup())

# works
rbind(d3, d3)

# works
rbind(d2, d2)

# works
bind_rows(d2, d3)


d2 <- d1 %>% group_by(y) %>% summarise(foo = mean(x))
d3 <- d1 %>% group_by(y, z) %>% summarise(foo = mean(x)) %>% select(-z)

# doesn't work
rbind(d2, d3)

# works 
rbind(d2, d3 %>% ungroup())

bug

Source

mkuhn

👍3 😄1

Most helpful comment

it doesn't look ideal to have both dplyr and data.table hack the base functions. We'll probably get unpredictable results as a function of which objects are supplied and which package was first loaded. Though we could load data.table if it is installed before applying our hack to work around that.

lionel- on 17 Apr 2017

👍2

All 15 comments

Similarly, rbind between a data.frame and a grouped tbl_df data.frame does not return expected result.

df1 <- data.frame ("a" = sample(10), "b" = sample(10))
df2 <- group_by(data.frame ("a" = sample(10, replace= TRUE), "b" = sample(10)), a)

rbind(df1, df2)
a b
df1 Integer,10 Integer,10
df2 Integer,10 Integer,10

Could someone let us know if that wanted or not ? Not like this in 0.4 and very weird result according to me

Fablepongiste on 14 Oct 2016

It does work as expected with bind_rows(), though. @hadley: Do you think there's a good way to make rbind() do the right thing for grouped df-s?

krlmlr on 7 Nov 2016

I think we can probably make it do better, either by defining a rbind() or rbind2() method.

hadley on 7 Nov 2016

👍2

That would have to be S4, right? Could you please tag this as appropriate (bug/feature)?

krlmlr on 7 Nov 2016

I'm not sure on the S3/S4 issue - I'd need to look into the dispatch issues in more detail.

Somewhat related to https://github.com/edzer/sfr/issues/49

hadley on 7 Nov 2016

Could do the same what data.table is doing: see https://github.com/hadley/dplyr/issues/606#issuecomment-56529411, also https://github.com/tidyverse/tibble/issues/34#issuecomment-275356594 for the related tibble issue.

CC @billdenney.

krlmlr on 10 Feb 2017

We had to revert this due to #2667 — it caused more problems than it fixed.

hadley on 17 Apr 2017

@hadley looks like we should also remove the grouped_df method? it's probably better to lose group information than getting a matrix when combining a grouped_df with a tbl_df or data.frame.

lionel- on 17 Apr 2017

Could an option be added to cbind and rbind so that they perform as described in this issue? Otherwise, there will be a lot of code that looks like:

rbind(as.data.frame(df1), as.data.frame(df2))

billdenney on 17 Apr 2017

@billdenney we're not in control of the cbind and rbind generics.

lionel- on 17 Apr 2017

@billdenney Just use bind_rows().

hadley on 17 Apr 2017

@hadley bind_rows (and bind_cols) is fair... Is it possible to add a warning reminder cbind and rbind so that they will say something like "cbind may give unexpected results with tbl_df, please use bind_cols" and "rbind may give unexpected results with tbl_df, please use bind_rows"?

(I haven't fully followed the parts about why it's infeasible, but I like warnings as reminders. And, the warning will hopefully help prevent future bug reports like this one. If it's already there-- no worries; I can't easily test the development version in my current environment.)

billdenney on 17 Apr 2017

it is not possible to reliably issue a warning because of the way cbind's and rbind's dispatch mechanism works.

lionel- on 17 Apr 2017

Well, we could hack into cbind() and rbind() the way data.table does: https://github.com/hadley/dplyr/issues/606#issuecomment-56529411, https://github.com/tidyverse/tibble/issues/34#issuecomment-275356594. I have suggested this before, so there may be a reason we're not following this path.

krlmlr on 17 Apr 2017

lionel- on 17 Apr 2017

👍2

Was this page helpful?

0 / 5 - 0 ratings