Dplyr: Create a group_indices as a new variable

Created on 30 May 2015 · 21Comments · Source: tidyverse/dplyr

Some packages like ggplot2 act on groups defined by one variable only (as opposed to groups defined by several variables). It would be nice to have a function, say group(), that creates a new integer variable from groups defined by multiple variables:

Batting %>% mutate(group = group(teamID, yearID))
Batting %>% group_by(teamID, yearID) %>% mutate(group = group())

This function could also have a na.rm argument. The default should return a missing value for the observation if some grouping variable for this observation is missing.

group_indices is not suited for that since (i) it requires df as an argument (ii) group_indices does not work inside mutate

df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4))
df %>% mutate(g = group_indices(df, v1))
# Error: cannot handle

A work around for now

group <- function(..., na.rm = FALSE){
  df <- data.frame(list(...))
  if (na.rm){
    out <- rep(NA, nrow(df))
    complete <- complete.cases(df)
    indices <- df %>% filter(complete) %>% group_indices_(.dots = names(df))
    out[complete] <- indices
  } else{
    out <- group_indices_(df, .dots = names(df))
  }
  out
}

feature

Source

matthieugomez

👍5

Most helpful comment

@romainfrancois

Why was the issue closed? While @matthieugomez found another solution for his particular use case, the original request would be an extremely useful feature. It would be great to be able to do what you suggested above:

df %>% group_by(v1) %>% mutate( g = group_indices() )
# or 
df %>% group_by(v1) %>% mutate( g = group() )

jarkub on 19 May 2016

👍38

All 21 comments

You use group_indices like this:

> df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4))
> df %>% group_by(v1) %>% group_indices
[1] 3 3 1 1 2

@hadley perhaps we could support something like:

df %>% group_by(v1) %>% mutate( g = group_indices() )
# or 
df %>% group_by(v1) %>% mutate( g = group() )

romainfrancois on 8 Jul 2015

👍5

In all cases I used this function (for instance before using ggplot2 or plm), I wanted the group index to be missing in observations where some of the column was missing. This is different from the behavior of group_indices. Maybe dplyr is not the best place to implement such a function?

matthieugomez on 8 Jul 2015

In that case, perhaps you can close the issue ?

romainfrancois on 8 Jul 2015

I think the initial idea is great. Using group_indices() as we use rleid() in data.table

voxnonecho on 8 Jul 2015

@romainfrancois

df %>% group_by(v1) %>% mutate( g = group_indices() )
# or 
df %>% group_by(v1) %>% mutate( g = group() )

jarkub on 19 May 2016

👍38

I would expect to be able to use this inside a mutate. I use a lot of data.table like @voxnonecho and that there is no easy way to do this is in dplyr is a bit of a pain

stephlocke on 28 Mar 2017

👍5

I can offer:

library(dplyr, warn.conflicts = FALSE)
df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4))
df %>% group_by(v1) %>% { mutate(ungroup(.), g = group_indices(.)) }
#> # A tibble: 5 × 3
#>      v1    v2     g
#>   <dbl> <dbl> <int>
#> 1    NA    NA     3
#> 2    NA    NA     3
#> 3     2     3     1
#> 4     2     3     1
#> 5     3     4     2

@hadley: This would be much simpler if we had a hybrid handler for group_indices() that just does the right thing for grouped data frames.

krlmlr on 28 Mar 2017

👍7

I want to shelve this for now, and come back to when we broadly reconsider what other pronouns would useful inside dplyr verbs.

hadley on 28 Mar 2017

howdy, thread. any updates on using group_indices inside of mutate? :)

df %>% group_by(v1) %>% mutate( g = group_indices() )

ericgtaylor on 3 Oct 2017

👍8

I see that another option is doing
df$g <- df %>% group_by(v1) %>% group_indices
but it would be great if we could do this in a seamless pipe like others have voiced above.

dfrail24 on 5 Feb 2018

👍1

I'll add that if you want to use group_indices inside a pipe, you can always do this:
df %>% bind_cols(g = group_indices(., group_var1, group_var2))
which does work seamlessly in a pipe but does not feel very neat.

Zedseayou on 14 Feb 2018

👍1

Happy to reconsider.

krlmlr on 28 Feb 2018

@hadley: Do we have a better idea what other pronouns are useful inside dplyr verbs?

krlmlr on 28 Feb 2018

I pushed some code in the feature-1185-group branch so that group_indices() is hybrid-interpreted.

df <- data_frame(v1 = c(3, 3, 2, 2, 3, 1), v2 = 1:6) %>% group_by(v1)
mutate(df, g = group_indices())
#> # A tibble: 6 x 3
#> # Groups:   v1 [3]
#>      v1    v2     g
#>   <dbl> <int> <int>
#> 1    3.     1     3
#> 2    3.     2     3
#> 3    2.     3     2
#> 4    2.     4     2
#> 5    3.     5     3
#> 6    1.     6     1

Because of internal implementation, this gives 0 as the group for everybody when this is not a grouped data frame. Is that ok ?

df <- data_frame(v1 = c(3, 3, 2, 2, 3, 1), v2 = 1:6)
mutate(df, g = group_indices())
#> # A tibble: 6 x 3
#>      v1    v2     g
#>   <dbl> <int> <int>
#> 1    3.     1     0
#> 2    3.     2     0
#> 3    2.     3     0
#> 4    2.     4     0
#> 5    3.     5     0
#> 6    1.     6     0

romainfrancois on 9 Apr 2018

🎉2

This sounds good, but I think group_indices() should return an all-1 vector for an ungrouped data frame for consistency:

dplyr::group_indices(mtcars)
#>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Created on 2018-04-10 by the reprex package (v0.2.0).

krlmlr on 10 Apr 2018

I made the change in the branch, why are group() returning -1 in the first place ?

romainfrancois on 10 Apr 2018

Thanks. What is group()?

krlmlr on 10 Apr 2018

Oh, I see -- it's in the C++ code.

krlmlr on 10 Apr 2018

I have no idea. Do the tests still pass if we change that -1 to 1 ?

krlmlr on 10 Apr 2018

This would have to be 0, I'll check

romainfrancois on 10 Apr 2018

👍1

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/