Dplyr: Create a group_indices as a new variable

Created on 30 May 2015  路  21Comments  路  Source: tidyverse/dplyr

Some packages like ggplot2 act on groups defined by one variable only (as opposed to groups defined by several variables). It would be nice to have a function, say group(), that creates a new integer variable from groups defined by multiple variables:

Batting %>% mutate(group = group(teamID, yearID))
Batting %>% group_by(teamID, yearID) %>% mutate(group = group())

This function could also have a na.rm argument. The default should return a missing value for the observation if some grouping variable for this observation is missing.

group_indices is not suited for that since (i) it requires df as an argument (ii) group_indices does not work inside mutate

df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4))
df %>% mutate(g = group_indices(df, v1))
# Error: cannot handle

A work around for now

group <- function(..., na.rm = FALSE){
  df <- data.frame(list(...))
  if (na.rm){
    out <- rep(NA, nrow(df))
    complete <- complete.cases(df)
    indices <- df %>% filter(complete) %>% group_indices_(.dots = names(df))
    out[complete] <- indices
  } else{
    out <- group_indices_(df, .dots = names(df))
  }
  out
}
feature

Most helpful comment

@romainfrancois

Why was the issue closed? While @matthieugomez found another solution for his particular use case, the original request would be an extremely useful feature. It would be great to be able to do what you suggested above:

df %>% group_by(v1) %>% mutate( g = group_indices() )
# or 
df %>% group_by(v1) %>% mutate( g = group() )

All 21 comments

You use group_indices like this:

> df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4))
> df %>% group_by(v1) %>% group_indices
[1] 3 3 1 1 2

@hadley perhaps we could support something like:

df %>% group_by(v1) %>% mutate( g = group_indices() )
# or 
df %>% group_by(v1) %>% mutate( g = group() )

In all cases I used this function (for instance before using ggplot2 or plm), I wanted the group index to be missing in observations where some of the column was missing. This is different from the behavior of group_indices. Maybe dplyr is not the best place to implement such a function?

In that case, perhaps you can close the issue ?

I think the initial idea is great. Using group_indices() as we use rleid() in data.table

@romainfrancois

Why was the issue closed? While @matthieugomez found another solution for his particular use case, the original request would be an extremely useful feature. It would be great to be able to do what you suggested above:

df %>% group_by(v1) %>% mutate( g = group_indices() )
# or 
df %>% group_by(v1) %>% mutate( g = group() )

I would expect to be able to use this inside a mutate. I use a lot of data.table like @voxnonecho and that there is no easy way to do this is in dplyr is a bit of a pain

I can offer:

library(dplyr, warn.conflicts = FALSE)
df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4))
df %>% group_by(v1) %>% { mutate(ungroup(.), g = group_indices(.)) }
#> # A tibble: 5 脳 3
#>      v1    v2     g
#>   <dbl> <dbl> <int>
#> 1    NA    NA     3
#> 2    NA    NA     3
#> 3     2     3     1
#> 4     2     3     1
#> 5     3     4     2

@hadley: This would be much simpler if we had a hybrid handler for group_indices() that just does the right thing for grouped data frames.

I want to shelve this for now, and come back to when we broadly reconsider what other pronouns would useful inside dplyr verbs.

howdy, thread. any updates on using group_indices inside of mutate? :)

df %>% group_by(v1) %>% mutate( g = group_indices() )

I see that another option is doing
df$g <- df %>% group_by(v1) %>% group_indices
but it would be great if we could do this in a seamless pipe like others have voiced above.

I'll add that if you want to use group_indices inside a pipe, you can always do this:
df %>% bind_cols(g = group_indices(., group_var1, group_var2))
which does work seamlessly in a pipe but does not feel very neat.

Happy to reconsider.

@hadley: Do we have a better idea what other pronouns are useful inside dplyr verbs?

I pushed some code in the feature-1185-group branch so that group_indices() is hybrid-interpreted.

df <- data_frame(v1 = c(3, 3, 2, 2, 3, 1), v2 = 1:6) %>% group_by(v1)
mutate(df, g = group_indices())
#> # A tibble: 6 x 3
#> # Groups:   v1 [3]
#>      v1    v2     g
#>   <dbl> <int> <int>
#> 1    3.     1     3
#> 2    3.     2     3
#> 3    2.     3     2
#> 4    2.     4     2
#> 5    3.     5     3
#> 6    1.     6     1

Because of internal implementation, this gives 0 as the group for everybody when this is not a grouped data frame. Is that ok ?

df <- data_frame(v1 = c(3, 3, 2, 2, 3, 1), v2 = 1:6)
mutate(df, g = group_indices())
#> # A tibble: 6 x 3
#>      v1    v2     g
#>   <dbl> <int> <int>
#> 1    3.     1     0
#> 2    3.     2     0
#> 3    2.     3     0
#> 4    2.     4     0
#> 5    3.     5     0
#> 6    1.     6     0

This sounds good, but I think group_indices() should return an all-1 vector for an ungrouped data frame for consistency:

dplyr::group_indices(mtcars)
#>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Created on 2018-04-10 by the reprex package (v0.2.0).

I made the change in the branch, why are group() returning -1 in the first place ?

Thanks. What is group()?

Oh, I see -- it's in the C++ code.

I have no idea. Do the tests still pass if we change that -1 to 1 ?

This would have to be 0, I'll check

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Was this page helpful?
0 / 5 - 0 ratings