Dplyr: group_split should provide an option to return a named list

Created on 25 Feb 2019 · 13Comments · Source: tidyverse/dplyr

group_split() is a powerful addition to dplyr 0.8.0 however i often need the list elements to be identified by the group's name (eg for exporting as json).

To get a named list I currently need to:

group_names <- df %>% group_keys(col1) %>% pull(1)
group_list <- df %>%
   group_split(col1) %>%
   set_names(group_names)

which seems too verbose and grouping needs to be computed twice.

As a simple improvement consider adding the "named" option to group_split():

group_list <- df %>%
   group_split(col1, named=T)

Source

dan-reznik

👍18

Most helpful comment

@romainfrancois

base::split is a function very commonly used in pipe chains along with tidyverse functions and the absent of support of NSE is very awkward (for example in this SO question the user is confused that he must use mtcars %>% split(.$cyl) instead of mtcars %>% split(cyl)).

Thus group_split is a very welcome feature. I was disappointed however by the choice to remove names altogether, which I think is a bit brutal and in the end is unfortunately enough for me to discard it most of the time.

The doc says:

The primary use case for group_split() is with already grouped data frames

I'm not sure about that, and I might be wrong but I believe its primary use will really be to be a drop in replacement for base::split, and that the absence of names is problematic.

Likewise :

it does not name the elements of the list based on the grouping as this typically loses information and is confusing.

If I'm right that typically we don't split by existing groups but use group_split as a replacement of base::split, I'd also venture to say that typically we split by one variable only, so we wouldn't lose information in this case by naming the elements of the output. group_split() however, in order to be consistent, currently always loses information.

I think a .naming_fun argument would make everybody happy:

group_split(mtcars , cyl) # keep current behavior as it's general and user won't get strange group names if splitting by date etc
group_split(mtcars , cyl, .naming_fun= identity) # same as base::split(mtcars, mtcars$cyl)
group_split(mtcars , cyl, gear, .naming_fun= ~paste(..., collapse ="_") # will build names from the values defining the groups, given as arguments from left to right

a .name_repair argument could be added as well to tweak the resulting names and make them syntactically valid for instance.

PS : the keep argument is not dotted, is this on purpose ? if not as the function is still marked as experimental it might still be time to rename it as .keep ?

moodymudskipper on 21 Jun 2019

👍11

All 13 comments

I too feel it's nice to let the list have names, but it would be a bit more complicated than the simple named=TRUE because there are multiple grouping columns. I think it needs to accept functions. For example:

# names are "col1: val1, col2: val2"
df %>%
   group_split(col1, col2, .name = ggplot2::label_both)

# names are "val1 val2," simply paste()ed values
df %>%
   group_split(col1, col2, .name = TRUE)

yutannihilation on 26 Feb 2019

👍6

@dan-reznik if you are going to do both group_keys() and group_split() I'd suggest you group_by() first and then use the no arguments. Something like this perhaps, and you can easily wrap that into a function you use instead of dplyr::group_split() :

library(dplyr, warn.conflicts = FALSE)

named_group_split <- function(.tbl, ...) {
  grouped <- group_by(.tbl, ...)
  names <- rlang::eval_bare(rlang::expr(paste(!!!group_keys(grouped), sep = " / ")))

  grouped %>% 
    group_split() %>% 
    rlang::set_names(names)
}

mtcars %>% 
  named_group_split(cyl, am)
#> $`4 / 0`
#> # A tibble: 3 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#> 2  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 3  21.5     4  120.    97  3.7   2.46  20.0     1     0     3     1
#> 
#> $`4 / 1`
#> # A tibble: 8 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
#> 2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 3  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> 4  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#> 5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#> 6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
#> 7  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
#> 8  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2
#> 
#> $`6 / 0`
#> # A tibble: 4 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#> 2  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#> 3  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> 4  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#> 
#> $`6 / 1`
#> # A tibble: 3 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  19.7     6   145   175  3.62  2.77  15.5     0     1     5     6
#> 
#> $`8 / 0`
#> # A tibble: 12 x 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  2  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  3  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#>  4  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#>  5  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
#>  6  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
#>  7  10.4     8  460    215  3     5.42  17.8     0     0     3     4
#>  8  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
#>  9  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
#> 10  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
#> 11  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
#> 12  19.2     8  400    175  3.08  3.84  17.0     0     0     3     2
#> 
#> $`8 / 1`
#> # A tibble: 2 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  15.8     8   351   264  4.22  3.17  14.5     0     1     5     4
#> 2  15       8   301   335  3.54  3.57  14.6     0     1     5     8

romainfrancois on 4 Mar 2019

👍2

That seems like quite the workaround. I love the functionality of named_group_split, but wouldn't it be nice to be able to supply a dynamic function that sets the names, much like you can use labeller in ggplot?

slhck on 29 May 2019

Perhaps it would, but that can definitely live in some other package.

romainfrancois on 31 May 2019

@romainfrancois

The doc says:

The primary use case for group_split() is with already grouped data frames

I'm not sure about that, and I might be wrong but I believe its primary use will really be to be a drop in replacement for base::split, and that the absence of names is problematic.

Likewise :

it does not name the elements of the list based on the grouping as this typically loses information and is confusing.

I think a .naming_fun argument would make everybody happy:

group_split(mtcars , cyl) # keep current behavior as it's general and user won't get strange group names if splitting by date etc
group_split(mtcars , cyl, .naming_fun= identity) # same as base::split(mtcars, mtcars$cyl)
group_split(mtcars , cyl, gear, .naming_fun= ~paste(..., collapse ="_") # will build names from the values defining the groups, given as arguments from left to right

a .name_repair argument could be added as well to tweak the resulting names and make them syntactically valid for instance.

PS : the keep argument is not dotted, is this on purpose ? if not as the function is still marked as experimental it might still be time to rename it as .keep ?

moodymudskipper on 21 Jun 2019

👍11

I'm with @moodymudskipper on this. Every time I go to use group_split and friends, I find myself going back to base::split or doing coding gymnastics to get my names back.

JasonAizkalns on 9 Jul 2019

👍3

That's not happening.

If names are really needed, see function above here, that can go in another package. See also group_map().

romainfrancois on 15 Jul 2019

Which other package would you recommend this be in?

slhck on 15 Jul 2019

whichever you use, also it's just a simple function, it could also just live next to where you use it.

We strongly believe giving it names is an anti pattern, hence that's not happening in dplyr, but we do provide some tools you can use to make your own versions with names.

romainfrancois on 15 Jul 2019

I second this feature request!

The documentation for group_split says-

"it does not name the elements of the list based on the grouping as this typically loses information and is confusing"

But I did not see any example in the docs where there is a loss of information and things become confusing in terms of how list elements are named by base::split.

library(tidyverse)

df <- dplyr::filter(mpg, drv %in% c("4", "f"), fl %in% c("p", "r"))

names(split(x = df, f = list(df$drv, df$fl), drop = TRUE))
#> [1] "4.p" "f.p" "4.r" "f.r"

names(group_split(df, drv, fl))
#> NULL

IndrajeetPatil on 17 Aug 2019

👍5

I love this too. I use @romainfrancois 's named_group_split() function pretty often because it makes so much easier to split and write list with names to files

tungmilan on 25 Sep 2019

👍5

I also support the addition of an argument to keep the names. group_map and the function above are nice, but I don't see a reason not to add an argument to group_split.

thekryz on 23 Dec 2019

👍1

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/