group_split() is a powerful addition to dplyr 0.8.0 however i often need the list elements to be identified by the group's name (eg for exporting as json).
To get a named list I currently need to:
group_names <- df %>% group_keys(col1) %>% pull(1)
group_list <- df %>%
group_split(col1) %>%
set_names(group_names)
which seems too verbose and grouping needs to be computed twice.
As a simple improvement consider adding the "named" option to group_split():
group_list <- df %>%
group_split(col1, named=T)
I too feel it's nice to let the list have names, but it would be a bit more complicated than the simple named=TRUE because there are multiple grouping columns. I think it needs to accept functions. For example:
# names are "col1: val1, col2: val2"
df %>%
group_split(col1, col2, .name = ggplot2::label_both)
# names are "val1 val2," simply paste()ed values
df %>%
group_split(col1, col2, .name = TRUE)
@dan-reznik if you are going to do both group_keys() and group_split() I'd suggest you group_by() first and then use the no arguments. Something like this perhaps, and you can easily wrap that into a function you use instead of dplyr::group_split() :
library(dplyr, warn.conflicts = FALSE)
named_group_split <- function(.tbl, ...) {
grouped <- group_by(.tbl, ...)
names <- rlang::eval_bare(rlang::expr(paste(!!!group_keys(grouped), sep = " / ")))
grouped %>%
group_split() %>%
rlang::set_names(names)
}
mtcars %>%
named_group_split(cyl, am)
#> $`4 / 0`
#> # A tibble: 3 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 2 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 3 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
#>
#> $`4 / 1`
#> # A tibble: 8 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 3 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 4 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
#> 5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
#> 6 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 7 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> 8 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
#>
#> $`6 / 0`
#> # A tibble: 4 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 3 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 4 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#>
#> $`6 / 1`
#> # A tibble: 3 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
#>
#> $`8 / 0`
#> # A tibble: 12 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 2 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 3 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
#> 4 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
#> 5 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
#> 6 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
#> 7 10.4 8 460 215 3 5.42 17.8 0 0 3 4
#> 8 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
#> 9 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
#> 10 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
#> 11 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
#> 12 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
#>
#> $`8 / 1`
#> # A tibble: 2 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
#> 2 15 8 301 335 3.54 3.57 14.6 0 1 5 8
That seems like quite the workaround. I love the functionality of named_group_split, but wouldn't it be nice to be able to supply a dynamic function that sets the names, much like you can use labeller in ggplot?
Perhaps it would, but that can definitely live in some other package.
@romainfrancois
base::split is a function very commonly used in pipe chains along with tidyverse functions and the absent of support of NSE is very awkward (for example in this SO question the user is confused that he must use mtcars %>% split(.$cyl) instead of mtcars %>% split(cyl)).
Thus group_split is a very welcome feature. I was disappointed however by the choice to remove names altogether, which I think is a bit brutal and in the end is unfortunately enough for me to discard it most of the time.
The doc says:
The primary use case for
group_split()is with already grouped data frames
I'm not sure about that, and I might be wrong but I believe its primary use will really be to be a drop in replacement for base::split, and that the absence of names is problematic.
Likewise :
it does not name the elements of the list based on the grouping as this typically loses information and is confusing.
If I'm right that typically we don't split by existing groups but use group_split as a replacement of base::split, I'd also venture to say that typically we split by one variable only, so we wouldn't lose information in this case by naming the elements of the output. group_split() however, in order to be consistent, currently always loses information.
I think a .naming_fun argument would make everybody happy:
group_split(mtcars , cyl) # keep current behavior as it's general and user won't get strange group names if splitting by date etc
group_split(mtcars , cyl, .naming_fun= identity) # same as base::split(mtcars, mtcars$cyl)
group_split(mtcars , cyl, gear, .naming_fun= ~paste(..., collapse ="_") # will build names from the values defining the groups, given as arguments from left to right
a .name_repair argument could be added as well to tweak the resulting names and make them syntactically valid for instance.
PS : the keep argument is not dotted, is this on purpose ? if not as the function is still marked as experimental it might still be time to rename it as .keep ?
I'm with @moodymudskipper on this. Every time I go to use group_split and friends, I find myself going back to base::split or doing coding gymnastics to get my names back.
That's not happening.
If names are really needed, see function above here, that can go in another package. See also group_map().
Which other package would you recommend this be in?
whichever you use, also it's just a simple function, it could also just live next to where you use it.
We strongly believe giving it names is an anti pattern, hence that's not happening in dplyr, but we do provide some tools you can use to make your own versions with names.
I second this feature request!
The documentation for group_split says-
"it does not name the elements of the list based on the grouping as this typically loses information and is confusing"
But I did not see any example in the docs where there is a loss of information and things become confusing in terms of how list elements are named by base::split.
library(tidyverse)
df <- dplyr::filter(mpg, drv %in% c("4", "f"), fl %in% c("p", "r"))
names(split(x = df, f = list(df$drv, df$fl), drop = TRUE))
#> [1] "4.p" "f.p" "4.r" "f.r"
names(group_split(df, drv, fl))
#> NULL
I love this too. I use @romainfrancois 's named_group_split() function pretty often because it makes so much easier to split and write list with names to files
I also support the addition of an argument to keep the names. group_map and the function above are nice, but I don't see a reason not to add an argument to group_split.
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/
Most helpful comment
@romainfrancois
base::splitis a function very commonly used in pipe chains along withtidyversefunctions and the absent of support of NSE is very awkward (for example in this SO question the user is confused that he must usemtcars %>% split(.$cyl)instead ofmtcars %>% split(cyl)).Thus
group_splitis a very welcome feature. I was disappointed however by the choice to remove names altogether, which I think is a bit brutal and in the end is unfortunately enough for me to discard it most of the time.The doc says:
I'm not sure about that, and I might be wrong but I believe its primary use will really be to be a drop in replacement for
base::split, and that the absence of names is problematic.Likewise :
If I'm right that typically we don't split by existing groups but use
group_splitas a replacement ofbase::split, I'd also venture to say that typically we split by one variable only, so we wouldn't lose information in this case by naming the elements of the output.group_split()however, in order to be consistent, currently always loses information.I think a
.naming_funargument would make everybody happy:a
.name_repairargument could be added as well to tweak the resulting names and make them syntactically valid for instance.PS : the
keepargument is not dotted, is this on purpose ? if not as the function is still marked as experimental it might still be time to rename it as.keep?