Currently mutate()
and summarise()
only work with vectorised functions: functions that take a vector as input and return a vector (or "scalar") as output. I don't see any reason why summarise()
and mutate()
couldn't also accept tibbles. The existing restrictions would continue to apply so that in summarise()
the tibble would have to have exactly one row, and in mutate()
it would have to have either one row or n rows.
In other words, the following two lines of code should be equivalent:
df %>%
summarise(mean = mean(x), sd = sd(x))
df %>%
summarise(tibble(mean = mean(x), sd = sd(x))
This would allow you to extract that repeated pattern out into a function:
# and hence
mean_sd <- function(df, var) {
tibble(mean = mean(df[[var]]), sd = sd(df[[var]]))
}
df %>%
summarise(mean_sd(df, "x"))
We'd need to work on documentation to help people develop effective functions of this nature develop tools so that you could easily specify input variables (using whatever the next iteration of lazyeval provides) and name the outputs. But that's largely a second-order concern: we can figure out those details later.
Supporting tibbles in this way would be particular useful for dplyr as it would help to clarify the nature of functions like separate()
and unite()
which are currently data frame wrappers around simple vector functions.
These ideas are most important for summarise()
and mutate()
but I think we should apply the same principles to filter()
and arrange()
as well.
cc @lionel- @jennybc @krlmlr
What should happen if the structure of the tibbles varies from call to call (grouped)? bind_rows()
semantics? Either way, I feel this should happen after #2311.
This is really helpful, especially for filter()
.
There are cases where functions return the tibble in rows, which is not accepted by summarise()
.
df %>%
summarise(tibble(quantile(x, probs = c(0.025, 0.5, 0.975))))
results in:
Error in `[<-.data.frame`(`*tmp*`, , value = list(model = c("exponential_detection_model", :
replacement element 2 is a matrix/data frame of 3 rows, need 1
In addition: Warning message:
In `[<-.data.frame`(`*tmp*`, , value = list(model = c("exponential_detection_model", :
replacement element 1 has 3 rows to replace 1 rows
Is there a way to transpose the tibble?
@aornugent: You can (ab?)use bind_rows()
instead of tibble()
. Example:
bind_rows(quantile(iris$Sepal.Length))
This returns:
# A tibble: 1 x 5
`0%` `25%` `50%` `75%` `100%`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4.3 5.1 5.8 6.4 7.9
Now that we have :=
and sort of going back to the initial #154, perhaps the lhs of :=
can be richer, i.e. something like this parses:
mtcars %>%
group_by(cyl) %>%
summarise( tie(mpg0,mp25,mpg50,mpg75,mpg100) := quantile(mpg) )
From this 🐦 thread https://twitter.com/romain_francois/status/943399604065849344
I toyed with this syntax on the tie 📦 here: https://github.com/romainfrancois/tie
> iris %>%
+ dplyr::group_by(Species) %>%
+ bow( tie(min, max) := range(Sepal.Length) )
# A tibble: 3 x 3
Species min max
<fct> <dbl> <dbl>
1 setosa 4.30 5.80
2 versicolor 4.90 7.00
3 virginica 4.90 7.90
>
> x <- "min"
> iris %>%
+ dplyr::group_by(Species) %>%
+ bow( tie(!!x, max) := range(Sepal.Length) )
# A tibble: 3 x 3
Species min max
<fct> <dbl> <dbl>
1 setosa 4.30 5.80
2 versicolor 4.90 7.00
3 virginica 4.90 7.90
Now it just does a classic summarise
of the rhs of :=
wrapped in a list
call, and then re-extracts into what is specified in the lhs, i.e. it does this:
> iris %>%
+ group_by(Species) %>%
+ summarise( ..tmp.. = list(range(Sepal.Length)) ) %>%
+ mutate( min = map_dbl(..tmp.., 1), max = map_dbl(..tmp.., 2) ) %>%
+ select( -..tmp..)
# A tibble: 3 x 3
Species min max
<fct> <dbl> <dbl>
1 setosa 4.30 5.80
2 versicolor 4.90 7.00
3 virginica 4.90 7.90
This is awesome, thank you.
:+1: :100: for the functionality.
The names bow
and tie
don't seem quite right (too cutesy). simply c()
would be consistent with zealot.
Yeah sure. I typically don’t use the same standards when making a poc :package: as when working on dplyr.
Back in the original suggestion:
df %>%
summarise(tibble(mean = mean(x), sd = sd(x))
the usual rule for summarise is that the result is of length 1, we could extend this in the data frame case by only allowing data frames with 1 row.
for mutate, we could say we would only accept data frames with n rows, n being the size of the group
what would we do if the expression has a name, e.g.
df %>%
summarise( y = tibble(mean = mean(x), sd = sd(x))
c("y_mean", "y_sd" )
c("mean", "sd")
I think the names case is easiest - we follow whatever strategy rlang::flatten()
uses.
Wondering about nesting now, would it make sense instead that:
df %>%
summarise(data = tibble(mean = mean(x), sd = sd(x))
make the same thing as
df %>%
summarise(data = list(tibble(mean = mean(x), sd = sd(x)))
(without the always confusing list
)
Relevant discussing in #2132
We can rationalize this in vctrs
parlance:
expressions in summarise()
should give one "observation", which includes data frames with one row, and matrices with one row.
expressions in mutate
should give n()
observations, including data frames and matrices of n()
rows.
Right now we have complex code that handles promotion, so that e.g. if one expression gives an int
for one group and then a double
we end up with a numeric
vector. Not sure we want to keep this, or rather hint users to produce type safe functions.
I think we'll be able to use vctrs code to handle the coercion — it'll just be most efficient with type-stable functions because we won't need to make any coercion-copies.
Update on naming: I think it now seems reasonable that named tibbles would produce a df-col:
# Tibble is spliced into output, producing two new columns
df %>% summarise(tibble(mean = mean(x), sd = sd(x))
# Produces a single df-col containing two variables.
df %>% summarise(summary = tibble(mean = mean(x), sd = sd(x))
Update on sizes: I think it now seems reasonable obvious that columns in summarise must have a _size_ of 1, and columns in mutate must have a size of n
.
To handle the "quantile" problem, we'll need a quantile()
wrapper that returns a tibble with one column for each quantile (https://github.com/r-lib/funs/issues/24). We'll also need a set of "colwise()" wrappers around standard summary functions:
df %>% summarise(agg_quantile(x))
df %>% summarise(col_mean(starts_with("x")), col_min(ends_with("y"))
We'll have carefully think how these functions compose: what does colwise quantile look like? What if you want to summarise multiple variables with multiple functions?
Is this implemented already? In the dev_0_9_0
branch I see:
library(tidyverse)
tibble(a = 1) %>% mutate(tibble(b = 2))
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
tibble(a = 1) %>% mutate(tibble(b = 2), c = b)
#> Error in get(as.character(FUN), mode = "function", envir = envir): object '.f' of mode 'function' was not found
Created on 2019-10-24 by the reprex package (v0.3.0)
Do we unpack (auto-splice) at the end or right after processing an expression? This is relevant for https://github.com/tidyverse/tibble/issues/581.
Was mistakingly using compat map()
as purrr::map()
. fixed now:
library(tidyverse)
tibble(a = 1) %>% mutate(tibble(b = 2))
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
tibble(a = 1) %>% mutate(tibble(b = 2), c = b)
#> # A tibble: 1 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 1 2 2
but yeah, it's on:
library(dplyr, warn.conflicts = FALSE)
mtcars %>%
group_by(cyl) %>%
summarise(as_tibble(as.list(quantile(mpg))))
#> # A tibble: 3 x 6
#> cyl `0%` `25%` `50%` `75%` `100%`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 21.4 22.8 26 30.4 33.9
#> 2 6 17.8 18.6 19.7 21 21.4
#> 3 8 10.4 14.4 15.2 16.2 19.2
Most helpful comment
but yeah, it's on: