Dplyr: Automatically unpack unnamed df-cols

Created on 16 Dec 2016 · 18Comments · Source: tidyverse/dplyr

Currently mutate() and summarise() only work with vectorised functions: functions that take a vector as input and return a vector (or "scalar") as output. I don't see any reason why summarise() and mutate() couldn't also accept tibbles. The existing restrictions would continue to apply so that in summarise() the tibble would have to have exactly one row, and in mutate() it would have to have either one row or n rows.

In other words, the following two lines of code should be equivalent:

df %>%
  summarise(mean = mean(x), sd = sd(x))

df %>%
   summarise(tibble(mean = mean(x), sd = sd(x))

This would allow you to extract that repeated pattern out into a function:

# and hence
mean_sd <- function(df, var) {
  tibble(mean = mean(df[[var]]), sd = sd(df[[var]]))
}
df %>% 
  summarise(mean_sd(df, "x"))

We'd need to work on documentation to help people develop effective functions of this nature develop tools so that you could easily specify input variables (using whatever the next iteration of lazyeval provides) and name the outputs. But that's largely a second-order concern: we can figure out those details later.

Supporting tibbles in this way would be particular useful for dplyr as it would help to clarify the nature of functions like separate() and unite() which are currently data frame wrappers around simple vector functions.

These ideas are most important for summarise() and mutate() but I think we should apply the same principles to filter() and arrange() as well.

cc @lionel- @jennybc @krlmlr

feature

Source

hadley

👍17

Most helpful comment

but yeah, it's on:

library(dplyr, warn.conflicts = FALSE)
mtcars %>% 
  group_by(cyl) %>% 
  summarise(as_tibble(as.list(quantile(mpg))))
#> # A tibble: 3 x 6
#>     cyl  `0%` `25%` `50%` `75%` `100%`
#>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1     4  21.4  22.8  26    30.4   33.9
#> 2     6  17.8  18.6  19.7  21     21.4
#> 3     8  10.4  14.4  15.2  16.2   19.2

romainfrancois on 25 Oct 2019

👍4 🚀3 🎉3 ❤1

All 18 comments

What should happen if the structure of the tibbles varies from call to call (grouped)? bind_rows() semantics? Either way, I feel this should happen after #2311.

krlmlr on 21 Feb 2017

This is really helpful, especially for filter().

There are cases where functions return the tibble in rows, which is not accepted by summarise().

df %>%
   summarise(tibble(quantile(x, probs = c(0.025, 0.5, 0.975))))

results in:

Error in `[<-.data.frame`(`*tmp*`, , value = list(model = c("exponential_detection_model",  : 
  replacement element 2 is a matrix/data frame of 3 rows, need 1
In addition: Warning message:
In `[<-.data.frame`(`*tmp*`, , value = list(model = c("exponential_detection_model",  :
  replacement element 1 has 3 rows to replace 1 rows

Is there a way to transpose the tibble?

aornugent on 29 Mar 2017

👍2

@aornugent: You can (ab?)use bind_rows() instead of tibble(). Example:

bind_rows(quantile(iris$Sepal.Length))

This returns:

# A tibble: 1 x 5
   `0%` `25%` `50%` `75%` `100%`
  <dbl> <dbl> <dbl> <dbl>  <dbl>
1   4.3   5.1   5.8   6.4    7.9

huftis on 2 Nov 2017

Now that we have := and sort of going back to the initial #154, perhaps the lhs of := can be richer, i.e. something like this parses:

mtcars %>% 
  group_by(cyl) %>% 
  summarise( tie(mpg0,mp25,mpg50,mpg75,mpg100) := quantile(mpg) )

From this 🐦 thread https://twitter.com/romain_francois/status/943399604065849344

romainfrancois on 20 Dec 2017

I toyed with this syntax on the tie 📦 here: https://github.com/romainfrancois/tie

> iris %>% 
+   dplyr::group_by(Species) %>% 
+   bow( tie(min, max) := range(Sepal.Length) )
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90
> 
> x <- "min"
> iris %>% 
+   dplyr::group_by(Species) %>% 
+   bow( tie(!!x, max) := range(Sepal.Length) )
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90

Now it just does a classic summarise of the rhs of := wrapped in a list call, and then re-extracts into what is specified in the lhs, i.e. it does this:

> iris %>% 
+   group_by(Species) %>% 
+   summarise( ..tmp.. = list(range(Sepal.Length)) ) %>% 
+   mutate( min = map_dbl(..tmp.., 1), max = map_dbl(..tmp.., 2) ) %>% 
+   select( -..tmp..)
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90

romainfrancois on 23 Feb 2018

👍3

This is awesome, thank you.

aornugent on 5 Mar 2018

:+1: :100: for the functionality.

The names bow and tie don't seem quite right (too cutesy). simply c() would be consistent with zealot.

t-kalinowski on 11 Apr 2018

Yeah sure. I typically don’t use the same standards when making a poc :package: as when working on dplyr.

romainfrancois on 11 Apr 2018

Back in the original suggestion:

df %>%
   summarise(tibble(mean = mean(x), sd = sd(x))

the usual rule for summarise is that the result is of length 1, we could extend this in the data frame case by only allowing data frames with 1 row.

for mutate, we could say we would only accept data frames with n rows, n being the size of the group

what would we do if the expression has a name, e.g.

df %>%
   summarise( y = tibble(mean = mean(x), sd = sd(x))

c("y_mean", "y_sd" )
c("mean", "sd")
not allow it

romainfrancois on 23 Apr 2018

I think the names case is easiest - we follow whatever strategy rlang::flatten() uses.

hadley on 23 Apr 2018

Wondering about nesting now, would it make sense instead that:

df %>%
   summarise(data = tibble(mean = mean(x), sd = sd(x))

make the same thing as

df %>%
   summarise(data = list(tibble(mean = mean(x), sd = sd(x)))

(without the always confusing list)

Relevant discussing in #2132

romainfrancois on 30 Apr 2018

We can rationalize this in vctrs parlance:

expressions in summarise() should give one "observation", which includes data frames with one row, and matrices with one row.
expressions in mutate should give n() observations, including data frames and matrices of n() rows.

Right now we have complex code that handles promotion, so that e.g. if one expression gives an int for one group and then a double we end up with a numeric vector. Not sure we want to keep this, or rather hint users to produce type safe functions.

romainfrancois on 17 Oct 2018

I think we'll be able to use vctrs code to handle the coercion — it'll just be most efficient with type-stable functions because we won't need to make any coercion-copies.

hadley on 17 Oct 2018

Update on naming: I think it now seems reasonable that named tibbles would produce a df-col:

# Tibble is spliced into output, producing two new columns
df %>% summarise(tibble(mean = mean(x), sd = sd(x))

# Produces a single df-col containing two variables.
df %>% summarise(summary = tibble(mean = mean(x), sd = sd(x))

Update on sizes: I think it now seems reasonable obvious that columns in summarise must have a _size_ of 1, and columns in mutate must have a size of n.

hadley on 8 Feb 2019

👍2

To handle the "quantile" problem, we'll need a quantile() wrapper that returns a tibble with one column for each quantile (https://github.com/r-lib/funs/issues/24). We'll also need a set of "colwise()" wrappers around standard summary functions:

df %>% summarise(agg_quantile(x))
df %>% summarise(col_mean(starts_with("x")), col_min(ends_with("y"))

We'll have carefully think how these functions compose: what does colwise quantile look like? What if you want to summarise multiple variables with multiple functions?

hadley on 8 Feb 2019

Is this implemented already? In the dev_0_9_0 branch I see:

library(tidyverse)
tibble(a = 1) %>% mutate(tibble(b = 2))
#> # A tibble: 1 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
tibble(a = 1) %>% mutate(tibble(b = 2), c = b)
#> Error in get(as.character(FUN), mode = "function", envir = envir): object '.f' of mode 'function' was not found

^{Created on 2019-10-24 by the reprex package (v0.3.0)}

Do we unpack (auto-splice) at the end or right after processing an expression? This is relevant for https://github.com/tidyverse/tibble/issues/581.

krlmlr on 24 Oct 2019

Was mistakingly using compat map() as purrr::map(). fixed now:

library(tidyverse)
tibble(a = 1) %>% mutate(tibble(b = 2))
#> # A tibble: 1 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
tibble(a = 1) %>% mutate(tibble(b = 2), c = b)
#> # A tibble: 1 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     2

romainfrancois on 25 Oct 2019

but yeah, it's on:

library(dplyr, warn.conflicts = FALSE)
mtcars %>% 
  group_by(cyl) %>% 
  summarise(as_tibble(as.list(quantile(mpg))))
#> # A tibble: 3 x 6
#>     cyl  `0%` `25%` `50%` `75%` `100%`
#>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1     4  21.4  22.8  26    30.4   33.9
#> 2     6  17.8  18.6  19.7  21     21.4
#> 3     8  10.4  14.4  15.2  16.2   19.2

romainfrancois on 25 Oct 2019

👍4 🚀3 🎉3 ❤1

Was this page helpful?

0 / 5 - 0 ratings