Dplyr: Matrix-y version of across()

Created on 9 Feb 2020  ·  11Comments  ·  Source: tidyverse/dplyr

Since that will generally be what people want. The major downside would be making it too magical, which is especially challenging for across since it’s hard to debug because it must always be embedded in a verb.

Maybe we need an across variant that returns a matrix usually, and a vector when called within rowwise?

each-row ↕️ feature

Most helpful comment

We've decided to implement c_across() for this use case.

All 11 comments

If in the rowwise context, across() is likely to be mainly used to select and not to transform (i.e. without defining fns to do stuff like df %>% rowwise() %>% mutate(foo = bar(across())) where bar needs a vector), perhaps there is room for 2 distinct functions rather than turning across() into something too versatile.

I'm not sure I'm following. Do you have some pretend code @hadley ?

@romainfrancois, check discussion in #4840

@romainfrancois Also discussion at https://github.com/tidyverse/dplyr/issues/4837

The experimental verb lay (https://github.com/romainfrancois/lay) is probably a better idea:

  • across() would remain type stable
  • how to apply a function on the tibble returned by across() would be identical whether one is using rowwise() or not

The basic problem (as nicely described by @bwiernik) is that you might want to compute a "rowwise" summary like so:

library(dplyr, warn.conflicts = FALSE)

df <- tibble(w = runif(3), x = runif(3), y = runif(3), z = runif(3))
df %>% rowwise() %>% mutate(m = mean(c(w, x, y, z)))
#> # A tibble: 3 x 5
#> # Rowwise: 
#>       w     x      y      z     m
#>   <dbl> <dbl>  <dbl>  <dbl> <dbl>
#> 1 0.364 0.229 0.850  0.0777 0.380
#> 2 0.729 0.282 0.0116 0.778  0.450
#> 3 0.466 0.459 0.599  0.432  0.489

This is obviously tedious if you have many columns.

Currently there are two ways to use across() here:

# Use rowwise() and coerce to a vector:
df %>% rowwise() %>% mutate(m = mean(unlist(across(w:z))))
#> # A tibble: 3 x 5
#> # Rowwise: 
#>       w     x      y      z     m
#>   <dbl> <dbl>  <dbl>  <dbl> <dbl>
#> 1 0.364 0.229 0.850  0.0777 0.380
#> 2 0.729 0.282 0.0116 0.778  0.450
#> 3 0.466 0.459 0.599  0.432  0.489

# Use existing rowwise function:
df %>% mutate(m = rowMeans(across(w:z)))
#> # A tibble: 3 x 5
#>       w     x      y      z     m
#>   <dbl> <dbl>  <dbl>  <dbl> <dbl>
#> 1 0.364 0.229 0.850  0.0777 0.380
#> 2 0.729 0.282 0.0116 0.778  0.450
#> 3 0.466 0.459 0.599  0.432  0.489
# Or apply
df %>% mutate(m = apply(across(w:z), 1, mean))
#> # A tibble: 3 x 5
#>       w     x      y      z     m
#>   <dbl> <dbl>  <dbl>  <dbl> <dbl>
#> 1 0.364 0.229 0.850  0.0777 0.380
#> 2 0.729 0.282 0.0116 0.778  0.450
#> 3 0.466 0.459 0.599  0.432  0.489

If we had a variant of across() that returned a matrix instead of a data frame, and where the function was applied across rows, rather than columns, we could write:

df %>% rowwise %>% mutate(m = mean(something(w:z)))
df %>% mutate(m = something(w:z, mean)))

It's a bit hard to know what to call this function, but it seems like it should be related to across() since it's closely related.

OTOH if this somehow became an additional feature of across() it would also solve’ #4770 because you could write (e.g.) across(is.numeric, ~ .x > 0, row_fn = any)

Perhaps, vector_across() or similar would be a good name for a separate function, returning a nrow × ncol matrix, which, following R subsetting of marices, becomes a vector if only one row or column.

oh ok, so that's essentially this: https://github.com/romainfrancois/lay

Maybe over() as in %>% mutate(m = over(w:z, mean))

I'd argue that having across() handle both a function to apply to each column and another function to apply to each "row" would be "much" for one function.

What about through()?

Here is a dull implementation just to illustrate the syntax:

through <- function(vars, fn) apply(across({{vars}}), 1, rlang::as_function(fn))

library(dplyr)

iris %>% tibble() %>%
  mutate(Petal.Sum = through(starts_with("Petal"), sum))
#> # A tibble: 150 x 6
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Sum
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>       <dbl>
#>  1          5.1         3.5          1.4         0.2 setosa       1.60
#>  2          4.9         3            1.4         0.2 setosa       1.60

#> # … with 140 more rows

iris %>% tibble() %>%
  mutate(Petal.Sum = through(starts_with("Petal"), ~ sum(.x)))
#> # A tibble: 150 x 6
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Sum
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>       <dbl>
#>  1          5.1         3.5          1.4         0.2 setosa       1.60
#>  2          4.9         3            1.4         0.2 setosa       1.60

We've decided to implement c_across() for this use case.

Was this page helpful?
0 / 5 - 0 ratings