Dplyr: Need equivalent of any_vars() with across()

Created on 17 Jan 2020 · 9Comments · Source: tidyverse/dplyr

Interim work around is to use rowAny():

rowAny <- function(x) rowSum(x) > 0
df %>% filter(rowAny(across(is.numeric, ~ .x > 0))

each-col ↔️ feature

Source

hadley

👍4

Most helpful comment

In terms of syntax, it looks weird IMO to wrap the result of across() in some reducer. What about something like if_any() and if_all():

library(palmerpenguins)
library(dplyr, warn.conflicts = FALSE)
library(purrr)

if_any <- function(...) {
  reduce(across(...), `|`)
}
if_all <- function(...) {
  reduce(across(...), `&`)
}

penguins %>% 
  filter(if_any(starts_with("bill"), ~ . > 42))
#> # A tibble: 203 x 8
#>    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#>    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
#>  1 Adelie  Torge…           42.5          20.7              197        4500
#>  2 Adelie  Torge…           46            21.5              194        4200
#>  3 Adelie  Dream            42.2          18.5              180        3550
#>  4 Adelie  Dream            44.1          19.7              196        4400
#>  5 Adelie  Dream            42.3          21.2              191        4150
#>  6 Adelie  Torge…           45.8          18.9              197        4150
#>  7 Adelie  Torge…           42.8          18.5              195        4250
#>  8 Adelie  Torge…           42.1          19.1              195        4000
#>  9 Adelie  Torge…           42.9          17.6              196        4700
#> 10 Adelie  Dream            43.2          18.5              192        4100
#> # … with 193 more rows, and 2 more variables: sex <fct>, year <int>
penguins %>% 
  filter(if_all(starts_with("bill"), ~ . > 42))
#> # A tibble: 0 x 8
#> # … with 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
#> #   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
#> #   year <int>

^{Created on 2020-11-06 by the reprex package (v0.3.0.9001)}

romainfrancois on 6 Nov 2020

👍2

All 9 comments

Something like rowAny (and a rowAll) would be great also for using inside case_when when one wants to build the same condition for multiple variables, instead of having to use multiple | or multiple &. (not sure if I should open a new issue).

Thank you!

iago-pssjd on 3 Aug 2020

A few notes:

Notionally, I don't like rowAny() because it doesn't really do what its name indicates. IMHO a slightly more intuitive alternative:
```
rowwise_any <- function(x) purrr::pmap_lgl(x, any)
df %>% filter(rowwise_any(across(where(is.numeric), ~ .x > 0)))
```

@courtiol suggests to use lay::lay() to achieve the same:

df %>% filter(lay::lay(across(where(is.numeric)), ~ any(.x > 0)))

Neither of the above are really intuitive in my eyes. Spontaneously I would wish for an "any-of" version of the existing "all-of" across(), e.g. across_any(), so we could just:
```
df %>% filter(across_any(where(is.numeric), ~ .x > 0))
```
Or maybe instead introduce a flag like .all:
```
df %>% filter(across(.all = FALSE, where(is.numeric), ~ .x > 0))
```
But I'm not deep enough into dplyr semantics/internals to be able to assess if either of the above would really be a good API...

Performance-wise it seems that rowAny() is by far the best of the above three workarounds.

Benchmark reprex:

rowAny <- function(x) rowSums(x) > 0
rowwise_any <- function(x) purrr::pmap_lgl(x, any)
iris_big <- tibble::as_tibble(iris[sample(1:nrow(iris), 5e+5, replace = TRUE), ])

bench::mark(rowAny = iris_big %>% dplyr::filter(rowAny(dplyr::across(where(is.numeric), ~ .x == 1))),
            rowwise_any = iris_big %>% dplyr::filter(rowwise_any(dplyr::across(where(is.numeric), ~ .x == 1))),
            lay = iris_big %>% dplyr::filter(lay::lay(dplyr::across(where(is.numeric)), ~ any(.x == 1))))

#> # A tibble: 3 x 13
#>   expression       min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result       memory         time     gc        
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>       <list>         <list>   <list>    
#> 1 rowAny       30.11ms  45.36ms    22.0      27.7MB    12.0     11     6   500.99ms <tibble [26… <Rprofmem[,3]… <bch:tm… <tibble […
#> 2 rowwise_any    1.93s    1.93s     0.519    16.3MB     5.19     1    10      1.93s <tibble [26… <Rprofmem[,3]… <bch:tm… <tibble […
#> 3 lay            8.74s    8.74s     0.114    18.2MB     2.74     1    24      8.74s <tibble [26… <Rprofmem[,3]… <bch:tm… <tibble […

salim-b on 15 Sep 2020

An additional alternative to the ones given by @salim-b would be the function any_cols introduced by @lionel- in issue #5536.

I ran in https://rstudio.cloud the same benchmark as above just adding this function:

any_cols <- function(df) reduce(df, `|`, .init = FALSE)
rowAny <- function(x) rowSums(x) > 0
rowwise_any <- function(x) purrr::pmap_lgl(x, any)
iris_big <- tibble::as_tibble(iris[sample(1:nrow(iris), 5e+5, replace = TRUE), ])

bench::mark(rowAny = iris_big %>% dplyr::filter(rowAny(dplyr::across(where(is.numeric), ~ .x == 1))),
            rowwise_any = iris_big %>% dplyr::filter(rowwise_any(dplyr::across(where(is.numeric), ~ .x == 1))),
            lay = iris_big %>% dplyr::filter(lay::lay(dplyr::across(where(is.numeric)), ~ any(.x == 1))),
            any_cols = iris_big %>% dplyr::filter(any_cols(across(where(is.numeric), ~ .x == 1))))
#> # A tibble: 4 x 13
#>   expression       min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                memory                     time          gc               
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                <list>                     <list>        <list>           
#> 1 rowAny       21.75ms  33.59ms    24.9      27.7MB     15.6    16    10    641.8ms <tibble [26,677 × 5]> <Rprofmem[,3] [30 × 3]>    <bch:tm [16]> <tibble [16 × 3]>
#> 2 rowwise_any     1.5s     1.5s     0.669    16.3MB     16.1     1    24       1.5s <tibble [26,677 × 5]> <Rprofmem[,3] [1,678 × 3]> <bch:tm [1]>  <tibble [1 × 3]> 
#> 3 lay            5.27s    5.27s     0.190    18.2MB     11.0     1    58      5.27s <tibble [26,677 × 5]> <Rprofmem[,3] [1,920 × 3]> <bch:tm [1]>  <tibble [1 × 3]> 
#> 4 any_cols     15.09ms   15.8ms    49.1        22MB     17.7    25     9   509.37ms <tibble [26,677 × 5]> <Rprofmem[,3] [56 × 3]>    <bch:tm [25]> <tibble [25 × 3]>

iago-pssjd on 6 Oct 2020

Currently filter() does it's own reducing internally with & when multiple columns are involved:

SEXP eval_filter_one(SEXP quos, SEXP mask, SEXP caller, R_xlen_t n, SEXP env_filter) {
  // then reduce to a single logical vector of size n
  SEXP reduced = PROTECT(Rf_allocVector(LGLSXP, n));

  // init with TRUE
  int* p_reduced = LOGICAL(reduced);
  for (R_xlen_t i = 0; i < n ; i++, ++p_reduced) {
    *p_reduced = TRUE;
  }

  // reduce
  R_xlen_t nquos = XLENGTH(quos);
  for (R_xlen_t i=0; i < nquos; i++) {
    SEXP current_expression = PROTECT(Rf_ScalarInteger(i+1));
    Rf_defineVar(dplyr::symbols::current_expression, current_expression, env_filter);

    SEXP res = PROTECT(rlang::eval_tidy(VECTOR_ELT(quos, i), mask, caller));

    filter_check_size(res, i, n, quos);
    filter_check_type(res, i, quos);

    if (TYPEOF(res) == LGLSXP) {
      reduce_lgl(reduced, res, n);
    } else if(Rf_inherits(res, "data.frame")) {
      R_xlen_t ncol = XLENGTH(res);
      for (R_xlen_t j=0; j<ncol; j++) {
        reduce_lgl(reduced, VECTOR_ELT(res, j), n);
      }
    }

    UNPROTECT(2);
  }

  UNPROTECT(1);
  return reduced;
}

Perhaps this could be an argument of filter() to control the reducing, instead of calling a function on the result of across() ?

romainfrancois on 15 Oct 2020

👍2

In terms of syntax, it looks weird IMO to wrap the result of across() in some reducer. What about something like if_any() and if_all():

library(palmerpenguins)
library(dplyr, warn.conflicts = FALSE)
library(purrr)

if_any <- function(...) {
  reduce(across(...), `|`)
}
if_all <- function(...) {
  reduce(across(...), `&`)
}

penguins %>% 
  filter(if_any(starts_with("bill"), ~ . > 42))
#> # A tibble: 203 x 8
#>    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#>    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
#>  1 Adelie  Torge…           42.5          20.7              197        4500
#>  2 Adelie  Torge…           46            21.5              194        4200
#>  3 Adelie  Dream            42.2          18.5              180        3550
#>  4 Adelie  Dream            44.1          19.7              196        4400
#>  5 Adelie  Dream            42.3          21.2              191        4150
#>  6 Adelie  Torge…           45.8          18.9              197        4150
#>  7 Adelie  Torge…           42.8          18.5              195        4250
#>  8 Adelie  Torge…           42.1          19.1              195        4000
#>  9 Adelie  Torge…           42.9          17.6              196        4700
#> 10 Adelie  Dream            43.2          18.5              192        4100
#> # … with 193 more rows, and 2 more variables: sex <fct>, year <int>
penguins %>% 
  filter(if_all(starts_with("bill"), ~ . > 42))
#> # A tibble: 0 x 8
#> # … with 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
#> #   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
#> #   year <int>

^{Created on 2020-11-06 by the reprex package (v0.3.0.9001)}

romainfrancois on 6 Nov 2020

👍2

What about the previous idea of yours (@romainfrancois) of doing this internally and having an argument in filter (e.g. .combine = c(&, |))?
For the if_all idea, perhaps some may get confused by looking for a similarity between that and the superseeded suffixed functions such as mutate_all... (but perhaps not).
Would turning any() and all() as S3 methods be an option to just have those functions, while keeping base compatibility via dispatch? I guess you would rather not mask more base fns though...

courtiol on 6 Nov 2020

I'd like this to be partial, i.e. apply to part of the filter(), in the same vein as results of across() participate in a mutate() or summarise() result.

Back on the approach of using something around across(), perhaps either() would work:

library(palmerpenguins)
library(dplyr, warn.conflicts = FALSE)

either <- function(...) {
  data <- suppressMessages(vctrs::df_list(..., .name_repair = "universal"))
  purrr::reduce(data, `|`)
}

penguins %>% 
  filter(
    either(across(starts_with("bill"), ~ . > 42), flipper_length_mm > 180)
  )
#> # A tibble: 331 x 8
#>    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#>    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
#>  1 Adelie  Torge…           39.1          18.7              181        3750
#>  2 Adelie  Torge…           39.5          17.4              186        3800
#>  3 Adelie  Torge…           40.3          18                195        3250
#>  4 Adelie  Torge…           36.7          19.3              193        3450
#>  5 Adelie  Torge…           39.3          20.6              190        3650
#>  6 Adelie  Torge…           38.9          17.8              181        3625
#>  7 Adelie  Torge…           39.2          19.6              195        4675
#>  8 Adelie  Torge…           34.1          18.1              193        3475
#>  9 Adelie  Torge…           42            20.2              190        4250
#> 10 Adelie  Torge…           37.8          17.1              186        3300
#> # … with 321 more rows, and 2 more variables: sex <fct>, year <int>

either() would be some sort of generalized or() that understand auto splicing:

either(
  c(TRUE, FALSE), 
  c(FALSE, TRUE)
)
#> [1] TRUE TRUE

either(
  data.frame(x = c(TRUE, FALSE), y = c(FALSE, TRUE)), 
  c(FALSE, FALSE)
)
#> [1] TRUE TRUE

^{Created on 2020-11-09 by the reprex package (v0.3.0.9001)}

romainfrancois on 9 Nov 2020

👍1

Examples from ?filter_if give :

# examples from ?filter_if(): 

mtcars %>% filter_all(all_vars(. > 150))
mtcars %>% filter(across(everything(), ~ .x > 150))

# Or the union:
mtcars %>% filter_all(any_vars(. > 150))
mtcars %>% filter(if_any(everything(), ~ .x > 150))
mtcars %>% filter(either(across(everything(), ~ .x > 150)))

# You can vary the selection of columns on which to apply the
# predicate. filter_at() takes a vars() specification:
mtcars %>% filter_at(vars(starts_with("d")), any_vars((. %% 2) == 0))
mtcars %>% filter(if_any(starts_with("d"), ~ (. %% 2) == 0))
mtcars %>% filter(either(across(starts_with("d"), ~ (. %% 2) == 0)))

# And filter_if() selects variables with a predicate function:
is_int <- function(x) all(floor(x) == x)
mtcars %>% filter_if(is_int, all_vars(. != 0))
mtcars %>% filter(across(where(is_int), ~ .x != 0))

# filter_if  + any_vars
is_int <- function(x) all(floor(x) == x)
mtcars %>% filter_if(is_int, any_vars(. != 0))
mtcars %>% filter(if_any(where(is_int), ~ .x != 0))
mtcars %>% filter(either(across(where(is_int), ~ .x != 0)))

romainfrancois on 9 Nov 2020

either() applies an in-row operation and there is a syntax for that already, so perhaps all what we need is already here:

library(dplyr, warn.conflicts = FALSE)

is_int <- function(x) all(floor(x) == x)

mtcars %>%
  rowwise() %>%
  filter(any(c_across(where(is_int)) == 6))  ## filter rows containing at least one 6 at integer cols
#> # A tibble: 7 x 11
#> # Rowwise: 
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#> 3  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#> 4  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#> 5  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> 6  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#> 7  19.7     6  145    175  3.62  2.77  15.5     0     1     5     6

mtcars %>%
  rowwise() %>%
  filter(all(c_across(where(is_int)) != 0)) ##  filter rows containing no 0 at integer cols
#> # A tibble: 7 x 11
#> # Rowwise: 
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
#> 2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 3  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> 4  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#> 5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#> 6  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
#> 7  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2

^{Created on 2020-11-09 by the reprex package (v0.3.0)}

courtiol on 9 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings