Dplyr: should mutate_at lambdas be able to refer to other columns in .data?

Created on 15 Feb 2019 · 17Comments · Source: tidyverse/dplyr

As a simple example, let's say you have a normalising constant in one dataframe column and you need to normalise many other columns. mutate_at seems like it would be convenient to do this, but you can't refer to other columns in the dataframe apart from . cleanly.

Example:

library(tidyverse)

## Works but hardcoding dataset name feels lame.
iris %>%
  mutate_at(vars(starts_with("Sepal")),
            ~ . / iris$Petal.Width) %>%
  head()
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         25.5       17.50          1.4         0.2  setosa
#> 2         24.5       15.00          1.4         0.2  setosa
#> 3         23.5       16.00          1.3         0.2  setosa
#> 4         23.0       15.50          1.5         0.2  setosa
#> 5         25.0       18.00          1.4         0.2  setosa
#> 6         13.5        9.75          1.7         0.4  setosa

## Fails
iris %>%
  mutate_at(vars(starts_with("Sepal")),
            ~ . / Petal.Width) %>%
  head()
#> Error in mutate_impl(.data, dots): Evaluation error: object 'Petal.Width' not found.

## Fails differently
iris %>%
  mutate_at(vars(starts_with("Sepal")),
            ~ . / .data$Petal.Width) %>%
  head()
#> Error in mutate_impl(.data, dots): Column `Sepal.Length` must be length 150 (the number of rows) or one, not 0

^{Created on 2019-02-15 by the reprex package (v0.2.1)}

bug

Source

MilesMcBain

All 17 comments

Not sure what's happening here, but this works:

library(dplyr, warn.conflicts = FALSE)

iris %>%
  mutate_at(vars(starts_with("Sepal")),
            list(~ . / .data$Petal.Width)) %>%
  head()
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         25.5       17.50          1.4         0.2  setosa
#> 2         24.5       15.00          1.4         0.2  setosa
#> 3         23.5       16.00          1.3         0.2  setosa
#> 4         23.0       15.50          1.5         0.2  setosa
#> 5         25.0       18.00          1.4         0.2  setosa
#> 6         13.5        9.75          1.7         0.4  setosa

^{Created on 2019-02-15 by the reprex package (v0.2.1.9000)}

yutannihilation on 15 Feb 2019

Oh thanks!

Well that sort of solves MY problem, but there seems to be some inconsistency here worth cleaning up. I also discovered this, which really feels like it should work:

library(tidyverse)

## fails
iris %>% 
  mutate_at(vars(starts_with("Sepal")),
            function(col, normaliser){
              col / normaliser
            },
            .data$Petal.Width) %>%
  head()
#> Error in mutate_impl(.data, dots): Column `Sepal.Length` must be length 150 (the number of rows) or one, not 0

## works
iris %>%
  mutate_at(vars(starts_with("Sepal")),
            function(col, normaliser){
              col / normaliser
            },
            iris$Petal.Width) %>%
  head()
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         25.5       17.50          1.4         0.2  setosa
#> 2         24.5       15.00          1.4         0.2  setosa
#> 3         23.5       16.00          1.3         0.2  setosa
#> 4         23.0       15.50          1.5         0.2  setosa
#> 5         25.0       18.00          1.4         0.2  setosa
#> 6         13.5        9.75          1.7         0.4  setosa

^{Created on 2019-02-15 by the reprex package (v0.2.1)}

MilesMcBain on 15 Feb 2019

👍1

If .funs is a formula, it is processed by rlang::as_function(), but, if .funs is a list, it is processed by dplyr:::as_fun(). Probably, the former should also be handled by as_fun()? (Or, does as_function() need .env argument specified?)

https://github.com/tidyverse/dplyr/blob/ce7f2bc2ecfe1c6085126da8cc4d4640ec8f4f1f/R/funs.R#L85-L99

yutannihilation on 15 Feb 2019

👍1

I thought we can just remove this if branch, but things seem a bit more complicated...

https://github.com/tidyverse/dplyr/blob/d2e77ee6a83cd301cb7c1e73383347d69e757a9d/R/funs.R#L91-L92

The change make some tests fail, because as_fun() is yet different from as_function() in that it accepts . only.

library(dplyr, warn.conflicts = FALSE)
library(rlang)

d <- head(mtcars, 2)

# c.f. https://github.com/tidyverse/dplyr/blob/d2e77ee6a83cd301cb7c1e73383347d69e757a9d/tests/testthat/test-colwise-select.R#L104
select_if(d, ~ is_integerish(.x))
#>               mpg cyl disp  hp vs am gear carb
#> Mazda RX4      21   6  160 110  0  1    4    4
#> Mazda RX4 Wag  21   6  160 110  0  1    4    4

select_if(d, list(~ is_integerish(.x)))
#> Error: Can't convert a list to function

^{Created on 2019-02-15 by the reprex package (v0.2.1)}

yutannihilation on 15 Feb 2019

Should this work? (might be related to #4150)

library(dplyr, warn.conflicts = FALSE)
summarise_at(data.frame(x = c(1:10, NA)), vars(starts_with("x")), ~mean(.), na.rm = TRUE)
#>    x
#> 1 NA

^{Created on 2019-02-17 by the reprex package (v0.2.1)}

yutannihilation on 17 Feb 2019

In summary, we can provide the followings as .funs:

a function (either pre-defined or defined in-place)
a name of a function (deprecated, kept just for backword-compatibility)
a purrr-style lambda function notation

a list of either form above

library(dplyr, warn.conflicts = FALSE)

# a function
summarise_at(data.frame(x = c(1:10, NA)), vars(x), mean, na.rm = TRUE)
#>     x
#> 1 5.5
summarise_at(data.frame(x = c(1:10, NA)), vars(x), function(x, ...) mean(x, ...), na.rm = TRUE)
#>     x
#> 1 5.5

# a name of a function
summarise_at(data.frame(x = c(1:10, NA)), vars(x), "mean", na.rm = TRUE)
#>     x
#> 1 5.5

# a purrr-style lambda notation
summarise_at(data.frame(x = c(1:10, NA)), vars(x), ~mean(.), na.rm = TRUE)
#>    x
#> 1 NA


# wrapped by list
summarise_at(data.frame(x = c(1:10, NA)), vars(x), 
             list(mean, function(x, ...) mean(x, ...), "mean", ~mean(.)), 
             na.rm = TRUE)
#>   fn1 fn2 mean..3 mean..4
#> 1 5.5 5.5     5.5     5.5

^{Created on 2019-02-17 by the reprex package (v0.2.1)}

Among those, currently it's only a bare purrr-style lambda function notation that is handled by rlang::as_function() and ignores .... Semantically, I feel this is the right behaviour. I don't think na.rm = TRUE should be passed to mean() in the following.

summarise_at(data.frame(x = c(1:10, NA)), vars(starts_with("x")), ~mean(.), na.rm = TRUE)

But... probably this is needed for backward-compatibility for funs()?

By the way, I'm not sure if this should work. Is ... inside of the context of mutate()?

iris %>% 
  mutate_at(vars(starts_with("Sepal")),
            function(col, normaliser){
              col / normaliser
            },
            .data$Petal.Width) %>%
  head()

yutannihilation on 17 Feb 2019

All these should work:

library(dplyr, warn.conflicts = FALSE)

iris <- head(iris, 3)

iris %>%
  mutate_if(is.numeric, ~ . / iris$Petal.Width)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         25.5        17.5          7.0           1  setosa
#> 2         24.5        15.0          7.0           1  setosa
#> 3         23.5        16.0          6.5           1  setosa
iris %>%
  mutate_if(is.numeric, ~ . / Petal.Width)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         25.5        17.5          7.0           1  setosa
#> 2         24.5        15.0          7.0           1  setosa
#> 3         23.5        16.0          6.5           1  setosa
iris %>%
  mutate_if(is.numeric, ~ . / .data$Petal.Width)
#> Error: Column `Sepal.Length` must be length 3 (the number of rows) or one, not 0

iris %>%
  mutate_if(is.numeric, ~ .x / iris$Petal.Width)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         25.5        17.5          7.0           1  setosa
#> 2         24.5        15.0          7.0           1  setosa
#> 3         23.5        16.0          6.5           1  setosa
iris %>%
  mutate_if(is.numeric, ~ .x / Petal.Width)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         25.5        17.5          7.0           1  setosa
#> 2         24.5        15.0          7.0           1  setosa
#> 3         23.5        16.0          6.5           1  setosa
iris %>%
  mutate_if(is.numeric, ~ .x / .data$Petal.Width)
#> Error: Column `Sepal.Length` must be length 3 (the number of rows) or one, not 0


iris %>%
  mutate_if(is.numeric, list(~ . / iris$Petal.Width))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         25.5        17.5          7.0           1  setosa
#> 2         24.5        15.0          7.0           1  setosa
#> 3         23.5        16.0          6.5           1  setosa
iris %>%
  mutate_if(is.numeric, list(~ . / Petal.Width))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         25.5        17.5          7.0           1  setosa
#> 2         24.5        15.0          7.0           1  setosa
#> 3         23.5        16.0          6.5           1  setosa
iris %>%
  mutate_if(is.numeric, list(~ . / .data$Petal.Width))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         25.5        17.5          7.0           1  setosa
#> 2         24.5        15.0          7.0           1  setosa
#> 3         23.5        16.0          6.5           1  setosa

iris %>%
  mutate_if(is.numeric, list(~ .x / iris$Petal.Width))
#> Error: object '.x' not found
iris %>%
  mutate_if(is.numeric, list(~ .x / Petal.Width))
#> Error: object '.x' not found
iris %>%
  mutate_if(is.numeric, list(~ .x / .data$Petal.Width))
#> Error: object '.x' not found

There are I believe two issues:

When formulas are used, e.g. ~ . / Petal.Width we go through rlang::as_function() there is a bit of internal handling for this, but apparently .data is not in scope
List of formulas go through as_fun instead of as_function so .x is not properly recognized as equivalent to .

romainfrancois on 22 Feb 2019

@MilesMcBain this should not work:

iris %>% 
  mutate_at(vars(starts_with("Sepal")),
            function(col, normaliser){
              col / normaliser
            },
            .data$Petal.Width)

The additional arguments are not quoted or dealt with specially, so .data$Petal.Width is evaluated normally.

romainfrancois on 22 Feb 2019

@yutannihilation I guess you'd reason that if you supply a formula, you are in total control of the arguments, however here's one trick:

library(dplyr)

tbl <- data.frame(x = c(1:10, NA))
tbl %>% 
  summarise_at(vars(starts_with("x")), ~ mean(.) , na.rm = TRUE)
#>    x
#> 1 NA
tbl %>% 
  summarise_at(vars(starts_with("x")), ~ mean(...) , na.rm = TRUE)
#>     x
#> 1 5.5

romainfrancois on 22 Feb 2019

👍1

Wow, cool... I didn't notice this.

yutannihilation on 22 Feb 2019

It might just be an accidental feature 🤗

romainfrancois on 22 Feb 2019

The additional arguments are not quoted or dealt with specially

FYI, I found we can use !!! and !! (only for LHS) for additional arguments by the power of rlang::list2(). Is this intensional feature...?

library(dplyr, warn.conflicts = FALSE)

d <- data.frame(x = 1:4)

mutate_at(d, "x", list(function(x, y, z) x + y + z), y = 10, z = 100)
#>     x
#> 1 111
#> 2 112
#> 3 113
#> 4 114

mutate_at(d, "x", list(function(x, y, z) x + y + z), !!!list(y = 10, z = 100))
#>     x
#> 1 111
#> 2 112
#> 3 113
#> 4 114

a <- quo(y)
b <- quo(z)
mutate_at(d, "x", list(function(x, y, z) x + y + z), !!a := 10, !!b := 100)
#>     x
#> 1 111
#> 2 112
#> 3 113
#> 4 114

^{Created on 2019-02-22 by the reprex package (v0.2.1)}

yutannihilation on 22 Feb 2019

Sounds legit, the list2() call is in dplyr:::as_fun_list()

romainfrancois on 22 Feb 2019

I see.

yutannihilation on 22 Feb 2019

Is this intensional feature...?

I think it's accidental and I wouldn't rely on it. For instance purrr doesn't support splicing in argument lists and probably will not.

(Unless the mapped function supports splicing itself of course.)

lionel- on 4 Mar 2019

Oh, should I file a new issue for it? I guess here will be soon closed by #4217.

yutannihilation on 4 Mar 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/