Dplyr: select() works with a vector of variable names

Created on 19 Jan 2018  Â·  19Comments  Â·  Source: tidyverse/dplyr

select() works with vector of variable names but I think it shouldn't. Let me show why:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union


## This is confusing. Shouldn't this give an error?
a <- tibble(c = 21:25, a = 1:5, b = 11:15)
bb <- c("a", "b")
a %>% select(bb)
#> # A tibble: 5 x 2
#>       a     b
#>   <int> <int>
#> 1     1    11
#> 2     2    12
#> 3     3    13
#> 4     4    14
#> 5     5    15

## This is the expected behavior
a <- tibble(c = 21:25, a = 1:5, b = 11:15)
b <- c("a", "b")
a %>% select(b)
#> # A tibble: 5 x 1
#>       b
#>   <int>
#> 1    11
#> 2    12
#> 3    13
#> 4    14
#> 5    15

It may be confusing for the user to have different results depending of the name of her object.

My suggestion is to give an error and force the user to write a %>% select(one_of(bb))

Most helpful comment

I strongly think a user should be able to infer what is going to be attempted by reading the code (even with incomplete knowledge of what values are set to). The fact that the same symbol may look up a column (or columns) by the name of the symbol, the index of a column stored in the symbol, or the names of columns stored in the symbol is going to be a source of bugs.

All 19 comments

Thanks. This is documented as a convenience feature of select() and rename() at the very bottom of the examples for select(). Changing this would take a lot of effort now, but might be worthwhile still. To remove ambiguity, you can already:

  • use .data to only refer to columns
  • use !! to refer to local variables

I strongly think a user should be able to infer what is going to be attempted by reading the code (even with incomplete knowledge of what values are set to). The fact that the same symbol may look up a column (or columns) by the name of the symbol, the index of a column stored in the symbol, or the names of columns stored in the symbol is going to be a source of bugs.

@JohnMount Totally agree, otherwise it may be confusing.

Thanks. Would you mind adding a comment to the linked issue in tidyselect?

I think this bug (which I very much agree with is a bug or a dangerous misfeature) is a duplicate of #2904 (‘select(df, colname) sometimes impersonates select_(df, .dots=colname)’).

There are two issues conflated here:

  • Support for vectors of column names in select(). This is consistent with the support for column index and we are likely not going to change this.

  • Lookup of symbols in data mask and then in lexical context. This is consistent with R semantics for data masking functions. We have tried changing it for select() and it was a disaster. See https://www.tidyverse.org/articles/2017/09/erratum-tidyr-0.7.0/

Is data mask a tidyeval concept or an R concept? I can't find the term in https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf .

Both. If you supply a list or data frame as second argument to base::eval() you are creating a data mask (i.e. an environment populated with objects from a data set that masks the lexical context).

In general we advise always unquoting with !! when you're referring to an object from the lexical environment. This way you won't hit data frame objects.

There is usually no need to be explicit about data lookup but if you want to make sure you can subset from the .data pronoun.

@lionel-: I don't think we should be supporting vectors of column names or indexes in select() taken from variables, because the behavior is inconsistent also for indexes. Okay to open an issue in tidyselect?

c <- 1:3
tidyselect::vars_select(letters, c)
#>   c 
#> "c"
cc <- 1:3
tidyselect::vars_select(letters, cc)
#>   a   b   c 
#> "a" "b" "c"

Created on 2018-02-28 by the reprex package (v0.2.0).

I don't mind explicit character vectors or explicit indexes:

tidyselect::vars_select(letters, c("a", "b", "c"))
#>   a   b   c 
#> "a" "b" "c"
tidyselect::vars_select(letters, 1:3)
#>   a   b   c 
#> "a" "b" "c"

Created on 2018-02-28 by the reprex package (v0.2.0).

I think we should not make any more changes to selection semantics. For better or worse data masking of symbols is standard semantics in (the fexpr subset of) R and we should just teach people to use !! to pick variables from the environment. WDYT @hadley?

We've had a lot of churn in the semantics of these functions, and think we're currently at a good place (it's not perfect but it's at least well thought out and systematic). I would prefer not to reconsider at this point.

Can we agree on showing a message in this case? The message is easy to turn off (with a !!sym() or !!syms()), I think there's some potential for confusion otherwise:

library(tidyverse)
data <- tibble(a = 1:3, b = 4:6)
c <- "a"
data %>% select(c)
#> Error: `c` must evaluate to column positions or names, not a function
cc <- "a"
data %>% select(cc)
#> # A tibble: 3 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
data <- data %>% mutate(cc = 7:9)
data %>% select(cc)
#> # A tibble: 3 x 1
#>      cc
#>   <int>
#> 1     7
#> 2     8
#> 3     9
data %>% select(!!sym(cc))
#> # A tibble: 3 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3

Created on 2018-02-28 by the reprex package (v0.2.0).

(In this example, the first and second select() calls would give a message.)

I read https://www.tidyverse.org/articles/2017/09/erratum-tidyr-0.7.0/, I wasn't fully aware of the problem. I still think a message would be a way to make users aware that they are doing something potentially unsafe:

tidyselect::vars_select(letters, cc)
#> Note: Selecting a, b, c, because we found an object cc. 
#> Use !!!syms(cc) to remove this note and avoid potential confusion with a variable.
#>   a   b   c 
#> "a" "b" "c"

Either way, this issue belongs in tidyselect.

@jtrecenti: Would you mind opening an issue in tidyselect that summarizes and links to this issue?

Yes, that issue was based on my lack of understanding of the problem. All I'm suggesting now is to give a message if the user is doing something potentially ambiguous. It's up to you.

This ambiguity is general across all data-masking functions in R: lm(), ggplot2::aes(), tidyr, dplyr, etc. While select() is a bit different, I'm wary of making any more change to evaluation.

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Was this page helpful?
0 / 5 - 0 ratings

Related issues

msberends picture msberends  Â·  20Comments

stephlocke picture stephlocke  Â·  21Comments

krlmlr picture krlmlr  Â·  18Comments

hadley picture hadley  Â·  44Comments

randomgambit picture randomgambit  Â·  21Comments