Dplyr: Bad error for nonexistent column that happens to be named 'id'

Created on 16 Apr 2020  ·  9Comments  ·  Source: tidyverse/dplyr

Discovered in actual usage where I had recently had (but renamed) a column named id.

I think the missing column id should generate the same error as foo, not delegate to a vctrs error.

library(dplyr)
#>     intersect, setdiff, setequal, union
packageVersion("dplyr")
#> [1] '0.8.99.9002'

dat <- tibble(x = 1)

select(dat, foo)
#> Error: Can't subset columns that don't exist.
#> x Column `foo` doesn't exist.

select(dat, id)
#> Error: `id()` was deprecated in dplyr 0.5.0 and is now defunct.
#> Please use `vctrs::vec_group_id()` instead.

Created on 2020-04-16 by the reprex package (v0.3.0.9001)

Most helpful comment

I just realized this is sort of our own version of “object of type closure is not subsettable”.

Unfortunately I think some of these problematic (non) variable names do come up a lot (id, data).

All 9 comments

I don't think we can do much about this. @lionel- might have some ideas.

This is by design because we thought it's more important to have a simple syntax for predicate selection than complete unambiguity regarding data frame columns. Maybe it's still time to reconsider before dplyr 1.0 if turns out it's going to be confusing. How likely is it that column names clash with existing functions?

For programming we have an unambiguous alternative: all_of("foo") or force(id).

Can maybe id be given a class that instructs tidyselect to ignore it ?

I think we need a general strategy here. Another example, users might expect a column data after some tidyr transformation. If it's not there for some reason, tidyselect will find utils::data instead and try to interpret it as a predicate.

I wonder if we could offer a hint here? i.e. if when we execute a function and it errors, add a hint that's something like "using function id since there's no variable id in your data frame?`.

I just realized this is sort of our own version of “object of type closure is not subsettable”.

Unfortunately I think some of these problematic (non) variable names do come up a lot (id, data).

Other thing to consider regarding the syntax: some users expect to use lambda formulas as predicate:
https://github.com/r-lib/tidyselect/issues/187
https://stackoverflow.com/questions/61283841/cant-use-purrr-style-lambda-function-with-select-dplyr-1-0-0-dev.

This is relevant because one alternative is to use a function constructor to solve the ambiguity:

data %>% select(fn(~ is_numeric(.)))

# With hypothetical purrr function operators for creating union or intersection of predicates
data %>% select(or(~ is.numeric(.), ~ is.character(.)))
data %>% select(or(is.numeric, is.character))

Edit: Or we just add support for formulas but then this might be confusing regarding precedence of | and &, they'll be part of the predicate function instead of joining them. E.g. ~ is.numeric(.) | ~ is.factor(.) is equivalent to ~ (is.numeric(.) | ~ is.factor(.)).

Just had this confusing error with pivot_longer():

Error: This tidyselect interface doesn't support predicates yet.
ℹ Contact the package author and suggest using `eval_select()`.

I was trying to pivot a variable mean which didn't exist. pivot_longer() doesn't support predicates yet, but if it did, it would have run mean() on the variables as if it were a predicate.

If we keep the simple syntax for predicates, we should at least improve the error message when the function returns numeric values:

iris %>% select(function(x) 1L)
#> Error: Can't coerce element 1 from a integer to a logical
Was this page helpful?
0 / 5 - 0 ratings