Dplyr: Support more advanced patterns in matches()

Created on 16 Nov 2017  路  6Comments  路  Source: tidyverse/dplyr

I tried to use matches() to match a slightly complex regular expression but it didn't support it so I had to use stringr instead. It seems that matches() uses grep rather than stringr.

test <- tibble(
                       hh15_test = c(1,2,3), 
                       hh02_test = c(3,3,3), 
                       hh3_test=c(4,4,5), 
                       blah = c(1,2,3)
)

#Doesn't work
select(test, matches("(hh1)|(^(?!hh))"))

#Error in grep(needle, haystack, ...) : 
#  invalid regular expression '(hh1)|(^(?!hh))', reason 'Invalid regexp'

#Works
select(test, str_subset(current_vars(), "(hh1)|(^(?!hh))"))

# A tibble: 3 x 2
#  hh15_test  blah
      <dbl> <dbl>
#1         1     1
#2         2     2
#3         3     3
verbs

Most helpful comment

Hi,

looking at the source code, matches uses grep function to apply the regex and grep uses Extended Regular Expression by default. (see help("regex", package = "base")). It can use PERL regex with perl=T but not activated in dplyr's matches function.
stringr uses another engine called ICU (close to perl syntax). Based on that you'll understand there are severals flavors of regex and all does not have the same functionalities. Here, it seems base R regex doesn't know about negative lookahead - so it does not recognize the regex.

Here is demo codes for illustration of all this
EDIT: Modified reprex to be consistent with @mungojam previous editing

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-11-17

library(dplyr, warn.conflicts = F)
test <- tibble(
  hh15_test = c(1,2,3), 
  hh02_test = c(3,3,3), 
  hh3_test=c(4,4,5), 
  blah = c(1,2,3)
)
test
#> # A tibble: 3 x 4
#>   hh15_test hh02_test hh3_test  blah
#>       <dbl>     <dbl>    <dbl> <dbl>
#> 1         1         3        4     1
#> 2         2         3        4     2
#> 3         3         3        5     3
# does not recognized
select(test, matches("(hh1)|^(?!hh)"))
#> Error in grep(needle, haystack, ...): expression r茅guli猫re '(hh1)|^(?!hh)' incorrecte, 脿 cause de 'Invalid regexp'
# does not recognized either
grep("(hh1)|^(?!hh)", names(test))
#> Error in grep("(hh1)|^(?!hh)", names(test)): expression r茅guli猫re '(hh1)|^(?!hh)' incorrecte, 脿 cause de 'Invalid regexp'
# It is recognized now
grep("(hh1)|^(?!hh)", names(test), perl = T)
#> [1] 1 4
# stringr recognizes it too
stringr::str_subset(names(test), "(hh1)|^(?!hh)")
#> [1] "hh15_test" "blah"

At last, you said that this works
select(test, str_subset(current_vars(), "(hh1)|(?!hh)"))
but it selects all the columns. Is this the intended behaviour ?

In conclusion, your issue raises a question about which flavors dplyr should work with.
Also, maybe regex should work the same around tidyverse now that stringr is part of it and it seems that dplyr differs in term of regex. But as you can use stringr inside dplyr, maybe matches not needed. Let's see what the tidyverse team has to say on all this. I just wanted to add some informations on this issue.

All 6 comments

Until this most recent release of the tidyverse package, stringr hadn't been part of the "core" (i.e. wasn't attached unless loaded separately from tidyverse). There's also the dependency factor.

Hi,

looking at the source code, matches uses grep function to apply the regex and grep uses Extended Regular Expression by default. (see help("regex", package = "base")). It can use PERL regex with perl=T but not activated in dplyr's matches function.
stringr uses another engine called ICU (close to perl syntax). Based on that you'll understand there are severals flavors of regex and all does not have the same functionalities. Here, it seems base R regex doesn't know about negative lookahead - so it does not recognize the regex.

Here is demo codes for illustration of all this
EDIT: Modified reprex to be consistent with @mungojam previous editing

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-11-17

library(dplyr, warn.conflicts = F)
test <- tibble(
  hh15_test = c(1,2,3), 
  hh02_test = c(3,3,3), 
  hh3_test=c(4,4,5), 
  blah = c(1,2,3)
)
test
#> # A tibble: 3 x 4
#>   hh15_test hh02_test hh3_test  blah
#>       <dbl>     <dbl>    <dbl> <dbl>
#> 1         1         3        4     1
#> 2         2         3        4     2
#> 3         3         3        5     3
# does not recognized
select(test, matches("(hh1)|^(?!hh)"))
#> Error in grep(needle, haystack, ...): expression r茅guli猫re '(hh1)|^(?!hh)' incorrecte, 脿 cause de 'Invalid regexp'
# does not recognized either
grep("(hh1)|^(?!hh)", names(test))
#> Error in grep("(hh1)|^(?!hh)", names(test)): expression r茅guli猫re '(hh1)|^(?!hh)' incorrecte, 脿 cause de 'Invalid regexp'
# It is recognized now
grep("(hh1)|^(?!hh)", names(test), perl = T)
#> [1] 1 4
# stringr recognizes it too
stringr::str_subset(names(test), "(hh1)|^(?!hh)")
#> [1] "hh15_test" "blah"

At last, you said that this works
select(test, str_subset(current_vars(), "(hh1)|(?!hh)"))
but it selects all the columns. Is this the intended behaviour ?

In conclusion, your issue raises a question about which flavors dplyr should work with.
Also, maybe regex should work the same around tidyverse now that stringr is part of it and it seems that dplyr differs in term of regex. But as you can use stringr inside dplyr, maybe matches not needed. Let's see what the tidyverse team has to say on all this. I just wanted to add some informations on this issue.

Thanks for the digging @cderv. You've also spotted a bug in my simplified reproduction. The second case was actually throwing an error too. It's working now in the second case as expected.

@hadley: What flavor of regular expressions should we accept in matches()?

Moving this to https://github.com/tidyverse/tidyselect/issues/58 as matches was factored out of dplyr.

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Was this page helpful?
0 / 5 - 0 ratings