tidyr::extract()
doesn't extract values correctly when text in column contains \n
. In the example below, a
and d
are ignored.
data.frame(x="a\nb-c\d") %>%
tidyr::extract(x,into = c("ab","cd"),regex = "(.+)-(.+)")
returns
ab cd
1 b c
instead of
ab cd
1 a\nb c\nd
.
doesn't match line terminators. I think this is a limitation of tidyr's regex and a documentation problem like #693.
In regex, .
(called _dotall_ sometimes) does not match newline character. It is not just for tidyr. Some information can be found on that if you search the web. You need to explicitly look for this character using \n
in your regex (or \s
for any whitespace character.
Or if supported by the regex tool, you can pass some option to make the .
matches line terminator
This different regex will work
library(tidyr)
data.frame(x="a\nb-c\nd") %>%
extract(x,into = c("ab","cd"), regex = "([\\w\\n]+)-([\\w\\n]+)")
#> ab cd
#> 1 a\nb c\nd
data.frame(x="a\nb-c\nd") %>%
extract(x,into = c("ab","cd"), regex = "([\\s\\S]+)-([\\s\\S]+)")
#> ab cd
#> 1 a\nb c\nd
data.frame(x="a\nb-c\nd") %>%
extract(x,into = c("ab","cd"), regex = "(.+\\n.+)-(.+\\n.+)")
#> ab cd
#> 1 a\nb c\nd
Using stringr, you can provide an option dotall
so that .
will match line terminator.
See ?stringr::regex
stringr::str_match("a\nb-c\nd", pattern = stringr::regex("(.+)-(.+)", dotall = TRUE))[1, ]
#> [1] "a\nb-c\nd" "a\nb" "c\nd"
Created on 2019-09-18 by the reprex package (v0.3.0)
Currently, you can't pass any regex option to the underlying stringi
function in tidyr
You can match on newline by adding the flag (?i)
at the beginning of the regex. You can turn it off again (within the same regular expression) with (-i)
. These options are documented in ?stringi::`stri_opts_regex`. If you look closely, each option in stri_opts_regex()
lists the corresponding regex flag at the end of the description. They are also mentioned in
?stringi::`stringi-search-regex`
text <- tibble(text = c('ab\n', '\n'))
extract(text, text, into = "newlines", regex = "(?s)(.)")
## A tibble: 2 x 1
# newlines
# <chr>
# 1 a
# 2 "\n"
Most helpful comment
In regex,
.
(called _dotall_ sometimes) does not match newline character. It is not just for tidyr. Some information can be found on that if you search the web. You need to explicitly look for this character using\n
in your regex (or\s
for any whitespace character.Or if supported by the regex tool, you can pass some option to make the
.
matches line terminatorThis different regex will work
Using stringr, you can provide an option
dotall
so that.
will match line terminator.See ?stringr::regex
Created on 2019-09-18 by the reprex package (v0.3.0)
Currently, you can't pass any regex option to the underlying
stringi
function intidyr