Tidyr: `extract` doesn't work properly when text contains `\n`

Created on 18 Sep 2019  路  3Comments  路  Source: tidyverse/tidyr

tidyr::extract() doesn't extract values correctly when text in column contains \n. In the example below, a and d are ignored.

data.frame(x="a\nb-c\d") %>% 
    tidyr::extract(x,into = c("ab","cd"),regex = "(.+)-(.+)")

returns

  ab cd
1 b c

instead of

    ab  cd
1 a\nb c\nd

Most helpful comment

In regex, . (called _dotall_ sometimes) does not match newline character. It is not just for tidyr. Some information can be found on that if you search the web. You need to explicitly look for this character using \n in your regex (or \s for any whitespace character.
Or if supported by the regex tool, you can pass some option to make the . matches line terminator

This different regex will work

library(tidyr)
data.frame(x="a\nb-c\nd") %>% 
  extract(x,into = c("ab","cd"), regex = "([\\w\\n]+)-([\\w\\n]+)")
#>     ab   cd
#> 1 a\nb c\nd
data.frame(x="a\nb-c\nd") %>% 
  extract(x,into = c("ab","cd"), regex = "([\\s\\S]+)-([\\s\\S]+)")
#>     ab   cd
#> 1 a\nb c\nd
data.frame(x="a\nb-c\nd") %>% 
  extract(x,into = c("ab","cd"), regex = "(.+\\n.+)-(.+\\n.+)")
#>     ab   cd
#> 1 a\nb c\nd

Using stringr, you can provide an option dotall so that . will match line terminator.
See ?stringr::regex

stringr::str_match("a\nb-c\nd", pattern = stringr::regex("(.+)-(.+)", dotall = TRUE))[1, ]
#> [1] "a\nb-c\nd" "a\nb"      "c\nd"

Created on 2019-09-18 by the reprex package (v0.3.0)

Currently, you can't pass any regex option to the underlying stringi function in tidyr

All 3 comments

. doesn't match line terminators. I think this is a limitation of tidyr's regex and a documentation problem like #693.

In regex, . (called _dotall_ sometimes) does not match newline character. It is not just for tidyr. Some information can be found on that if you search the web. You need to explicitly look for this character using \n in your regex (or \s for any whitespace character.
Or if supported by the regex tool, you can pass some option to make the . matches line terminator

This different regex will work

library(tidyr)
data.frame(x="a\nb-c\nd") %>% 
  extract(x,into = c("ab","cd"), regex = "([\\w\\n]+)-([\\w\\n]+)")
#>     ab   cd
#> 1 a\nb c\nd
data.frame(x="a\nb-c\nd") %>% 
  extract(x,into = c("ab","cd"), regex = "([\\s\\S]+)-([\\s\\S]+)")
#>     ab   cd
#> 1 a\nb c\nd
data.frame(x="a\nb-c\nd") %>% 
  extract(x,into = c("ab","cd"), regex = "(.+\\n.+)-(.+\\n.+)")
#>     ab   cd
#> 1 a\nb c\nd

Using stringr, you can provide an option dotall so that . will match line terminator.
See ?stringr::regex

stringr::str_match("a\nb-c\nd", pattern = stringr::regex("(.+)-(.+)", dotall = TRUE))[1, ]
#> [1] "a\nb-c\nd" "a\nb"      "c\nd"

Created on 2019-09-18 by the reprex package (v0.3.0)

Currently, you can't pass any regex option to the underlying stringi function in tidyr

You can match on newline by adding the flag (?i) at the beginning of the regex. You can turn it off again (within the same regular expression) with (-i). These options are documented in ?stringi::`stri_opts_regex`. If you look closely, each option in stri_opts_regex() lists the corresponding regex flag at the end of the description. They are also mentioned in
?stringi::`stringi-search-regex`

text <- tibble(text = c('ab\n', '\n'))
extract(text, text, into = "newlines", regex = "(?s)(.)")
## A tibble: 2 x 1
#  newlines
#  <chr>   
# 1 a       
# 2 "\n"
Was this page helpful?
0 / 5 - 0 ratings