Data.table: Scope for %plike%

Created on 13 Jul 2019 · 12Comments · Source: Rdatatable/data.table

Just reading the new dev notes and noticed #3333. I was going to actually feature request %likep% (would make sense to conform to %plike%) the other day, but decided against it (thought maybe the consensus was that less convenience wrappers were more ideal for data.table. Any particular reason why data.table can't incorporate another, leveraging the perl = TRUE argument?

Often you get considerable speed improvements, and a bunch of other features / behaviors

# Following packages required .
# install.packages(c("stringi", "microbenchmark")

# load data.table.
library(data.table)

# Create a data.table of 100,000 random strings (20 chars in length).
DT = data.table(x = stringi::stri_rand_strings(100000, 20))

# Define a trivial regex pattern.
regex_pattern = "car|blah|far|nah"

# Create an alternative to %like% that sets `perl = TRUE`.
`%likep%` = function (vector, pattern) {
    if (is.factor(vector)) {
        as.integer(vector) %in% grep(pattern, levels(vector), perl = TRUE)
    }
    else {
        grepl(pattern, vector, perl = TRUE)
    }
}

# Microbenchmark the results to demonstrate speed improvements.
microbenchmark::microbenchmark(like = {(DT[x %like% regex_pattern])}, likep = (DT[x %likep% regex_pattern]))
# Unit: milliseconds
#   expr     min       lq     mean   median       uq      max neval
#   like 84.1235 86.56265 91.51547 87.74410 91.16710 159.6292   100
#  likep 16.0932 16.64750 17.81476 16.95985 17.82195  34.1415   100

Source

KyleHaynes

All 12 comments

maybe we could just set Perl=TRUE by default to current %like%? off the top
of my head, that shouldn't break anything?

On Sat, Jul 13, 2019, 10:53 AM Kyle Haynes notifications@github.com wrote:

Just reading the new dev notes and noticed #3333
https://github.com/Rdatatable/data.table/issues/3333. I was going to
actually feature request %likep% (would make sense to conform to %plike%)
the other day, but decided against it (thought maybe the consensus was that
less convenience wrappers were more ideal for data.table. Any particular
reason why data.table can't incorporate another, leveraging the perl =
TRUE argument?

Often you get considerable speed improvements, and a bunch of other
features / behaviors
https://stackoverflow.com/questions/47240375/regular-expressions-in-base-r-perl-true-vs-the-default-pcre-vs-tre

Following packages required .# install.packages(c("stringi", "microbenchmark")

load data.table.

library(data.table)

Create a data.table of 100,000 random strings (20 chars in length).DT = data.table(x = stringi::stri_rand_strings(100000, 20))

Define a trivial regex pattern.regex_pattern = "car|blah|far|nah"

Create an alternative to %like% that sets perl = TRUE.

%likep% = function (vector, pattern) {
if (is.factor(vector)) {
as.integer(vector) %in% grep(pattern, levels(vector), perl = TRUE)
}
else {
grepl(pattern, vector, perl = T)
}
}

Microbenchmark the results to demonstrate speed improvements.microbenchmark::microbenchmark(like = {(DT[x %like% regex_pattern])}, likep = (DT[x %likep% regex_pattern]))# Unit: milliseconds# expr min lq mean median uq max neval# like 84.1235 86.56265 91.51547 87.74410 91.16710 159.6292 100# likep 16.0932 16.64750 17.81476 16.95985 17.82195 34.1415 100

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3702?email_source=notifications&email_token=AB2BA5OBGQC2EUF4P2W7XA3P7E7SLA5CNFSM4ICVLFI2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G6752RQ,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB2BA5MB3YQFK22HOOV76CLP7E7SLANCNFSM4ICVLFIQ
.

MichaelChirico on 13 Jul 2019

It would throw a warning in %flike% as it's conflicting to declare both as TRUE.

My only justification for another function is to ...

Retain the default behavior of %like%, so backwards compatibility
It feels right that %like% should inherit base-R behavior.

The above points are obviously my own opinions, not sure if they completely reflect that of the data.table authors / users..?

KyleHaynes on 13 Jul 2019

👍1

I'm not sure there's any backwards compatibility issue (in my experience
Perl regex in R are a superset of non-perl?)

if so, pcre are (a) more flexible and as you say (b) more efficient, don't
see why not to diverge from base (we have like already, after all)

like() would retain Perl option and %flike% would be made to have the
underlying like() call with Perl=false

On Sun, Jul 14, 2019, 4:15 AM Kyle Haynes notifications@github.com wrote:

It would throw a warning in %flike% as it's conflicting to declare both
as TRUE.

My only justification for another function is to ...

Retain the default behavior of %like%, so backwards compatibility

It feels right that %like% should inherit base-R behavior.

The above points are obviously my own opinions, not sure if they
completely reflect that of the data.table authors / users..?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3702?email_source=notifications&email_token=AB2BA5LJOGVFB3X64BI6I3TP7IZVJA5CNFSM4ICVLFI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ3YUVI#issuecomment-511150677,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB2BA5LDDNWBBFDKLRK2JXLP7IZVJANCNFSM4ICVLFIQ
.

MichaelChirico on 13 Jul 2019

I'm not sure there's any backwards compatibility issue (in my experience
Perl regex in R are a superset of non-perl?)

Not entirely. The Perl variant often appears to extend on TRE (mainly with lookarounds), but, they do have slight different behaviors:

# TRE matches . to any literal character, including line break:
grepl(".", "\n", perl = F)
[1] TRUE

# Perl does not:
grepl(".", "\n", perl = T)
[1] FALSE

TRE also offers stuff that Perl doesn't. E.g.:

grepl("[:alpha:]","a", perl = F)
[1] TRUE

grepl("[:alpha:]","a", perl = T)
#Error in grepl("[:alpha:]", "1", perl = T) : 
# invalid regular expression '[:alpha:]'
# In addition: Warning message:
# In grepl("[:alpha:]", "1", perl = T) : PCRE pattern compilation error
#       'POSIX named classes are supported only within a class'
#        at '[:alpha:]'

if so, pcre are (a) more flexible and as you say (b) more efficient, don't
see why not to diverge from base (we have like already, after all)

I might be being daft but %like% doesn't diverge, like it's a convenience wrapper obviously, bu it currently inherits all default arguments?

like() would retain Perl option and %flike% would be made to have the
underlying like() call with Perl=false

Yeah, that's certainly possible.

I'm happy with whatever direction you guys wanna go in, I just would be super stoked if perl = TRUE was incorporated either in %like% or as another convenience wrapper.

KyleHaynes on 14 Jul 2019

bu it currently inherits all default arguments?

yes, what I'm getting at is that like() is ~95% a synonym for grepl(), one of the advantages of having split from that is we can choose our own defaults (with an eye to backwards compatibility)

thanks for spelling out the pre/tre differences so clearly; I'm guessing that in fact we'd be breaking some code if we changed the default. the examples you gave, is there a workaround to accomplish the same RE in perl? if so, and we're really insistent on using perl, we could do a deprecation cycle.

I'm not opposed to %plike%, just trying to make sure there's some orthogonal value-add as discipline to prevent the %[a-z]*like% family from exploding

MichaelChirico on 14 Jul 2019

yes, what I'm getting at is that like() is ~95% a synonym for grepl(), one of the advantages of having split from that is we can choose our own defaults (with an eye to backwards compatibility)

Ahh right. That makes sense. If the direction was to be insistent on using perl in %like%, would a workaround be to declare the arguments in the convenience function and have doco to indicate how easy it is to change these (if required), or is that too wishy-washy?

# %like%, new behavior as `perl = TRUE`:
`%like%` = function (vector, pattern, perl = TRUE) {
    if (is.factor(vector)) {
        as.integer(vector) %in% grep(pattern, levels(vector), perl = perl)
    }
    else {
        grepl(pattern, vector, perl = perl)
    }
}

# Some old dataset.
iris = as.data.table(copy(iris))
iris[, x := "\n"]

# Check that . matches any character, like it used to ...
iris[x %like% "."]
# Ohh no! It doesn't. As a user of R that's aware default arguments can change overtime, I'd immediately look at help ?'%like%' to which it would indicate if you want to revert to previous behavior then do the following...

# Change default back to FALSE.
formals(`%like%`)$perl = FALSE

# Re-run.
iris[x %like% "."]
# Phew, back to previous behavior.

I'm not opposed to %plike%, just trying to make sure there's some orthogonal value-add as discipline to prevent the %[a-z]*like% family from exploding

Totally agree, less is more. If you consider the above a potential solution, you could pass all arguments and allow the user to dictate if required. To extend on this I already pondered ignore.case = T & perl = T (as I often use it with grepl), but having another convenience function around that would probably result in it being less of a convenience and more of a burden of remembering what it was called.

KyleHaynes on 14 Jul 2019

For the examples above at least, you can make TRE and PCRE match with small modifications. If only these and other minor differences existed, you could introduce data.table options which prefixed the pattern with the appropriate modifier.

Adding some of these to the ?like manual page might be a good idea. ?grepl and ?regex have some of the densest, arcane help pages -- with excellent reason.

> # alternates "single-line" modifier from the one used at compile time.
> # Use (?m) instead of (?s) for multi-line (Though this is the default for R.)
> # Can be combined, so (?im) sets caseless matching.
> grepl("(?s).", "\n", perl = TRUE)
[1] TRUE
> grepl(".", "\n", perl = FALSE)
[1] TRUE
> # Don't match new-line under TRE
> grepl("(?n).", "\n", perl = FALSE)
[1] FALSE
> 
> # Enclose character classes within an extra set of brackets.
> # Double brackets also works under TRE.
> grepl("[[:alpha:]]","a", perl = TRUE)
[1] TRUE
> grepl("[:alpha:]","a", perl = FALSE)
[1] TRUE

Not mentioned here, but in #3552 was caseless and fixed string matching. You can indicate caseless using (?i).

> grepl("this does not match", 'THIS DOES NOT MATCH')
[1] FALSE
> grepl('(?i)this does match', 'THIS DOES MATCH')
[1] TRUE

I'm not aware of a flag to indicate fixed matching, though wrapping regex ... in \\Q...\\E will indicate to the engine it contains no meta characters (except PERL uses $ and @ for interpolation inside this, see ?regex in R). I don't think this is the same as passing fixed = TRUE, because there is virtually no speed difference.

smingerson on 14 Jul 2019

@smingerson thanks for you input, that's super neat! I'm totally happy to let @MichaelChirico indicate the best way forward, as I have no real strong opinions and am obviously not a key contributor to (my fav R package) data.table.

Adding some of these to the ?like manual page might be a good idea. ?grepl and ?regex have some of the densest, arcane help pages -- with excellent reason.

I do agree with both of your points as well.

KyleHaynes on 15 Jul 2019

One major feature PCRE has that TRE does not (as far as I can tell), is recursion. I've never had occasion to use this, but it looks like one major difference.

# Example regex from https://www.regular-expressions.info/recurse.html
to_match <- "(((why yes))))"
grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = TRUE)
#> [1] TRUE
# Does not match because parentheses are unbalanced
# and we require them to be by the addition of [^)] at the end.
grepl("\\((?>[^()]|(?R))*\\)[^)]", to_match, perl = TRUE)
#> [1] FALSE

grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = FALSE)
#> Error in grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = FALSE): invalid regular expression '\((?>[^()]|(?R))*\)', reason 'Invalid regexp'

TRE has a feature PCRE lacks in approximate regex matching, but since that is not accessed through grepl() I don't know whether that point has any value.

smingerson on 16 Jul 2019

I think it is much safer to leave %like% as it is now, and just add %plike%.

jangorecki on 30 Nov 2019

👍1

@MichaelChirico - if you agree, I'm more than happy to create a pull request.

KyleHaynes on 18 Dec 2019

👍1

@KyleHaynes you are welcome, you are already in contributors team so please push to Rdatatable/data.table repo directly

jangorecki on 18 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

FR: assign names to CJ() like data.table() does

franknarf1 · 3Comments

fread Unable to handle mis-quoted field if it is out-of-sample

st-pasha · 3Comments

fread may fail to parse valid file when dec=','

st-pasha · 3Comments

Regression in unique.data.table() as of data.table 1.12.0

jameslamb · 3Comments

grouping sets

jangorecki · 3Comments

Data.table: Scope for %plike%

All 12 comments

Following packages required .# install.packages(c("stringi", "microbenchmark")

load data.table.

Create a data.table of 100,000 random strings (20 chars in length).DT = data.table(x = stringi::stri_rand_strings(100000, 20))

Define a trivial regex pattern.regex_pattern = "car|blah|far|nah"

Create an alternative to %like% that sets perl = TRUE.

Related issues

Create an alternative to %like% that sets `perl = TRUE`.