Truffleruby: TRegex regular expressions engine comparison errors

Created on 28 Dec 2020 · 9Comments · Source: oracle/truffleruby

Running some cucumber tests with --compare-regex-engines=true led to these discrepancies

match_in_region(/(?-mix:\A([0-9]+)_([_a-z0-9]*)\.?([_a-z0-9]*)?\.rb\z)/, "20190116152522_enable_postgis_extension.rb"@UTF-8, 0, 42, false, true, 0) gate
    0 - 42
    0 - 14
    15 - 39
     - 
    20190116152522
    enable_postgis_extension

but we expected
    0 - 42
    0 - 14
    15 - 39
    39 - 39
    20190116152522
    enable_postgis_extension

and

match_in_region(/(?-mix:\A([0-9]+)_([_a-z0-9]*)\.?([_a-z0-9]*)?\.rb\z)/, "20190116152523_create_schools.rb"@UTF-8, 0, 32, false, true, 0) gate
    0 - 32
    0 - 14
    15 - 29
     - 
    20190116152523
    create_schools

but we expected
    0 - 32
    0 - 14
    15 - 29
    29 - 29
    20190116152523
    create_schools

and

match_in_region(/(?-mix:^0{2}?(00)?(44)?(0)?([1-357-9]\d{9}|[18]\d{8}|8\d{6})$)/, "07123456789"@UTF-8, 0, 11, false, true, 0) gate
    NO MATCH
but we expected
    0 - 11
     - 
     - 
    0 - 1
    1 - 11


    0
    7123456789

and

match_in_region(/(?-mix:^0{2}?(00)?44)/, "447123456789"@UTF-8, 0, 12, false, true, 0) gate
    NO MATCH
but we expected
    0 - 2
     -

and

match_in_region(/(?-mix:^0{2}?(00)?(44)(0)?([1-357-9]\d{9}|[18]\d{8}|8\d{6})$)/, "447123456789"@UTF-8, 0, 12, false, true, 0) gate
    NO MATCH
but we expected
    0 - 12
     - 
    0 - 2
     - 
    2 - 12

    44

    7123456789

and

match_in_region(/(?-mix:^0{2}?(00)?(44)?(0)?([1-357-9]\d{9}|[18]\d{8}|8\d{6})$)/, "07123456789"@UTF-8, 0, 11, false, true, 0) gate
    NO MATCH
but we expected
    0 - 11
     - 
     - 
    0 - 1
    1 - 11


    0
    7123456789

and

match_in_region(/(?-mix:^0{2}?(00)?44)/, "447123456789"@UTF-8, 0, 12, false, true, 0) gate
    NO MATCH
but we expected
    0 - 2
     -

and

match_in_region(/(?-mix:^0{2}?(00)?(44)(0)?([1-357-9]\d{9}|[18]\d{8}|8\d{6})$)/, "447123456789"@UTF-8, 0, 12, false, true, 0) gate
    NO MATCH
but we expected
    0 - 12
     - 
    0 - 2
     - 
    2 - 12

    44

    7123456789

compatibility

Source

petenorth

Most helpful comment

I'm happy to report that yes! I've been working on this over the course of January, as it turned out that thoroughly fixing the first of the two issues listed above involves mimicking some unforeseen nuances in the way Ruby implements regular expressions. I now have a sufficiently general implementation of the fix. I'll clean it up and push it to master soon.

The TRegex engine is rooted in an implementation of ECMAScript regular expressions, which have some minor but rather annoying semantic differences to Ruby regular expressions. If you're curious, below are some of the features of Ruby regular expressions that will now be correctly reflected in the TRegex engine.

In ECMAScript, when a loop matches the minimum required number of iterations, any further iterations are only matched provided they consume some characters from the input. This is a measure intended to stop infinite loops once they no longer consume any input. Ruby has a similar guard, but it admits extra iterations if they either consume characters or change the state of capture groups. Thus it is possible to have extra iterations that don't consume any characters but that store empty strings as matches of capture groups. In the example below, ECMAScript executes the outer ? loop zero times, since executing it once would consume no characters. As a result, the contents of capture group 1 are null. On the other hand, Ruby executes the loop once, because the execution modifies the contents of capture group 1 so that it contains the empty string.

Node.js (ECMAScript):
> /(a*)? /.exec("")
[ '', undefined, index: 0, input: '', groups: undefined ]

MRI (Ruby):
irb(main):001:1" /(a*)?/.match("")
=> #<MatchData "" 1:"">

This has now been fixed to provide the correct behavior, even in cases such as the following, when a loop can perform a series of empty matches in order to change the state of several capture groups and then continue by consuming more characters in subsequent iterations.

irb(main):001:0> /(a|\2b|\3()|())*/.match("aaabbb")
=> #<MatchData "aaabbb" 1:"" 2:"" 3:"">

In ECMAScript, when a loop fails the empty check (an iteration matches only the empty string), the engine terminates the loop by rejecting this branch and backtracking to another alternative (eventually backtracking to the point where it chooses not to re-enter the loop and consider it finished). On the other hand, in Ruby, when a loop fails the empty check (an iteration matches only the empty string and it does not modify the state of the capture groups), the engine continues with the current branch by proceeding to the continuation of the loop. Most notably, it doesn't try to backtrack and alter decisions made inside the loop until some future failure forces it to. This can be illustrated on the following example, where ECMAScript will backtrack into the loop and choose the second alternative, whereas Ruby will proceed with the empty match.

Node.js (ECMAScript)
> /(?:|a)*/.exec('a')
[ 'a', index: 0, input: 'a', groups: undefined ]

MRI (Ruby):
irb(main):001:0> /(?:|a)*/.match('a')
=> #<MatchData "">

jirkamarsik on 8 Feb 2021

🚀2

All 9 comments

Thank you for trying it out and the great bug report.

@jirkamarsik Could you take a look at this?

To make sure, does TRegex recognize (?-mix: (which is what Regexp#to_s will prepend to say there are no options)?
In the first and second examples, it seems ([_a-z0-9]*)? which matches "" sets no group offsets on TRegex, but it sets equal group offsets on Joni.

eregon on 4 Jan 2021

TRegex does support the (?-mix: syntax.

The issue in the first two cases is about the semantics of nesting quantifiers that match empty strings. If [_a-z0-9]* matches an empty string, then the outer quantifier, ([_a-z0-9]*)?, treats that as a failure and the capture group is not matched. This behavior is in some form part of most regex systems, as it prevents infinite loops when repeatedly matching empty strings in quantifiers such as *. Here, the behavior kicks in also for ?, which leads to different results from Ruby.

The issue in the rest of the cases is a parser issue. Ruby parses A{1,2}? as a non-greedy quantifier (because of the ?), equivalent to A|AA, but it parses A{2}? as two nested quantifiers, equivalent to AA|.

I'll work up for a fix for both. :-)

jirkamarsik on 4 Jan 2021

Any progress on this?

petenorth on 4 Feb 2021

In ECMAScript, when a loop matches the minimum required number of iterations, any further iterations are only matched provided they consume some characters from the input. This is a measure intended to stop infinite loops once they no longer consume any input. Ruby has a similar guard, but it admits extra iterations if they either consume characters or change the state of capture groups. Thus it is possible to have extra iterations that don't consume any characters but that store empty strings as matches of capture groups. In the example below, ECMAScript executes the outer ? loop zero times, since executing it once would consume no characters. As a result, the contents of capture group 1 are null. On the other hand, Ruby executes the loop once, because the execution modifies the contents of capture group 1 so that it contains the empty string.

Node.js (ECMAScript):
> /(a*)? /.exec("")
[ '', undefined, index: 0, input: '', groups: undefined ]

MRI (Ruby):
irb(main):001:1" /(a*)?/.match("")
=> #<MatchData "" 1:"">

irb(main):001:0> /(a|\2b|\3()|())*/.match("aaabbb")
=> #<MatchData "aaabbb" 1:"" 2:"" 3:"">

In ECMAScript, when a loop fails the empty check (an iteration matches only the empty string), the engine terminates the loop by rejecting this branch and backtracking to another alternative (eventually backtracking to the point where it chooses not to re-enter the loop and consider it finished). On the other hand, in Ruby, when a loop fails the empty check (an iteration matches only the empty string and it does not modify the state of the capture groups), the engine continues with the current branch by proceeding to the continuation of the loop. Most notably, it doesn't try to backtrack and alter decisions made inside the loop until some future failure forces it to. This can be illustrated on the following example, where ECMAScript will backtrack into the loop and choose the second alternative, whereas Ruby will proceed with the empty match.

Node.js (ECMAScript)
> /(?:|a)*/.exec('a')
[ 'a', index: 0, input: 'a', groups: undefined ]

MRI (Ruby):
irb(main):001:0> /(?:|a)*/.match('a')
=> #<MatchData "">

jirkamarsik on 8 Feb 2021

🚀2

@jirkamarsik Should this be fixed by https://github.com/oracle/graal/commit/5389f3cc7077f24bcf62f1315530e445ec5646c7 ?

Just seen https://github.com/oracle/truffleruby/commit/a6282a3464017fc53aac61ee3bae8c42a9a2ff6f so will run the Cucumber tests again tomorrow after the nightly Docker image build that I use runs.

petenorth on 26 Feb 2021

Yes, we were waiting that a6282a3464017fc53aac61ee3bae8c42a9a2ff6f merges so that TruffleRuby picks the fixes.

eregon on 27 Feb 2021

The only mismatch reported is listed below (it may have been been present before but I chose a representative sample of mismatches when posting this issue originally) .

The Cucumber test themselves pass now using either regex engine.

match_in_region(/(?x-mi:
      ^
      ([ ]*) # indentations
      (.+) # key
      (?::(?=(?:\s|$))) # :  (without the lookahead the #key includes this when : is present in value)
      [ ]?
      (['"]?) # optional opening quote
      (.*) # value
      \3 # matching closing quote
      $
    )/, "BUNDLE_WITHOUT: "development""@UTF-8, 0, 29, false, true, 0) gate
    0 - 29
    0 - 0
    0 - 14
    16 - 17
    17 - 28
     - 

    BUNDLE_WITHOUT
    "
    development

but we expected
    0 - 29
    0 - 0
    0 - 14
    16 - 17
    17 - 28

    BUNDLE_WITHOUT
    "
    development

petenorth on 28 Feb 2021

Thanks for running the test suite! We found a fix for this mismatch as well and the fix is now in the CI pipeline, so it should be merged and picked up by TruffleRuby soon.

Thanks again for testing the regex implementation!

jirkamarsik on 1 Mar 2021

👍1

The most recent build has fixed it.

Many Thanks.

petenorth on 4 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings