Running some cucumber tests with --compare-regex-engines=true led to these discrepancies
match_in_region(/(?-mix:\A([0-9]+)_([_a-z0-9]*)\.?([_a-z0-9]*)?\.rb\z)/, "20190116152522_enable_postgis_extension.rb"@UTF-8, 0, 42, false, true, 0) gate
0 - 42
0 - 14
15 - 39
-
20190116152522
enable_postgis_extension
but we expected
0 - 42
0 - 14
15 - 39
39 - 39
20190116152522
enable_postgis_extension
and
match_in_region(/(?-mix:\A([0-9]+)_([_a-z0-9]*)\.?([_a-z0-9]*)?\.rb\z)/, "20190116152523_create_schools.rb"@UTF-8, 0, 32, false, true, 0) gate
0 - 32
0 - 14
15 - 29
-
20190116152523
create_schools
but we expected
0 - 32
0 - 14
15 - 29
29 - 29
20190116152523
create_schools
and
match_in_region(/(?-mix:^0{2}?(00)?(44)?(0)?([1-357-9]\d{9}|[18]\d{8}|8\d{6})$)/, "07123456789"@UTF-8, 0, 11, false, true, 0) gate
NO MATCH
but we expected
0 - 11
-
-
0 - 1
1 - 11
0
7123456789
and
match_in_region(/(?-mix:^0{2}?(00)?44)/, "447123456789"@UTF-8, 0, 12, false, true, 0) gate
NO MATCH
but we expected
0 - 2
-
and
match_in_region(/(?-mix:^0{2}?(00)?(44)(0)?([1-357-9]\d{9}|[18]\d{8}|8\d{6})$)/, "447123456789"@UTF-8, 0, 12, false, true, 0) gate
NO MATCH
but we expected
0 - 12
-
0 - 2
-
2 - 12
44
7123456789
and
match_in_region(/(?-mix:^0{2}?(00)?(44)?(0)?([1-357-9]\d{9}|[18]\d{8}|8\d{6})$)/, "07123456789"@UTF-8, 0, 11, false, true, 0) gate
NO MATCH
but we expected
0 - 11
-
-
0 - 1
1 - 11
0
7123456789
and
match_in_region(/(?-mix:^0{2}?(00)?44)/, "447123456789"@UTF-8, 0, 12, false, true, 0) gate
NO MATCH
but we expected
0 - 2
-
and
match_in_region(/(?-mix:^0{2}?(00)?(44)(0)?([1-357-9]\d{9}|[18]\d{8}|8\d{6})$)/, "447123456789"@UTF-8, 0, 12, false, true, 0) gate
NO MATCH
but we expected
0 - 12
-
0 - 2
-
2 - 12
44
7123456789
Thank you for trying it out and the great bug report.
@jirkamarsik Could you take a look at this?
To make sure, does TRegex recognize (?-mix: (which is what Regexp#to_s will prepend to say there are no options)?
In the first and second examples, it seems ([_a-z0-9]*)? which matches "" sets no group offsets on TRegex, but it sets equal group offsets on Joni.
TRegex does support the (?-mix: syntax.
The issue in the first two cases is about the semantics of nesting quantifiers that match empty strings. If [_a-z0-9]* matches an empty string, then the outer quantifier, ([_a-z0-9]*)?, treats that as a failure and the capture group is not matched. This behavior is in some form part of most regex systems, as it prevents infinite loops when repeatedly matching empty strings in quantifiers such as *. Here, the behavior kicks in also for ?, which leads to different results from Ruby.
The issue in the rest of the cases is a parser issue. Ruby parses A{1,2}? as a non-greedy quantifier (because of the ?), equivalent to A|AA, but it parses A{2}? as two nested quantifiers, equivalent to AA|.
I'll work up for a fix for both. :-)
Any progress on this?
I'm happy to report that yes! I've been working on this over the course of January, as it turned out that thoroughly fixing the first of the two issues listed above involves mimicking some unforeseen nuances in the way Ruby implements regular expressions. I now have a sufficiently general implementation of the fix. I'll clean it up and push it to master soon.
The TRegex engine is rooted in an implementation of ECMAScript regular expressions, which have some minor but rather annoying semantic differences to Ruby regular expressions. If you're curious, below are some of the features of Ruby regular expressions that will now be correctly reflected in the TRegex engine.
? loop zero times, since executing it once would consume no characters. As a result, the contents of capture group 1 are null. On the other hand, Ruby executes the loop once, because the execution modifies the contents of capture group 1 so that it contains the empty string.Node.js (ECMAScript):
> /(a*)? /.exec("")
[ '', undefined, index: 0, input: '', groups: undefined ]
MRI (Ruby):
irb(main):001:1" /(a*)?/.match("")
=> #<MatchData "" 1:"">
This has now been fixed to provide the correct behavior, even in cases such as the following, when a loop can perform a series of empty matches in order to change the state of several capture groups and then continue by consuming more characters in subsequent iterations.
irb(main):001:0> /(a|\2b|\3()|())*/.match("aaabbb")
=> #<MatchData "aaabbb" 1:"" 2:"" 3:"">
Node.js (ECMAScript)
> /(?:|a)*/.exec('a')
[ 'a', index: 0, input: 'a', groups: undefined ]
MRI (Ruby):
irb(main):001:0> /(?:|a)*/.match('a')
=> #<MatchData "">
@jirkamarsik Should this be fixed by https://github.com/oracle/graal/commit/5389f3cc7077f24bcf62f1315530e445ec5646c7 ?
Just seen https://github.com/oracle/truffleruby/commit/a6282a3464017fc53aac61ee3bae8c42a9a2ff6f so will run the Cucumber tests again tomorrow after the nightly Docker image build that I use runs.
Yes, we were waiting that a6282a3464017fc53aac61ee3bae8c42a9a2ff6f merges so that TruffleRuby picks the fixes.
The only mismatch reported is listed below (it may have been been present before but I chose a representative sample of mismatches when posting this issue originally) .
The Cucumber test themselves pass now using either regex engine.
match_in_region(/(?x-mi:
^
([ ]*) # indentations
(.+) # key
(?::(?=(?:\s|$))) # : (without the lookahead the #key includes this when : is present in value)
[ ]?
(['"]?) # optional opening quote
(.*) # value
\3 # matching closing quote
$
)/, "BUNDLE_WITHOUT: "development""@UTF-8, 0, 29, false, true, 0) gate
0 - 29
0 - 0
0 - 14
16 - 17
17 - 28
-
BUNDLE_WITHOUT
"
development
but we expected
0 - 29
0 - 0
0 - 14
16 - 17
17 - 28
BUNDLE_WITHOUT
"
development
Thanks for running the test suite! We found a fix for this mismatch as well and the fix is now in the CI pipeline, so it should be merged and picked up by TruffleRuby soon.
Thanks again for testing the regex implementation!
The most recent build has fixed it.
Many Thanks.
Most helpful comment
I'm happy to report that yes! I've been working on this over the course of January, as it turned out that thoroughly fixing the first of the two issues listed above involves mimicking some unforeseen nuances in the way Ruby implements regular expressions. I now have a sufficiently general implementation of the fix. I'll clean it up and push it to
mastersoon.The TRegex engine is rooted in an implementation of ECMAScript regular expressions, which have some minor but rather annoying semantic differences to Ruby regular expressions. If you're curious, below are some of the features of Ruby regular expressions that will now be correctly reflected in the TRegex engine.
?loop zero times, since executing it once would consume no characters. As a result, the contents of capture group 1 are null. On the other hand, Ruby executes the loop once, because the execution modifies the contents of capture group 1 so that it contains the empty string.This has now been fixed to provide the correct behavior, even in cases such as the following, when a loop can perform a series of empty matches in order to change the state of several capture groups and then continue by consuming more characters in subsequent iterations.