Chapel: Regexp multilLine option not working as expected

Created on 15 May 2020  路  7Comments  路  Source: chapel-lang/chapel

Summary of Problem

Regexp.compile seems to ignore multiLine=true.

Steps to Reproduce

Source Code:

use Regexp;
config var multiLine = true;
const reLeadingWhitespace = compile("^a$", multiLine=multiLine);
{
const s =   "a";
var indents = reLeadingWhitespace.matches(s);
writeln('indents.size:', indents.size);
}
{
const s =   "a\n";
var indents = reLeadingWhitespace.matches(s);
writeln('indents.size:', indents.size);
}

Output of running above program:

./example --multiLine=true
1
0
./example --multiLine=false
1
0

Expected output:

./example --multiLine=true
1
1
./example --multiLine=false
1
0

See associated regexr as a correctness reference.

Configuration Information

  • Output of chpl --version: chpl version 1.23.0 pre-release (188e1b3772)
Libraries / Modules Bug

All 7 comments

I chatted a bit about this issue with @e-kayrakli and @mppf offline.

Some next steps to consider:

  • Is opts.multiline == 'm' in the re2 wrapper code?
  • Can we write a standalone C program that uses the wrapper around the re2 C++ interface?

I dug a bit deeper and I think this is a documentation issue. (re2 itself has
pretty limited documentation)

multiLine has no effect (hardwired to false) if you are not using posix,
which we don't by default. The solution is either to use POSIX syntax by passing
posix=true to compile, or use m flag in your regular expression. So, the
above code must be rewritten as:

use Regexp;

config var multiLine = true;

var regexpString = "^a$";
if multiLine then
  regexpString = "(?m:"+regexpString+")";
else
  regexpString = "(?:"+regexpString+")";


const reLeadingWhitespace = compile(regexpString);
{
const s =   "a";
var indents = reLeadingWhitespace.matches(s);
writeln('indents.size:', indents.size);
}

{
const s =   "a\n";
var indents = reLeadingWhitespace.matches(s);
writeln('indents.size:', indents.size);
}

where m flag is applied to the group.

See https://github.com/google/re2/blob/e48b461c1e3e09574300587672c2498b77bc24dc/re2/re2.h#L579-L585

Things we can do:

  1. Document this and hope that user reads it.
  2. Give a warning for cases re2 silently ignores.

We should definitely do 1.

I think we can also do 2, but it'll have to be a runtime warning as these flags
are not params. We can have param overloads, and give only compilerWarnings but
I am not sure if the asymmetry is a good idea. Given that this is not probably
gonna happen in performance-critical code, runtime warnings shouldn't be a huge
deal except for annoyance. We can also think about adding a CHPL_REGEXP_QUIET
or something to control the behavior.

@e-kayrakli - do you know a reason why we cannot just put this

if multiLine then regexpString = "(?m:"+regexpString+")";

in the regexp module itself based on the already existing multiline option?

I don't. And it may be an option.

But I don't have experience in passing flags to ~capture~ groups like this.

If the user-provided regexpString already contained groups, would wrapping those in (?m: ... ) be problematic? I don't know how nested groups works with re2.

I believe that the (?m:) syntax does not create a new capture group (but if it does there is other syntax to ask it not to do that).

I believe that the (?m:) syntax does not create a new capture group (but if it does there is other syntax to ask it not to do that).

That's right, fixed my comment above.

Nonetheless, I just wanted to express my uncertainty about introducing a nesting level with that approach. @ben-albrecht captured that though much better than I did.

Was this page helpful?
0 / 5 - 0 ratings