Crystal: Scan should be multiline by default?

Created on 8 Aug 2019  路  30Comments  路  Source: crystal-lang/crystal

I wanted to search all lines that started with something ("foo"), I wrote this in Ruby:

string = "foo bar\nfoo baz"
p string.scan(/^foo/) # => ["foo", "foo"]

But in Crystal the above result is:

[Regex::MatchData("foo")]

Ignoring the fact that Crystal returns a different type (which makes sense for Crystal), should scan always do a multiline lookup? It works the same as in Ruby if I do:

string = "foo bar\nfoo baz"
p string.scan(/^foo/m) # note the m modifier

Most helpful comment

Just a huge breaking change

Yes, though I can imagine we'll eventually switch to oniguruma or onigmo, and it's better to make these breaking changes before 1.0. Plus I think Crystal isn't that popular yet to have many apps using regex in a way that it will break many things. Maybe yes, but then again this is the time to make breaking changes.

All 30 comments

I didn't do a deep dive into the code yet since it's kind of involved, but this comment here: https://github.com/ruby/ruby/blob/master/string.c#L9282-L9284 seems to me to suggest that they continue running through the string on a non-match, while we just bail in that case: https://github.com/crystal-lang/crystal/blob/master/src/string.cr#L3722

@jhass, I'm not sure how your comment relates to the original post.

The suggestion is OK but has many caveats.

  1. how to do non-multiline scan then?
  2. backwards-incompatible.

I think what @jhass pointed out is relevant: when we don't find a next match we stop searching, while ruby always advanced by at least one char. That's why /^foo/ will eventually find a match in the next line(s).

I think it's more intuitive for scan to scan everything, all lines, because the name implies that, even if the /m modifier becomes a bit more redundant. Then if you only want to do a non-multiline scan you can just get the first line and scan over that.

I'm pretty sure there's a reason Ruby chose that, and when I found out the difference I was surprised it didn't work as I expected it (in Crystal).

@oprypin If you only want to look at the first line, I suggest something like

string =~ /\A.*foo/

I suppose the heart of it is the question about what the /m stands for - is it about allowing single matches to span multiple lines (I think it is), or is it about what part of the string that is matched against that will be used.

@asterite Sounds reasonable :+1:

Actually, this:

"foo\nbar" =~ /^bar/

gives 4 in Ruby and nil in Crystal... yet in Ruby you have the m modifier. I guess I don't understand how this works in Ruby and what is the m for... Oh, here it is:

/pat/m - Treat a newline as a character matched by .

But I don't know why adding m works in my original example.

Then I tried this:

$ irb
irb(main):001:0> "foo\nbar" =~ /.*bar/ # ?? but I thought . didn't match newline here...
=> 4
irb(main):002:0> "\nbar" =~ /.bar/m
=> 0
irb(main):003:0> "\nbar" =~ /.bar/ # This is fine...
=> nil

I don't know... I think we are missing something...

"foo\nbar" =~ /.*bar/ # ?? but I thought . didn't match newline here...

Try with .+, not .*. Then I bet it won't match

In PCRE, Python and many others,
MULTILINE (m) means that ^ can match at line bounds, etc;
DOTALL means that . can match a newline

And I guess Ruby always has MULTILINE on?
And has m stand for DOTALL?

@asterite regarding the first example, .* doesn't match newline there. It doesn't match anything at all (ie the empty string).

Oprypin: In Ruby, ^ always match any line boundary, yes. You have to use \A to only match start of string.

In PCRE, Python and many others,
MULTILINE (m) means that ^ can match at line bounds, etc;
DOTALL means that . can match a newline
And I guess Ruby always has MULTILINE on?
And has m stand for DOTALL?

It seems that's the case. I wonder what happened in Ruby 1.8.7 because they used PCRE, but I can't seem to install it via rbenv (I get an error). Maybe they changed the meaning of MULTILINE between 1.8.7 and 1.9.

In any case, I think we should copy Ruby here: MULTILINE always on, and the m modifier means DOTALL.

In my (very little) experience using regex, I never cared about the m modifier until a regex stops working and then I remember "oh, yeah, I always want to match multiple lines, it's just that in the other cases it wasn't needed".

I'd agree with this sentiment.
Just a huge breaking change and also more difficulty in matching start/end of string.
It could actually cause hidden security issues. If some sanitization was checking the whole string with ^$ then maybe now it can let through arbitrary text as long as there's a newline...

Don't forget about confusion with PCRE where (?m) will always mean something else even if we translate //m to be DOTALL

I guess when creating a Regex we could check the start and just replace (?m) with (?s) (well, whenever m appears in that starting parentheses).

I found Ruby's way very intuitive so I'd like to use it in Crystal too. Plus people coming from Ruby will find it similar, and then we avoid confusions.

I think we should slowly start defining how a regex works independently of the underlying engine, although that's a super hard task (a regex is a beast with tons of options and modifiers).

It also appears oniguruma outperforms all other engines, so an option is to switch to that: https://stackoverflow.com/questions/11996988/how-good-is-oniguruma-compared-to-other-cross-platform-regexp-libraries

But it's probably an even harder task...

" [..] and also more difficulty in matching start/end of string."

No, because \A / \zmatches that.

Just a huge breaking change

Yes, though I can imagine we'll eventually switch to oniguruma or onigmo, and it's better to make these breaking changes before 1.0. Plus I think Crystal isn't that popular yet to have many apps using regex in a way that it will break many things. Maybe yes, but then again this is the time to make breaking changes.

The fact that ^ and $ in Ruby's regular expressions means multiline implicitly also leads to this kind of docs needed https://guides.rubyonrails.org/security.html#regular-expressions

So this kind of behavior should be at least a surprise for many newcomers. Might be or might be not a big issue of course.

We can use the same security guide and educate people, or make multiline more explicit and avoid this surprise (maybe).

Making Crystal regular expressions work the same as in Ruby would make Crystal easier to pick up for Ruby developers.

The problem is that there is no easy alternative to consistently match beginng/end of line. You'd need some weird combination of ^/$ with \r and \n. Using ^ and $ for beginning and end of line makes this really simple.

FYI regarding

I guess when creating a Regex we could check the start and just replace (?m) with (?s) (well, whenever m appears in that starting parentheses).

(? modifiers are allowed in the middle of a regex as well, so going down that path appears unviable.

https://regex101.com/r/CzzUBo/1/

@oprypin Good point.

I guess the only alternatives are:

  • keep with /m being multiline, like it was in Ruby 1.8
  • use oniguruma (or onigmo) instead of PCRE

When I have some time I'd like to try oniguruma. If it's faster and more intuitive (and I think they support unicode classes like [[:alpha:]] or something like that) then I'm all for it.

Of course if anyone wants to try creating bindings for oniguruma and replacing Regex/Match with it then please do so! oniguruma doesn't seem to be documented a lot and we'll have to use the samples they provide.

use oniguruma (or onigmo) instead of PCRE

馃憤 鉂わ笍

No, there's also the option of applying this change anyway. With two sub-options:

  • Just accept that PCRE disagrees.
  • Do not add any conflict with
    PCRE, just make MULTILINE the default without any other alterations.

You mean, making MULTILINE be true by default, and changing /m to mean DOTALL, but do nothing with (?m)?

Yes.
And maybe not even changing /m to mean DOTALL -- that's the 2nd bullet point.

Then what would /m mean? Or we just remove it?

I fiddled a bit with oniguruma, it seems to be slower than PCRE... 馃槥

Useful reference here btw, with a whole bunch of "except in Ruby"
https://www.rexegg.com/regex-disambiguation.html#modifiers

Was this page helpful?
0 / 5 - 0 ratings

Related issues

asterite picture asterite  路  71Comments

malte-v picture malte-v  路  77Comments

MakeNowJust picture MakeNowJust  路  64Comments

HCLarsen picture HCLarsen  路  162Comments

akzhan picture akzhan  路  67Comments