Crystal: Support match? : Bool for String and Regex

Created on 27 Mar 2020  路  5Comments  路  Source: crystal-lang/crystal

Ruby

In Ruby, String and Regexp both have a match? method which returns true or false depending on whether there is a match or not, rather then MatchData.

This is quite useful as often all you what to know is _if_ there is a match or not, rather than the details.

So for example, if I want to know if all the letters in a content string are in caps (e.g. "yelling"), but not the details,
I'd just like to use match? and return true or false

e.g.

[6] pry(main)> 'THE dog IS NOT IN CAPS'.match?(/^[^a-z]*[A-Z][^a-z]*$/)
=> false
[7] pry(main)> 'THE CAT IS IN CAPS'.match?(/^[^a-z]*[A-Z][^a-z]*$/)
=> true

Crystal

At the moment String and Regex only support match : MatchData?, not match?.

To achieve the same with Crystal ("spare me the details - just tell me if it matches") is possible with a double - bang...

$ cat issue.cr 
puts !!"THE dog IS NOT IN CAPS".match(/^[^a-z]*[A-Z][^a-z]*$/)
puts !!"THE CAT IS IN CAPS".match(/^[^a-z]*[A-Z][^a-z]*$/)
$ crystal issue.cr 
false
true

...but I think it would be more intuitive (and readable) if there was a match? : Bool

feature topictext

Most helpful comment

I implemented matches? naively:

patch:

diff --git a/src/regex.cr b/src/regex.cr
index db5ad35cf..d6aeee121 100644
--- a/src/regex.cr
+++ b/src/regex.cr
@@ -481,8 +481,7 @@ class Regex

     ovector_size = (@captures + 1) * 3
     ovector = Pointer(Int32).malloc(ovector_size)
-    ret = LibPCRE.exec(@re, @extra, str, str.bytesize, byte_index, (options | Options::NO_UTF8_CHECK), ovector, ovector_size)
-    if ret > 0
+    if internal_matches?(str, byte_index, options, ovector, ovector_size)
       match = MatchData.new(self, @re, str, byte_index, ovector, @captures)
     else
       match = nil
@@ -491,6 +490,26 @@ class Regex
     $~ = match
   end

+  def matches?(str, pos = 0, options = Regex::Options::None) : Bool
+    if byte_index = str.char_index_to_byte_index(pos)
+      matches_at_byte_index?(str, byte_index, options)
+    else
+      false
+    end
+  end
+
+  def matches_at_byte_index?(str, byte_index = 0, options = Regex::Options::None) : Bool
+    return false if byte_index > str.bytesize
+
+    internal_matches?(str, byte_index, options, nil, 0)
+  end
+
+  private def internal_matches?(str, byte_index, options, ovector, ovector_size)
+    ret = LibPCRE.exec(@re, @extra, str, str.bytesize, byte_index, (options | Options::NO_UTF8_CHECK), ovector, ovector_size)
+    # TODO: when `ret < -1`, it means PCRE error. It should handle correctly.
+    ret >= 0
+  end
+
   # Returns a `Hash` where the values are the names of capture groups and the
   # keys are their indexes. Non-named capture groups will not have entries in
   # the `Hash`. Capture groups are indexed starting from `1`.

benchmark script:

require "benchmark"

# This regular expressions comes from marked.js.
# https://github.com/markedjs/marked/blob/master/lib/marked.js#L466-L469
regex = /^!?\[((?:\[[^\[\]]*\]|\\.|`[^`]*`|[^\[\]\\`])*?)\]\(\s*(<(?:\\[<>]?|[^\s<>\\])*>|[^\s\x00-\x1f]*)(?:\s+("(?:\\"?|[^"\\])*"|'(?:\\'?|[^'\\])*'|\((?:\\\)?|[^)\\])*\)))?\s*\)/

README = File.read("README.md")

Benchmark.ips do |x|
  x.report("match") { README.size.times { |i| regex.match(README, i) } }
  x.report("matches?") { README.size.times { |i| regex.matches?(README, i) } }
end

And compiling benchmark script with --release and running it under crystal/crystal repo, then:

   match   3.43k (291.45碌s) (卤 5.38%)  201kB/op   1.73脳 slower
matches?   5.95k (168.01碌s) (卤 6.00%)   0.0B/op        fastest

I'll open a new PR if needed.

All 5 comments

I think this would be nice to have, though I guess it should be named matches?.

The advantage would be to avoid creating a MatchData so it would be slightly faster. But we should see if it's really faster because MatchData in Crystal is a struct, though some memory is allocated by PCRE, but I don't know if that can be avoided.

If it's really faster I'd be okay with it, if not I think MatchData? is just fine, since nil is falsey in any boolean expresssions.

man pcreapi says

If neither the actual string matched nor any captured substrings are of interest, pcre_exec() may be called with ovector passed as NULL and ovecsize as zero.

So, we can avoid PCRE substring capture memory allocation at least. However I don't know how much this effect performance.

I implemented matches? naively:

patch:

diff --git a/src/regex.cr b/src/regex.cr
index db5ad35cf..d6aeee121 100644
--- a/src/regex.cr
+++ b/src/regex.cr
@@ -481,8 +481,7 @@ class Regex

     ovector_size = (@captures + 1) * 3
     ovector = Pointer(Int32).malloc(ovector_size)
-    ret = LibPCRE.exec(@re, @extra, str, str.bytesize, byte_index, (options | Options::NO_UTF8_CHECK), ovector, ovector_size)
-    if ret > 0
+    if internal_matches?(str, byte_index, options, ovector, ovector_size)
       match = MatchData.new(self, @re, str, byte_index, ovector, @captures)
     else
       match = nil
@@ -491,6 +490,26 @@ class Regex
     $~ = match
   end

+  def matches?(str, pos = 0, options = Regex::Options::None) : Bool
+    if byte_index = str.char_index_to_byte_index(pos)
+      matches_at_byte_index?(str, byte_index, options)
+    else
+      false
+    end
+  end
+
+  def matches_at_byte_index?(str, byte_index = 0, options = Regex::Options::None) : Bool
+    return false if byte_index > str.bytesize
+
+    internal_matches?(str, byte_index, options, nil, 0)
+  end
+
+  private def internal_matches?(str, byte_index, options, ovector, ovector_size)
+    ret = LibPCRE.exec(@re, @extra, str, str.bytesize, byte_index, (options | Options::NO_UTF8_CHECK), ovector, ovector_size)
+    # TODO: when `ret < -1`, it means PCRE error. It should handle correctly.
+    ret >= 0
+  end
+
   # Returns a `Hash` where the values are the names of capture groups and the
   # keys are their indexes. Non-named capture groups will not have entries in
   # the `Hash`. Capture groups are indexed starting from `1`.

benchmark script:

require "benchmark"

# This regular expressions comes from marked.js.
# https://github.com/markedjs/marked/blob/master/lib/marked.js#L466-L469
regex = /^!?\[((?:\[[^\[\]]*\]|\\.|`[^`]*`|[^\[\]\\`])*?)\]\(\s*(<(?:\\[<>]?|[^\s<>\\])*>|[^\s\x00-\x1f]*)(?:\s+("(?:\\"?|[^"\\])*"|'(?:\\'?|[^'\\])*'|\((?:\\\)?|[^)\\])*\)))?\s*\)/

README = File.read("README.md")

Benchmark.ips do |x|
  x.report("match") { README.size.times { |i| regex.match(README, i) } }
  x.report("matches?") { README.size.times { |i| regex.matches?(README, i) } }
end

And compiling benchmark script with --release and running it under crystal/crystal repo, then:

   match   3.43k (291.45碌s) (卤 5.38%)  201kB/op   1.73脳 slower
matches?   5.95k (168.01碌s) (卤 6.00%)   0.0B/op        fastest

I'll open a new PR if needed.

Yes, please!

Was this page helpful?
0 / 5 - 0 ratings