Go: regexp: port RE2's DFA matcher to the regexp package

Created on 10 Jul 2015  Β·  35Comments  Β·  Source: golang/go

The regexp package currently chooses between the standard NFA matcher, onepass, or the backtracker. This proposal is for porting over RE2's DFA matcher to be used as an option by exec.

NeedsFix Performance Proposal Proposal-Accepted

Most helpful comment

Forgive the length of this post. TLDR: This post summarizes Chivers' paper and my thoughts on DFA implementation for Go in general.

Chivers discusses optimization techniques implemented in the jsre regular expression library. Sections 3-5 are the pertinent ones for this post.

  • __3.1: Executable Architecture__: NFA simulation is implemented with a bytecode virtual machine, where the states, transition functions, and parallel simulation are represented by the program counter, a set of instructions, and separate execution threads.
  • __3.3: Counted repetitions and NFA state__: In order to support arbitrarily complex repeated subexpressions, jsre supports counted loops similar to an assembly translation with a loop control variable and a conditional jump. However, Chivers states that in order to avoid merging threads that are not in the same NFA state, one must β€œlabel states using both the program counter and the loop counter. Threads may be merged if they are at the same point in the input stream and both the program counter and any loop counters have the same value.[1]”
  • __4.1: Preview Definition__: A preview is a sequence of sets, where the ith symbol of every string in the language is a member of the ith set in the preview. For example:

    1. The language denoted by the regular expression abcd|ef is {abcd, ef}. The preview for this language is {{a,e}, {b,f}, {c}, {d}}.

    2. The minimum length of a string described by a preview is called its validity length (or validity index). The validity length of this preview is 2.

    3. A preview generalizes the one-byte lookahead common to existing regular expression libraries.

  • __4.2: Preview Construction__:

    1. Base case: Regular expression has single symbol x. Preview is {x}.

    2. Union: Preview(R|S) = {Preview(R)i βˆͺ Preview(S)i | i ∈ β„•}



      1. ValidityLength(Preview(R|S)) = min(ValidityLength(Preview(R)), ValidityLength(Preview(S)))



    3. Concatenation: Denote the ValidityLength(Preview(R)) by VR. Preview(RS) = {Preview(R)i+VR βˆͺ Preview(S)i | i ∈ β„•}



      1. ValidityLength(Preview(RS)) = ValidityLength(Preview(R)) + ValidityLength(Preview(S))



    4. Kleene Star: Concatenate a Preview to itself many times. e.g: Preview(RR), Preview(Preview(RR)R), Preview(Preview(Preview(RR)R)R) …



      1. ValidityLength = 0 because of empty string.



  • __4.3: From preview to DFAs__:

    1. Each set in the preview is the next transition function.

    2. Produces false positives, but not false negatives.

5: Optimization with previews:

I found that the paper was lacking enough examples, so I added one for each optimization.

  • __5.1: Anchor preview scan__: Unanchored matches typically try to anchor position in the input to match a substring.

    1. One simple and lightweight optimization is to use a preview to eliminate non-candidates before trying to match the expression from that anchor point.

    2. Chivers provides an example of an RE for an IP address: ([0-9]{1,3}\.){3} [0-9] {1,3}

    3. The preview for the RE would be the following: {{[0-9]}, {\., [0-9]}, {\., [0-9]}}

    4. Therefore, based on the preview, the following input strings are filtered accordingly

      8.6.7.123 candidate 192.16.7.123 candidate .1.2.234 not a candidate abc.234.532.445 not a candidate 123.b.~.445 candidate

    5. Chivers shows a speedup of 5x from this optimization when tested on a large corpus.

  • __5.3: Matching multiple alternate expressions__: A common use case for regular expressions is to union multiple different expressions and test each alternative on a different thread.

    1. Before accepting the overhead of dispatching another thread, previews allow one to test that a (sub) expression may match.
    2. For example, let’s say we wanted to match the expression (asking|aardvark|barhopping|bartender|barring)
    3. Here’s a preview of size 3: {{a,b},{s,a},{k,t,r}}
    4. Chivers shows that β€œ[i]t is therefore straightforward to arrange an efficient search strategy by re-ordering the alternative expressions into a tree and using preview tests at each node.[1]” Note that if the preview check fails, we reject (or do not use that anchor point in the case of unanchored matches).
    5. Based on this example, here is a tree that I have marked up to include the preview checks and thread dispatch locations.
      $ β”‚ {a,b}? β”‚ β”œβ”€β”€ a β”‚ β”‚ β”‚ {s,a}? β”‚ β”‚ β”‚ β”œβ”€β”€ s─{k,t,r}?─k─(new thread)─i──n──g β”‚ β”‚ β”‚ └── a─{k,t,r}?─r─(new thread)─d──v──a──r──k β”‚ └── b─{s,a}?─a─{k,t,r}?─r β”œβ”€(new thread)─ h──o──p──p──i──n──g β”œβ”€(new thread)─ t──e──n──d──e──r └─(new thread)─ r──i──n──g
  • __5.5: Loop Optimization__: For a given RE of the form [c]*e, where c is a character set and e is an expression

    1. If c and First(Preview(e)) have no members in common (that is, they are disjoint), we can terminate the loop for [c]* as soon as we encounter [^c].
    2. Normally, one has to dispatch a thread on each iteration to check that e also matches. However, if the sets are disjoint, that can’t happen.
    3. Consider the RE [abc]*d+
    4. Preview(d+) = {{d}, {d}, …}
    5. First(Preview(d+)) = {d}
    6. {a,b,c} is disjoint from {d}, so as soon as we encounter an element that is not in the character set [abc], we exit the loop.
  • __5.6: String Prefix Optimization__: This optimization utilizes the disjoint property from __5.5__. Chivers proves that if there is a repeated test of character set [x] with a disjoint suffix y, as in the case of [x]*y, if the character test stops and we fail to match the suffix, then any anchor point between our original anchor point and the suffix will also stop the character test and fail to match the suffix.

    1. Given this input:   aaaaxaaxy

    2. Match attempt 1:  aaaaxy

    3. The match fails on the sixth character. Any anchor point chosen before the first x will also fail at the same position.

    4. The next logical anchor point is after the first x, allowing us to optimally select anchor points rather than exhaustively trying each one.

    5. Input:                    aaaaxaaxy

    6. Match attempt 2:          aaxy

    7. Chivers notes that this optimization can be applied to arbitrary sub-expressions, where all that is needed is a preview of the (sub)expression and the disjoint property. This optimization can lead to substantial performance improvements on unanchored searches in expressions with Kleene closures.

  • __5.8: Avoiding Quadratic Complexity__: Trying an anchor point at every position in the input results in roughly O(n2) runtime.

    1. A common solution to this is to prefix the expression with .* which creates a thread for every symbol in the alphabet. This exploits the thread merging of NFA simulation but has the disadvantage that if the beginning of the expression can have a large character set, then the number of threads can equal the number of input positions consumed.
      ```
      For the regular expression x{5}.
      xxxxx
      01234 // states written out
      Input: xaxxxaxxxxxabc

      Note: f denotes failure.
      Exhaustive anchor point testing. (Each line is a new anchor point)
      xaxxxaxxxxxabc
      0f
      f
      012f
      01f
      0f
      f
      01234 // success
      0123f
      012f
      01f
      0f
      f
      f
      f

      Using .* prefix to exploit thread merging. (Each line is a new thread)
      .*xxxxx
      0 12345

      xaxxxaxxxxxabc
      00000000000000
      1f
      123f
      12345 // success
      1234f
      ```

    2. Chivers states that β€œ[a]t best thread handling may result in slower evaluation because of thread overheads; at worst this may result in early exhaustion of the available thread space.[1]”

    3. An alternative solution applies to Kleene closures embedded anywhere in the expression. If we fail to match from the first anchor position using parallel simulation, we end up with a β€œguarded area” that corresponds to all of the characters matched by the Kleene closure. When we try a second anchor position and we reach the same state of the Kleene closure, if our position in the input is in the guarded area, then we know that the expression evaluated from this anchor will fail.
      ```
      Input: abcabczzzxyy
      Expression: abc.*xyz
      States: 0123 456

      Note: G denotes the guarded area.
      f denotes failure.
      Each line after the colon denotes another thread.

      Attempt 1:
      Input: abcabczzzxyy
      States: 012333333333
      : 45f
      Guarded: GGGGGGGGG

      Attempt 2:
      Input: abcabczzzxyy
      Guarded: GGGGGGGGG
      States: 012f // because 3 would land in the guarded area
      ```

    4. Chivers shows that β€œin addition to providing linear search performance, this approach minimizes thread usage in the NFA simulation.[2]”

Thoughts on implementation

Considering Go tries to have robust handling of Unicode symbols, previews allow one to more efficiently match characters that have multi-byte encodings. Given that Go's regexp compiler also outputs bytecode, it can be extended to support the preview scan instruction, which I assume stores a flag denoting the ith character's membership in the ith set of the preview. Perhaps this feature would also allow Go's implementation to operate on bytes instead of runes, considering it essentially has k-lookahead.

It seems that the primary advantage of having previews is to minimize thread usage during NFA simulation. New threads are not spawned until the input matches the preview, and invalid anchor positions can be quickly eliminated, preventing new threads from being spawned.

The preview for a language R of size i is a superset of the set of strings in R, which contains at most i elements. Because of this, a rejection from the preview implies a rejection from R, but not the other way around. Additionally, a preview can be represented as a DFA, with the ith subset being a transition function from state i to state i+1. Construction of previews would happen during compilation (or lazily--see below), but we'd have to do additional testing to determine their optimal sizes.

A balanced approach to DFA construction seems to be the lazy subset construction described in Russ's article[3]. I also agree with @rsc's comment that we should keep a bound on the space used by the DFA. However, if we run out of space, we should not throw away the automata. Instead, we construct a partial DFA with an upper bound on the allotted space. If the DFA doesn't require all of the allotted space, that’s great! We have a complete DFA. On the other hand, if we run out of space, only a certain fraction of the machine will be deterministic[4]. This would still offer an average case speedup for the deterministic portion of the machine.

I wonder if the current implementation has the notion of composable machines (or subroutines). An enhancement to finite state machines that doesn't add any language-recognition power but offers a practical performance benefit is the ability to feed the result of a separate machine or subroutine into the first. This allows one to take the union and intersection of machines, and therefore their languages, simply by accepting if one or both of the machines accept, and vice versa for rejection. In the context of subroutines, we would still have access to the same thread state, allowing the machine to record start and end positions of matches. Additionally, when we compute the preview of a subexpression that has multiple instances inside an expression, we can reuse that preview for each instance, provided that the subexpression has a separate machine or subroutine.

Finally, partial DFAs might be useful for previews as well. Similar to cached DFA construction, we can compute the previews only for code paths that are reachable. The first time an input character is tested as a member of a preview set, we construct the preview and cache it for lookup later. Although preview sizes are small, a large number of subexpressions in the language may demand many previews to the point where some of them will be NFAs constructed on the fly.

Overall, a reasonable strategy for implementation would be:

  1. Cached Partial DFA Construction
  2. Instruction for preview scan

    • Optimization for alternate expressions to improve Unicode code point matching

    • Optimization for Anchor Points

    • Loop optimization

    • String Prefix Optimization

    • Guarded area anchor point

      _We could settle here if performance improvements were notable enough._

  3. Also construct preview DFAs lazily
  4. Explore the idea of subroutines.

[1] Chivers, H. (2016). Optimising unicode regular expression evaluation with previews. Software: Practice and Experience, 47(5), 669-688. doi:10.1002/spe.2436

[2] Chivers, H. (2016, August 15). Retrieved from https://youtu.be/tIRpyQpJ3Ug. York CS Seminars

[3] Cox, R. (2007, January). Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...). Retrieved from https://swtch.com/~rsc/regexp/regexp1.html

[4] Navarro, G. (1997). A Partial Deterministic Automaton for Approximate String Matching.

All 35 comments

CC @rsc

IIRC, @rsc and @robpike were discussing this a few weeks ago and @rsc said he'd hit some design stumbling block where the C++ implementation didn't have an obvious Go translation.

Also, you were mentioning that the UTF-8-edness of the Go's regexp package and its runes make the 256-sized tables (which are just bytes in the C++ version) potentially tricky. Not sure whether restricting the DFA optimization to ASCII-only regexps helps or not.

One problem @rsc mentioned was the allocation of state data, but he gave me a solution so I don't think that will be a big problem.

After looking some more at the rune vs byte issue, I don't think we would need to skip the DFA for non-ascii regexps. The DFA makes a set of character ranges, and indexes the output transitions by range. We could do the same range lookup for input runes < 256 (or some other arbitrary table size) and do a slower lookup for larger runes.

Of course, there are a number of other things that don't map easily. I think most of those will have replacements and for any that don't we'll be able to do something slightly slower than RE2 but still be much faster than the nfa matcher.

I would like to understand why the NFA is not performant first. I am afraid that adding a DFA (I presume for performance reasons) may hit the same problems.

@michaelmatloob and I talked about this at Gophercon. I'm okay with adding this provided:

  1. A bound on the space used by the DFA can be kept.
  2. When the DFA cache runs out of space and must be flushed, the allocated memory can be reused directly. (That is, we don't allocate a new cache and rely on a GC cycle to reclaim the old cache memory.)

I'm working on this and hope to have it ready for 1.7.

(How do I add myself as an assignee?)

CL https://golang.org/cl/22189 mentions this issue.

CL https://golang.org/cl/22246 mentions this issue.

This isn't going to get in for Go 1.7.

For those interested in trying it out, I've got a standalone copy of my work thus far in matloob.io/regexp. It passes all the regexp tests and is 2-4x faster than regexp for hard regexps and large inputs. It's 5-10x faster for the new BenchmarkMatchHard1 benchmark.

@matloob plan on sending new CLs this cycle?

Ooh I really want to... but...

rsc had some ideas that I want to look at before moving forward with this. I think things will need to be on hold till he gets back.

In the meantime I'll sync the cl to tip and re-check-in an even harder benchmark.

@matloob can you share some of @rsc ideas ?

Russ would like to implement the ideas in the following paper: https://doi.org/10.1002/spe.2436. I haven't had a chance to read the paper yet (and it's behind a paywall).

If it helps, there is a video of the seminar that Chivers held about one month before the paper was published. The "preview" idea seems elegant and more appealing (to me, at least) than dealing with SIMD instructions for various architectures.

The first page of the paper mentions the use of loop counting to implement counted repetitions. If that idea is of interest, there are at least two papers about "extended" finite automata that predate Chivers' work. I must also point out that Smith, Estan and Jha patented their work.

@matloob @jubalh thanks for the links ...

@matloob if your code is strictly faster already, is there any reason not to do the further optimizations later? Or is the idea that substantial changes to regexes require thorough testing, and you'd like to avoid doing it twice? (Seems like tests can be reused, though…)

Unfortunately I haven't had significant time to do research into this, but I think the idea here was that we could improve the performance Go regexp matching while avoid adding the complexity of the RE2 DFA.

I don't know if that's the case. (I haven't even read the Chivers paper yet!) But I think we want to do our due diligence here before adding complexity that it will be very hard to remove.

@matloob, any update here? Still want to do this sometime?

I haven't had the chance to review the Chivers paper, and don't know when I'd be able to dedicate the time to investigating it further.

On the other hand bringing in the DFA matcher is far easier, and I'd be able to do that work if we decided that's something we wanted.

Forgive the length of this post. TLDR: This post summarizes Chivers' paper and my thoughts on DFA implementation for Go in general.

Chivers discusses optimization techniques implemented in the jsre regular expression library. Sections 3-5 are the pertinent ones for this post.

  • __3.1: Executable Architecture__: NFA simulation is implemented with a bytecode virtual machine, where the states, transition functions, and parallel simulation are represented by the program counter, a set of instructions, and separate execution threads.
  • __3.3: Counted repetitions and NFA state__: In order to support arbitrarily complex repeated subexpressions, jsre supports counted loops similar to an assembly translation with a loop control variable and a conditional jump. However, Chivers states that in order to avoid merging threads that are not in the same NFA state, one must β€œlabel states using both the program counter and the loop counter. Threads may be merged if they are at the same point in the input stream and both the program counter and any loop counters have the same value.[1]”
  • __4.1: Preview Definition__: A preview is a sequence of sets, where the ith symbol of every string in the language is a member of the ith set in the preview. For example:

    1. The language denoted by the regular expression abcd|ef is {abcd, ef}. The preview for this language is {{a,e}, {b,f}, {c}, {d}}.

    2. The minimum length of a string described by a preview is called its validity length (or validity index). The validity length of this preview is 2.

    3. A preview generalizes the one-byte lookahead common to existing regular expression libraries.

  • __4.2: Preview Construction__:

    1. Base case: Regular expression has single symbol x. Preview is {x}.

    2. Union: Preview(R|S) = {Preview(R)i βˆͺ Preview(S)i | i ∈ β„•}



      1. ValidityLength(Preview(R|S)) = min(ValidityLength(Preview(R)), ValidityLength(Preview(S)))



    3. Concatenation: Denote the ValidityLength(Preview(R)) by VR. Preview(RS) = {Preview(R)i+VR βˆͺ Preview(S)i | i ∈ β„•}



      1. ValidityLength(Preview(RS)) = ValidityLength(Preview(R)) + ValidityLength(Preview(S))



    4. Kleene Star: Concatenate a Preview to itself many times. e.g: Preview(RR), Preview(Preview(RR)R), Preview(Preview(Preview(RR)R)R) …



      1. ValidityLength = 0 because of empty string.



  • __4.3: From preview to DFAs__:

    1. Each set in the preview is the next transition function.

    2. Produces false positives, but not false negatives.

5: Optimization with previews:

I found that the paper was lacking enough examples, so I added one for each optimization.

  • __5.1: Anchor preview scan__: Unanchored matches typically try to anchor position in the input to match a substring.

    1. One simple and lightweight optimization is to use a preview to eliminate non-candidates before trying to match the expression from that anchor point.

    2. Chivers provides an example of an RE for an IP address: ([0-9]{1,3}\.){3} [0-9] {1,3}

    3. The preview for the RE would be the following: {{[0-9]}, {\., [0-9]}, {\., [0-9]}}

    4. Therefore, based on the preview, the following input strings are filtered accordingly

      8.6.7.123 candidate 192.16.7.123 candidate .1.2.234 not a candidate abc.234.532.445 not a candidate 123.b.~.445 candidate

    5. Chivers shows a speedup of 5x from this optimization when tested on a large corpus.

  • __5.3: Matching multiple alternate expressions__: A common use case for regular expressions is to union multiple different expressions and test each alternative on a different thread.

    1. Before accepting the overhead of dispatching another thread, previews allow one to test that a (sub) expression may match.
    2. For example, let’s say we wanted to match the expression (asking|aardvark|barhopping|bartender|barring)
    3. Here’s a preview of size 3: {{a,b},{s,a},{k,t,r}}
    4. Chivers shows that β€œ[i]t is therefore straightforward to arrange an efficient search strategy by re-ordering the alternative expressions into a tree and using preview tests at each node.[1]” Note that if the preview check fails, we reject (or do not use that anchor point in the case of unanchored matches).
    5. Based on this example, here is a tree that I have marked up to include the preview checks and thread dispatch locations.
      $ β”‚ {a,b}? β”‚ β”œβ”€β”€ a β”‚ β”‚ β”‚ {s,a}? β”‚ β”‚ β”‚ β”œβ”€β”€ s─{k,t,r}?─k─(new thread)─i──n──g β”‚ β”‚ β”‚ └── a─{k,t,r}?─r─(new thread)─d──v──a──r──k β”‚ └── b─{s,a}?─a─{k,t,r}?─r β”œβ”€(new thread)─ h──o──p──p──i──n──g β”œβ”€(new thread)─ t──e──n──d──e──r └─(new thread)─ r──i──n──g
  • __5.5: Loop Optimization__: For a given RE of the form [c]*e, where c is a character set and e is an expression

    1. If c and First(Preview(e)) have no members in common (that is, they are disjoint), we can terminate the loop for [c]* as soon as we encounter [^c].
    2. Normally, one has to dispatch a thread on each iteration to check that e also matches. However, if the sets are disjoint, that can’t happen.
    3. Consider the RE [abc]*d+
    4. Preview(d+) = {{d}, {d}, …}
    5. First(Preview(d+)) = {d}
    6. {a,b,c} is disjoint from {d}, so as soon as we encounter an element that is not in the character set [abc], we exit the loop.
  • __5.6: String Prefix Optimization__: This optimization utilizes the disjoint property from __5.5__. Chivers proves that if there is a repeated test of character set [x] with a disjoint suffix y, as in the case of [x]*y, if the character test stops and we fail to match the suffix, then any anchor point between our original anchor point and the suffix will also stop the character test and fail to match the suffix.

    1. Given this input:   aaaaxaaxy

    2. Match attempt 1:  aaaaxy

    3. The match fails on the sixth character. Any anchor point chosen before the first x will also fail at the same position.

    4. The next logical anchor point is after the first x, allowing us to optimally select anchor points rather than exhaustively trying each one.

    5. Input:                    aaaaxaaxy

    6. Match attempt 2:          aaxy

    7. Chivers notes that this optimization can be applied to arbitrary sub-expressions, where all that is needed is a preview of the (sub)expression and the disjoint property. This optimization can lead to substantial performance improvements on unanchored searches in expressions with Kleene closures.

  • __5.8: Avoiding Quadratic Complexity__: Trying an anchor point at every position in the input results in roughly O(n2) runtime.

    1. A common solution to this is to prefix the expression with .* which creates a thread for every symbol in the alphabet. This exploits the thread merging of NFA simulation but has the disadvantage that if the beginning of the expression can have a large character set, then the number of threads can equal the number of input positions consumed.
      ```
      For the regular expression x{5}.
      xxxxx
      01234 // states written out
      Input: xaxxxaxxxxxabc

      Note: f denotes failure.
      Exhaustive anchor point testing. (Each line is a new anchor point)
      xaxxxaxxxxxabc
      0f
      f
      012f
      01f
      0f
      f
      01234 // success
      0123f
      012f
      01f
      0f
      f
      f
      f

      Using .* prefix to exploit thread merging. (Each line is a new thread)
      .*xxxxx
      0 12345

      xaxxxaxxxxxabc
      00000000000000
      1f
      123f
      12345 // success
      1234f
      ```

    2. Chivers states that β€œ[a]t best thread handling may result in slower evaluation because of thread overheads; at worst this may result in early exhaustion of the available thread space.[1]”

    3. An alternative solution applies to Kleene closures embedded anywhere in the expression. If we fail to match from the first anchor position using parallel simulation, we end up with a β€œguarded area” that corresponds to all of the characters matched by the Kleene closure. When we try a second anchor position and we reach the same state of the Kleene closure, if our position in the input is in the guarded area, then we know that the expression evaluated from this anchor will fail.
      ```
      Input: abcabczzzxyy
      Expression: abc.*xyz
      States: 0123 456

      Note: G denotes the guarded area.
      f denotes failure.
      Each line after the colon denotes another thread.

      Attempt 1:
      Input: abcabczzzxyy
      States: 012333333333
      : 45f
      Guarded: GGGGGGGGG

      Attempt 2:
      Input: abcabczzzxyy
      Guarded: GGGGGGGGG
      States: 012f // because 3 would land in the guarded area
      ```

    4. Chivers shows that β€œin addition to providing linear search performance, this approach minimizes thread usage in the NFA simulation.[2]”

Thoughts on implementation

Considering Go tries to have robust handling of Unicode symbols, previews allow one to more efficiently match characters that have multi-byte encodings. Given that Go's regexp compiler also outputs bytecode, it can be extended to support the preview scan instruction, which I assume stores a flag denoting the ith character's membership in the ith set of the preview. Perhaps this feature would also allow Go's implementation to operate on bytes instead of runes, considering it essentially has k-lookahead.

It seems that the primary advantage of having previews is to minimize thread usage during NFA simulation. New threads are not spawned until the input matches the preview, and invalid anchor positions can be quickly eliminated, preventing new threads from being spawned.

The preview for a language R of size i is a superset of the set of strings in R, which contains at most i elements. Because of this, a rejection from the preview implies a rejection from R, but not the other way around. Additionally, a preview can be represented as a DFA, with the ith subset being a transition function from state i to state i+1. Construction of previews would happen during compilation (or lazily--see below), but we'd have to do additional testing to determine their optimal sizes.

A balanced approach to DFA construction seems to be the lazy subset construction described in Russ's article[3]. I also agree with @rsc's comment that we should keep a bound on the space used by the DFA. However, if we run out of space, we should not throw away the automata. Instead, we construct a partial DFA with an upper bound on the allotted space. If the DFA doesn't require all of the allotted space, that’s great! We have a complete DFA. On the other hand, if we run out of space, only a certain fraction of the machine will be deterministic[4]. This would still offer an average case speedup for the deterministic portion of the machine.

I wonder if the current implementation has the notion of composable machines (or subroutines). An enhancement to finite state machines that doesn't add any language-recognition power but offers a practical performance benefit is the ability to feed the result of a separate machine or subroutine into the first. This allows one to take the union and intersection of machines, and therefore their languages, simply by accepting if one or both of the machines accept, and vice versa for rejection. In the context of subroutines, we would still have access to the same thread state, allowing the machine to record start and end positions of matches. Additionally, when we compute the preview of a subexpression that has multiple instances inside an expression, we can reuse that preview for each instance, provided that the subexpression has a separate machine or subroutine.

Finally, partial DFAs might be useful for previews as well. Similar to cached DFA construction, we can compute the previews only for code paths that are reachable. The first time an input character is tested as a member of a preview set, we construct the preview and cache it for lookup later. Although preview sizes are small, a large number of subexpressions in the language may demand many previews to the point where some of them will be NFAs constructed on the fly.

Overall, a reasonable strategy for implementation would be:

  1. Cached Partial DFA Construction
  2. Instruction for preview scan

    • Optimization for alternate expressions to improve Unicode code point matching

    • Optimization for Anchor Points

    • Loop optimization

    • String Prefix Optimization

    • Guarded area anchor point

      _We could settle here if performance improvements were notable enough._

  3. Also construct preview DFAs lazily
  4. Explore the idea of subroutines.

[1] Chivers, H. (2016). Optimising unicode regular expression evaluation with previews. Software: Practice and Experience, 47(5), 669-688. doi:10.1002/spe.2436

[2] Chivers, H. (2016, August 15). Retrieved from https://youtu.be/tIRpyQpJ3Ug. York CS Seminars

[3] Cox, R. (2007, January). Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...). Retrieved from https://swtch.com/~rsc/regexp/regexp1.html

[4] Navarro, G. (1997). A Partial Deterministic Automaton for Approximate String Matching.

I have checked myself and the performance of golang regexp library is horrible.

In a simple benchmark RE2 was 10 times faster. It might not be representative.

RE2 compiler code is 1283 lines long. golang compiler is 290 lines long.

Porting RE2 code should be enough

The Teddy algorithm used in Hyperscan is much better than even RE2's DFA:
https://01.org/hyperscan/blogs/jpviiret/2017/regex-set-scanning-hyperscan-and-re2set

Perhaps somebody can translate it from the Rust implementation at https://github.com/jneem/teddy

@adsouza Note that the Teddy algorithm is a specific form of literal optimization that only really applies to a small number of small literals.

Has anyone evaluated the DFA implementation in https://github.com/google/codesearch/blob/master/regexp/match.go yet?

@junyer The implementation in codesearch keeps a state cache of size 256 for each DFA state (although the alphabet spans the size of a rune). A cache like this could be placed into a sync.Pool, to play nicely with the GC. We would still be swapping a lot given a single DFA cache is 2K bytes, so it would be nice to reduce the size of the bitmap.

I'm trying to understand toByteProg, which I believe modifies a syntax.Prog to break up a >1-byte-range rune instruction into multiple 1-byte-range rune instructions (correct me if I'm wrong). I think @rsc mentions this idea of constructing smaller FSMs to handle unicode in https://swtch.com/~rsc/regexp/regexp1.html under _Character sets._

A preview approach might be to use unicode character ranges to create a preview DFA of depth 3 to filter out invalid byte sequences. This presents a tradeoff between the number of states (from the depth value) and size of the DFA cache, which we could tune to our liking.

I'm trying to understand toByteProg, which I believe modifies a syntax.Prog to break up a >1-byte-range rune instruction into multiple 1-byte-range rune instructions (correct me if I'm wrong).

If you can read Rust, then this is exactly what the utf8-ranges crate does: https://github.com/BurntSushi/utf8-ranges There isn't much code, so it should be pretty easy to port. The idea is itself inspired by what RE2 does.

This approach is how you get the alphabet size down to 256, even while supporting all of Unicode. You can further reduce the alphabet size by merging the symbols in your alphabet that are indistinguishable from each other for a specific regex. @rsc talks about this in his third regexp article IIRC.

Regexps are still slow as hell #26943

@kamilgregorczyk, you can file constructive bug reports without saying things are "slow as hell" or "unacceptable". People are more likely to deprioritize bug reports from people who seem rude and more likely to help people being polite.

  1. Performance of regexp in go is unacceptable, that's the truth, there's
    nothing rude about it.
  2. Such design flaw (that's been know for quite a long time) affects my an
    most likely others people work and apparently some built-ins aren't
    optimized in any way AND there are not plans to fix it.

That's really bad for a language as I thought I could trust go, in fact i
wanted to start using it for my commercial projects but for know, it goes
back to the toy bucket again the first time was when I tried iris and owner
of it deleted all commits).

niedz., 12 sie 2018, 21:37 uΕΌytkownik Brad Fitzpatrick <
[email protected]> napisaΕ‚:

@kamilgregorczyk https://github.com/kamilgregorczyk, you can file
constructive bug reports without saying things are "slow as hell" or
"unacceptable". People are more likely to deprioritize bug reports from
people who seem rude and more likely to help people being polite.

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/11646#issuecomment-412366510, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AI1koPbrT3jeUZemw6Pc6VPY-QLsgiCCks5uQIPvgaJpZM4FVpkq
.

What do you even mean by 'unacceptable performance'? Do you care to define/explain? Do you mean the happy case, the average or the worst one? For every of that you get different answers. Do you know, that Go is actually in orders of magnitude _faster_ in some worst cases compared to PCRE? Have you heard about https://en.wikipedia.org/wiki/ReDoS, for example?

What's the best mix of the performance/safety characteristics in the different cases is a design choice, not a simple nor universal "truth" as falsely claimed.

Of course it differs, in some cases it will be fine in some it won't, it
can brake any benchmarks/tests/whatever, it's still slower than
java/python/ruby in my any in some other cases which were reported. What
scares me away is that there are no plans to even start fixing about etc.

niedz., 12 sie 2018 o 22:19 cznic notifications@github.com napisaΕ‚(a):

What do you even mean by 'unacceptable performance'? Do you care to
define/explain? Do you mean the happy case, the average or the worst one?
For every of that you get different answers. Do you know, that Go is
actually in orders of magnitude faster in some worst cases
https://swtch.com/%7Ersc/regexp/regexp1.html compared to PCRE? Have you
heard about https://en.wikipedia.org/wiki/ReDoS, for example?

What's the best mix of the performance/safety characteristics in the
different cases is a design choice, not a simple nor universal "truth" as
falsely claimed.

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/11646#issuecomment-412368963, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AI1koP3Bp63-5Suk4Ici1_MwJFWXJNCZks5uQI3jgaJpZM4FVpkq
.

You could say something like "I find its performance unacceptable" or "It's unacceptable for my application", but saying that it's flat-out "unacceptable" is what's coming across as rude in your tone. It's been almost 9 years of people largely accepting its performance, even if there are always things that could be better. The bug is one such thing.

Since

should this issue be closed in favor of https://github.com/golang/go/issues/26623 @bradfitz ?

Let's keep this open. This is a specific task with good discussion and history.

Was this page helpful?
0 / 5 - 0 ratings