The regexp package currently chooses between the standard NFA matcher, onepass, or the backtracker. This proposal is for porting over RE2's DFA matcher to be used as an option by exec.
CC @rsc
IIRC, @rsc and @robpike were discussing this a few weeks ago and @rsc said he'd hit some design stumbling block where the C++ implementation didn't have an obvious Go translation.
Also, you were mentioning that the UTF-8-edness of the Go's regexp package and its runes make the 256-sized tables (which are just bytes in the C++ version) potentially tricky. Not sure whether restricting the DFA optimization to ASCII-only regexps helps or not.
One problem @rsc mentioned was the allocation of state data, but he gave me a solution so I don't think that will be a big problem.
After looking some more at the rune vs byte issue, I don't think we would need to skip the DFA for non-ascii regexps. The DFA makes a set of character ranges, and indexes the output transitions by range. We could do the same range lookup for input runes < 256 (or some other arbitrary table size) and do a slower lookup for larger runes.
Of course, there are a number of other things that don't map easily. I think most of those will have replacements and for any that don't we'll be able to do something slightly slower than RE2 but still be much faster than the nfa matcher.
I would like to understand why the NFA is not performant first. I am afraid that adding a DFA (I presume for performance reasons) may hit the same problems.
@michaelmatloob and I talked about this at Gophercon. I'm okay with adding this provided:
I'm working on this and hope to have it ready for 1.7.
(How do I add myself as an assignee?)
CL https://golang.org/cl/22189 mentions this issue.
CL https://golang.org/cl/22246 mentions this issue.
This isn't going to get in for Go 1.7.
For those interested in trying it out, I've got a standalone copy of my work thus far in matloob.io/regexp. It passes all the regexp tests and is 2-4x faster than regexp for hard regexps and large inputs. It's 5-10x faster for the new BenchmarkMatchHard1 benchmark.
@matloob plan on sending new CLs this cycle?
Ooh I really want to... but...
rsc had some ideas that I want to look at before moving forward with this. I think things will need to be on hold till he gets back.
In the meantime I'll sync the cl to tip and re-check-in an even harder benchmark.
@matloob can you share some of @rsc ideas ?
Russ would like to implement the ideas in the following paper: https://doi.org/10.1002/spe.2436. I haven't had a chance to read the paper yet (and it's behind a paywall).
If it helps, there is a video of the seminar that Chivers held about one month before the paper was published. The "preview" idea seems elegant and more appealing (to me, at least) than dealing with SIMD instructions for various architectures.
The first page of the paper mentions the use of loop counting to implement counted repetitions. If that idea is of interest, there are at least two papers about "extended" finite automata that predate Chivers' work. I must also point out that Smith, Estan and Jha patented their work.
@matloob @jubalh thanks for the links ...
@matloob if your code is strictly faster already, is there any reason not to do the further optimizations later? Or is the idea that substantial changes to regexes require thorough testing, and you'd like to avoid doing it twice? (Seems like tests can be reused, thoughβ¦)
Unfortunately I haven't had significant time to do research into this, but I think the idea here was that we could improve the performance Go regexp matching while avoid adding the complexity of the RE2 DFA.
I don't know if that's the case. (I haven't even read the Chivers paper yet!) But I think we want to do our due diligence here before adding complexity that it will be very hard to remove.
@matloob, any update here? Still want to do this sometime?
I haven't had the chance to review the Chivers paper, and don't know when I'd be able to dedicate the time to investigating it further.
On the other hand bringing in the DFA matcher is far easier, and I'd be able to do that work if we decided that's something we wanted.
Forgive the length of this post. TLDR: This post summarizes Chivers' paper and my thoughts on DFA implementation for Go in general.
Chivers discusses optimization techniques implemented in the jsre regular expression library. Sections 3-5 are the pertinent ones for this post.
abcd|ef
is {abcd, ef}
. The preview for this language is {{a,e}, {b,f}, {c}, {d}}
.x
. Preview is {x}
.Preview(R|S) = {Preview(R)i βͺ Preview(S)i | i β β}
ValidityLength(Preview(R|S)) = min(ValidityLength(Preview(R)), ValidityLength(Preview(S)))
Denote the ValidityLength(Preview(R)) by VR. Preview(RS) = {Preview(R)i+VR βͺ Preview(S)i | i β β}
ValidityLength(Preview(RS)) = ValidityLength(Preview(R)) + ValidityLength(Preview(S))
Concatenate a Preview to itself many times. e.g: Preview(RR), Preview(Preview(RR)R), Preview(Preview(Preview(RR)R)R) β¦
ValidityLength = 0 because of empty string.
I found that the paper was lacking enough examples, so I added one for each optimization.
([0-9]{1,3}\.){3} [0-9] {1,3}
{{[0-9]}, {\., [0-9]}, {\., [0-9]}}
8.6.7.123 candidate
192.16.7.123 candidate
.1.2.234 not a candidate
abc.234.532.445 not a candidate
123.b.~.445 candidate
__5.3: Matching multiple alternate expressions__: A common use case for regular expressions is to union multiple different expressions and test each alternative on a different thread.
(asking|aardvark|barhopping|bartender|barring)
{{a,b},{s,a},{k,t,r}}
$
β
{a,b}?
β
βββ a
β β
β {s,a}?
β β
β βββ sβ{k,t,r}?βkβ(new thread)βiββnββg
β β
β βββ aβ{k,t,r}?βrβ(new thread)βdββvββaββrββk
β
βββ bβ{s,a}?βaβ{k,t,r}?βr
ββ(new thread)β hββoββpββpββiββnββg
ββ(new thread)β tββeββnββdββeββr
ββ(new thread)β rββiββnββg
__5.5: Loop Optimization__: For a given RE of the form [c]*e
, where c
is a character set and e
is an expression
c
and First(Preview(e))
have no members in common (that is, they are disjoint), we can terminate the loop for [c]*
as soon as we encounter [^c]
.e
also matches. However, if the sets are disjoint, that canβt happen.[abc]*d+
Preview(d+) = {{d}, {d}, β¦}
First(Preview(d+)) = {d}
{a,b,c}
is disjoint from {d}
, so as soon as we encounter an element that is not in the character set [abc]
, we exit the loop.[x]
with a disjoint suffix y
, as in the case of [x]*y
, if the character test stops and we fail to match the suffix, then any anchor point between our original anchor point and the suffix will also stop the character test and fail to match the suffix.x
will also fail at the same position.x
, allowing us to optimally select anchor points rather than exhaustively trying each one.__5.8: Avoiding Quadratic Complexity__: Trying an anchor point at every position in the input results in roughly O(n2)
runtime.
A common solution to this is to prefix the expression with .*
which creates a thread for every symbol in the alphabet. This exploits the thread merging of NFA simulation but has the disadvantage that if the beginning of the expression can have a large character set, then the number of threads can equal the number of input positions consumed.
```
For the regular expression x{5}.
xxxxx
01234 // states written out
Input: xaxxxaxxxxxabc
Note: f denotes failure.
Exhaustive anchor point testing. (Each line is a new anchor point)
xaxxxaxxxxxabc
0f
f
012f
01f
0f
f
01234 // success
0123f
012f
01f
0f
f
f
f
Using .* prefix to exploit thread merging. (Each line is a new thread)
.*xxxxx
0 12345
xaxxxaxxxxxabc
00000000000000
1f
123f
12345 // success
1234f
```
Chivers states that β[a]t best thread handling may result in slower evaluation because of thread overheads; at worst this may result in early exhaustion of the available thread space.[1]β
An alternative solution applies to Kleene closures embedded anywhere in the expression. If we fail to match from the first anchor position using parallel simulation, we end up with a βguarded areaβ that corresponds to all of the characters matched by the Kleene closure. When we try a second anchor position and we reach the same state of the Kleene closure, if our position in the input is in the guarded area, then we know that the expression evaluated from this anchor will fail.
```
Input: abcabczzzxyy
Expression: abc.*xyz
States: 0123 456
Note: G denotes the guarded area.
f denotes failure.
Each line after the colon denotes another thread.
Attempt 1:
Input: abcabczzzxyy
States: 012333333333
: 45f
Guarded: GGGGGGGGG
Attempt 2:
Input: abcabczzzxyy
Guarded: GGGGGGGGG
States: 012f // because 3 would land in the guarded area
```
Chivers shows that βin addition to providing linear search performance, this approach minimizes thread usage in the NFA simulation.[2]β
Considering Go tries to have robust handling of Unicode symbols, previews allow one to more efficiently match characters that have multi-byte encodings. Given that Go's regexp compiler also outputs bytecode, it can be extended to support the preview scan instruction, which I assume stores a flag denoting the ith character's membership in the ith set of the preview. Perhaps this feature would also allow Go's implementation to operate on bytes instead of runes, considering it essentially has k-lookahead.
It seems that the primary advantage of having previews is to minimize thread usage during NFA simulation. New threads are not spawned until the input matches the preview, and invalid anchor positions can be quickly eliminated, preventing new threads from being spawned.
The preview for a language R of size i is a superset of the set of strings in R, which contains at most i elements. Because of this, a rejection from the preview implies a rejection from R, but not the other way around. Additionally, a preview can be represented as a DFA, with the ith subset being a transition function from state i to state i+1. Construction of previews would happen during compilation (or lazily--see below), but we'd have to do additional testing to determine their optimal sizes.
A balanced approach to DFA construction seems to be the lazy subset construction described in Russ's article[3]. I also agree with @rsc's comment that we should keep a bound on the space used by the DFA. However, if we run out of space, we should not throw away the automata. Instead, we construct a partial DFA with an upper bound on the allotted space. If the DFA doesn't require all of the allotted space, thatβs great! We have a complete DFA. On the other hand, if we run out of space, only a certain fraction of the machine will be deterministic[4]. This would still offer an average case speedup for the deterministic portion of the machine.
I wonder if the current implementation has the notion of composable machines (or subroutines). An enhancement to finite state machines that doesn't add any language-recognition power but offers a practical performance benefit is the ability to feed the result of a separate machine or subroutine into the first. This allows one to take the union and intersection of machines, and therefore their languages, simply by accepting if one or both of the machines accept, and vice versa for rejection. In the context of subroutines, we would still have access to the same thread state, allowing the machine to record start and end positions of matches. Additionally, when we compute the preview of a subexpression that has multiple instances inside an expression, we can reuse that preview for each instance, provided that the subexpression has a separate machine or subroutine.
Finally, partial DFAs might be useful for previews as well. Similar to cached DFA construction, we can compute the previews only for code paths that are reachable. The first time an input character is tested as a member of a preview set, we construct the preview and cache it for lookup later. Although preview sizes are small, a large number of subexpressions in the language may demand many previews to the point where some of them will be NFAs constructed on the fly.
Overall, a reasonable strategy for implementation would be:
[1] Chivers, H. (2016). Optimising unicode regular expression evaluation with previews. Software: Practice and Experience, 47(5), 669-688. doi:10.1002/spe.2436
[2] Chivers, H. (2016, August 15). Retrieved from https://youtu.be/tIRpyQpJ3Ug. York CS Seminars
[3] Cox, R. (2007, January). Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...). Retrieved from https://swtch.com/~rsc/regexp/regexp1.html
[4] Navarro, G. (1997). A Partial Deterministic Automaton for Approximate String Matching.
I have checked myself and the performance of golang regexp library is horrible.
In a simple benchmark RE2 was 10 times faster. It might not be representative.
RE2 compiler code is 1283 lines long. golang compiler is 290 lines long.
Porting RE2 code should be enough
The Teddy algorithm used in Hyperscan is much better than even RE2's DFA:
https://01.org/hyperscan/blogs/jpviiret/2017/regex-set-scanning-hyperscan-and-re2set
Perhaps somebody can translate it from the Rust implementation at https://github.com/jneem/teddy
@adsouza Note that the Teddy algorithm is a specific form of literal optimization that only really applies to a small number of small literals.
Has anyone evaluated the DFA implementation in https://github.com/google/codesearch/blob/master/regexp/match.go yet?
@junyer The implementation in codesearch keeps a state cache of size 256 for each DFA state (although the alphabet spans the size of a rune). A cache like this could be placed into a sync.Pool
, to play nicely with the GC. We would still be swapping a lot given a single DFA cache is 2K bytes, so it would be nice to reduce the size of the bitmap.
I'm trying to understand toByteProg
, which I believe modifies a syntax.Prog
to break up a >1-byte-range rune instruction into multiple 1-byte-range rune instructions (correct me if I'm wrong). I think @rsc mentions this idea of constructing smaller FSMs to handle unicode in https://swtch.com/~rsc/regexp/regexp1.html under _Character sets._
A preview approach might be to use unicode character ranges to create a preview DFA of depth 3 to filter out invalid byte sequences. This presents a tradeoff between the number of states (from the depth value) and size of the DFA cache, which we could tune to our liking.
I'm trying to understand toByteProg, which I believe modifies a syntax.Prog to break up a >1-byte-range rune instruction into multiple 1-byte-range rune instructions (correct me if I'm wrong).
If you can read Rust, then this is exactly what the utf8-ranges crate does: https://github.com/BurntSushi/utf8-ranges There isn't much code, so it should be pretty easy to port. The idea is itself inspired by what RE2 does.
This approach is how you get the alphabet size down to 256, even while supporting all of Unicode. You can further reduce the alphabet size by merging the symbols in your alphabet that are indistinguishable from each other for a specific regex. @rsc talks about this in his third regexp article IIRC.
Regexps are still slow as hell #26943
@kamilgregorczyk, you can file constructive bug reports without saying things are "slow as hell" or "unacceptable". People are more likely to deprioritize bug reports from people who seem rude and more likely to help people being polite.
That's really bad for a language as I thought I could trust go, in fact i
wanted to start using it for my commercial projects but for know, it goes
back to the toy bucket again the first time was when I tried iris and owner
of it deleted all commits).
niedz., 12 sie 2018, 21:37 uΕΌytkownik Brad Fitzpatrick <
[email protected]> napisaΕ:
@kamilgregorczyk https://github.com/kamilgregorczyk, you can file
constructive bug reports without saying things are "slow as hell" or
"unacceptable". People are more likely to deprioritize bug reports from
people who seem rude and more likely to help people being polite.β
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/11646#issuecomment-412366510, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AI1koPbrT3jeUZemw6Pc6VPY-QLsgiCCks5uQIPvgaJpZM4FVpkq
.
What do you even mean by 'unacceptable performance'? Do you care to define/explain? Do you mean the happy case, the average or the worst one? For every of that you get different answers. Do you know, that Go is actually in orders of magnitude _faster_ in some worst cases compared to PCRE? Have you heard about https://en.wikipedia.org/wiki/ReDoS, for example?
What's the best mix of the performance/safety characteristics in the different cases is a design choice, not a simple nor universal "truth" as falsely claimed.
Of course it differs, in some cases it will be fine in some it won't, it
can brake any benchmarks/tests/whatever, it's still slower than
java/python/ruby in my any in some other cases which were reported. What
scares me away is that there are no plans to even start fixing about etc.
niedz., 12 sie 2018 o 22:19 cznic notifications@github.com napisaΕ(a):
What do you even mean by 'unacceptable performance'? Do you care to
define/explain? Do you mean the happy case, the average or the worst one?
For every of that you get different answers. Do you know, that Go is
actually in orders of magnitude faster in some worst cases
https://swtch.com/%7Ersc/regexp/regexp1.html compared to PCRE? Have you
heard about https://en.wikipedia.org/wiki/ReDoS, for example?What's the best mix of the performance/safety characteristics in the
different cases is a design choice, not a simple nor universal "truth" as
falsely claimed.β
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/11646#issuecomment-412368963, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AI1koP3Bp63-5Suk4Ici1_MwJFWXJNCZks5uQI3jgaJpZM4FVpkq
.
You could say something like "I find its performance unacceptable" or "It's unacceptable for my application", but saying that it's flat-out "unacceptable" is what's coming across as rude in your tone. It's been almost 9 years of people largely accepting its performance, even if there are always things that could be better. The bug is one such thing.
Since
should this issue be closed in favor of https://github.com/golang/go/issues/26623 @bradfitz ?
Let's keep this open. This is a specific task with good discussion and history.
Most helpful comment
Forgive the length of this post. TLDR: This post summarizes Chivers' paper and my thoughts on DFA implementation for Go in general.
Chivers discusses optimization techniques implemented in the jsre regular expression library. Sections 3-5 are the pertinent ones for this post.
abcd|ef
is{abcd, ef}
. The preview for this language is{{a,e}, {b,f}, {c}, {d}}
.x
. Preview is{x}
.Preview(R|S) = {Preview(R)i βͺ Preview(S)i | i β β}
ValidityLength(Preview(R|S)) = min(ValidityLength(Preview(R)), ValidityLength(Preview(S)))
Denote the ValidityLength(Preview(R)) by VR. Preview(RS) = {Preview(R)i+VR βͺ Preview(S)i | i β β}
ValidityLength(Preview(RS)) = ValidityLength(Preview(R)) + ValidityLength(Preview(S))
Concatenate a Preview to itself many times. e.g: Preview(RR), Preview(Preview(RR)R), Preview(Preview(Preview(RR)R)R) β¦
ValidityLength = 0 because of empty string.
5: Optimization with previews:
I found that the paper was lacking enough examples, so I added one for each optimization.
([0-9]{1,3}\.){3} [0-9] {1,3}
{{[0-9]}, {\., [0-9]}, {\., [0-9]}}
8.6.7.123 candidate 192.16.7.123 candidate .1.2.234 not a candidate abc.234.532.445 not a candidate 123.b.~.445 candidate
__5.3: Matching multiple alternate expressions__: A common use case for regular expressions is to union multiple different expressions and test each alternative on a different thread.
(asking|aardvark|barhopping|bartender|barring)
{{a,b},{s,a},{k,t,r}}
$ β {a,b}? β βββ a β β β {s,a}? β β β βββ sβ{k,t,r}?βkβ(new thread)βiββnββg β β β βββ aβ{k,t,r}?βrβ(new thread)βdββvββaββrββk β βββ bβ{s,a}?βaβ{k,t,r}?βr ββ(new thread)β hββoββpββpββiββnββg ββ(new thread)β tββeββnββdββeββr ββ(new thread)β rββiββnββg
__5.5: Loop Optimization__: For a given RE of the form
[c]*e
, wherec
is a character set ande
is an expressionc
andFirst(Preview(e))
have no members in common (that is, they are disjoint), we can terminate the loop for[c]*
as soon as we encounter[^c]
.e
also matches. However, if the sets are disjoint, that canβt happen.[abc]*d+
Preview(d+) = {{d}, {d}, β¦}
First(Preview(d+)) = {d}
{a,b,c}
is disjoint from{d}
, so as soon as we encounter an element that is not in the character set[abc]
, we exit the loop.[x]
with a disjoint suffixy
, as in the case of[x]*y
, if the character test stops and we fail to match the suffix, then any anchor point between our original anchor point and the suffix will also stop the character test and fail to match the suffix.x
will also fail at the same position.x
, allowing us to optimally select anchor points rather than exhaustively trying each one.__5.8: Avoiding Quadratic Complexity__: Trying an anchor point at every position in the input results in roughly
O(n2)
runtime.A common solution to this is to prefix the expression with
.*
which creates a thread for every symbol in the alphabet. This exploits the thread merging of NFA simulation but has the disadvantage that if the beginning of the expression can have a large character set, then the number of threads can equal the number of input positions consumed.```
For the regular expression x{5}.
xxxxx
01234 // states written out
Input: xaxxxaxxxxxabc
Note: f denotes failure.
Exhaustive anchor point testing. (Each line is a new anchor point)
xaxxxaxxxxxabc
0f
f
012f
01f
0f
f
01234 // success
0123f
012f
01f
0f
f
f
f
Using .* prefix to exploit thread merging. (Each line is a new thread)
.*xxxxx
0 12345
xaxxxaxxxxxabc
00000000000000
1f
123f
12345 // success
1234f
```
Chivers states that β[a]t best thread handling may result in slower evaluation because of thread overheads; at worst this may result in early exhaustion of the available thread space.[1]β
An alternative solution applies to Kleene closures embedded anywhere in the expression. If we fail to match from the first anchor position using parallel simulation, we end up with a βguarded areaβ that corresponds to all of the characters matched by the Kleene closure. When we try a second anchor position and we reach the same state of the Kleene closure, if our position in the input is in the guarded area, then we know that the expression evaluated from this anchor will fail.
```
Input: abcabczzzxyy
Expression: abc.*xyz
States: 0123 456
Note: G denotes the guarded area.
f denotes failure.
Each line after the colon denotes another thread.
Attempt 1:
Input: abcabczzzxyy
States: 012333333333
: 45f
Guarded: GGGGGGGGG
Attempt 2:
Input: abcabczzzxyy
Guarded: GGGGGGGGG
States: 012f // because 3 would land in the guarded area
```
Chivers shows that βin addition to providing linear search performance, this approach minimizes thread usage in the NFA simulation.[2]β
Thoughts on implementation
Considering Go tries to have robust handling of Unicode symbols, previews allow one to more efficiently match characters that have multi-byte encodings. Given that Go's regexp compiler also outputs bytecode, it can be extended to support the preview scan instruction, which I assume stores a flag denoting the ith character's membership in the ith set of the preview. Perhaps this feature would also allow Go's implementation to operate on bytes instead of runes, considering it essentially has k-lookahead.
It seems that the primary advantage of having previews is to minimize thread usage during NFA simulation. New threads are not spawned until the input matches the preview, and invalid anchor positions can be quickly eliminated, preventing new threads from being spawned.
The preview for a language R of size i is a superset of the set of strings in R, which contains at most i elements. Because of this, a rejection from the preview implies a rejection from R, but not the other way around. Additionally, a preview can be represented as a DFA, with the ith subset being a transition function from state i to state i+1. Construction of previews would happen during compilation (or lazily--see below), but we'd have to do additional testing to determine their optimal sizes.
A balanced approach to DFA construction seems to be the lazy subset construction described in Russ's article[3]. I also agree with @rsc's comment that we should keep a bound on the space used by the DFA. However, if we run out of space, we should not throw away the automata. Instead, we construct a partial DFA with an upper bound on the allotted space. If the DFA doesn't require all of the allotted space, thatβs great! We have a complete DFA. On the other hand, if we run out of space, only a certain fraction of the machine will be deterministic[4]. This would still offer an average case speedup for the deterministic portion of the machine.
I wonder if the current implementation has the notion of composable machines (or subroutines). An enhancement to finite state machines that doesn't add any language-recognition power but offers a practical performance benefit is the ability to feed the result of a separate machine or subroutine into the first. This allows one to take the union and intersection of machines, and therefore their languages, simply by accepting if one or both of the machines accept, and vice versa for rejection. In the context of subroutines, we would still have access to the same thread state, allowing the machine to record start and end positions of matches. Additionally, when we compute the preview of a subexpression that has multiple instances inside an expression, we can reuse that preview for each instance, provided that the subexpression has a separate machine or subroutine.
Finally, partial DFAs might be useful for previews as well. Similar to cached DFA construction, we can compute the previews only for code paths that are reachable. The first time an input character is tested as a member of a preview set, we construct the preview and cache it for lookup later. Although preview sizes are small, a large number of subexpressions in the language may demand many previews to the point where some of them will be NFAs constructed on the fly.
Overall, a reasonable strategy for implementation would be:
_We could settle here if performance improvements were notable enough._
[1] Chivers, H. (2016). Optimising unicode regular expression evaluation with previews. Software: Practice and Experience, 47(5), 669-688. doi:10.1002/spe.2436
[2] Chivers, H. (2016, August 15). Retrieved from https://youtu.be/tIRpyQpJ3Ug. York CS Seminars
[3] Cox, R. (2007, January). Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...). Retrieved from https://swtch.com/~rsc/regexp/regexp1.html
[4] Navarro, G. (1997). A Partial Deterministic Automaton for Approximate String Matching.