runtime 🚀 - Regex performance compare with other programming languages

Note I would have expected to achieve about what Java is able to get, at least.

.... the fact that IP addresses are so wildly different across the board suggests some engine differences, or something. Certainly, specifying RegexOptions.ECMAScript cuts the non-ip numbers by about 1/3, which would be telling (still a bad showing, though).

Clockwork-Muse on 29 Sep 2017

👍3

Thank you for pointing us to this benchmark. @ViktorHofer made a change https://github.com/dotnet/corefx/pull/24158 post 2.0 which should help substantially. They are already passing in the compiled flag, it is being ignored.

This should give us performance similar to desktop .NET Framework. We would like to do much better though. @vancem has suggested we could create a "lightweight" regex codepath used for the 80% case where the regexes are "simple". The ones in this benchmark look like they might not be "simple enough" however.

We would welcome collaboration on regex performance. Are either of you interested?

danmosemsft on 29 Sep 2017

👍1

The ones in this benchmark look like they might not be "simple enough" however.

The benchmark lists three expressions

Email: [\w\.+-]+@[\w\.-]+\.[\w\.-]+
URI: [\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?
IPv4: (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])

The only one I would not expect the 'simplified' parser to handle is the IPv4 example because it uses the {3} expression to repeat something 3 times (if you had expanded it out three times I would be in my conception of 'simplified'. Of course the whole strategy I laid out was that we would add more complex features as usage dictates.

Personally I don't think the IPv4 regex is great example from software engineering perspective. It is trying to keep the numbers < 256, and that is painful to do in a regex. Instead I would expect people to use a simpler pattern (\d+).(\d+).(\d+).(\d+) convert the strings to number and do the numeric validation as a semantic check.

vancem on 29 Sep 2017

@danmosemsft said:

Thank you for pointing us to this benchmark. @ViktorHofer made a change dotnet/corefx#24158 post 2.0 which should help substantially. They are already passing in the compiled flag, it is being ignored.

This should give us performance similar to desktop .NET Framework. We would like to do much better though. @vancem has suggested we could create a "lightweight" regex codepath used for the 80% case where the regexes are "simple". The ones in this benchmark look like they might not be "simple enough" however.

It lists it as .NET Core, but desktop doesn't do any better (what I'm on at the moment).

Clockwork-Muse on 29 Sep 2017

@Clockwork-Muse that is interesting. Usually compiled helps. This will need some profiling.

danmosemsft on 29 Sep 2017

btw, on Mono RegexOptions.Compiled is currently ignored due to a bug (it explains why it's so slow in the benchmark).

EgorBo on 30 Sep 2017

👍1

One final thing on RegEx. Today when we say 'compiled' we mean generating IL at RUNTIME (using ReflectionEmit). Better than that would be to generate the IL (frankly C#) at COMPILE TIME. This was always a stumbling block before because we did not have ways of injecting tools at build time without users having to do something painful, but Nuget packages support this and we should take advantage of this (frankly my expectation is that > 90% of all RegEx patterns are know at COMPILE TIME. It would be straightforward to generate C# code (it would even be readable), that would recognize the pattern and would be SUPER FAST (and would have a nice comment that shows the Regex it recognizes.

vancem on 30 Sep 2017

🎉4 👍4

In many other languages, Compile is used as the Parser process for Regex. But a Compiled option is provided on .NET. It is easy to be used incorrectly.

I have made a serious mistake. I used to think that the regular engines are much the same. Today, after careful testing and found that this is not true.
C# benchmark speed is consumed in the Match process, rather than the Parser process.
It seems that C# regular engine is really too complicated to achieve, the performance needs to be optimized.

tupunco on 30 Sep 2017

Better than that would be to generate the IL (frankly C#) at COMPILE TIME

Just some additional information.
In .NET Framework we support this by calling Regex.CompileToAssembly(..) (http://referencesource.microsoft.com/#System/regex/system/text/regularexpressions/Regex.cs,1193) which produces an assembly that can be referenced at compile time. With that you get IntelliSense and all the other benefits from it BUT there is still pain involved as this feature isn't really dynamic at all.
In .NET Core we currently don't support this feature and throw PlattformNotSupportedExceptions (with the latest nightlies) when trying to use the API. We do support IL generation at runtime now but only on netcoreapp applications, not UWP.

What @vancem suggested with NuGet packages sounds interesting. If we want to we could even publish common patterns like email adresses, domains, ip addresses and others as compiled regex assemblies. But that's something for future discussions as we would first need to bring back the mentioned API in .NET Core (which isn't trivial).

ViktorHofer on 2 Oct 2017

👍2

Costing covers making a plan.

danmosemsft on 8 Jan 2018

We have made improvements here but do not plan more for 2.1 that isn't already tracked. Moving to Future to collect future work

danmosemsft on 13 Mar 2018

👍3

Some specific use-cases would benefit greatly from faster Regex.
For example: dev tools/extensions tend to use them extensively to find patterns in code (think // TODO), filter filenames and so on.

The best performing Regex engine is actually a re-usable C library that includes a JIT, maybe it's worth looking into integrating it rather than writing your own Regex JIT.
Under BSD license, github mirror:
https://github.com/luvit/pcre2

I know every regex engine has its own flavor, but PCRE is compatible with Perl regexes which are rather broad feature-wise!

@vancem idea of referencing a Nuget package to precompile all static regexes would be awesome when it comes to startup time.

EDIT BTW if you want to do some comparisons, someone wrote a mixed assembly wrapper around PCRE, here: https://github.com/ltrzesniewski/pcre-net

jods4 on 13 Aug 2018

Thanks, we know about both pcre and pcre-net. We had offline discussions about adopting pcre but nothing concrete yet.

ViktorHofer on 13 Aug 2018

👍1

Need to mention ripgrep https://blog.burntsushi.net/ripgrep/
It use Rust regex library (based on RE2) and have other optimizations (like Intel Teddy re-implementation).
It is worth studying.
In PowerShell repo we discovered that Select-String cmdlet performance is very bad. There are two reasons for this.
Slow Regex CoreFX engine
Tons string allocations.
Regex only works with _strings_ and we have to read files line by line which leads to a huge number of string allocations. We really need to enhance Regex to work not only with strings but also with ReadOnlySpan and buffers (perhaps incrementally).
Seems the enhancement doesn't require to rewrite whole Regex engine and could be implemented in the near future. Thoughts?

(frankly my expectation is that > 90% of all RegEx patterns are know at COMPILE TIME.

The @vancem expectation seems to be true. I would add that we can expect that most regex patterns _simple_ because complex patterns are difficult to write and difficult to debug. (A developer can always break a complex pattern into pieces and apply them consistently without loss of performance) If so, a developer could choose a simpler and faster implementation of RegEx engine. I mean that we could have several different implementations and allow the developer to choose the option that best suits their situation.
The options could be simple string search, fastest but feature limited DFA and full feature NFA engines.

iSazonov on 11 Oct 2018

@ViktorHofer would your proposed Span based API likely fix the string allocation problem? If so, was the reason that is on pause was that it was feasible to implement for the interpreted codepath but rather expensive to implement for the compiled codepath? If so, I wonder if the benefit is sufficiently large that we could implement it only for interpreted, and throw (for now) for compiled.

@iSazonov just curious, do your regexes tend to be compiled?

danmosemsft on 11 Oct 2018

👍1

@danmosemsft If your question is about a Select-String cmdlet, then I would expect that we need compile regex patterns. (But I see in the cmdlet code that we are not doing this and this should be fixed.)

iSazonov on 12 Oct 2018

Compiling a pattern is not necessarily the right choice. I believe they can't be unloaded oncd compiled.

danmosemsft on 12 Oct 2018

In general, yes, but PowerShell can internally compiles script blocks for performance so compile a regex is not problem. We could get a performance boost for large files especially with TC.

iSazonov on 12 Oct 2018

A closer look at ripgrep I discovered that heuristics for selecting one of 7 (!) search methods are used there. https://github.com/rust-lang/regex/blob/77140e7c1628ac31026d2421c6a4c3b0eb19506c/src/literal/mod.rs#L39
It takes definitely a lot of work for .Net Core library to catch up with the search leaders.

It seems Span.Sort() API was approved already. Did the MSFT team consider Span.Match()/IsMatch() option?

iSazonov on 16 Oct 2018

The .NET Regex engine executes expressions through backtracking. Some other engines build a finite state automaton instead. It is possible to match a string in O(N) time as opposed to possibly exponential time with backtracking.

Two downsides:

Engineering work
This does not support all features (e.g. backreferences)

Most regex expressions that occur in practice are supported. FSA matching is extremely fast.

If we care about regex performance we should not try to squeeze out a few percent here or there. The switch to an FSA implementation (for supported expressions which are most) can provide an exponential speedup.

Also, compilation is very important as a constant factor optimization. It was added recently to .NET Core.

GSPP on 6 May 2019

👍1

I also think that we could have several engines and choose the right one after regex parsing.

iSazonov on 7 May 2019

👍1

I do agree regex performance has fallen behind. The way forward may not be major engineering work in the existing Regex engine. As well as the cost there is some risk of breaking behavior (possibly necessarily).

My thinking was we may want to do a combination of

selected tweaks to the existing engine, such as maybe API that takes span.
offer an alternative engine based on existing code. possibly behind a new API (the existing API has some inefficiencies, such as creating objects for groups/matches/captures, and some idiosyncracies, such as .NET is the only engine that supports a captures collection as far as I know, and it also returns all matches at once)

For 2. I did some experimentation with PCRE as discussed here
https://github.com/ltrzesniewski/pcre-net/issues/12
https://github.com/dotnet/corefxlab/compare/master...danmosemsft:pcre?expand=1
Although in that case I stuck it behind the existing API, I began to think it would make better sense to offer a smaller, more optimized API that better matched PCRE semantics.

danmosemsft on 8 May 2019

cc @ltrzesniewski

danmosemsft on 8 May 2019

The best quality of implementation would be to upgrade the existing API. Exact same syntax and API surface. It might make sense to add a parallel API surface that is more optimized (less allocations etc.).

I understand it's attractive to just drop the existing API and create something new in a greenfield way. The better solution for customers, though, is to avoid that fragmentation and just fix the engine. This requires thorough compatibility work but surely this is possible. Does Microsoft not have access to large code bodies that use Regex? It must be possible to compile a compatibility test suite.

If effort was not a concern we'd create a new fast engine for supported expressions (FSA based with no backtracking) and keep the old engine for unsupported expressions. Almost all practically occurring expressions are formed from a rather simple syntax subset. This would be completely transparent to existing code. Customers do not need to learn new ways of doing things.

GSPP on 8 May 2019

just fix the engine

This is not cheap. Writing a new regex engine that is competitive is a big investment and would take time. Where possible it would be good to build on the work others have done and spend our dev dollars in other places. Making it fully compatible is likely an opposing requirement anyway.

Almost all practically occurring expressions are formed from a rather simple syntax subset.

You are quite right - we did an audit of a large quantity of code and this is correct. You can get an idea of the nuances of existing behavior by looking at the tests I disabled in the branch above to get tests passing against PCRE.

danmosemsft on 8 May 2019

👍1

I began to think it would make better sense to offer a smaller, more optimized API that better matched PCRE semantics.

That's understandable if performance is the only thing you're going after, but IMO that would be a missed opportunity. PCRE has a very rich feature set which you can leverage to do things that are just not possible in other engines. That's why I decided to support everything in PCRE.NET - it gives the user so much tools.

There's one feature I left out because I deemed it unnecessary and I had provided a suitable equivalent. And sure enough, just yesterday someone posted an issue asking me to implement it. People want these features. :)

Of course, you could provide simpler APIs that would use faster code paths alongside more feature-complete APIs to get the best of both worlds.

Writing a different API for each regex engine makes sense from a feature availability standpoint.

The .NET Regex engine executes expressions through backtracking. Some other engines build a finite state automaton instead. It is possible to match a string in O(N) time as opposed to possibly exponential time with backtracking.

PCRE (and thus PCRE.NET) provide both implementations. But in the case of PCRE, the O(N) algorithm is not necessarily faster. Read here for more details, but in short:

It is substantially slower than the standard algorithm. This is partly because it has to search for all possible matches, but is also because it is less susceptible to optimization.

Also, PCRE doesn't implement compiled regexes with this algorithm. Engines like RE2 that are designed from the ground up with this algorithm in mind are probably faster.

ltrzesniewski on 8 May 2019

I wonder whether it could help to make changes to the "regular expression pattern parser" rather than in the matching engine itself. Sort of like static analysis for regex patterns, and rewriting the patterns to get maximum performance from the existing engine without changing the match results.

Specifically, I'm thinking of functionality like "unrolling the loop", or automatically wrapping some subexpressions in an atomic group to curb unnecessary backtracking if there is no chance of it finding a match anyway. (Please forgive me for the lack of concrete examples here, I am on mobile atm.)

I guess this would only really help when the original author of the expression didn't already try to optimize it, or decided it would be more maintainable in a simpler form etc., but it could perhaps be useful for cases where users can provide their own patterns, what do you think? I imagine the development effort would be a lot smaller than trying to rewrite the match engine or support third party engines? It could even be part of a separate API that developers could opt-in to, in case that is a concern.

That said, one thing I would find useful, and am considering working on myself when/if time permits (and submitting as a NuGet package), would be a "regex AST". (Currently, the patterns that are parsed are stored in classes marked as internal.) This could power the "static analysis" I suggested earlier, and have the advantage of being able to "rewrite" a pattern originally written for one engine to be suitable for use in another. (i.e. whether \h means horizontal whitespace or a hexadecimal digit etc.)
The reason I mention this, is that if multiple engines will be supported in future, I imagine it would be beneficial to have this functionality as part of the framework. :)

keith-hall on 9 May 2019

👍1

I like the idea of having a new regex engine in .NET. In that sense I like the work proposed in https://github.com/ltrzesniewski/pcre-net/issues/12.

This would not be affiliated with the official .NET libraries, right? It could be an independent project.

If such a powerful library exists then I agree that there is limited need for improving Regex. It seems better to recommend to users: If performance is not a concern use the most convenient tool. If performance is important use the alternative engine. It's just a NuGet package away.

As a .NET user I'd rather see the engineering time spent elsewhere given these considerations.

GSPP on 9 May 2019

I like the idea of having a new regex engine in .NET. In that sense I like the work proposed in ltrzesniewski/pcre-net#12.

PCRE.NET no longer uses C++/CLI, I've rewritten the interop part entirely so I could target netstandard2.0.
It also now supports Windows, Linux and OS X.

I left that issue open for the perf aspect, on which I haven't worked much.

ltrzesniewski on 9 May 2019

This would not be affiliated with the official .NET libraries, right? It could be an independent project.

I see no problem in principle with it being used in.NET Core itself : the PCRE license is friendly, it is well maintained. we already depend on the native binary for zlib, built from sources in this repo.

danmosemsft on 9 May 2019

I see no problem in principle with it being used in.NET Core itself

From Russ Cox https://research.swtch.com/deps#avoid_the_dependency

Avoid the dependency
If a dependency seems too risky and you can’t find a way to isolate it, the best answer may be to avoid it entirely, or at least to avoid the parts you’ve identified as most problematic.
For example, as we better understood the risks and costs associated with PCRE, our plan for Google Code Search evolved from “use PCRE directly,” to “use PCRE but sandbox the parser,” to “write a new regular expression parser but keep the PCRE execution engine,” to “write a new parser and connect it to a different, more efficient open-source execution engine.” Later we rewrote the execution engine as well, so that no dependencies were left, and we open-sourced the result: RE2.

iSazonov on 9 May 2019

We could give more freedom for improvements if we made the regex parser open API.
The parser could be made configurable (which features to support).
It would allow to create new engines (ex. in CoreFXLab) without regressions in current engine and add new heuristics (in custom code too, not in the repo).

iSazonov on 17 May 2019

There is no holy grail here. Regex is a complex subject and any solution is made up of compromises.

PCRE2 alone offers NFA, DFA, JIT and a plethora of options. Because none is strictly better than others in all situations.
RE2 is designed to execute untrusted Regex from users in linear time, with caps on resources (memory) usage.

If I have special requirements for regex in my application (whether safety, extreme performance, partial matches, etc.) I don't mind picking up a Nuget package that offers a specific engine for that situation.

System.Text.RegularExpressions is special because it's built into the platform and it will be the go-to solution for most users. It's also the de-facto contract for any library that accept a regular expression as input (e.g. validation, routing, and more).
IMHO as such it's not important that it fits _all_ needs, but it should be able to fit most common needs with great performance.

So for me a first big improvement would be to make the current API _faster_.
A further improvement could be to introduce new API that are more suited to low-allocations operations (maybe based on Span, that create captures lazily, etc.). Maybe introduce new features if it makes sense.

A more complex goal would be to design a standard Regex interface that would enable inter-operable regex engines as Nuget packages...

jods4 on 17 May 2019

👍1

For what it's worth, here's my results (on Ubuntu) using the benchmark at the top of this issue, targeting 2.2 and with the compile flag switched back on (it looks like they removed it).

| | | | | Total |
| ------------- |:-------------:| -----:|-----:| -----:|
| PHP | 26.28 | 24.16 | 8.12 | 58.56 |
| C PCRE2 | 29.38 | 26.50 | 7.24 | 63.12 |
| Rust | 31.88 | 29.08 | 8.40 | 69.36 |
| Javascript | 94.18 | 66.55 | 2.17 | 162.90 |
| Crystal | 174.25 | 155.35 | 19.25 | 348.86 |
| D ldc | 174.77 | 173.53 | 4.88 | 353.19 |
| D dmd | 259.59 | 242.22 | 7.37 | 509.18 |
| Perl | 205.00 | 277.62 | 36.93 | 519.54 |
| Python PyPy | 220.50 | 197.99 | 367.01 | 785.50 |
| Ruby | 413.85 | 362.56 | 56.83 | 833.24 |
| Python 2 | 323.50 | 235.29 | 556.56 | 1115.35 |
| Python 3 | 457.60 | 289.60 | 480.00 | 1227.20 |
| Kotlin | 441.16 | 542.09 | 356.90 | 1340.15 |
| Java | 435.57 | 548.09 | 377.47 | 1361.12 |
| Go | 409.63 | 409.72 | 631.65 | 1451.00 |
| C# .Net Core | 1506.23 | 1332.92 | 103.60 | 2942.75 |
| C# Mono | 3914.31 | 3332.87 | 218.64 | 7465.82 |

As mentioned above, it's best not to read too much into this given this is a simple benchmark. But it's clear there is plenty of scope for improvement.

I notice the input file is UTF8, and .NET (and perhaps Java?) has to process it as UCS-2. Ability to run regex over UTF8 would presumably be helpful.

danmosemsft on 8 Jul 2019

System.Text.RegularExpressions is special because it's built into the platform and it will be the go-to solution for most users. ...
So for me a first big improvement would be to make the current API faster.
A more complex goal would be to design a standard Regex interface that would enable inter-operable regex engines as Nuget packages...

Perhaps as @iSazonov suggests we can allow other engines to be plugged in to the existing S.T.RE API, so that existing code could run with little or no modification. This is an approach that has been used in Perl: one can plug in PCRE or RE2 using re::engine::PCRE and re::engine::RE2 without otherwise changing code (apparently).

As noted above many "everyday" patterns would have much the same semantics in different engines, but not all. One could deal with this two ways: attempt to fall back to the default implementation if this can be reliably determined (this is what the RE2 Perl package apparently does if compilation fails eg because of lookbehind), or opt-in. Opt-in could be done eg with a new RegexOptions value or other API, or possibly at some higher level such as a thread static. Some of the API would have to selectively throw, eg anything related to Captures objects. It's unfortunate our Regex API surface is so large.

danmosemsft on 8 Jul 2019

@danmosemsft Could you please share more information about your tests? Is it for Ordinal or OrdinalIgnorecase?

Re Utf8. I made an experiment and did not get a perf win in file read scenario. I guess it is because file reading is more slow than transcoding UTF8 to UTF16. I guess we will see more perf win in in-memory-only scenario.

Re design. Main question is should we implement plugin model or separate APIs.

I ported https://github.com/kokke/tiny-regex-c to C# https://github.com/iSazonov/Regex0. It implements small Regex subset for matching only. The new IsMatch() works in 5-10x faster than C# Regex. I could pull this to corefxlab to discuss development directions.

iSazonov on 8 Jul 2019

@iSazonov GitHub gives me a "page not found" error when trying to follow the link to view your C# port of tiny-regex-c - maybe the repo is private atm?

keith-hall on 8 Jul 2019

Sorry, fixed.

iSazonov on 8 Jul 2019

The .NET Regular Expression engine is not slower. The test includes the RegexOptions.Compiled option. The benchmark test includes the cost of compilation and is only run once. If RegexOptions.Compiled was removed, we would see much more reasonable performance.

wesnerm on 16 Dec 2019

Note that there have been some big improvements made to Regex perf in master (since 3.1 forked). Perhaps someone is interested in measuring with those.

danmosemsft on 16 Dec 2019

@danmosemsft Could you please point the PRs?

iSazonov on 17 Dec 2019

@iSazonov these. I do not know whether @stephentoub plans more work.

https://github.com/dotnet/runtime/pulls?q=is%3Apr+label%3Aarea-System.Text.RegularExpressions+is%3Aclosed+author%3Astephentoub

It would be fun to get more benchmark data, maybe I can do the mariomka benchmark at the top when I have access to that machine at home.

danmosemsft on 20 Dec 2019

These significant recent wins suggest to me the existing engine is not at the limit of where we can get it. Also, that it may be less important to offer a way to plug in another engine behind the existing API (because the existing API is faster now)

I think the future is likely a combination of both

continue optimizing existing engine, possibly adding new API for Span, etc.
encourage and possibly contribute to alternative .NET API such as eg PCRE.NET

danmosemsft on 20 Dec 2019

cc @pgovind fyi.

danmosemsft on 20 Dec 2019

👍1

I do not know whether @stephentoub plans more work.

I do.

recent wins suggest to me the existing engine is not at the limit of where we can get it

I agree. There are still things that can be done to make further meaningful improvements beyond what's already been done for .NET 5.

stephentoub on 20 Dec 2019

🎉2

Cool. I'll try to run the benchmarks next week on my machine to see where we're at.

pgovind on 4 Jan 2020

(been staying off GitHub while on vacation, but I'll be putting up another significant PR early next week)

stephentoub on 4 Jan 2020

For what it's worth, here's my results (on Ubuntu) using the benchmark at the top of this issue, targeting 2.2 and with the compile flag switched back on (it looks like they removed it)

@danmosemsft, can you run your test again, but using .NET 5 master?

stephentoub on 2 Mar 2020

Pasted from https://github.com/mariomka/regex-benchmark/pull/14#issuecomment-623038745

PHP | 26.57 | 24.27 | 8.20 | 59.05
C PCRE2 | 29.40 | 25.70 | 7.15 | 62.25
Rust | 30.90 | 29.14 | 8.29 | 68.33
C++ Boost | 62.97 | 62.49 | 23.13 | 148.58
Javascript | 95.32 | 67.16 | 2.19 | 164.68
Perl | 151.18 | 90.58 | 33.98 | 275.74
Crystal | 172.62 | 148.42 | 19.01 | 340.04
C# .Net Core 50 ECMA | 215.35 | 172.11 | 17.35 | 404.80
D ldc | 223.99 | 221.51 | 6.71 | 452.21
C# .Net Core 50 | 314.79 | 254.98 | 17.40 | 587.17
D dmd | 292.91 | 303.55 | 8.84 | 605.29
Ruby | 365.49 | 324.14 | 59.95 | 749.58
Kotlin | 237.89 | 300.62 | 408.42 | 946.93
Java | 233.96 | 317.21 | 429.29 | 980.46
Python 2 | 317.74 | 210.41 | 521.91 | 1050.06
Go | 411.11 | 399.17 | 634.96 | 1445.24
C++ STL | 604.55 | 488.13 | 422.20 | 1514.88
C# .Net Core 31 | 1504.76 | 1256.17 | 104.65 | 2865.57
C# Mono | 3923.71 | 3301.79 | 218.56 | 7444.07

I also offered a PR to add RegexOptions.ECMAScript (ie., to give the ECMA results above) since none of the C, Rust, C++ and Java versions are doing Unicode matching.

danmosemsft on 4 May 2020

since none of the C, Rust, C++ and Java versions are doing Unicode matching.

I wonder if Rust does not understand Utf8.

iSazonov on 4 May 2020

@iSazonov see https://github.com/mariomka/regex-benchmark/pull/26

danmosemsft on 4 May 2020

👍1

Note that https://github.com/dotnet/runtime/pull/35824 should take this down substantially further. But still a fair bit off Rust

danmosemsft on 8 May 2020

👍1

I've tested recent Mono (llvm-jit mode) vs daily .net 5 on this benchmark
Mono is 30% slower than CoreCLR.

EgorBo on 9 May 2020

@EgorBo that woudl be recent Mono with all of @stephentoub improvements to the library, I assume. How does it compare with/without RegexOptions.Compiled. Are both regressed or is it specific to ref-emit?

danmosemsft on 9 May 2020

@EgorBo that woudl be recent Mono with all of @stephentoub improvements to the library, I assume. How does it compare with/without RegexOptions.Compiled. Are both regressed or is it specific to ref-emit?

Yes, it's a mono built from dotnet/runtime master so it uses all the latest Regex changes.
Numbers for "Compiled" mode:

coreclr:
7.1213 - 5
51.5561 - 92
54.7056 - 5301

mono-llvm-jit:
13.63 - 5
86.0223 - 92
87.768 - 5301

mono-jit:
13.1713 - 5
101.0694 - 92
103.0671 - 5301

For non-Compiled mode the difference is smaller between coreclr and mono-jit-llvm (~30%)
NOTE: Mono can be faster in AOT-LLVM mode (because it runs more llvm optimizations than the LLVM in JIT mode) but it looks like it's broken at the moment, will try again once I fix it.

EgorBo on 9 May 2020

Thanks. What's your interpretation of the numbers - is this hitting an area in mono that is known to have a perf gap?

danmosemsft on 9 May 2020

Thanks. What's your interpretation of the numbers - is this hitting an area in mono that is known to have a perf gap?

Not sure, had a quick profiler run (via xcode instruments) and it looks a lot of time is spent in GC.

EgorBo on 9 May 2020

We did a lot of perf work in this area in 5.0. Can this issue be closed now?

eerhardt on 8 Jul 2020

I think so - we're doing very substantially better on both regex-redux and the mariomka benchmark. If there's more work, it should probably be driven by a new benchmark or insights. Also, @EgorBo there's a Mono specific issue, that should have it's own issue not least because it would be different engineers working on it. (Perhaps it is important for Blazor? @marek-safar )

danmosemsft on 8 Jul 2020

Perhaps it is important for Blazor?

One of the 3 Blazor test apps we are working with uses Regex -

https://github.com/SteveSandersonMS/CarChecker/blob/e9c7e7114e7936e38389074f1b5005e4b27a0e8e/Shared/VehiclePart.cs#L43-L50

Note that on WASM, RuntimeFeature.IsDynamicCodeCompiled always returns false, which means the Regex will never use the IL Compiler and instead will always be interpreted.

https://github.com/dotnet/runtime/blob/54a09d24142864767d50c83c2a64c0e1e37a34c2/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs#L76-L81

eerhardt on 8 Jul 2020

I'm not aware of any regex perf problems/asks for browser

marek-safar on 8 Jul 2020

Runtime: Regex performance compare with other programming languages

Most helpful comment

All 61 comments

Related issues