Runtime: Developers can precompile their Regex code for faster startup

Created on 14 Nov 2020  ·  27Comments  ·  Source: dotnet/runtime

Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster. Some users want both startup time and performance; other users want to run on a platform where JITted code is not allowed, and also want performance. For those users, .NET Framework allowed the generated code to be saved out but .NET Core/.NET 5 does not plan to support saving emitted IL.

Instead, we could develop a Roslyn Source Generator that would generate the necessary code (as C#) at compile time.

Essentially API should be need on Regex, as the generated code would hook into it in the regular way, by deriving from RegexRunner and implementing FindFirstChar() and Go(). This API is all public/protected today (indeed it happens that .NET Core can load regexes written out by .NET Framework.)

TBD the user experience - eg., how one must annotate regexes in order to trigger the generator - assuming it doesn't attempt to read all the code to try to infer them - and how the generated code is wired up.

For example, one could imagine requiring an annotation like this (we could also expose API on the source generator so it could consume a list of regexes, for example)

[RegexSourceGenerator(“(?:[0-9]{1,3}\.){3}[0-9]{1,3}”, RegexOptions.IgnoreCase)]
public partial Regex CreateIpAddressRegex(); 

This new attribute would need to be exposed somewhere, probably on the Regex assembly since the code being compiled would be referencing that anyway.

Also TBD the implementation details necessary to wire into the existing Regex implementation - this is probably most of the work.

cc @pgovind @eerhardt @stephentoub this is just a summary of our email thread. Anything to add?

In terms of customers, we have at least one major 1st party service that heavily uses regex and highly values both throughput and startup time. This is probably P1 for them.

@danroth27 in a Blazor app, can I assume that most users in the .NET 6 timeframe will be content with either using Javascript regexes, or interpreted mode .NET regexes? In my mind this is P2/P3.

@marek-safar similar question for Xamarin, on platforms that don't JIT, Xamarin apps presumably currently interpret the ref-emitted code, or use the regex interpreted mode - and this would not be a big win for them - we had talked about P2.

2 Libraries User Story area-System.Text.RegularExpressions

Most helpful comment

@nbevans F# and source generators are tracked here: https://github.com/fsharp/fslang-suggestions/issues/864

I encourage questions like this (what about F#?) to be routed there, because there are several options and we'll take the one that is the best for F# when we consider the state of the .NET ecosystem.

This issue tracks adding a way to push regex compilation into compile-time, which is inherently going to depend on how a language can support that kind of scenario. For C#, it is via a C# Source Generator that the Roslyn C# compiler supports. That's what this means:

The feature here is an additional mechanism relying on and specific to Roslyn.

As mentioned, this doesn't somehow exclude F# from using the Regex API. It is an additional mechanism that optimizes an existing code path. That's the general model for source generators, at least certainly for something like the runtime where there is an extremely high bar for compatibility.

If you're concerned about F# support for source generators then the issue I linked is the best place to engage.

All 27 comments

Tagging @eerhardt, @pgovind, @jeffhandley as subscribed to this area.
See info in area-owners.md if you want to be subscribed.


Issue Details





















Description:

Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster. Some users want both startup time and performance; other users want to run on a platform where JITted code is not allowed, and also want performance. For those users, .NET Framework allowed the generated code to be saved out but .NET Core/.NET 5 does not plan to support saving emitted IL.

Instead, we could develop a Roslyn Source Generator that would generate the necessary code (as C#) at compile time.

Essentially API should be need on Regex, as the generated code would hook into it in the regular way, by deriving from RegexRunner and implementing FindFirstChar() and Go(). This API is all public/protected today (indeed it happens that .NET Core can load regexes written out by .NET Framework.)

TBD the user experience - eg., how one must annotate regexes in order to trigger the generator - assuming it doesn't attempt to read all the code to try to infer them.

For example, one could imagine requiring an annotation like this (we could also expose API on the source generator so it could consume a list of regexes, for example)

[RegexSourceGenerator(“[a-f0-9]*”, RegexOptions.IgnoreCase)]
public partial Regex CreateExpression();

This new attribute would need to be exposed somewhere, probably on the Regex assembly since the code being compiled would be referencing that anyway.

Also TBD the implementation details necessary to wire into the existing Regex implementation - this is probably most of the work.

cc @pgovind @eerhardt @stephentoub this is just a summary of our email thread. Anything to add?

In terms of customers, we have at least one major 1st party service that heavily uses regex and highly values both throughput and startup time. This is probablyP1 for them.

@danroth27 in a Blazor app, can I assume that most users in the .NET 6 timeframe will be content with either using Javascript regexes, or interpreted mode .NET regexes? In my mind this is P2/P3.

@marek-safar similar question for Xamarin, on platforms that don't JIT, Xamarin apps presumably currently interpret the ref-emitted code, or use the regex interpreted mode - and this would not be a big win for them - we had talked about P2.

Author: danmosemsft
Assignees: -
Labels:

area-System.Text.RegularExpressions, untriaged

Milestone: -

cc @mjsabby

With source generator based implementation, will the difference between compiled and interpreted regex modes disappear; i.e. they both will endup using the same 'generated managed code' version?

@am11 if you enable it, and presumably annotate appropriately.

There should already be no observable difference, other than perf.

Great! In addition to annotations (or instead of, if low granularity is not required?), AppContext setting approach could also be useful here, for the end-users to specify project-wide choice.

<!-- enduser-lib.csrpoj -->
<ItemGroup>
  <!-- existing options, where we also do not care about low granularity by means of annotations -->
  <RuntimeHostConfigurationOption Include="System.Globalization.Invariant" Value="true" />
  <RuntimeHostConfigurationOption Include="System.Globalization.UseNls" Value="true" />
  <RuntimeHostConfigurationOption Include="System.Threading.ThreadPool.MaxThreads" Value="true" />
  <!-- etc. -->

  <!-- new, could be annotations and this option, or just this option -->
  <RuntimeHostConfigurationOption Include="System.Text.RegularExpressions.UseSourceGenerated" Value="true" />
</ItemGroup>

@am11 I'm not sure it's feasible to make code that uses existing Regex members use source generated implementation.

I'm not sure it's feasible to make code that uses existing Regex members use source generated implementation.

There are technically ways, but I currently don't think we should pursue them.

For example, imagine we added a static method like:
```C#
public abstract class RegexRunnerFactory
{
public static void Register(string pattern, RegexOptions options, Func createFactory);
}

and then modified the Regex implementation to consult upon Regex construction a dictionary of registered runner factories populated by Register.  That would enable any Regex usage to instead be backed by something provided via a Register call.  Then in addition to the regex source generator spitting out the relevant types, it would also spit out a module initializer that called Register, e.g.
```C#
var r = new Regex("[a-f0-9]*", RegexOptions.IgnoreCase);

would be recognized by the source generator and would spit out:
C# [ModuleInitializer] internal static void Initialize1() => RegexRunnerFactory.Register("[a-f0-9]*", RegexOptions.IgnoreCase, () => new RegexRunner1Factory());
where RegexRunner1Factory contains/references all the generated regex code.

But there are a variety of downsides to an approach like this, including but not limited to it bloating the IL with the regex code for all regexes and not just those a developer opted-in to having be source generated. I'm currently of the opinion it's better to opt-in on a case-by-case basis and have it be clear in the source what's resulting in a potentially huge amount of IL being spit into the binary.

Note that the original proposal would require existing code to be refactored

  • Places where commonly used regexes are newed up need to call the factory method instead
  • Places that use the static regex methods (Match, Matches, Replace, IsMatch, Split) need to be modified to call the factory method and call instance methods on the result instead.

In the latter case, there is currently an internal cache of Regex objects (by default 15) that is consulted by the static methods; by changing to the factory method you lose the benefit of that cache, but of course most of the purpose of that cache is to help avoid re-generating the IL, which would no longer happen anyway.

If it was important to avoid the refactorings above (eg., so that you could "warm up" the key regexes you plan to use in your startup code, say, without changing existing points of use) it would be easy to expose API for the generated code to add its Regex object to the existing cache (for static methods) or add its runner to a new RegexRunnerCache (for the instance methods, or possibly both). But I assume anyone opting into this is fine changing their existing points of use.

@danmosemsft I don't have any customer evidence that this is an important scenario. You are correct that by default we run in non-SRE mode on all existing mobile configurations

As for existing code that uses the .NET Framework ability to write out compiled regexes (like @mjsabby ) -- using CompileToAssembly -- that would have to be modified; instead of newing up the type derived from Regex, it would have to be refactored to use the factory method. I think that's fine -- I don't see any particular advantage of a type over a factory method, and an annotated partial method is more convenient way to hook into compilation.

it would have to be refactored to use the factory method

Which, for a scenario involving dynamically generating lots of regexes, might mean generating source that in turn relies on a source generator.

@stephentoub Except one source generator can't depend on another source generator: https://github.com/dotnet/roslyn/discussions/48358.

Except one source generator can't depend on another source generator

Needn't be an actual source generator, just a program that spits out C# source that uses the regex source generator. I was careful in my wording above ;-)

How will this work for F#?

How will this work for F#?

F# continues to support Regex the same ways it always has. The feature here is an additional mechanism relying on and specific to Roslyn.

In other words F# is completly excluded from this optimisation.

This new attribute would need to be exposed somewhere, probably on the Regex assembly since the code being compiled would be referencing that anyway.

This doesn't need to be defined ahead of time. It could be one of the outputs of the source generator.

How will this work for F#?

F# continues to support Regex the same ways it always has. The feature here is an additional mechanism relying on and specific to Roslyn.

Will the dotnet/fsharp project be involved at all about these ideas? Why is the dotnet/runtime project doing work for dotnet/roslyn project?

So many questions...

This doesn't need to be defined ahead of time. It could be one of the outputs of the source generator.

But if the attribute is the thing the source generator keys off of, you wouldn't get IntelliSense when authoring use of the attribute, would you?

@nbevans F# and source generators are tracked here: https://github.com/fsharp/fslang-suggestions/issues/864

I encourage questions like this (what about F#?) to be routed there, because there are several options and we'll take the one that is the best for F# when we consider the state of the .NET ecosystem.

This issue tracks adding a way to push regex compilation into compile-time, which is inherently going to depend on how a language can support that kind of scenario. For C#, it is via a C# Source Generator that the Roslyn C# compiler supports. That's what this means:

The feature here is an additional mechanism relying on and specific to Roslyn.

As mentioned, this doesn't somehow exclude F# from using the Regex API. It is an additional mechanism that optimizes an existing code path. That's the general model for source generators, at least certainly for something like the runtime where there is an extremely high bar for compatibility.

If you're concerned about F# support for source generators then the issue I linked is the best place to engage.

@stephentoub Correct we'd write a tool to spit out C#, although you could imagine we could just as well spit out what the source generator would, but perhaps it's easier to leverage the work being done here. I imagine this issue covers changes in the public surface area and not the specific way of invocation?

I'm sure sure you know that and I'm probably imagining reticence in your comment that isn't actually there, but I'm chiming in per the tag that this will be quite useful to us.

although you could imagine we could just as well spit out what the source generator would

I assume you mean if there were a programmatic API for invoking the generator you'd use it, not that you'd implement your own equivalent generator (if you actually mean the latter, please first contribute that implementation :wink:)

We're more than happy to use the API you provide :)

@mjsabby just to make sure I understand your need -- you essentially have a huge list of regexes and, analogous of what you do today, you want to pre-generate the C# and compile it independently of the product assemblies that would instantiate the regexes?

So I guess you would start with (generate with a script maybe) a long list of partial methods with attributes like above, each exposed as a public member on some type(s), and the assembly would contain only that plus the generated code; then your separate implementation assemblies call into that. Does that need extra programmatic API?

We call Regex.CompileToAssembly at build time. This generates an assembly from the passed in RegexCompilationInfo[], we would alter this tool do whatever is required such that it works with the new model of generating C# that will perhaps still compile it all into its own assembly.

How will this work for F#?

F# continues to support Regex the same ways it always has. The feature here is an additional mechanism relying on and specific to Roslyn.

Are you sure it's a good idea to add Roslyn specific features to a language agnostic runtime?

Are you sure it's a good idea to add Roslyn specific features to a language agnostic runtime?

It's not in the runtime. It'd be a separate library used by tooling. Just like analyzers.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sahithreddyk picture sahithreddyk  ·  3Comments

yahorsi picture yahorsi  ·  3Comments

nalywa picture nalywa  ·  3Comments

bencz picture bencz  ·  3Comments

GitAntoinee picture GitAntoinee  ·  3Comments