edit by @ViktorHofer, moved initial post down.
Spanifying Regex removes (a) unnecessary string allocations that tend to decrease perf and (b) allows different types of Memory to be processed.
API proposal and implementation: https://github.com/ViktorHofer/corefx/pull/1
This diff contains the Memory overloads and the MatchEvaluator overloads. See discussion above if we should introduce new ref types for Match, Group & Capture.
namespace System.Text.RegularExpressions
{
public partial class Capture
{
internal Capture() { }
public int Index { get { throw null; } }
public int Length { get { throw null; } }
public string Value { get { throw null; } }
+ public ReadOnlySpan<char> ValueSpan { get { throw null; } }
public override string ToString() { throw null; }
}
public partial class Match : System.Text.RegularExpressions.Group
{
internal Match() { }
public static System.Text.RegularExpressions.Match Empty { get { throw null; } }
public virtual System.Text.RegularExpressions.GroupCollection Groups { get { throw null; } }
public System.Text.RegularExpressions.Match NextMatch() { throw null; }
public virtual string Result(string replacement) { throw null; }
+ public virtual bool TryResult(string replacement, Span<char> destination, out int charsWritten) { throw null; }
public static System.Text.RegularExpressions.Match Synchronized(System.Text.RegularExpressions.Match inner) { throw null; }
}
public partial class Regex : System.Runtime.Serialization.ISerializable
{
+ public ref struct SplitEnumerator
+ {
+ public ReadOnlySpan<char> Current { get { throw null; } }
+ public SplitEnumerator GetEnumerator() { throw null; }
+ public bool MoveNext() { throw null; }
+ }
protected internal System.Collections.Hashtable caps;
protected internal System.Collections.Hashtable capnames;
protected internal int capsize;
protected internal string[] capslist;
protected internal System.Text.RegularExpressions.RegexRunnerFactory factory;
public static readonly System.TimeSpan InfiniteMatchTimeout;
protected internal System.TimeSpan internalMatchTimeout;
protected internal string pattern;
protected internal System.Text.RegularExpressions.RegexOptions roptions;
protected Regex() { }
protected Regex(System.Runtime.Serialization.SerializationInfo info, System.Runtime.Serialization.StreamingContext context) { }
public Regex(string pattern) { }
public Regex(string pattern, System.Text.RegularExpressions.RegexOptions options) { }
public Regex(string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { }
public static int CacheSize { get { throw null; } set { } }
[System.CLSCompliant(false)]
protected System.Collections.IDictionary Caps { get { throw null; } set { } }
[System.CLSCompliant(false)]
protected System.Collections.IDictionary CapNames { get { throw null; } set { } }
public System.TimeSpan MatchTimeout { get { throw null; } }
public System.Text.RegularExpressions.RegexOptions Options { get { throw null; } }
public bool RightToLeft { get { throw null; } }
public static string Escape(string str) { throw null; }
+ public static bool TryEscape(ReadOnlySpan<char> str, Span<char> destination, out int charsWritten) { throw null; }
public string[] GetGroupNames() { throw null; }
public int[] GetGroupNumbers() { throw null; }
public string GroupNameFromNumber(int i) { throw null; }
public int GroupNumberFromName(string name) { throw null; }
protected void InitializeReferences() { }
public bool IsMatch(string input) { throw null; }
+ public bool IsMatch(ReadOnlySpan<char> input) { throw null; }
public bool IsMatch(string input, int startat) { throw null; }
public static bool IsMatch(string input, string pattern) { throw null; }
public static bool IsMatch(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
public static bool IsMatch(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+ public static bool IsMatch(ReadOnlySpan<char> input, string pattern, System.Text.RegularExpressions.RegexOptions options = System.Text.RegularExpressions.RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
public System.Text.RegularExpressions.Match Match(string input) { throw null; }
+ public System.Text.RegularExpressions.Match Match(ReadOnlyMemory<char> input) { throw null; }
public System.Text.RegularExpressions.Match Match(string input, int startat) { throw null; }
public System.Text.RegularExpressions.Match Match(string input, int beginning, int length) { throw null; }
public static System.Text.RegularExpressions.Match Match(string input, string pattern) { throw null; }
public static System.Text.RegularExpressions.Match Match(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
public static System.Text.RegularExpressions.Match Match(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+ public static System.Text.RegularExpressions.Match Match(ReadOnlyMemory<char> input, string pattern, System.Text.RegularExpressions.RegexOptions options = RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
public System.Text.RegularExpressions.MatchCollection Matches(string input) { throw null; }
+ public System.Text.RegularExpressions.MatchCollection Matches(ReadOnlyMemory<char> input) { throw null; }
public System.Text.RegularExpressions.MatchCollection Matches(string input, int startat) { throw null; }
public static System.Text.RegularExpressions.MatchCollection Matches(string input, string pattern) { throw null; }
public static System.Text.RegularExpressions.MatchCollection Matches(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
public static System.Text.RegularExpressions.MatchCollection Matches(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+ public static System.Text.RegularExpressions.MatchCollection Matches(ReadOnlyMemory<char> input, string pattern, System.Text.RegularExpressions.RegexOptions options = RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
public string Replace(string input, string replacement) { throw null; }
public string Replace(string input, string replacement, int count) { throw null; }
+ public bool TryReplace(ReadOnlySpan<char> input, string replacement, Span<char> destination, out int charsWritten, int count = -1) { throw null; }
public string Replace(string input, string replacement, int count, int startat) { throw null; }
public static string Replace(string input, string pattern, string replacement) { throw null; }
public static string Replace(string input, string pattern, string replacement, System.Text.RegularExpressions.RegexOptions options) { throw null; }
public static string Replace(string input, string pattern, string replacement, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+ public static bool TryReplace(ReadOnlySpan<char> input, string pattern, string replacement, Span<char> destination, out int charsWritten, System.Text.RegularExpressions.RegexOptions options = System.Text.RegularExpressions.RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
public static string Replace(string input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator) { throw null; }
public static string Replace(string input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator, System.Text.RegularExpressions.RegexOptions options) { throw null; }
public static string Replace(string input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+ public static bool TryReplace(ReadOnlySpan<char> input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator, Span<char> destination, out int charsWritten, System.Text.RegularExpressions.RegexOptions options = System.Text.RegularExpressions.RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
public string Replace(string input, System.Text.RegularExpressions.MatchEvaluator evaluator) { throw null; }
public string Replace(string input, System.Text.RegularExpressions.MatchEvaluator evaluator, int count) { throw null; }
+ public bool TryReplace(ReadOnlySpan<char> input, System.Text.RegularExpressions.MatchEvaluator evaluator, Span<char> destination, out int charsWritten, int count = -1) { throw null; }
public string Replace(string input, System.Text.RegularExpressions.MatchEvaluator evaluator, int count, int startat) { throw null; }
public string[] Split(string input) { throw null; }
public string[] Split(string input, int count) { throw null; }
public string[] Split(string input, int count, int startat) { throw null; }
+ public SplitEnumerator Split(ReadOnlySpan<char> input, int count = 0) { throw null; }
public static string[] Split(string input, string pattern) { throw null; }
public static string[] Split(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
public static string[] Split(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+ public static SplitEnumerator Split(ReadOnlySpan<char> input, string pattern, RegexOptions options = RegexOptions.None, TimeSpan? matchTimeout = null) { throw null; }
void System.Runtime.Serialization.ISerializable.GetObjectData(System.Runtime.Serialization.SerializationInfo si, System.Runtime.Serialization.StreamingContext context) { }
public override string ToString() { throw null; }
public static string Unescape(string str) { throw null; }
+ public static bool TryUnescape(ReadOnlySpan<char> str, Span<char> destination, out int charsWritten) { throw null; }
protected bool UseOptionC() { throw null; }
protected bool UseOptionR() { throw null; }
protected internal static void ValidateMatchTimeout(System.TimeSpan matchTimeout) { }
}
}
I had a discussion with Jan offline and he pointed out that we might want to introduce a ref struct MatchValue
type that is returned by APIs that take Span/Memory as an input.
The problem with just using the current
class Match
is that it gives you unsecure access to the Span. For example, you can send the Match object to other thread and start working on the Span there, while the current thread unwinds and frees the memory.
ref struct MatchValue would avoid this issue
(And also saved the alllocation)
The issues with that is that we currently have the following hiearchy: Match --> Group --> Capture and that Groups and Match contain collections of Captures/Groups.
Yes, the flip from class to valuetype tends to be like this. E.g. when we have introduced ValueTask to CoreFX, a bunch of parallel ValueSomething types went with it.
It is a topic for API review discussion
One option is to just not have Span version of the APIs that returns these collections or have callbacks
For things like Regex.Replace, the startat argument means copy everything up to startat to the destination Span, and then run regular Replace that does not take startat method. So this looks like a convenience method to me - it saves you from typing a tiny bit of code in rare cases to achieve the same effect.
Should we add these startat convenience overloads for Span also?
If yes, this commit should be reverted https://github.com/ViktorHofer/corefx/pull/1/commits/bf7d7f986faaa212476ceeac09ab71c8ab6c7fd5
If you call the Span version of Regex.Split and pass RegexOptions.RightToLeft
to it the yield order of the enumerator will also be right to left as we start looking for matches from right to left. The current implementation (which is not an enumerator!) reverses the captured strings before returning.
I'm not aware of any other cases in the BCL where we have a GetEnumerator method like this on a struct enumerator. I understand you want it to be able to directly foreach the results without introducing an enumerable struct to serve as the return type, but I'm not sure this is a pattern we want to introduce. You should be sure to highlight this as part of any API review discussion.
copied danmosemsft's initial post:
Right now System.Text.RegularExpression.* only works against strings. Matching against buffers requires a string be created from the buffer. That string might not be needed again (especially if there's no match).
Motivating example: a grep tool that enumerates files in a directory tree, and scans their content against a pattern. It should not need to create strings except when a line in the buffer matches the pattern.
@stephentoub mentioned that while Regex.IsMatch could likely have a Span
This proposal obviously isn't ready for review.
@ViktorHofer @JeremyKuhne
@ViktorHofer could you take on making this into a reviewable proposal? It seems to be a fit for your regex perf mission..
@JeremyKuhne presumably your regex filter for GetFiles could operate more efficiently if IsMatch could accept a span?
Yes I already started thinking about it. Will need bit more investigation.
Even if we can do this in phases it would be awesome. Presumably it wouldn't be too hard to change RegexRunner.runtext to ReadOnlySpan
+1
This would really help Select-String in PowerShell. Tried to parallelize it and ran into GC death on string allocations for the inputs.
So +1
Just as a summary of the Select-String
use case:
Run the Match function on every line of a file, given an encoding.
ReadOnlyMemory
seems like a good start, but there needs to be some transform from the readonly memory to a enumeration of the decoded char/rune of the specified encoding.
Maybe also provide
IEnumerable<Match> MatchInFile(string fullname, string pattern, Encoding encoding)
IEnumerable<Match> ReplaceInFile(string fullname, string pattern, string replacement, Encoding encoding)
cc @JeremyKuhne
Some discussion in https://github.com/dotnet/corefx/issues/27267
@powercode for your MatchInFile
proposal, please open a new issue using the regular API propsal format. It would need to add significant value over a simple loop over getting lines from a filestream. I'm not sure that it does. If Regex could consume an actual stream, it might. Some discussion in dotnet/corefx#27267.
Design in progress but will miss 2.1
Presumably it wouldn't be too hard to change RegexRunner.runtext to ReadOnlySpan and that would open up hitting statics on Regex? Haven't looked too deeply, but the key is not the pattern, but the input.
RegexRunner.runtext is exposed (internal protected) and therefore can't be changed easily without breaking already compiled assemblies.
We should only expose these APIs if we can handle both, interpreted as well as compiled regexes, because otherwise we'd make compiled regexes even slower than today (because we need to copy the ReadOnlyMemory<char>
/ReadOnlySpan<char>
to a string).
The current API shape has two modes IsMatch
which doesn't return any results and Match
which returns class instances that point back to the original input. Ideally, we'd like the former to be able operate on spans and the latter over memories. However, this isn't easily doable without hacks (e.g. MemoryOrPinnedSpan
) as they both use the same infratructure.
We don't like the MemoryOrPinnedSpan
type and having to pass pointers internally as this seems like a farm for future AVs.
Jan proposed that we add public ref structs like CaptureValue
and MatchValue
so that we can return spans, but this wouldn't be straight forward to do with the current API shape we where have to return collections of these. This would become messy or force a persistent regex engine where the caller can enumerate the matches.
It seems the most sensbile model for the current API shape of regex is to only accept memories. Yes, it's less safe than spans, but it's more flexible and more suitable for the current API shape without crazy hacks in the implementation.
If we can change our infastructure to use ref structs internally, we can add an overload for IsMatch
that operates over a span instead of a memory.
It might be worthwhile looking into what a brand new ideal API surface for Regex would look like (both in terms of reducing allocations as well as using a better internal engine). If that shape is close to what we have, bolting it on top might be best. If it isn't it seems to be better to offer a brand new API in a new namespace.
Any progress on this?
Just my 2 cents but I'm excited for this to be a reality since it seems so logical to expose .ValueSpan
as an option on classes and sub classes of Capture. @terrajobst: I wish I had a better understanding of why this could be bad with compiled Regex.
Isn't the real difference this?
string Value => Text.Substring(Index, Length);
vs
ReadOnlySpan<char> ValueSpan => Text.AsSpan().Slice(Index, Length);
I'm definitely not a pro at Spans, but I get the underlying benefit.
Right I agree, the ValueSpan
change should go into 3.0 if possible. That said, we need an api-review for it first.
I have another use case for this proposal.
I'm writing a compiler for a simple scripting language. Provided how simple the language is, I'm planning to just use a simple regex-based tokenizer (an LL-recursive descent lexer or something would be overkill).
Now, I have exactly 25,3 MB worth of text to process. Having to create substrings for every single token means millions of allocations.
(Still, this isn't exactly a major problem for me. Performance isn't critical. I can easily allow corert to spend a few seconds allocating/deallocating strings. I ended up here mainly because I "wanted to do the right thing" and was curious to see why Regex API doesn't accept spans.)
Most helpful comment
Any progress on this?