Our regex engine can now spend a decent amount of time inside of span helpers like:
https://github.com/dotnet/runtime/blob/2355a1095ded03ff9e69cf239273f00eaadb3dfb/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs#L467
and in particular, we seem to hit this path:
https://github.com/dotnet/runtime/blob/2355a1095ded03ff9e69cf239273f00eaadb3dfb/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs#L492-L499
fairly frequently. It'd be great to investigate whether we can do anything to improve the performance of these IndexOfAny helpers, whether it's by improving how we do the vectorization, or utilizing intrinsics directly if that would help, etc. I believe @tannergooding had some ideas.
For example, we spend ~30% of the time in the regex redux benchmark in this helper:

cc: @danmosemsft
From initial glance, we never hit the vectorized path for less than 64 bytes on most modern machines (32 bytes on older machines without AVX/AVX2 support or for ARM machines).
Further, we also enforce "alignment" and so we will do sequential scanning for the first few cases (data is not likely to be aligned properly) and therefore also for trailing elements.
I imagine the biggest benefit would be just handling the initial leading/trailing elements as unaligned (doing at most 2 unaligned iterations and the remaining aligned) which would avoid the small 1-4 iteration loops with many branches (the unaligned vector operation should be faster even if the load happens to cross a cache line boundary; since it will cut down the 8 comparisons/branches and the loop).
Other than that, we'd likely get some benefit from switching to use Hardware Intrinsics (rather than Vector<T>). This is namely because Vector<T> doesn't currently support containment and isn't properly "vex" aware. So, essentially, on modern x86 computers we emit more instructions than necessary and as we aren't encoding the operations "efficiently". Hardware Intrinsics fully support both of these and so can clean up the codegen a bit. Utilizing hardware intrinsics would also allow us to vectorize more cases (as we could cover 16 bytes and above; not just 32 or 64-bytes and above).
Also CC. @GrabYourPitchforks
@stephentoub @tannergooding I would like to give this a shot. Seems like a good opportunity to sharpen my intrinsics skills. Focus seems to be on improving perf for less than 64 bytes, hence plan could be:
the initial leading/trailing elements as unaligned (doing at most 2 unaligned iterations and the remaining aligned) which would avoid the small 1-4 iteration loops with many branches (the unaligned vector operation should be faster even if the load happens to cross a cache line boundary; since it will cut down the 8 comparisons/branches and the loop).
- Then perhaps exploring (although perhaps still falling back to
Vector<T>for ARM if relevant)
switching to use Hardware Intrinsics (rather than Vector).
What is in place currently?
Have 4 PRs for these that I've not moved over as they had been open for about a year
Intrinsicify SpanHelpers.IndexOfAny(char,char) https://github.com/dotnet/coreclr/pull/22877
Intrinsicify SpanHelpers.IndexOfAny(char,char,char) https://github.com/dotnet/coreclr/pull/22878
Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) https://github.com/dotnet/coreclr/pull/22879
Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) https://github.com/dotnet/coreclr/pull/22880
So was waiting for other of my intrinsic changes to be merged before bothering e.g. https://github.com/dotnet/runtime/pull/32371
Have 4 PRs for these
@benaadams ha of course you do! :+1: I'll let you finish then. Let me know if you need any help or something :)
@GrabYourPitchforks seems there's a PR dependency chain that includes #32371 -- could you help shuffle it along?
@benaadams were you still going to give this a try?
@benaadams were you still going to give this a try?
Intrinsicify SpanHelpers.IndexOfAny(char,char) https://github.com/dotnet/runtime/pull/40589
Intrinsicify SpanHelpers.IndexOfAny(char,char,char) https://github.com/dotnet/runtime/pull/40590
Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) https://github.com/dotnet/runtime/pull/40591
Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) https://github.com/dotnet/runtime/pull/40592
Most helpful comment
Have 4 PRs for these that I've not moved over as they had been open for about a year
Intrinsicify SpanHelpers.IndexOfAny(char,char)https://github.com/dotnet/coreclr/pull/22877Intrinsicify SpanHelpers.IndexOfAny(char,char,char)https://github.com/dotnet/coreclr/pull/22878Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char)https://github.com/dotnet/coreclr/pull/22879Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char)https://github.com/dotnet/coreclr/pull/22880So was waiting for other of my intrinsic changes to be merged before bothering e.g. https://github.com/dotnet/runtime/pull/32371