Runtime: Improve SpanHelpers.IndexOfAny throughput / vectorization

Created on 31 Jan 2020  路  8Comments  路  Source: dotnet/runtime

Our regex engine can now spend a decent amount of time inside of span helpers like:
https://github.com/dotnet/runtime/blob/2355a1095ded03ff9e69cf239273f00eaadb3dfb/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs#L467
and in particular, we seem to hit this path:
https://github.com/dotnet/runtime/blob/2355a1095ded03ff9e69cf239273f00eaadb3dfb/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs#L492-L499
fairly frequently. It'd be great to investigate whether we can do anything to improve the performance of these IndexOfAny helpers, whether it's by improving how we do the vectorization, or utilizing intrinsics directly if that would help, etc. I believe @tannergooding had some ideas.

For example, we spend ~30% of the time in the regex redux benchmark in this helper:
image

cc: @danmosemsft

area-System.Runtime tenet-performance

Most helpful comment

What is in place currently?

Have 4 PRs for these that I've not moved over as they had been open for about a year

Intrinsicify SpanHelpers.IndexOfAny(char,char) https://github.com/dotnet/coreclr/pull/22877

Intrinsicify SpanHelpers.IndexOfAny(char,char,char) https://github.com/dotnet/coreclr/pull/22878

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) https://github.com/dotnet/coreclr/pull/22879

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) https://github.com/dotnet/coreclr/pull/22880

So was waiting for other of my intrinsic changes to be merged before bothering e.g. https://github.com/dotnet/runtime/pull/32371

All 8 comments

From initial glance, we never hit the vectorized path for less than 64 bytes on most modern machines (32 bytes on older machines without AVX/AVX2 support or for ARM machines).
Further, we also enforce "alignment" and so we will do sequential scanning for the first few cases (data is not likely to be aligned properly) and therefore also for trailing elements.

I imagine the biggest benefit would be just handling the initial leading/trailing elements as unaligned (doing at most 2 unaligned iterations and the remaining aligned) which would avoid the small 1-4 iteration loops with many branches (the unaligned vector operation should be faster even if the load happens to cross a cache line boundary; since it will cut down the 8 comparisons/branches and the loop).

Other than that, we'd likely get some benefit from switching to use Hardware Intrinsics (rather than Vector<T>). This is namely because Vector<T> doesn't currently support containment and isn't properly "vex" aware. So, essentially, on modern x86 computers we emit more instructions than necessary and as we aren't encoding the operations "efficiently". Hardware Intrinsics fully support both of these and so can clean up the codegen a bit. Utilizing hardware intrinsics would also allow us to vectorize more cases (as we could cover 16 bytes and above; not just 32 or 64-bytes and above).

Also CC. @GrabYourPitchforks

@stephentoub @tannergooding I would like to give this a shot. Seems like a good opportunity to sharpen my intrinsics skills. Focus seems to be on improving perf for less than 64 bytes, hence plan could be:

  • Ensure there are benchmarks in place for <64 bytes cases. Ideally, getting patterns observed by @stephentoub What is in place currently?
  • First handling:

the initial leading/trailing elements as unaligned (doing at most 2 unaligned iterations and the remaining aligned) which would avoid the small 1-4 iteration loops with many branches (the unaligned vector operation should be faster even if the load happens to cross a cache line boundary; since it will cut down the 8 comparisons/branches and the loop).

  • Then perhaps exploring (although perhaps still falling back to Vector<T> for ARM if relevant)
    switching to use Hardware Intrinsics (rather than Vector).

What is in place currently?

Have 4 PRs for these that I've not moved over as they had been open for about a year

Intrinsicify SpanHelpers.IndexOfAny(char,char) https://github.com/dotnet/coreclr/pull/22877

Intrinsicify SpanHelpers.IndexOfAny(char,char,char) https://github.com/dotnet/coreclr/pull/22878

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) https://github.com/dotnet/coreclr/pull/22879

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) https://github.com/dotnet/coreclr/pull/22880

So was waiting for other of my intrinsic changes to be merged before bothering e.g. https://github.com/dotnet/runtime/pull/32371

Have 4 PRs for these

@benaadams ha of course you do! :+1: I'll let you finish then. Let me know if you need any help or something :)

@GrabYourPitchforks seems there's a PR dependency chain that includes #32371 -- could you help shuffle it along?

@benaadams were you still going to give this a try?

@benaadams were you still going to give this a try?

Intrinsicify SpanHelpers.IndexOfAny(char,char) https://github.com/dotnet/runtime/pull/40589

Intrinsicify SpanHelpers.IndexOfAny(char,char,char) https://github.com/dotnet/runtime/pull/40590

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) https://github.com/dotnet/runtime/pull/40591

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) https://github.com/dotnet/runtime/pull/40592

Was this page helpful?
0 / 5 - 0 ratings

Related issues

EgorBo picture EgorBo  路  3Comments

GitAntoinee picture GitAntoinee  路  3Comments

jkotas picture jkotas  路  3Comments

matty-hall picture matty-hall  路  3Comments

jzabroski picture jzabroski  路  3Comments