Runtime: Improve SpanHelpers.IndexOfAny throughput / vectorization

Created on 31 Jan 2020 · 8Comments · Source: dotnet/runtime

Our regex engine can now spend a decent amount of time inside of span helpers like:
https://github.com/dotnet/runtime/blob/2355a1095ded03ff9e69cf239273f00eaadb3dfb/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs#L467
and in particular, we seem to hit this path:
https://github.com/dotnet/runtime/blob/2355a1095ded03ff9e69cf239273f00eaadb3dfb/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs#L492-L499
fairly frequently. It'd be great to investigate whether we can do anything to improve the performance of these IndexOfAny helpers, whether it's by improving how we do the vectorization, or utilizing intrinsics directly if that would help, etc. I believe @tannergooding had some ideas.

For example, we spend ~30% of the time in the regex redux benchmark in this helper:

cc: @danmosemsft

area-System.Runtime tenet-performance

Source

stephentoub

👍1

Most helpful comment

What is in place currently?

Have 4 PRs for these that I've not moved over as they had been open for about a year

Intrinsicify SpanHelpers.IndexOfAny(char,char) https://github.com/dotnet/coreclr/pull/22877

Intrinsicify SpanHelpers.IndexOfAny(char,char,char) https://github.com/dotnet/coreclr/pull/22878

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) https://github.com/dotnet/coreclr/pull/22879

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) https://github.com/dotnet/coreclr/pull/22880

So was waiting for other of my intrinsic changes to be merged before bothering e.g. https://github.com/dotnet/runtime/pull/32371

benaadams on 3 Apr 2020

👍3

All 8 comments

From initial glance, we never hit the vectorized path for less than 64 bytes on most modern machines (32 bytes on older machines without AVX/AVX2 support or for ARM machines).
Further, we also enforce "alignment" and so we will do sequential scanning for the first few cases (data is not likely to be aligned properly) and therefore also for trailing elements.

I imagine the biggest benefit would be just handling the initial leading/trailing elements as unaligned (doing at most 2 unaligned iterations and the remaining aligned) which would avoid the small 1-4 iteration loops with many branches (the unaligned vector operation should be faster even if the load happens to cross a cache line boundary; since it will cut down the 8 comparisons/branches and the loop).

Other than that, we'd likely get some benefit from switching to use Hardware Intrinsics (rather than Vector<T>). This is namely because Vector<T> doesn't currently support containment and isn't properly "vex" aware. So, essentially, on modern x86 computers we emit more instructions than necessary and as we aren't encoding the operations "efficiently". Hardware Intrinsics fully support both of these and so can clean up the codegen a bit. Utilizing hardware intrinsics would also allow us to vectorize more cases (as we could cover 16 bytes and above; not just 32 or 64-bytes and above).

tannergooding on 31 Jan 2020

👍1

Also CC. @GrabYourPitchforks

tannergooding on 31 Jan 2020

@stephentoub @tannergooding I would like to give this a shot. Seems like a good opportunity to sharpen my intrinsics skills. Focus seems to be on improving perf for less than 64 bytes, hence plan could be:

Ensure there are benchmarks in place for <64 bytes cases. Ideally, getting patterns observed by @stephentoub What is in place currently?
First handling:

the initial leading/trailing elements as unaligned (doing at most 2 unaligned iterations and the remaining aligned) which would avoid the small 1-4 iteration loops with many branches (the unaligned vector operation should be faster even if the load happens to cross a cache line boundary; since it will cut down the 8 comparisons/branches and the loop).

Then perhaps exploring (although perhaps still falling back to Vector<T> for ARM if relevant)
switching to use Hardware Intrinsics (rather than Vector).

nietras on 3 Apr 2020

What is in place currently?

Have 4 PRs for these that I've not moved over as they had been open for about a year

Intrinsicify SpanHelpers.IndexOfAny(char,char) https://github.com/dotnet/coreclr/pull/22877

Intrinsicify SpanHelpers.IndexOfAny(char,char,char) https://github.com/dotnet/coreclr/pull/22878

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) https://github.com/dotnet/coreclr/pull/22879

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) https://github.com/dotnet/coreclr/pull/22880

So was waiting for other of my intrinsic changes to be merged before bothering e.g. https://github.com/dotnet/runtime/pull/32371

benaadams on 3 Apr 2020

👍3

Have 4 PRs for these

@benaadams ha of course you do! :+1: I'll let you finish then. Let me know if you need any help or something :)

nietras on 3 Apr 2020

@GrabYourPitchforks seems there's a PR dependency chain that includes #32371 -- could you help shuffle it along?

danmosemsft on 3 Apr 2020

@benaadams were you still going to give this a try?

ericstj on 6 Aug 2020

@benaadams were you still going to give this a try?

Intrinsicify SpanHelpers.IndexOfAny(char,char) https://github.com/dotnet/runtime/pull/40589

Intrinsicify SpanHelpers.IndexOfAny(char,char,char) https://github.com/dotnet/runtime/pull/40590

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) https://github.com/dotnet/runtime/pull/40591

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) https://github.com/dotnet/runtime/pull/40592

benaadams on 9 Aug 2020

❤1 👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

JIT: Optimize boxing(x) == boxing(y) to false.

EgorBo · 3Comments

RC2 netstandard1.5 'Object' is defined in an assembly that is not referenced.

jchannon · 3Comments

Default hashcode for struct with double or float field is wrong

jkotas · 3Comments

Generic interface method not being called

v0l · 3Comments

Workaround for apps dependent on DTC in respect to porting to NetCore 3.0

iCodeWebApps · 3Comments