Runtime: Optimize library code using arm64 intrinsics

Created on 6 Mar 2020  路  37Comments  路  Source: dotnet/runtime

The following classes/functions in the libraries have Intel x86/x64 intrinsics usage. These are where _ISA_.IsSupported() is called. This information was collected manually and might not be complete. Some of these function names represent many overloads. There are some vectorized helper methods not shown here -- where a function calls IsSupported and then calls a specific helper function to do the actual work, such as for SSE2 or AVX2 specifically. There are other cases where Vector<T> is used, but arm64 already supports that (it should be verified that the arm64 Vector<T> code is complete and performant).

When each of these has added an arm64-specific intrinsics optimization, it should be "checked off".

The sections below are ordered in the presumed priority order that they should be implemented in. (There is no assumed priority order for the individual functions in each section.)

It is expected that System.Collections.BitArray, System.Numerics, and System.SpanHelpers will be "arm64 intrinsi-fied" for .NET 5. If possible, System.Buffers and System.Text will as well, but that is not considered required.

System.Collections.BitArray https://github.com/dotnet/runtime/issues/33309

  • [x] System.Collections.BitArray - constructor
  • [x] System.Collections.BitArray.And()
  • [x] System.Collections.BitArray.Or()
  • [x] System.Collections.BitArray.Xor()
  • [x] System.Collections.BitArray.Not()
  • [x] System.Collections.BitArray.CopyTo()

System.Runtime.Intrinsics https://github.com/dotnet/runtime/issues/33496

Vector64

  • [x] As()
  • [x] AsInt64()
  • [x] AsUInt64()
  • [x] AsDouble()
  • [x] CreateScalarUnsafe(int value);
  • [x] CreateScalarUnsafe(uint value);
  • [x] CreateScalarUnsafe(float value);
  • [x] CreateScalarUnsafe(byte value);
  • [x] CreateScalarUnsafe(sbyte value);
  • [x] CreateScalarUnsafe(short value);
  • [x] CreateScalarUnsafe(ushort value);
  • [x] CreateScalar(uint)
  • [x] CreateScalar(float)
  • [x] CreateScalar(sbyte)
  • [x] CreateScalar(ushort)
  • [x] CreateScalar(short)
  • [x] CreateScalar(byte)
  • [x] CreateScalar(int)
  • [x] Create(sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte)
  • [x] Create(byte, byte, byte, byte, byte, byte, byte, byte)
  • [x] Create(ushort, ushort, ushort, ushort)
  • [x] Create(short, short, short, short)
  • [x] Create(float, float)
  • [x] Create(int, int)
  • [x] Create(ulong)
  • [x] Create(uint)
  • [x] Create(uint, uint)
  • [x] Create(float)
  • [x] Create(sbyte)
  • [x] Create(long)
  • [x] Create(int)
  • [x] Create(short)
  • [x] Create(double)
  • [x] Create(byte)
  • [x] Create(ushort)
  • [x] GetElement(int index)
  • [x] ToScalar()
  • [x] ToVector128()
  • [x] ToVector128Unsafe()
  • [x] WithElement(Vector64, int, T)

Vector128

  • [x] As()
  • [x] AsVector()
  • [x] AsVector4(Vector128)
  • [x] AsVector128(Vector)
  • [x] AsVector128(Vector4)
  • [x] CreateScalarUnsafe(int value);
  • [x] CreateScalarUnsafe(uint value);
  • [x] CreateScalarUnsafe(float value);
  • [x] CreateScalarUnsafe(long value);
  • [x] CreateScalarUnsafe(ulong value);
  • [x] CreateScalarUnsafe(double value);
  • [x] CreateScalarUnsafe(byte value);
  • [x] CreateScalarUnsafe(sbyte value);
  • [x] CreateScalarUnsafe(short value);
  • [x] CreateScalarUnsafe(ushort value);
  • [x] CreateScalar(ulong)
  • [x] CreateScalar(uint)
  • [x] CreateScalar(ushort)
  • [x] CreateScalar(sbyte)
  • [x] CreateScalar(float)
  • [x] CreateScalar(int)
  • [x] CreateScalar(short)
  • [x] CreateScalar(double)
  • [x] CreateScalar(byte)
  • [x] CreateScalar(long)
  • [x] Create(sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte)
  • [x] Create(byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte)
  • [x] Create(ushort, ushort, ushort, ushort, ushort, ushort, ushort, ushort)
  • [x] Create(short, short, short, short, short, short, short, short)
  • [x] Create(uint, uint, uint, uint)
  • [x] Create(float, float, float, float)
  • [x] Create(int, int, int, int)
  • [x] Create(ulong, ulong)
  • [x] Create(Vector64, Vector64)
  • [x] Create(Vector64, Vector64)
  • [x] Create(Vector64, Vector64)
  • [x] Create(Vector64, Vector64)
  • [x] Create(Vector64, Vector64)
  • [x] Create(Vector64, Vector64)
  • [x] Create(Vector64, Vector64)
  • [x] Create(Vector64, Vector64)
  • [x] Create(byte)
  • [x] Create(double)
  • [x] Create(short)
  • [x] Create(int)
  • [x] Create(long)
  • [x] Create(Vector64, Vector64)
  • [x] Create(float)
  • [x] Create(sbyte)
  • [x] Create(uint)
  • [x] Create(ulong)
  • [x] Create(double, double)
  • [x] Create(long, long)
  • [x] Create(Vector64, Vector64)
  • [x] Create(ushort)
  • [x] GetElement(int index)
  • [x] GetLower()
  • [x] GetUpper()
  • [x] WithElement(Vector128, int, T)
  • [x] WithLower(Vector64)
  • [x] WithUpper(Vector64)
  • [x] ToScalar()

Vector256

  • [x] Software fallback

System.Numerics

System.Numerics.BitOperations https://github.com/dotnet/runtime/issues/33495

  • [x] System.Numerics.BitOperations.LeadingZeroCount()
  • [x] System.Numerics.BitOperations.Log2()
  • [x] System.Numerics.BitOperations.PopCount()
  • [x] System.Numerics.BitOperations.TrailingZeroCount()

System.Numerics.Matrix4x4 #33565

  • [x] System.Numerics.Matrix4x4.Transpose()
  • [x] System.Numerics.Matrix4x4.Lerp()
  • [x] System.Numerics.Matrix4x4.operator-()
  • [x] System.Numerics.Matrix4x4.operator+()
  • [x] System.Numerics.Matrix4x4.operator*()
  • [x] System.Numerics.Matrix4x4.operator==()
  • [x] System.Numerics.Matrix4x4.operator!=()

System.SpanHelpers #33707

  • [x] System.SpanHelpers.IndexOf(byte)
  • [x] System.SpanHelpers.IndexOf(char)
  • [x] System.SpanHelpers.IndexOfAny(byte)
    ~[ ] System.SpanHelpers.SequenceCompareTo(byte)~ (SIMD vector implementation is fast enough)
    ~[ ] System.SpanHelpers.SequenceEqual(byte)~ (SIMD vector implementation is fast enough)
    ~[ ] System.SpanHelpers.LocateFirstFoundByte()~ (Only used by SIMD version of IndexOf and IndexOfAny which are already optimized by ARM64 intrinsics)

System.Buffers #35033

(Not completed in 5.0.0; moved to 6.0.0)

  • [ ] System.Buffers.Text.Base64.DecodeFromUtf8()
  • [ ] System.Buffers.Text.Base64.EncodeToUtf8()

System.Text

System.Text.ASCIIUtility #35034

  • [x] System.Text.ASCIIUtility.GetIndexOfFirstNonAsciiByte()
  • [ ] System.Text.ASCIIUtility.GetIndexOfFirstNonAsciiChar() - (Not completed in 5.0.0; moved to 6.0.0) - PR #39507
  • [x] System.Text.ASCIIUtility.NarrowFourUtf16CharsToAsciiAndWriteToBuffer()
  • [ ] System.Text.ASCIIUtility.NarrowUtf16ToAscii() - (Not completed in 5.0.0; moved to 6.0.0) - PR #39509
  • [x] System.Text.ASCIIUtility.WidenAsciiToUtf16()
  • [x] System.Text.ASCIIUtility.WidenFourAsciiBytesToUtf16AndWriteToBuffer()
  • [x] System.Text.ASCIIUtility.CountNumberOfLeadingAsciiBytesFromUInt32WithSomeNonAsciiData()

System.Text.Unicode #35035

  • [x] System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar()
  • [x] System.Text.Unicode.Utf8Utility.TranscodeToUtf8()
  • [x] System.Text.Unicode.Utf8Utility.GetPointerToFirstInvalidByte()

System.Text.Encodings.Web #35036

  • [x] System.Text.Encodings.Web.DefaultJavaScriptEncoder.FindFirstCharacterToEncodeUtf8()
  • [x] System.Text.Encodings.Web.DefaultJavaScriptEncoderBasicLatin.FindFirstCharacterToEncode()
  • [x] System.Text.Encodings.Web.DefaultJavaScriptEncoderBasicLatin.FindFirstCharacterToEncodeUtf8()
  • [x] System.Text.Encodings.Web.TextEncoder.FindFirstCharacterToEncodeUtf8()
  • [x] System.Text.Encodings.Web.UnsafeRelaxedJavaScriptEncoder.WillEncode()
  • [x] System.Text.Encodings.Web.UnsafeRelaxedJavaScriptEncoder.FindFirstCharacterToEncodeUtf8()
Epic arch-arm64 area-Meta up-for-grabs

Most helpful comment

It's my understanding that ASP.NET is currently not running benchmarks on arm64. Hopefully that will change, then we'd be able to see benefits from optimizing these.

@BruceForstall It has been a while and you might already know it, but for those who don't: ASP.NET is now running the benchmarks for ARM64. To see the continuous results you need to go to the default Power BI dashboard and check the "ARM64" checkbox:

obraz

Moreover, thanks to awesome work done by @sebastienros and @brianrob you can run the benchmarks using ASP.NET infrastructure and profile it using perfcollect. The produced trace file contains all available information about managed and native methods used by the app. As of today, the tracefile can be opened only with PerfView.

obraz

The commands required to run the benchmarks:

git clone https://github.com/aspnet/Benchmarks.git
cd benchmarks\src\BenchmarksDriver
# Plaintext
dotnet run -- --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.plaintext.json --scenario "PlaintextPlatform" --collect-trace
# Json
dotnet run -- --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.json.json --scenario "JsonPlatform" --collect-trace
# Fortunes
dotnet run -- --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.html.json --scenario "FortunesPlatform" --database "PostgreSQL" --collect-trace

Every command should contain the address of the ARM server and client machine which you can get from @sebastienros

--server "$secret1" --client "$secret2" 

If you want to test how your change in given System*.dll affects the performance of the TechEmpower benchmark you need to pass it to the driver:

--output-file "C:\Projects\runtime\artifacts\bin\System.Net.Sockets\netcoreapp5.0-Unix-Release\System.Net.Sockets.dll"

The provided file is copied to the output folder of the published ASP.NET app, so it works the same way like running dotnet publish -c Release and changing the file manually in the bin folder.

/cc @kunalspathak @tannergooding

All 37 comments

System.Text System.Text.ASCIIUtility, System.Text.Unicode and System.SpanHelpers are the most important ones for Web workloads.

@jkotas It's my understanding that ASP.NET is currently not running benchmarks on arm64. Hopefully that will change, then we'd be able to see benefits from optimizing these.

Also, BitArray is chosen to be first to implement because it's simple, has benchmarks defined, and can easily be used as a proof-of-concept of the basics of the intrinsics implementation -- not because it's the most important class.

CC. @CarolEidt, @echesakovMSFT, @GrabYourPitchforks, @TamarChristinaArm

Would love to help this! Though, I remember majority of the ARM intrinsics being unavailable last time I looked up (only available via experimental NuGet package). Has this been addressed? I have seen a few API review sessions where ARM intrinsics are discussed but I don't know if they are implemented.

One concern is that I can't quite run tests and benchmarks locally since I do not own an ARM machine (well, I do - but it's an RPi 3B+). I know those exist as part of CI/CD but are the perf numbers available to non-MSFT people?

I have seen a few API review sessions where ARM intrinsics are discussed but I don't know if they are implemented.

The implementation is in progress, so there will be feature gaps.

One concern is that I can't quite run tests and benchmarks locally since I do not own an ARM machine (well, I do - but it's an RPi 3B+). I know those exist as part of CI/CD but are the perf numbers available to non-MSFT people?

That's a bigger problem. The CI will do feature testing on ARM64, but there is currently no perf testing available to non-MSFT (and minimal available internal). You probably don't want to depend on the CI to do all functional testing for you. I _think_ it's possible to use the RPi for this if you install, say, Ubuntu 64-bit.

I think it's possible to use the RPi for this if you install, say, Ubuntu 64-bit.

I have successfully run perf tests on a RPi 4 with Ubuntu 64-bit; I assume it would also work on the RPi 3B+.

cc @jeffschwMSFT

System.Numerics.Intrinsics #33496, this (and the namespaces under it) should be System.Runtime.Intrinsics.

I think we also could add System.Buffers.Binary.BinaryPrimitives.ReverseEndianness in the list as well. Trivial search on CoreCLR sources do not seem to suggest that they are intrinsic; however I may as well be wrong.

For System.Runtime.Intrinsics (https://github.com/dotnet/runtime/issues/33496), it is correct that Vector256<T> is unsupported on ARM/ARM64. The functions already have a software fallback however, so no work is needed there.

However, ARM/ARM64 do support Vector64<T> (unlike x86) and functions which are special cased on Vector128<T> should be mirrored down. Likewise, there are functions like Vector128.GetLower which should be specialized (either in the JIT or in software).

I attempted to use Arm64 intrinsics to optimize System.Numerics.BitOperations.LeadingZeroCount but it was too early :-) https://github.com/dotnet/coreclr/pull/26815/files (tested on a real device)

Also SpanHelpers.SequenceEqual<byte> after https://github.com/dotnet/runtime/pull/32371

Could I give Vector64/128 methods a go (at least, parts of them that I can manage to make it work)? Only if nobody is actively working on it, of course.

It's my understanding that ASP.NET is currently not running benchmarks on arm64. Hopefully that will change, then we'd be able to see benefits from optimizing these.

@BruceForstall It has been a while and you might already know it, but for those who don't: ASP.NET is now running the benchmarks for ARM64. To see the continuous results you need to go to the default Power BI dashboard and check the "ARM64" checkbox:

obraz

Moreover, thanks to awesome work done by @sebastienros and @brianrob you can run the benchmarks using ASP.NET infrastructure and profile it using perfcollect. The produced trace file contains all available information about managed and native methods used by the app. As of today, the tracefile can be opened only with PerfView.

obraz

The commands required to run the benchmarks:

git clone https://github.com/aspnet/Benchmarks.git
cd benchmarks\src\BenchmarksDriver
# Plaintext
dotnet run -- --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.plaintext.json --scenario "PlaintextPlatform" --collect-trace
# Json
dotnet run -- --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.json.json --scenario "JsonPlatform" --collect-trace
# Fortunes
dotnet run -- --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.html.json --scenario "FortunesPlatform" --database "PostgreSQL" --collect-trace

Every command should contain the address of the ARM server and client machine which you can get from @sebastienros

--server "$secret1" --client "$secret2" 

If you want to test how your change in given System*.dll affects the performance of the TechEmpower benchmark you need to pass it to the driver:

--output-file "C:\Projects\runtime\artifacts\bin\System.Net.Sockets\netcoreapp5.0-Unix-Release\System.Net.Sockets.dll"

The provided file is copied to the output folder of the published ASP.NET app, so it works the same way like running dotnet publish -c Release and changing the file manually in the bin folder.

/cc @kunalspathak @tannergooding

FWIW I've been profiling the TechEmpower benchmarks for the last few weeks. The most commonly used methods that are important from the performance perspective are:

  • ConcurrentQueue.Enqueue and ConcurrentQueue.TryDequeue which are used for Thread Poll work items scheduling and are soon going to be used by epoll thread to handle IO notifications
  • span.IndexOfAny which is used by the request parser
  • span.CopyTo(span) which is used for writing the output response
  • System.Text.Encodings.Web.TextEncoder::Encode(string) which is used by the Fortunes benchmark to encode given string (it calls System.Text.Internal.AllowedCharactersBitmap::FindFirstCharacterToEncode(char*,int32) which I can see on the list above)
  • The Fortunes benchmark is also converting strings from Unicode to Utf8 and it's using following methods: System.Private.CoreLib!System.Text.UTF8Encoding::GetByteCount(string), System.Private.CoreLib!System.Text.UTF8Encoding::GetBytes(valuetype System.ReadOnlySpan,valuetype System.Span),
    System.Private.CoreLib!System.Text.Unicode.Utf16Utility::GetPointerToFirstInvalidChar(char*,int32,int64&,int32&)

@adamsitnik Thanks for all the great info! We'll be sure to track ASP.NET-related benchmarks with intrinsics changes.

That's a really nice dashboard @adamsitnik ! Are you able to share which hardware or core the tests run on? Just curious what the benchmarks are run on.

Are you able to share which hardware or core the tests run on? Just curious what the benchmarks are run on.

All I know is that it's a 32 Core ARM64 machine running on Ubuntu (4.15.0-76-generic).

@sebastienros should know the full spec

Rack-Mount, 1U
ThinkSystem HR330A
1x 32-Core/3.0GHz eMAG CPU
64GB DDR4 (8x8GB)
1x 960GB NVMe M.2 SSD
1x Single-Port 50GbE NIC
2x Serial Ports
1x 1GbE Management Port

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  32
Socket(s):           1
NUMA node(s):        1
Vendor ID:           APM
Model:               2
Model name:          X-Gene
Stepping:            0x3
CPU max MHz:         3300.0000
CPU min MHz:         363.9700
BogoMIPS:            80.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
NUMA node0 CPU(s):   0-31

Awesome, thanks @sebastienros

At this time, we don't believe there will be enough time (or resources) to implement the following optimizations for .NET 5:

System.Numerics.Matrix4x4 #33565
System.Text.ASCIIUtility #35034
System.Text.Unicode #35035
System.Text.Encodings.Web #35036

As @jkotas writes above, this will leave some potential web workload performance improvements on the table.

cc @tannergooding @danmosemsft

There is a meeting Monday morning for me to onboard 3-4 others with porting these remaining types. #33565 is one that I'll be using to help walk them through the process of porting the other code.

What is the current deadline for getting these changes in by?

CC. @jeffhandley

@tannergooding That's great to hear. The deadline is whatever your team chooses as the deadline, for how you manage 5.0 work. The CLR CodeGen team doesn't expect to have enough time/resources to these ourselves for 5.0 (we chose to do lots of this libraries work initially, to validate the intrinsics work and seed the optimizations), but certainly other teams/people can do them instead.

What is the current deadline for getting these changes in by?

The platform teams (us) are aiming to be feature complete by Preview 8 snap which is July 15 (internal link)

@BruceForstall - we'll keep you updated on our planning and progress after we kick off the work next Monday.

@tannergooding what is our test strategy for all these -- presumably we're just relying on our regular test bed and having enough hardware variation.

In particular, how do we get coverage on the software fallback paths? In the past, we've said that the ARM machines cover this path for us.

@danmosemsft We have two AzDO pipelines for stressing the ISAs, which includes disabling hardware intrinsics:

runtime-coreclr jitstress-isas-arm: https://dev.azure.com/dnceng/public/_build?definitionId=665
runtime-coreclr jitstress-isas-x86: https://dev.azure.com/dnceng/public/_build?definitionId=664

These are defined using the following modes (in engpipelines\common\templates\runtimes
\run-test-job.yml, with the tags defined in src\coreclr\tests\testenvironment.proj):

${{ if in(parameters.testGroup, 'jitstress-isas-arm') }}:
  scenarios:
  - jitstress_isas_incompletehwintrinsic
  - jitstress_isas_nohwintrinsic
  - jitstress_isas_nohwintrinsic_nosimd
  - jitstress_isas_nosimd
${{ if in(parameters.testGroup, 'jitstress-isas-x86') }}:
  scenarios:
  - jitstress_isas_incompletehwintrinsic
  - jitstress_isas_nohwintrinsic
  - jitstress_isas_nohwintrinsic_nosimd
  - jitstress_isas_nosimd
  - jitstress_isas_x86_noaes
  - jitstress_isas_x86_noavx
  - jitstress_isas_x86_noavx2
  - jitstress_isas_x86_nobmi1
  - jitstress_isas_x86_nobmi2
  - jitstress_isas_x86_nofma
  - jitstress_isas_x86_nohwintrinsic
  - jitstress_isas_x86_nolzcnt
  - jitstress_isas_x86_nopclmulqdq
  - jitstress_isas_x86_nopopcnt
  - jitstress_isas_x86_nosse
  - jitstress_isas_x86_nosse2
  - jitstress_isas_x86_nosse3
  - jitstress_isas_x86_nosse3_4
  - jitstress_isas_x86_nosse41
  - jitstress_isas_x86_nosse42
  - jitstress_isas_x86_nossse3

@tannergooding can comment on whether that is still sufficient for x86/x64, arm, as well as scenarios like R2R. Note that ARM32 doesn't support hardware intrinsics, so requires the fallback path.

@jeffhandley is work planned for 5.0 complete? If so please let's close this now.

@danmosemsft - There are still some APIs under System.Text.ASCIIUtility and System.Buffers that are not optimized. You can check the status of those methods in the PR description.

@danmosemsft I'm aggregating the status of these and will make a decision today whether the remaining items should be moved to 6.0.

JsonReaderHelper.IndexOfOrLessThan is available with Arm intrinsics https://github.com/dotnet/runtime/pull/41097 (though discussion about packaging for wasm; which could always be dropped out)

Here's the list of intrinsics efforts not merged into release/5.0. I recommend that we move them all to 6.0.

  • System.Buffers #35033

    • System.Buffers.Text.Base64.DecodeFromUtf8()

    • System.Buffers.Text.Base64.EncodeToUtf8()

  • System.Text.ASCIIUtility #35034

    • System.Text.ASCIIUtility.GetIndexOfFirstNonAsciiChar() - Open PR #39507

    • System.Text.ASCIIUtility.NarrowUtf16ToAscii() - Open PR #39509

  • System.Text.Json.JsonReaderHelper (from @benaadams)

    • IndexOfOrLessThan() - Open PR #41097

I will take the following actions:

  1. Update the text of this issue description for the Buffers and ASCIIUtility items to note that those will be completed in 6.0
  2. Move #35033 (Buffers) to the 6.0.0 milestone
  3. Comment on #35034 (ASCIIUtility) which methods were not completed in 5.0.0, and then close that issue
  4. Create a new issue for the remaining ASCIIUtility methods

This all looks good to me. Thanks @jeffhandley for summarizing and closing out the 5.0 release.

And thanks everyone for the great work with all these arm64 improvements!

Agreed. And I think we can please close this then.

And thanks everyone for the great work with all these arm64 improvements!

+100 ! and appreciation for @BruceForstall for establishing clarity originally by creating this list.

Aforementioned actions taken. #41292 will track the remaining ASCIIUtility methods in 6.0.0 and #35033 was moved to 6.0.0.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

omajid picture omajid  路  3Comments

sahithreddyk picture sahithreddyk  路  3Comments

noahfalk picture noahfalk  路  3Comments

jzabroski picture jzabroski  路  3Comments

matty-hall picture matty-hall  路  3Comments