The following classes/functions in the libraries have Intel x86/x64 intrinsics usage. These are where _ISA_.IsSupported()
is called. This information was collected manually and might not be complete. Some of these function names represent many overloads. There are some vectorized helper methods not shown here -- where a function calls IsSupported
and then calls a specific helper function to do the actual work, such as for SSE2 or AVX2 specifically. There are other cases where Vector<T>
is used, but arm64 already supports that (it should be verified that the arm64 Vector<T>
code is complete and performant).
When each of these has added an arm64-specific intrinsics optimization, it should be "checked off".
The sections below are ordered in the presumed priority order that they should be implemented in. (There is no assumed priority order for the individual functions in each section.)
It is expected that System.Collections.BitArray
, System.Numerics
, and System.SpanHelpers
will be "arm64 intrinsi-fied" for .NET 5. If possible, System.Buffers
and System.Text
will as well, but that is not considered required.
IndexOf
and IndexOfAny
which are already optimized by ARM64 intrinsics)(Not completed in 5.0.0; moved to 6.0.0)
System.Text
System.Text.ASCIIUtility
, System.Text.Unicode
and System.SpanHelpers
are the most important ones for Web workloads.
@jkotas It's my understanding that ASP.NET is currently not running benchmarks on arm64. Hopefully that will change, then we'd be able to see benefits from optimizing these.
Also, BitArray is chosen to be first to implement because it's simple, has benchmarks defined, and can easily be used as a proof-of-concept of the basics of the intrinsics implementation -- not because it's the most important class.
CC. @CarolEidt, @echesakovMSFT, @GrabYourPitchforks, @TamarChristinaArm
Would love to help this! Though, I remember majority of the ARM intrinsics being unavailable last time I looked up (only available via experimental NuGet package). Has this been addressed? I have seen a few API review sessions where ARM intrinsics are discussed but I don't know if they are implemented.
One concern is that I can't quite run tests and benchmarks locally since I do not own an ARM machine (well, I do - but it's an RPi 3B+). I know those exist as part of CI/CD but are the perf numbers available to non-MSFT people?
I have seen a few API review sessions where ARM intrinsics are discussed but I don't know if they are implemented.
The implementation is in progress, so there will be feature gaps.
One concern is that I can't quite run tests and benchmarks locally since I do not own an ARM machine (well, I do - but it's an RPi 3B+). I know those exist as part of CI/CD but are the perf numbers available to non-MSFT people?
That's a bigger problem. The CI will do feature testing on ARM64, but there is currently no perf testing available to non-MSFT (and minimal available internal). You probably don't want to depend on the CI to do all functional testing for you. I _think_ it's possible to use the RPi for this if you install, say, Ubuntu 64-bit.
I think it's possible to use the RPi for this if you install, say, Ubuntu 64-bit.
I have successfully run perf tests on a RPi 4 with Ubuntu 64-bit; I assume it would also work on the RPi 3B+.
cc @jeffschwMSFT
System.Numerics.Intrinsics #33496
, this (and the namespaces under it) should be System.Runtime.Intrinsics
.
I think we also could add System.Buffers.Binary.BinaryPrimitives.ReverseEndianness
in the list as well. Trivial search on CoreCLR sources do not seem to suggest that they are intrinsic; however I may as well be wrong.
For System.Runtime.Intrinsics
(https://github.com/dotnet/runtime/issues/33496), it is correct that Vector256<T>
is unsupported on ARM/ARM64. The functions already have a software fallback however, so no work is needed there.
However, ARM/ARM64 do support Vector64<T>
(unlike x86) and functions which are special cased on Vector128<T>
should be mirrored down. Likewise, there are functions like Vector128.GetLower
which should be specialized (either in the JIT or in software).
I attempted to use Arm64 intrinsics to optimize System.Numerics.BitOperations.LeadingZeroCount
but it was too early :-) https://github.com/dotnet/coreclr/pull/26815/files (tested on a real device)
Also SpanHelpers.SequenceEqual<byte>
after https://github.com/dotnet/runtime/pull/32371
Could I give Vector64/128 methods a go (at least, parts of them that I can manage to make it work)? Only if nobody is actively working on it, of course.
It's my understanding that ASP.NET is currently not running benchmarks on arm64. Hopefully that will change, then we'd be able to see benefits from optimizing these.
@BruceForstall It has been a while and you might already know it, but for those who don't: ASP.NET is now running the benchmarks for ARM64. To see the continuous results you need to go to the default Power BI dashboard and check the "ARM64" checkbox:
Moreover, thanks to awesome work done by @sebastienros and @brianrob you can run the benchmarks using ASP.NET infrastructure and profile it using perfcollect
. The produced trace file contains all available information about managed and native methods used by the app. As of today, the tracefile can be opened only with PerfView.
The commands required to run the benchmarks:
git clone https://github.com/aspnet/Benchmarks.git
cd benchmarks\src\BenchmarksDriver
# Plaintext
dotnet run -- --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.plaintext.json --scenario "PlaintextPlatform" --collect-trace
# Json
dotnet run -- --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.json.json --scenario "JsonPlatform" --collect-trace
# Fortunes
dotnet run -- --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.html.json --scenario "FortunesPlatform" --database "PostgreSQL" --collect-trace
Every command should contain the address of the ARM server and client machine which you can get from @sebastienros
--server "$secret1" --client "$secret2"
If you want to test how your change in given System*.dll
affects the performance of the TechEmpower benchmark you need to pass it to the driver:
--output-file "C:\Projects\runtime\artifacts\bin\System.Net.Sockets\netcoreapp5.0-Unix-Release\System.Net.Sockets.dll"
The provided file is copied to the output folder of the published ASP.NET app, so it works the same way like running dotnet publish -c Release
and changing the file manually in the bin folder.
/cc @kunalspathak @tannergooding
FWIW I've been profiling the TechEmpower benchmarks for the last few weeks. The most commonly used methods that are important from the performance perspective are:
ConcurrentQueue.Enqueue
and ConcurrentQueue.TryDequeue
which are used for Thread Poll work items scheduling and are soon going to be used by epoll thread to handle IO notificationsspan.IndexOfAny
which is used by the request parserspan.CopyTo(span)
which is used for writing the output responseSystem.Text.Encodings.Web.TextEncoder::Encode(string)
which is used by the Fortunes benchmark to encode given string (it calls System.Text.Internal.AllowedCharactersBitmap::FindFirstCharacterToEncode(char*,int32)
which I can see on the list above)System.Private.CoreLib!System.Text.UTF8Encoding::GetByteCount(string)
, System.Private.CoreLib!System.Text.UTF8Encoding::GetBytes(valuetype System.ReadOnlySpan,valuetype System.Span)
,System.Private.CoreLib!System.Text.Unicode.Utf16Utility::GetPointerToFirstInvalidChar(char*,int32,int64&,int32&)
@adamsitnik Thanks for all the great info! We'll be sure to track ASP.NET-related benchmarks with intrinsics changes.
That's a really nice dashboard @adamsitnik ! Are you able to share which hardware or core the tests run on? Just curious what the benchmarks are run on.
Are you able to share which hardware or core the tests run on? Just curious what the benchmarks are run on.
All I know is that it's a 32 Core ARM64 machine running on Ubuntu (4.15.0-76-generic).
@sebastienros should know the full spec
Rack-Mount, 1U
ThinkSystem HR330A
1x 32-Core/3.0GHz eMAG CPU
64GB DDR4 (8x8GB)
1x 960GB NVMe M.2 SSD
1x Single-Port 50GbE NIC
2x Serial Ports
1x 1GbE Management Port
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Vendor ID: APM
Model: 2
Model name: X-Gene
Stepping: 0x3
CPU max MHz: 3300.0000
CPU min MHz: 363.9700
BogoMIPS: 80.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
NUMA node0 CPU(s): 0-31
Awesome, thanks @sebastienros
At this time, we don't believe there will be enough time (or resources) to implement the following optimizations for .NET 5:
System.Numerics.Matrix4x4 #33565
System.Text.ASCIIUtility #35034
System.Text.Unicode #35035
System.Text.Encodings.Web #35036
As @jkotas writes above, this will leave some potential web workload performance improvements on the table.
cc @tannergooding @danmosemsft
There is a meeting Monday morning for me to onboard 3-4 others with porting these remaining types. #33565 is one that I'll be using to help walk them through the process of porting the other code.
What is the current deadline for getting these changes in by?
CC. @jeffhandley
@tannergooding That's great to hear. The deadline is whatever your team chooses as the deadline, for how you manage 5.0 work. The CLR CodeGen team doesn't expect to have enough time/resources to these ourselves for 5.0 (we chose to do lots of this libraries work initially, to validate the intrinsics work and seed the optimizations), but certainly other teams/people can do them instead.
What is the current deadline for getting these changes in by?
The platform teams (us) are aiming to be feature complete by Preview 8 snap which is July 15 (internal link)
@BruceForstall - we'll keep you updated on our planning and progress after we kick off the work next Monday.
@tannergooding what is our test strategy for all these -- presumably we're just relying on our regular test bed and having enough hardware variation.
In particular, how do we get coverage on the software fallback paths? In the past, we've said that the ARM machines cover this path for us.
@danmosemsft We have two AzDO pipelines for stressing the ISAs, which includes disabling hardware intrinsics:
runtime-coreclr jitstress-isas-arm: https://dev.azure.com/dnceng/public/_build?definitionId=665
runtime-coreclr jitstress-isas-x86: https://dev.azure.com/dnceng/public/_build?definitionId=664
These are defined using the following modes (in engpipelines\common\templates\runtimes
\run-test-job.yml, with the tags defined in src\coreclr\tests\testenvironment.proj):
${{ if in(parameters.testGroup, 'jitstress-isas-arm') }}:
scenarios:
- jitstress_isas_incompletehwintrinsic
- jitstress_isas_nohwintrinsic
- jitstress_isas_nohwintrinsic_nosimd
- jitstress_isas_nosimd
${{ if in(parameters.testGroup, 'jitstress-isas-x86') }}:
scenarios:
- jitstress_isas_incompletehwintrinsic
- jitstress_isas_nohwintrinsic
- jitstress_isas_nohwintrinsic_nosimd
- jitstress_isas_nosimd
- jitstress_isas_x86_noaes
- jitstress_isas_x86_noavx
- jitstress_isas_x86_noavx2
- jitstress_isas_x86_nobmi1
- jitstress_isas_x86_nobmi2
- jitstress_isas_x86_nofma
- jitstress_isas_x86_nohwintrinsic
- jitstress_isas_x86_nolzcnt
- jitstress_isas_x86_nopclmulqdq
- jitstress_isas_x86_nopopcnt
- jitstress_isas_x86_nosse
- jitstress_isas_x86_nosse2
- jitstress_isas_x86_nosse3
- jitstress_isas_x86_nosse3_4
- jitstress_isas_x86_nosse41
- jitstress_isas_x86_nosse42
- jitstress_isas_x86_nossse3
@tannergooding can comment on whether that is still sufficient for x86/x64, arm, as well as scenarios like R2R. Note that ARM32 doesn't support hardware intrinsics, so requires the fallback path.
@jeffhandley is work planned for 5.0 complete? If so please let's close this now.
@danmosemsft - There are still some APIs under System.Text.ASCIIUtility
and System.Buffers
that are not optimized. You can check the status of those methods in the PR description.
@danmosemsft I'm aggregating the status of these and will make a decision today whether the remaining items should be moved to 6.0.
JsonReaderHelper.IndexOfOrLessThan
is available with Arm intrinsics https://github.com/dotnet/runtime/pull/41097 (though discussion about packaging for wasm; which could always be dropped out)
Here's the list of intrinsics efforts not merged into release/5.0
. I recommend that we move them all to 6.0.
I will take the following actions:
This all looks good to me. Thanks @jeffhandley for summarizing and closing out the 5.0 release.
And thanks everyone for the great work with all these arm64 improvements!
Agreed. And I think we can please close this then.
And thanks everyone for the great work with all these arm64 improvements!
+100 ! and appreciation for @BruceForstall for establishing clarity originally by creating this list.
Aforementioned actions taken. #41292 will track the remaining ASCIIUtility methods in 6.0.0 and #35033 was moved to 6.0.0.
Most helpful comment
@BruceForstall It has been a while and you might already know it, but for those who don't: ASP.NET is now running the benchmarks for ARM64. To see the continuous results you need to go to the default Power BI dashboard and check the "ARM64" checkbox:
Moreover, thanks to awesome work done by @sebastienros and @brianrob you can run the benchmarks using ASP.NET infrastructure and profile it using
perfcollect
. The produced trace file contains all available information about managed and native methods used by the app. As of today, the tracefile can be opened only with PerfView.The commands required to run the benchmarks:
Every command should contain the address of the ARM server and client machine which you can get from @sebastienros
If you want to test how your change in given
System*.dll
affects the performance of the TechEmpower benchmark you need to pass it to the driver:The provided file is copied to the output folder of the published ASP.NET app, so it works the same way like running
dotnet publish -c Release
and changing the file manually in the bin folder./cc @kunalspathak @tannergooding