I've noticed while doing some benchmarking that .NET Core is noticeably (21%) slower than .NET Framework on my AMD Ryzen 1200 based PC for a certain piece of code.
Running the same benchmark on an Intel i7 6700 based PC shows .NET Core running significantly (41%) faster than Framework.
Code to reproduce is here:
https://github.com/LordBenjamin/DotNetCore-Ryzen-Performance-Repro/
Benchmarks
OriginalBenchmark.cs Quite close to my actual code. Slower on Ryzen and faster on Intel.
StreamWriteBenchmark.csDistilled down after I noticed that commenting out the call to MemoryStream.Write(byte[] buffer, int offset, int count); in OriginalBenchmark.cs leaves .NET Core consistently faster on both CPUs. Core performance is significantly worse on Ryzen, but comparable to Framework on Intel.
Affected Frameworks
I can reproduce using both .NET Core 2.1 and 3.0. Framework version is 4.7.2 in both cases.
Results
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
AMD Ryzen 3 1200, 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview6-012264
[Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
Core : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
| Method | Job | Runtime | Mean | Error | StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite | Clr | Clr | 68.21 us | 0.4611 us | 0.4088 us |
| StreamWrite | Core | Core | 118.42 us | 0.5408 us | 0.5058 us |
| Original | Clr | Clr | 502.4 us | 6.2900 us | 5.8840 us |
| Original | Core | Core | 630.2 us | 3.8080 us | 3.5620 us |
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.1.700
[Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3801.0
Core : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
| Method | Job | Runtime | Mean | Error | StdDev |
|------------ |----- |-------- |---------:|----------:|----------:|
| StreamWrite | Clr | Clr | 75.37 us | 0.9133 us | 0.7626 us|
| StreamWrite | Core | Core | 75.39 us | 0.8624 us | 0.8067 us|
| Original | Clr | Clr | 598.0 us |11.6530 us | 14.74 us |
| Original | Core | Core | 341.6 us | 6.8020 us | 10.79 us |
Happy to supply further information or change and re-run benchmarks as required.
Additional benchmark data from a machine with a Ryzen 1600X. Same characteristics:
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
AMD Ryzen 5 1600X, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=2.2.100
[Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
Core : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
| Method | Job | Runtime | Mean | Error | StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite | Clr | Clr | 59.30 us | 0.4784 us | 0.4241 us |
| StreamWrite | Core | Core | 124.52 us | 1.5506 us | 1.3746 us |
May not be strictly related to Ryzen - I've managed to reproduce on a Xeon (albeit running inside a VM):
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.504 (1809/October2018Update/Redstone5), VM=Hyper-V
Intel Xeon Platinum 8168 CPU 2.70GHz, 1 CPU, 36 logical and 18 physical cores
.NET Core SDK=2.1.700
[Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
Core : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
| Method | Job | Runtime | Mean | Error | StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite | Clr | Clr | 89.87 us | 1.7903 us | 2.1312 us |
| StreamWrite | Core | Core | 110.88 us | 0.8309 us | 0.7772 us |
Is it possible to show a comparison of 2.1 and 3.0?
Re-ran on Ryzen 1200 using 3.0 - I can run on the Intel machines later if required.
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
AMD Ryzen 3 1200, 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview6-012264
[Host] : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
Core : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT
| Method | Job | Runtime | Mean | Error | StdDev |
|------------ |----- |-------- |----------:|---------:|---------:|
| StreamWrite | Clr | Clr | 68.10 us | 1.341 us | 1.696 us |
| StreamWrite | Core | Core | 111.50 us | 1.235 us | 1.095 us |
Got a new CPU (Ryzen 3600) recently and .NET 4.7.2 runs considerably faster, but Core 3.0 runs even slower:
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.615 (1809/October2018Update/Redstone5)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview6-012264
[Host] : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
Core : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT
| Method | Job | Runtime | Mean | Error | StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite | Clr | Clr | 44.45 us | 0.0486 us | 0.0430 us |
| StreamWrite | Core | Core | 138.92 us | 0.4383 us | 0.4100 us |
Perhaps @tannergooding has thoughts.
I am just about to refresh my private PC with Ryzen 3600, so unless someone beats me on that, I can look into it once I have the new machine set up. I expects that to happen in about two weeks.
I'll profile on my 1800X when I get into the office. I'll be getting a 3900X in a few days and can test on that then as well.
After investigating, almost the entire time for both .NET Core and full framework is being spent in a memcpy routine. In full framework, it is calling memcpy directly; However, in .NET Core it is calling JIT_MemCpy in src/vm/amd64/CrtHelpers.asm and that looks to be causing the ~2x slowdown.
Looks like this was changed in https://github.com/dotnet/coreclr/pull/7198
Interesting, I was almost sure it will be that :-).
It would be interesting to try if changing the rep movsb cutoff that's set to 512 bytes would help Ryzen to perform better.
I'll give that a try. Also going to grab perf metrics using the CRT implementation method.
For reference, the AMD Optimization Manuals currently give the following guidelines for rep stos

From the guidance given and the current source code, the following things are likely hurting perf:
Reverting to use the CRT implementation makes it the same speed as .NET Framework (it actually looks to be consistently about 1% faster, but that is likely within error).
Changing the JIT_MemCpy algorithm to remove the rep stosb path makes the code ~2% slower than the .NET Framework code.
As @jkotas called out in the original PR (https://github.com/dotnet/coreclr/pull/7198#discussion_r78892798), having a private copy of memset/memcpy has a number of drawbacks. It also certainly hasn't undergone nearly the same amount of testing/profiling as the CRT implementation.
I would be very much in favor of removing this path, especially since it causes a nearly 2x slowdown on Ryzen machines. It is also missing a bunch of logic around prefetching, ensuring data alignment, etc that the CRT implementation does currently have.
@jkotas, @janvorli. I this something we should consider fixing for 3.0, considering it is a nearly 2x penalty for copying memory on Ryzen CPUs?
The fix is relatively trivial, removes us having a custom memcpy routine (which is already 64-bit Windows only), and won't significantly regress Intel CPUs (based on the numbers in the original PR, it was 33% average for inputs under 512 bytes and no real difference for inputs over 512 bytes).
In any case, you should submit a fix to master for .NET 5. We can then decide whether it will get backported to release/3.0.
I put up https://github.com/dotnet/coreclr/pull/25750 to resolve the issue for .NET 5
Most helpful comment
Reverting to use the CRT implementation makes it the same speed as .NET Framework (it actually looks to be consistently about 1% faster, but that is likely within error).
Changing the
JIT_MemCpyalgorithm to remove therep stosbpath makes the code ~2% slower than the .NET Framework code.As @jkotas called out in the original PR (https://github.com/dotnet/coreclr/pull/7198#discussion_r78892798), having a private copy of
memset/memcpyhas a number of drawbacks. It also certainly hasn't undergone nearly the same amount of testing/profiling as the CRT implementation.I would be very much in favor of removing this path, especially since it causes a nearly 2x slowdown on Ryzen machines. It is also missing a bunch of logic around prefetching, ensuring data alignment, etc that the CRT implementation does currently have.