Runtime: MemoryStream.Write() slow on Ryzen CPUs

Created on 29 Jun 2019  路  18Comments  路  Source: dotnet/runtime

I've noticed while doing some benchmarking that .NET Core is noticeably (21%) slower than .NET Framework on my AMD Ryzen 1200 based PC for a certain piece of code.

Running the same benchmark on an Intel i7 6700 based PC shows .NET Core running significantly (41%) faster than Framework.

Code to reproduce is here:
https://github.com/LordBenjamin/DotNetCore-Ryzen-Performance-Repro/

Benchmarks

  • OriginalBenchmark.cs

Quite close to my actual code. Slower on Ryzen and faster on Intel.

  • StreamWriteBenchmark.cs

Distilled down after I noticed that commenting out the call to MemoryStream.Write(byte[] buffer, int offset, int count); in OriginalBenchmark.cs leaves .NET Core consistently faster on both CPUs. Core performance is significantly worse on Ryzen, but comparable to Framework on Intel.

Affected Frameworks

I can reproduce using both .NET Core 2.1 and 3.0. Framework version is 4.7.2 in both cases.

Results

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
AMD Ryzen 3 1200, 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview6-012264
  [Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT


|      Method |  Job | Runtime |      Mean |     Error |    StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite |  Clr |     Clr |  68.21 us | 0.4611 us | 0.4088 us |
| StreamWrite | Core |    Core | 118.42 us | 0.5408 us | 0.5058 us |
| Original    |  Clr |     Clr |  502.4 us | 6.2900 us | 5.8840 us |
| Original    | Core |    Core |  630.2 us | 3.8080 us | 3.5620 us |
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.1.700
  [Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3801.0
  Core   : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT


|      Method |  Job | Runtime |     Mean |     Error |    StdDev |
|------------ |----- |-------- |---------:|----------:|----------:|
| StreamWrite |  Clr |     Clr | 75.37 us | 0.9133 us |  0.7626 us|
| StreamWrite | Core |    Core | 75.39 us | 0.8624 us |  0.8067 us|
| Original    |  Clr |     Clr | 598.0 us |11.6530 us | 14.74 us  |
| Original    | Core |    Core | 341.6 us | 6.8020 us | 10.79 us  |

Happy to supply further information or change and re-run benchmarks as required.

amd-perf arch-x64 area-System.IO tenet-performance

Most helpful comment

Reverting to use the CRT implementation makes it the same speed as .NET Framework (it actually looks to be consistently about 1% faster, but that is likely within error).

Changing the JIT_MemCpy algorithm to remove the rep stosb path makes the code ~2% slower than the .NET Framework code.

As @jkotas called out in the original PR (https://github.com/dotnet/coreclr/pull/7198#discussion_r78892798), having a private copy of memset/memcpy has a number of drawbacks. It also certainly hasn't undergone nearly the same amount of testing/profiling as the CRT implementation.

I would be very much in favor of removing this path, especially since it causes a nearly 2x slowdown on Ryzen machines. It is also missing a bunch of logic around prefetching, ensuring data alignment, etc that the CRT implementation does currently have.

All 18 comments

Additional benchmark data from a machine with a Ryzen 1600X. Same characteristics:

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
AMD Ryzen 5 1600X, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=2.2.100
  [Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT


|      Method |  Job | Runtime |      Mean |     Error |    StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite |  Clr |     Clr |  59.30 us | 0.4784 us | 0.4241 us |
| StreamWrite | Core |    Core | 124.52 us | 1.5506 us | 1.3746 us |

May not be strictly related to Ryzen - I've managed to reproduce on a Xeon (albeit running inside a VM):

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.504 (1809/October2018Update/Redstone5), VM=Hyper-V
Intel Xeon Platinum 8168 CPU 2.70GHz, 1 CPU, 36 logical and 18 physical cores
.NET Core SDK=2.1.700
  [Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT


|      Method |  Job | Runtime |      Mean |     Error |    StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite |  Clr |     Clr |  89.87 us | 1.7903 us | 2.1312 us |
| StreamWrite | Core |    Core | 110.88 us | 0.8309 us | 0.7772 us |

Is it possible to show a comparison of 2.1 and 3.0?

Re-ran on Ryzen 1200 using 3.0 - I can run on the Intel machines later if required.

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
AMD Ryzen 3 1200, 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview6-012264
  [Host] : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT


|      Method |  Job | Runtime |      Mean |    Error |   StdDev |
|------------ |----- |-------- |----------:|---------:|---------:|
| StreamWrite |  Clr |     Clr |  68.10 us | 1.341 us | 1.696 us |
| StreamWrite | Core |    Core | 111.50 us | 1.235 us | 1.095 us |

Got a new CPU (Ryzen 3600) recently and .NET 4.7.2 runs considerably faster, but Core 3.0 runs even slower:

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.615 (1809/October2018Update/Redstone5)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview6-012264
  [Host] : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT


|      Method |  Job | Runtime |      Mean |     Error |    StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite |  Clr |     Clr |  44.45 us | 0.0486 us | 0.0430 us |
| StreamWrite | Core |    Core | 138.92 us | 0.4383 us | 0.4100 us |

Perhaps @tannergooding has thoughts.

I am just about to refresh my private PC with Ryzen 3600, so unless someone beats me on that, I can look into it once I have the new machine set up. I expects that to happen in about two weeks.

I'll profile on my 1800X when I get into the office. I'll be getting a 3900X in a few days and can test on that then as well.

After investigating, almost the entire time for both .NET Core and full framework is being spent in a memcpy routine. In full framework, it is calling memcpy directly; However, in .NET Core it is calling JIT_MemCpy in src/vm/amd64/CrtHelpers.asm and that looks to be causing the ~2x slowdown.

Looks like this was changed in https://github.com/dotnet/coreclr/pull/7198

Interesting, I was almost sure it will be that :-).

It would be interesting to try if changing the rep movsb cutoff that's set to 512 bytes would help Ryzen to perform better.

I'll give that a try. Also going to grab perf metrics using the CRT implementation method.

For reference, the AMD Optimization Manuals currently give the following guidelines for rep stos
image

From the guidance given and the current source code, the following things are likely hurting perf:

  • Not using the largest possible operand size
  • Not attempting to align the source/destination to 16-bytes (or at least the destination if both can't be)
  • Not using other techniques if the repeat count is larger than the data cache size

Reverting to use the CRT implementation makes it the same speed as .NET Framework (it actually looks to be consistently about 1% faster, but that is likely within error).

Changing the JIT_MemCpy algorithm to remove the rep stosb path makes the code ~2% slower than the .NET Framework code.

As @jkotas called out in the original PR (https://github.com/dotnet/coreclr/pull/7198#discussion_r78892798), having a private copy of memset/memcpy has a number of drawbacks. It also certainly hasn't undergone nearly the same amount of testing/profiling as the CRT implementation.

I would be very much in favor of removing this path, especially since it causes a nearly 2x slowdown on Ryzen machines. It is also missing a bunch of logic around prefetching, ensuring data alignment, etc that the CRT implementation does currently have.

@jkotas, @janvorli. I this something we should consider fixing for 3.0, considering it is a nearly 2x penalty for copying memory on Ryzen CPUs?

The fix is relatively trivial, removes us having a custom memcpy routine (which is already 64-bit Windows only), and won't significantly regress Intel CPUs (based on the numbers in the original PR, it was 33% average for inputs under 512 bytes and no real difference for inputs over 512 bytes).

In any case, you should submit a fix to master for .NET 5. We can then decide whether it will get backported to release/3.0.

I put up https://github.com/dotnet/coreclr/pull/25750 to resolve the issue for .NET 5

Was this page helpful?
0 / 5 - 0 ratings