Runtime: Possible Performance Regression from 2.0.7 to 2.1-RC1 on Windows Server 2016 w/ Hyper-V Role

Created on 9 May 2018 · 41Comments · Source: dotnet/runtime

@Mike-EEE commented on Wed May 09 2018

Hello .NET Core Team and Community,

I have a very interesting performance issue for you here. I have produced some code that demonstrates a regressing difference in performance when switching from the .NET Core 2.0.x runtime to 2.1-RC1 on my Windows Server 2016 machine that I am using as a Hyper-V server. Please note that I am also using Visual Studio 2017 15.7.1 in producing my output here.

To start, this issue has context that is most recently rooted in a Benchmark.NET issue that you can view here. I have created a series of five benchmarks using the exact same underlying test code. Each have a slight modification that produce different results in 2.0.7 when run on my Windows Server host machine (note: the host machine and not a VM instance). You can see an overview of this information here.

I took the code that is in the 3rd example, and by simply rebuilding it and running it in the 2.1 runtime, it produced slower results. I have created this 2.1 branch here for your convenience.

(Please do note that the results of each scenario are found in the README.md of their respective branches.)

Additionally, with a small change, I was once again able to produce the faster result times as seen in 2.0.7 but within the 2.1 runtime. I have published this branch here. This is consistent with the bizarre behavior referenced in the five earlier examples, but seen here in a different form (or configuration, for lack of a better term) in 2.1.

I feel it's worthy to note that I believe that this is a gremlin of a problem that I have been chasing in .NET Core tooling since 1.1. I have encountered the most weird of circumstances around this behavior while trying to benchmark code using Benchmark.NET. You can see previous iterations of my previous, more complicated attempts at wrestling this beast here for your reference:
https://github.com/dotnet/BenchmarkDotNet/issues/330
https://github.com/dotnet/BenchmarkDotNet/issues/433

Now that I have been able to show a (regressing) discrepancy in performance between 2.0.7 and 2.1 RC1 with a simple recompile, I believe this could be due to a tooling/SDK problem that has been around for some time now. Fortunately, I have been able to capture it using a very simple project this time and am more than happy to share it now in hopes to finally tracking down this very weird issue. 😄

Finally, I will provide an overview of my Windows Server specifications here.

        Operating System
            Windows Server 2016 Standard 64-bit
        CPU
            Intel Core i7 4820K @ 3.70GHz   38 °C
            Ivy Bridge-E 22nm Technology
        RAM
            48.0GB DDR3 @ 792MHz (9-9-9-27)
        Motherboard
            ASUSTeK COMPUTER INC. RAMPAGE IV EXTREME (LGA2011)  30 °C
        Graphics
            VW246 (1920x1080@60Hz)
            G246HL (1920x1080@60Hz)
            VW246 (1920x1080@60Hz)
            VW246 (1920x1080@60Hz)
            2047MB NVIDIA GeForce GTX 660 (Gigabyte)    30 °C
        Storage
            238GB Samsung SSD 840 PRO Series (SSD)  25 °C
            892GB Microsoft Storage Space Device (SSD)  28 °C
            463GB Microsoft Storage Space Device (SSD)  28 °C
            1536GB SYNOLOGY iSCSI Storage SCSI Disk Device (iSCSI)
        Optical Drives
            No optical disk drives detected
        Audio
            High Definition Audio Device
Operating System
    Windows Server 2016 Standard 64-bit
    Computer type: Virtual
    Installation Date: 5/8/2017 4:41:12 PM

CPU
        Intel Core i7 4820K
            Cores   4
            Threads 8
            Name    Intel Core i7 4820K
            Code Name   Ivy Bridge-E
            Package Socket 2011 LGA
            Technology  22nm
            Specification   Intel Core i7-4820K CPU @ 3.70GHz
            Family  6
            Extended Family 6
            Model   E
            Extended Model  3E
            Stepping    4
            Revision    S0/S1
            Instructions    MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, Intel 64, NX, AES, AVX
            Virtualization  Not supported
            Hyperthreading  Supported, Enabled
            Fan Speed   594 RPM
            Rated Bus Speed 3167.2 MHz
            Stock Core Speed    3700 MHz
            Stock Bus Speed 100 MHz
            Average Temperature 38 °C
                Caches
                    L1 Data Cache Size  4 x 32 KBytes
                    L1 Instructions Cache Size  4 x 32 KBytes
                    L2 Unified Cache Size   4 x 256 KBytes
                    L3 Unified Cache Size   10240 KBytes
                Cores
                        Core 0
                            Core Speed  4552.8 MHz
                            Multiplier  x 46.0
                            Bus Speed   99.0 MHz
                            Rated Bus Speed 3167.2 MHz
                            Temperature 35 °C
                            Threads APIC ID: 0, 1
                        Core 1
                            Core Speed  4651.8 MHz
                            Multiplier  x 47.0
                            Bus Speed   99.0 MHz
                            Rated Bus Speed 3167.2 MHz
                            Temperature 36 °C
                            Threads APIC ID: 2, 3
                        Core 2
                            Core Speed  4552.8 MHz
                            Multiplier  x 46.0
                            Bus Speed   99.0 MHz
                            Rated Bus Speed 3167.2 MHz
                            Temperature 42 °C
                            Threads APIC ID: 4, 5
                        Core 3
                            Core Speed  4552.8 MHz
                            Multiplier  x 46.0
                            Bus Speed   99.0 MHz
                            Rated Bus Speed 3167.2 MHz
                            Temperature 38 °C
                            Threads APIC ID: 6, 7
RAM
        Memory slots
            Total memory slots  8
            Used memory slots   6
            Free memory slots   2
        Memory
            Type    DDR3
            Size    49152 MBytes
            DRAM Frequency  791.8 MHz
            CAS# Latency (CL)   9 clocks
            RAS# to CAS# Delay (tRCD)   9 clocks
            RAS# Precharge (tRP)    9 clocks
            Cycle Time (tRAS)   27 clocks
            Command Rate (CR)   2T
        Physical Memory
            Memory Usage    64 %
            Total Physical  48 GB
            Available Physical  17 GB
            Total Virtual   96 GB
            Available Virtual   64 GB
        SPD
            Number Of SPD Modules   6
                Slot dotnet/coreclr#1
                    Type    DDR3
                    Size    8192 MBytes
                    Manufacturer    Kingston
                    Max Bandwidth   PC3-10700 (667 MHz)
                    Part Number KHX2133C11D3/8GX
                    Week/year   42 / 13
                    SPD Ext.    XMP
                        Timing table
                                JEDEC dotnet/coreclr#1
                                    Frequency   457.1 MHz
                                    CAS# Latency    6.0
                                    RAS# To CAS#    6
                                    RAS# Precharge  6
                                    tRAS    17
                                    tRC 23
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#2
                                    Frequency   533.3 MHz
                                    CAS# Latency    7.0
                                    RAS# To CAS#    7
                                    RAS# Precharge  7
                                    tRAS    20
                                    tRC 27
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#3
                                    Frequency   609.5 MHz
                                    CAS# Latency    8.0
                                    RAS# To CAS#    8
                                    RAS# Precharge  8
                                    tRAS    22
                                    tRC 30
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#4
                                    Frequency   666.7 MHz
                                    CAS# Latency    9.0
                                    RAS# To CAS#    9
                                    RAS# Precharge  9
                                    tRAS    24
                                    tRC 33
                                    Voltage 1.500 V
                                XMP-2132
                                    Frequency   1066 MHz
                                    CAS# Latency    11.0
                                    RAS# To CAS#    12
                                    RAS# Precharge  11
                                    tRAS    30
                                    Voltage 1.600 V
                Slot dotnet/coreclr#2
                    Type    DDR3
                    Size    8192 MBytes
                    Manufacturer    Kingston
                    Max Bandwidth   PC3-10700 (667 MHz)
                    Part Number KHX2133C11D3/8GX
                    Week/year   42 / 13
                    SPD Ext.    XMP
                        Timing table
                                JEDEC dotnet/coreclr#1
                                    Frequency   457.1 MHz
                                    CAS# Latency    6.0
                                    RAS# To CAS#    6
                                    RAS# Precharge  6
                                    tRAS    17
                                    tRC 23
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#2
                                    Frequency   533.3 MHz
                                    CAS# Latency    7.0
                                    RAS# To CAS#    7
                                    RAS# Precharge  7
                                    tRAS    20
                                    tRC 27
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#3
                                    Frequency   609.5 MHz
                                    CAS# Latency    8.0
                                    RAS# To CAS#    8
                                    RAS# Precharge  8
                                    tRAS    22
                                    tRC 30
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#4
                                    Frequency   666.7 MHz
                                    CAS# Latency    9.0
                                    RAS# To CAS#    9
                                    RAS# Precharge  9
                                    tRAS    24
                                    tRC 33
                                    Voltage 1.500 V
                                XMP-2132
                                    Frequency   1066 MHz
                                    CAS# Latency    11.0
                                    RAS# To CAS#    12
                                    RAS# Precharge  11
                                    tRAS    30
                                    Voltage 1.600 V
                Slot dotnet/coreclr#3
                    Type    DDR3
                    Size    8192 MBytes
                    Manufacturer    Kingston
                    Max Bandwidth   PC3-10700 (667 MHz)
                    Part Number KHX2133C11D3/8GX
                    Week/year   42 / 13
                    SPD Ext.    XMP
                        Timing table
                                JEDEC dotnet/coreclr#1
                                    Frequency   457.1 MHz
                                    CAS# Latency    6.0
                                    RAS# To CAS#    6
                                    RAS# Precharge  6
                                    tRAS    17
                                    tRC 23
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#2
                                    Frequency   533.3 MHz
                                    CAS# Latency    7.0
                                    RAS# To CAS#    7
                                    RAS# Precharge  7
                                    tRAS    20
                                    tRC 27
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#3
                                    Frequency   609.5 MHz
                                    CAS# Latency    8.0
                                    RAS# To CAS#    8
                                    RAS# Precharge  8
                                    tRAS    22
                                    tRC 30
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#4
                                    Frequency   666.7 MHz
                                    CAS# Latency    9.0
                                    RAS# To CAS#    9
                                    RAS# Precharge  9
                                    tRAS    24
                                    tRC 33
                                    Voltage 1.500 V
                                XMP-2132
                                    Frequency   1066 MHz
                                    CAS# Latency    11.0
                                    RAS# To CAS#    12
                                    RAS# Precharge  11
                                    tRAS    30
                                    Voltage 1.600 V
                Slot dotnet/coreclr#4
                    Type    DDR3
                    Size    8192 MBytes
                    Manufacturer    Kingston
                    Max Bandwidth   PC3-10700 (667 MHz)
                    Part Number KHX2133C11D3/8GX
                    Week/year   42 / 13
                    SPD Ext.    XMP
                        Timing table
                                JEDEC dotnet/coreclr#1
                                    Frequency   457.1 MHz
                                    CAS# Latency    6.0
                                    RAS# To CAS#    6
                                    RAS# Precharge  6
                                    tRAS    17
                                    tRC 23
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#2
                                    Frequency   533.3 MHz
                                    CAS# Latency    7.0
                                    RAS# To CAS#    7
                                    RAS# Precharge  7
                                    tRAS    20
                                    tRC 27
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#3
                                    Frequency   609.5 MHz
                                    CAS# Latency    8.0
                                    RAS# To CAS#    8
                                    RAS# Precharge  8
                                    tRAS    22
                                    tRC 30
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#4
                                    Frequency   666.7 MHz
                                    CAS# Latency    9.0
                                    RAS# To CAS#    9
                                    RAS# Precharge  9
                                    tRAS    24
                                    tRC 33
                                    Voltage 1.500 V
                                XMP-2132
                                    Frequency   1066 MHz
                                    CAS# Latency    11.0
                                    RAS# To CAS#    12
                                    RAS# Precharge  11
                                    tRAS    30
                                    Voltage 1.600 V
                Slot dotnet/coreclr#5
                    Type    DDR3
                    Size    8192 MBytes
                    Manufacturer    Kingston
                    Max Bandwidth   PC3-10700 (667 MHz)
                    Part Number KHX2133C11D3/8GX
                    Week/year   42 / 13
                    SPD Ext.    XMP
                        Timing table
                                JEDEC dotnet/coreclr#1
                                    Frequency   457.1 MHz
                                    CAS# Latency    6.0
                                    RAS# To CAS#    6
                                    RAS# Precharge  6
                                    tRAS    17
                                    tRC 23
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#2
                                    Frequency   533.3 MHz
                                    CAS# Latency    7.0
                                    RAS# To CAS#    7
                                    RAS# Precharge  7
                                    tRAS    20
                                    tRC 27
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#3
                                    Frequency   609.5 MHz
                                    CAS# Latency    8.0
                                    RAS# To CAS#    8
                                    RAS# Precharge  8
                                    tRAS    22
                                    tRC 30
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#4
                                    Frequency   666.7 MHz
                                    CAS# Latency    9.0
                                    RAS# To CAS#    9
                                    RAS# Precharge  9
                                    tRAS    24
                                    tRC 33
                                    Voltage 1.500 V
                                XMP-2132
                                    Frequency   1066 MHz
                                    CAS# Latency    11.0
                                    RAS# To CAS#    12
                                    RAS# Precharge  11
                                    tRAS    30
                                    Voltage 1.600 V
                Slot dotnet/coreclr#6
                    Type    DDR3
                    Size    8192 MBytes
                    Manufacturer    Kingston
                    Max Bandwidth   PC3-10700 (667 MHz)
                    Part Number KHX2133C11D3/8GX
                    Week/year   42 / 13
                    SPD Ext.    XMP
                        Timing table
                                JEDEC dotnet/coreclr#1
                                    Frequency   457.1 MHz
                                    CAS# Latency    6.0
                                    RAS# To CAS#    6
                                    RAS# Precharge  6
                                    tRAS    17
                                    tRC 23
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#2
                                    Frequency   533.3 MHz
                                    CAS# Latency    7.0
                                    RAS# To CAS#    7
                                    RAS# Precharge  7
                                    tRAS    20
                                    tRC 27
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#3
                                    Frequency   609.5 MHz
                                    CAS# Latency    8.0
                                    RAS# To CAS#    8
                                    RAS# Precharge  8
                                    tRAS    22
                                    tRC 30
                                    Voltage 1.500 V
                                JEDEC dotnet/coreclr#4
                                    Frequency   666.7 MHz
                                    CAS# Latency    9.0
                                    RAS# To CAS#    9
                                    RAS# Precharge  9
                                    tRAS    24
                                    tRC 33
                                    Voltage 1.500 V
                                XMP-2132
                                    Frequency   1066 MHz
                                    CAS# Latency    11.0
                                    RAS# To CAS#    12
                                    RAS# Precharge  11
                                    tRAS    30
                                    Voltage 1.600 V

Please let me know if you require any further information around this issue or my environment and I will be happy to get it for you.

Thank you for any assistance you can provide towards solving this issue!

area-CodeGen-coreclr optimization

Source

Petermarcu

🎉2

Most helpful comment

I created an issue for that https://github.com/dotnet/BenchmarkDotNet/issues/756

@AndyAyersMS is there any way to enforce the alignment today?

adamsitnik on 15 May 2018

🎉1 👍1

All 41 comments

Moving this here to start the investigation. @valenis @jaredpar @livarcocc . I dont know if this is going to end up being a runtime, compiler, or SDK issue.

Petermarcu on 9 May 2018

🎉1

LOL... @Petermarcu we have shared agreement with this. 😆 Thank you for getting this to the right area... to start, at least. 👍

Also, a quick note that when I had posted this this morning, I was not able to reproduce exactly on my Surface Pro, but later I was indeed able to see the divergence while copying from my VM. Unfortunately I do not know the exact reproduction but I will track it down for you and update here when I do.

Everyone there is used to it by now, but I am swimming from keeping track of the different runtimes on different machines and keeping it all straight. Please pardon the mess here while I get acclimated.

Mike-E-angelo on 9 May 2018

Sorry if I'm being dense, but I've read the description several times now and I'm still not clear on... what exactly do you see regressing?

stephentoub on 9 May 2018

Hi @stephentoub sorry for the confusion. I will try my best to break it down here.

I am running the code found in this repository on .NET Core 2.0.7 and seeing these results. (~19us)
I am then taking that same code without modification -- with the exception of a .csproj change -- found in this repository recompiling to .NET Core 2.1 RC1 and seeing these results (~21us).

Mike-E-angelo on 10 May 2018

On my machine I get identical results:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17661
Intel Core i5-4440 CPU 3.10GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
Frequency=3026466 Hz, Resolution=330.4184 ns, Timer=TSC
.NET Core SDK=2.1.300-rc1-008673
  [Host] : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=5
WarmupCount=5

  Method |     Mean |     Error |    StdDev |   Gen 0 | Allocated |
-------- |---------:|----------:|----------:|--------:|----------:|
 ToArray | 25.80 us | 0.2487 us | 0.0646 us | 12.6343 |  39.13 KB |

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17661
Intel Core i5-4440 CPU 3.10GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
Frequency=3026466 Hz, Resolution=330.4184 ns, Timer=TSC
.NET Core SDK=2.1.300-rc1-008673
  [Host] : .NET Core 2.1.0-rc1 (CoreCLR 4.6.26426.02, CoreFX 4.6.26426.04), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=5
WarmupCount=5

  Method |     Mean |     Error |    StdDev |   Gen 0 | Allocated |
-------- |---------:|----------:|----------:|--------:|----------:|
 ToArray | 25.82 us | 0.1774 us | 0.0461 us | 12.6343 |  39.13 KB |

Also, it's not clear how your results are actually different. You have a delta of 1.53us between 2.0 and 2.1 and the 2.0 result has error = 1.584us. And the benchmark includes allocations so its result needs to be taken with a grain of salt anyway.

mikedn on 10 May 2018

Hi @mikedn thank you for your input here. I admit there are a lot of considerations here. I am embarrassed to admit this, but I have easily spent over 100 hours on this issue (in 3 different forms, this issue being the most recent), mostly last year on 1.1 tooling. I am moving into the 2nd week of time into this issue for this year. My first inclination was that this was a Benchmark.NET issue, but once I saw the deviations between 2.0 and 2.1 from a simple recompile I started to think this was something more fundamental.

Anyways, it is not clear from your results, but are you running your benchmarks on a Windows Server 2016? Mine is fully updated and reads OS=Windows 10.0.14393.2189 (1607/AnniversaryUpdate/Redstone1) in the results. As I have mentioned I have not been able to reproduce this on my Surface Laptop so what you see there is what I see here on my Surface Pro.

Additionally, what you describe is accurate. There is only ~1.5us difference between the results. The point here though is that these results are 100% consistent and reproducible. That is, subsequent runs never deviate from this number save for maybe within the range of ~.1us to ~.25us max. If I run with multiple launches, they all execute within this "band" for lack of a better word (I think "mode" is what BDN calls it).

Mike-E-angelo on 10 May 2018

Anyways, it is not clear from your results, but are you running your benchmarks on a Windows Server 2016?

Nope, it's the current Win10 insider preview.

That is, subsequent runs never deviate from this number save for maybe within the range of .1us to .25us max.

Well, in 5 runs I got one result that was more than 1us away from the rest:

 ToArray | 24.26 us | 0.0264 us | 0.0069 us | 12.6343 |  39.13 KB |
 ToArray | 24.39 us | 1.107 us  | 0.2877 us | 12.6343 |  39.13 KB |
 ToArray | 25.34 us | 2.444 us  | 0.6349 us | 12.6343 |  39.13 KB |
 ToArray | 24.29 us | 0.2169 us | 0.0563 us | 12.6343 |  39.13 KB |
 ToArray | 24.25 us | 0.1349 us | 0.0350 us | 12.6343 |  39.13 KB |

I think you're looking too much into this. A real difference may exist but it's too small and the numbers are too noisy to tell. And if you drag in different hardware, OS versions and even virtual machines then you'll soon end up nowhere. You may as well try to count the atoms in the universe.

mikedn on 10 May 2018

😄1

so what you see there is what I see here on my Surface Pro

What Surface Pro? Surface Pro 4 uses Skylake CPUs but your results indicate that you're using a Haswell CPU. I'll try to run the benchmark on my Surface later today.

mikedn on 10 May 2018

Here you go @mikedn. Try running the code found in this branch on your machine. Please make sure the BenchmarkDotNet.Artifacts folder is deleted on the first run. You should experience the deviance here between the first run and second (and subsequent) runs. I was able to reproduce this on my Surface Pro, at least. The first run without the folder is always faster than the 2nd run. The 2nd and subsequent runs are always in the upper/slower band/mode.

Now this could indeed be a bug in BDN, so I am using this as an example of the consistent deviation that is experienced here for now (although I believe this is a symptom of the greater issue at large).

Here is what I experience with the first run on my Surface Pro with the BenchmarkDotNet.Artifacts folder deleted:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.431 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-4650U CPU 1.70GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
Frequency=2240905 Hz, Resolution=446.2483 ns, Timer=TSC
  [Host] : .NET Core 2.1.0-rc1 (CoreCLR 4.6.26426.02, CoreFX 4.6.26426.04), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=3
WarmupCount=3

  Method |     Mean |    Error |    StdDev |
-------- |---------:|---------:|----------:|
 ToArray | 21.61 us | 3.962 us | 0.2239 us |

And from my second and subsequent runs:

  Method |     Mean |     Error |    StdDev |
-------- |---------:|----------:|----------:|
 ToArray | 24.72 us | 0.2784 us | 0.0157 us |

I got one result that was more than 1us away from the rest

Right, I get what you are saying. Please forgive my lack of vocabulary in describing this problem. Deviance does occur as you demonstrate, but what I am saying is that there is a "band" of deviance that differs altogether from another one. It also doesn't help that I wasn't very accurate in providing my perceived differences in the deviance you are describing. I did further update that range to be .1us-.25us as 3/5 results you provide fit that criteria.

In any case, I would be most interested in seeing your results from the code above.

Mike-E-angelo on 10 May 2018

Ah @mikedn all the results you have seen to date with the exception of the one I just posted have been from my Windows Server 2016 that acts as a Hyper-V server. I just now posted my very first Surface Pro _3_ results.

Additionally, I would like to note that I also have a development VM based on Windows 10 that runs on that Windows Server and it gets a completely different set of results altogether. As Benchmark.NET mentions, VMs have been known to get different results so I have not yet included these in my findings to reduce the overall complexity and information here. As I said earlier... a lot of considerations going on here. 😆

Mike-E-angelo on 10 May 2018

Here is what I experience with the first run on my Surface Pro with the BenchmarkDotNet.Artifacts folder deleted:

The folder issue seems to be completely unrelated to the .NET Core 2.0 vs. 2.1 issue. I tried this myself and I don't see any difference between results with the folder deleted. But I don't think it's completely impossible for some small differences to exists. As far as I can tell BenchmarkDotNet creates that folder and a file in it before running the tests so there's probably a very small chance that doing that affects the system somehow.

mikedn on 10 May 2018

I am super appreciative of you taking the time to help me through this, @mikedn. I assure you my sanity has been challenged numerous times by this issue. 😆

The folder issue seems to be completely unrelated to the .NET Core 2.0 vs. 2.1 issue.

I agree. That is why I stated: "Now this could indeed be a bug in BDN, so I am using this as an example of the consistent deviation that is experienced here for now (although I believe this is a symptom of the greater issue at large)."

I tried this myself and I don't see any difference between results with the folder deleted.

Boo, bummer. Back to the drawing board here. What I have yet to achieve is a discrepancy between a simple 2.0 -> 2.1 build on my Surface Pro 3. I will work towards seeing if I can get this to happen.

As far as I can tell BenchmarkDotNet creates that folder and a file in it before running the tests so there's probably a very small chance that doing that affects the system somehow.

Right, but now consider that I am able to get the same "bands", but in the _opposite_ direction. That is, running _without_ the folder makes it _slower_. This is seen in Example 4.

Mike-E-angelo on 10 May 2018

Ah bummer, I cannot edit the original message, but here are the five examples I provide:

In the list below, I provide a link to the example branch, followed by the results of the first run and then the second run, each denoted by their resulting bounds, if that makes sense. Please note that the "second run" is also consistent with further subsequent runs as well.

Example 1: lower bound -> upper bound
Example 2: upper bound -> upper bound
Example 3: lower bound -> lower bound
Example 4: upper bound -> lower bound
Example 5: lower bound -> lower bound

Again I would like to reiterate that each of these are produced by shaping the ceremony in some way with the Benchmark.NET benchmark (for instance, adding a diagnoser or, inexplicably, adding a System.ComponentModel.DescriptionAttribute), so these could all very well be Benchmark.NET-related. The reason it is now on this repo is that I am now experiencing the same phenomena by simply recompiling 2.0 -> 2.1 and running the exact same code.

Mike-E-angelo on 10 May 2018

Until I get to my Surface I ran the benchmark on an older CPU/OS and got a slightly larger difference between 2.0 and 2.1, larger enough that it may matter:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-3770 CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical and 4 physical cores
Frequency=3312784 Hz, Resolution=301.8609 ns, Timer=TSC
.NET Core SDK=2.1.300-rc1-008673
  [Host] : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=5
WarmupCount=5

  Method |     Mean |     Error |    StdDev |  Gen 0 | Allocated |
-------- |---------:|----------:|----------:|-------:|----------:|
 ToArray | 22.57 us | 0.4560 us | 0.1184 us | 9.5215 |  39.13 KB |
 ToArray | 22.17 us | 1.0280 us | 0.2669 us | 9.5215 |  39.13 KB |
 ToArray | 22.34 us | 0.8832 us | 0.2294 us | 9.5215 |  39.13 KB |
 ToArray | 22.28 us | 0.4800 us | 0.1247 us | 9.5215 |  39.13 KB |
 ToArray | 22.36 us | 0.7602 us | 0.1975 us | 9.5215 |  39.13 KB |

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-3770 CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical and 4 physical cores
Frequency=3312784 Hz, Resolution=301.8609 ns, Timer=TSC
.NET Core SDK=2.1.300-rc1-008673
  [Host] : .NET Core 2.1.0-rc1 (CoreCLR 4.6.26426.02, CoreFX 4.6.26426.04), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=5
WarmupCount=5

  Method |     Mean |    Error  |    StdDev |  Gen 0 | Allocated |
-------- |---------:|----------:|----------:|-------:|----------:|
 ToArray | 25.11 us | 1.5240 us | 0.3957 us | 9.5215 |  39.13 KB |
 ToArray | 25.53 us | 2.1760 us | 0.5652 us | 9.5215 |  39.13 KB |
 ToArray | 24.97 us | 0.5880 us | 0.1527 us | 9.5215 |  39.13 KB |
 ToArray | 24.94 us | 0.4506 us | 0.1170 us | 9.5215 |  39.13 KB |
 ToArray | 24.77 us | 2.0160 us | 0.5235 us | 9.5215 |  39.13 KB |

5 repeated runs persistently show a 3us difference.

The same difference can be reproduced without LINQ's ToArray:
```C#
var i = 0;
var r = new int[_data.Length];
var s = _select;
foreach (var d in _data)
r[i++] = s(d);
return r;

Turns out that there is a small difference in the code generated by the JIT between 2.0 and 2.1:
```asm
; 2.0
00007FFC5DA90009 49 63 CF             movsxd      rcx,r15d  
00007FFC5DA9000C 48 8B 54 CB 10       mov         rdx,qword ptr [rbx+rcx*8+10h]  
    55:             {
    56:                 k[i++] = s(x);
00007FFC5DA90011 8D 4F 01             lea         ecx,[rdi+1]  
00007FFC5DA90014 44 8B E1             mov         r12d,ecx  
00007FFC5DA90017 48 8B C6             mov         rax,rsi  
00007FFC5DA9001A 48 8B 48 08          mov         rcx,qword ptr [rax+8]  
00007FFC5DA9001E FF 50 18             call        qword ptr [rax+18h]  
00007FFC5DA90021 41 3B 7E 08          cmp         edi,dword ptr [r14+8]  
00007FFC5DA90025 73 25                jae         00007FFC5DA9004C  
00007FFC5DA90027 48 63 D7             movsxd      rdx,edi  
00007FFC5DA9002A 41 89 44 96 10       mov         dword ptr [r14+rdx*4+10h],eax  
00007FFC5DA9002F 41 FF C7             inc         r15d  
    54:             foreach (string x in _data)
00007FFC5DA90032 41 3B EF             cmp         ebp,r15d  
00007FFC5DA90035 41 8B FC             mov         edi,r12d  
00007FFC5DA90038 7F CF                jg          00007FFC5DA90009 
; 2.1
00007FFC5DAF27BC 49 63 CE             movsxd      rcx,r14d  
00007FFC5DAF27BF 48 8B 54 CB 10       mov         rdx,qword ptr [rbx+rcx*8+10h]  
    55:             {
    56:                 k[i++] = s(x);
00007FFC5DAF27C4 8D 4F 01             lea         ecx,[rdi+1]  
00007FFC5DAF27C7 44 8B E1             mov         r12d,ecx  
00007FFC5DAF27CA 48 8B C6             mov         rax,rsi  
; the lea below wasn't here before
00007FFC5DAF27CD 48 8D 48 08          lea         rcx,[rax+8]  
00007FFC5DAF27D1 48 8B 09             mov         rcx,qword ptr [rcx]  
00007FFC5DAF27D4 FF 50 18             call        qword ptr [rax+18h]  
00007FFC5DAF27D7 3B 7D 08             cmp         edi,dword ptr [rbp+8]  
00007FFC5DAF27DA 73 24                jae         00007FFC5DAF2800  
00007FFC5DAF27DC 48 63 D7             movsxd      rdx,edi  
00007FFC5DAF27DF 89 44 95 10          mov         dword ptr [rbp+rdx*4+10h],eax  
00007FFC5DAF27E3 41 FF C6             inc         r14d  
    54:             foreach (string x in _data)
00007FFC5DAF27E6 45 3B FE             cmp         r15d,r14d  
00007FFC5DAF27E9 41 8B FC             mov         edi,r12d  
00007FFC5DAF27EC 7F CE                jg          00007FFC5DAF27BC

I don't know yet if this codegen difference has anything to do with the slowdown. It might, especially considering that it appears to impact only older CPUs.

Even if it is the reason for the 2.0-2.1 slowdown this is unlikely to have anything to do with the artifacts folder or any other difference you have seen.

mikedn on 10 May 2018

🎉1

It might, especially considering that it appears to impact only older CPUs.

I'm getting a public dress down over my old equipment that I clearly need to upgrade. 🤣

Awesome, thank you for your additional findings, @mikedn. Again, it means a lot for you to take this time and add some sanity back to my world here.

this is unlikely to have anything to do with the artifacts folder or any other difference you have seen.

Yes, to be sure I have 3 different machines producing different results using the same code. However, each machine does manage to find these two "bands" using different configurations of code. Again, might be BDN-specific, but this along with the 2.0->2.1 discrepancy has got my suspicions up around this.

Let me see if I can find a 2.0->2.1 build repro that works on my Surface Pro 3 and I'll get back to you here.

Mike-E-angelo on 10 May 2018

Alright @mikedn I have got something for you, but it is dependent on how I have understood your comments above. I have managed to create a repo for the repro here. Please note that this is strictly a 2.0 -> 2.1 recompile and BenchmarkDotNet.Artifacts has nothing to do with this.

This was run on my Surface Pro 3 and this time the results are in the _other_ direction (from slower to faster).

At first glance, this may look correct due to performance improvements in 2.1. However, what got my attention is that you previously noted above that the IL has changed from 2.0 to 2.1 in such a way that it would suggest that there is a slight performance _hit_ rather than an _improvement_.

So it would seem that the code would run a bit slower, not faster, if I understand correctly. Please feel free to correct me if I have something fundamentally misunderstood here.

As seen here in the results, on .NET Core 2.0.7 I see a consistent ~24.5us and on .NET 2.1 I see a consistent ~21us when running this on my Surface Pro 3 after a simple recompile between the two runtimes.

I will continue now to see if I can find any further examples to help classify this problem. Specifically, I would like to reproduce the _other_ direction that we have been discussing as seen on my Windows Server, where the performance times regress from faster to slower. I thought I would share this here to provide more data for now, and also to verify my understanding thus far around the IL.

Mike-E-angelo on 10 May 2018

Please note that this is strictly a 2.0 -> 2.1 recompile and BenchmarkDotNet.Artifacts has nothing to do with this.

The readme in your repro seems to indicate that the 2.1 result is actually obtained on 2.0, both results have the line [Host] : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT in them.

mikedn on 10 May 2018

Ah indeed it does @mikedn. I was trying to cut out information from the previous template that included the artifacts folder and was obviously more focused on that than getting the summary data in. 😞

The results were accurate, though. Thank you for pointing that out. I have posted the update in the readme. I did do a re-publish/sanity check twice to make sure. Here are the results from that here, for your convenience:

// * Summary *

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.431 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-4650U CPU 1.70GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
Frequency=2240905 Hz, Resolution=446.2483 ns, Timer=TSC
  [Host] : .NET Core 2.1.0-rc1 (CoreCLR 4.6.26426.02, CoreFX 4.6.26426.04), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=5
WarmupCount=5

Method | Mean | Error | StdDev | Gen 0 | Allocated |
-------- |---------:|---------:|----------:|--------:|----------:|
ToArray | 21.72 us | 1.690 us | 0.4390 us | 18.8599 | 39.13 KB |

Mike-E-angelo on 10 May 2018

Hrm, I played a bit with this on my Haswell desktop and it does look like there's something fishy going on but I don't know exactly what. The issues seems to be that the benchmark numbers are impacted by the presence of attributes on your Array class.

With 2.1 and [Config(typeof(Config))] I get ~24us. If I add back the Description attribute (or any other attribute it seems) then the time drops to ~21us. If add more attributes the time goes back to 24us.

With 2.0 it seems to work the opposite way and that would explain how come 2.1 seems faster in this case.

Right now I have no idea why those attributes impact the timings. One possibility is that BenchmarkDotNet queries them and the different number of attributes result in a different allocation pattern that has a certain impact on GC. But that's pure guesswork.

As before, the issue can be reproduce by using a normal for loop instead of ToArray. In that case it's easy to check the generate assembly code - it's identical (including loop head alignment) for a given runtime version.

mikedn on 10 May 2018

🎉1

it does look like there's something fishy going on

Glad someone can finally share my pain here. 😆

But that's pure guesswork

Right. I like your theory. Now that we have established some verified traction here I would also like to note that I have seen the exact same thing occur with local (readonly) references used within the benchmark class. That is, commenting out unused references would also cause these discrepancies. It's a specific type of reference, too. They have to be referenced from a particular assembly, project, and/or namespace.

Further -- and this is where it gets _really_ weird -- commenting out classes altogether would result in discrepancies. This is actually where my pain all started with https://github.com/dotnet/BenchmarkDotNet/issues/330 (note that this was in 4.6.1 but with .NET Core 1.1 tooling).

EDIT: Also, I am encountering this with simply moving types from one assembly to another. It seems to oscillate between different phases like what I have outlined above in the different examples. Each one of those examples can be considered a "phase". A further adjustment (like removing a reference, or adding a class) moves to the next phase and therefore impacting performance numbers.

As before, the issue can be reproduce by using a normal for loop instead of ToArray

Indeed, looking at that it makes sense now. I saw all the IL and conflated your for loop with additional IL having got lost in it and never scrolling back up to see the source. 😆 TBH my IL/disassembly/JIT is very poor and it is an area I am trying to improve.

Mike-E-angelo on 10 May 2018

But that's pure guesswork

Another more likely to be true guess is that this is caused by code alignment issues. I can reproduce the unexpected speed-up by simply adding an empty for loop before the actual loop (yeah, the JIT does not eliminate empty for loops :grin:):
```C#
var i = 0;
var r = new int[_data.Length];
var s = _select;
for (int n = 0; n < 4; n++)
;
foreach (string x in _data)
r[i++] = s(x);
return r;

Does adding/removing a class attribute can have the same effect? Yes, it can. If some code is looking for attributes the attribute constructor needs to run so it has to be JIT compiled. Methods that are compiled after the attribute constructor is run will end up on a different address than before.

In 5 runs with and without the `Description` attribute the `foreach` loop ends up at these addresses:

; fast
00007FF9 D298 2D6C ; loop end at 2D9C
00007FF9 D298 2D6C
00007FF9 D299 2D6C
00007FF9 D29B 2D6C
00007FF9 D299 2D6C
; slow
00007FF9 D29A 27BC ; loop end at 27EC
00007FF9 D299 27BC
00007FF9 D29A 27BC
00007FF9 D29A 27BC
00007FF9 D299 27BC
```
These addresses are remarkably stable. They vary only by multiple of 64KB from run to run and that's unlikely to matter.

CPUs (at least Intel ones) decode instructions in 16 byte blocks so failure to align loops (and more generally, any "hot" branch target) to 16 byte boundaries can lead to performance problems. But this loops is never 16 byte aligned so that's likely not the reason for the strange speed-up.

But newer CPUs (Haswell? I need to check) also have some loop buffers that cache decoded loop instructions. Those operate on 32 byte blocks. So? The loop is never 32 byte aligned just like it wasn't 16 byte aligned.

Yes, but the slow loop (at 27BC) happens to start near the end of a 32 byte block. That might be a problem, a loop iteration just started executing and then almost immediately a new 32 byte block needs to be fetched possibly delaying the execution of the rest of the loop.

I'm not sure if the above is a realistic explanation of what is going on but there's very good chance that the issue is one way or another related to code alignment.

Further evidence that is code alignment related:

placing the empty for loop after the foreach loop has no speed-up effect
the empty for loop can be replaced by a new DescriptionAttribute(); statement with the same speed-up effect
moving the new DescriptionAttribute statement after the foreach loop again has no speed-up effect. This makes it much less likely that the issue is GC/memory related

mikedn on 11 May 2018

❤1

I cannot begin to share my appreciation for your efforts here @mikedn! Thank you so much for being able to put some much needed context/explanation around this for me. As I have mentioned, I have been able to see the quirkiness as well from adding new code, whether it is a new reference assignment and/or instantiation, or even simply a nested class. Each one of these will produce a different type of result (or "mode") in my findings, leading to the utmost derailed form of insanity for me, LOL.

So, I like what you are saying here and the direction it is taking.

My immediate question right now is: is this an architecture issue? That is, if I drop back into x86, this should probably go away, right? FWIW I am in the middle of setting up my environments to support .NET Core 2.1 x86 to see if this is the case. I can report my findings here if need-be.

The other question is: is this a .NET Core-specific issue? That is, do you think that this can be fixed or are we running into a CPU-architecture issue here? Apologies if that is a silly question as again I am learning all the "guts" (as it were) of the internals here so I am not so familiar with them (yet 😄).

EDIT: If you have a good resource on "code alignment" (this is the first time I have heard of it) I would be more than appreciative if you could share it. 👍

Mike-E-angelo on 11 May 2018

This alignment issue is probably a "feature" on most modern CPUs. It will show up in cases where performance depends critically on one small loop. So it's less of a problem for realistic applications where there are many key loops and/or the loops do more work per iteration.

We have a few issues open on this already (#11607, dotnet/runtime#9912) and other examples of the annoying impact of not having alignment (#16854).

The subtlety here in your case is that innocuous looking non-code things like attributes can alter alignment.

The challenge we face is that aligning all methods and labels mainly bloats code size and generally won't improve performance. Alignment is only helpful when the particular bits of code that might end up misaligned are also frequently executed.

Perhaps tiered jitting can come to the rescue.

AndyAyersMS on 11 May 2018

👍1

Perhaps tiered jitting can come to the rescue.

Ah! That is a new feature in 2.1 RC1, right? Are you saying that the existing features might be able to work some magic here? Or perhaps future additional features too this feature?

Also, this is sounding like a x64-specific bug (or "feature" 😆) if I am reading the tea leaves here correctly. Will switching to x86 offer a work around until this is addressed?

The issue in my world is that performance results will say that my code is 15-20% faster on one run vs 1-2% faster the next. _So, which is it_? LOL. It might not look like much when you look at the numbers directly on their own, but if you are comparing it an alternative implementation, it gets very hazy very quickly. Hope that makes sense.

An obvious consequence of course is that you can post the faster times publicly somehow but you know it's just a matter of time before some sleuth tries out your code and finds out the slower, times, too. I'm trying to cut out the embarrassments in my life already as it is. 😛

Mike-E-angelo on 11 May 2018

Tiering (in preview in 2.1) gives us a potential future avenue to address the problem. It does not address it today.

x86 is just as vulnerable to this as x64.

You typically don't need to worry about the impact of code alignment on perf in realistic applications. It usually only shows up prominently in microbenchmarks.

AndyAyersMS on 11 May 2018

🎉1

Bummer. This is no good, then. It's great to know why this is happening, though (or at least, a theory).

It usually only shows up prominently in microbenchmarks.

Ah, I am going to reach back in time here and say that I have seen this outside of microbenchmark scenarios. For instance, for our ExtendedXmlSerializer benchmarks, we were seeing this when doing comparisons between our serializer and the classic XmlSerializer. It was occurring in _both_ tests and causing a lot of confusion on which metric(s) to use. We ultimately just gave up on it because it was taking up too much time.

So I guess perhaps I need to know better what is meant by _microbenchmarks_ or perhaps maybe there is an even bigger problem here? To me, microbenchmarks is more like what we're talking about here in these tests (simpler operations like a for-loop, etc), not an entire serialization/deserialization story as I have illuminated.

Of course, we are talking about code written over a year ago and it was mostly me wrangling all this new tech/tooling in .NET Core 1.1 (compiling took FOREVER for both net461 and netcoreapp1.1!), so there might be other factors involved here.

Mike-E-angelo on 11 May 2018

Alright, I have done a bit more research on this matter. Sadly, as you stated @AndyAyersMS this does occur convincingly in x86 as well. Some notes:

Interestingly enough, my host (non-VM) machine managed to provide stable, consistent results through all my example branches in my repo with x86. It wasn't until I reverted back to the primary solution I was working on before this rabbit hole that it started to show the strain. The benchmark project in this solutions is MUCH more comprehensive and involved than the one shared on my repo. For starters, it has five benchmarks and this seems to do the trick as far as exposing the pain here.
I've noticed that 2.1 does a better job of providing more stable results than 2.0. That is, in 2.0 you will see different bands in subsequent runs. In 2.1, the configuration is pretty much set and you will not see deviations from the band that it renders. I say "pretty much" because in my more comprehensive test suite I was able to see the different bands in subsequent runs there, too.
If you want to reproduce this issue easily, use a VM. 😛 I'm not sure if it's the virtual memory or what, but it's very obvious in this environment and I am sure this is the reason why Benchmark.NET warns to use a non-VM machine.
Worth noting here. I had 2.0 x86 and x64 runtimes installed on my host machine, along with 2.1 equivalent x86 and x64 runtimes, for a total of four installs. At some point, my x64 2.0 example tests stopped providing the expected/posted results from my repo. This happened 3 times, but I wasn't able to consistently reproduce it. If I either uninstalled another runtime or repaired the x64 2.0 runtime, the results would begin to reflect the expected/posted results. Most weird.

Outstanding questions:

Deleting the BenchmarkDotNet.Artifacts folder. I am curious if the currently working code alignment theory explains the odd discrepancy with executions with and without this folder? I still see this in 2.1 and again to reiterate, the performance goes both ways (on some configurations it is slower without the folder, others it is faster without the folder).
Is this only impacting Haswell processors? If so, I might buy a newer laptop specifically to work around this issue. SorryNotSorry, I'm sooooo done dealing with this sucker if I can help it, LOL.
Actually those are the only questions I have. If Skylake (or another CPU) is not impacted by this, I am open to any cool suggestions for a laptop devoted to performance, if you have them. 🎉

Mike-E-angelo on 12 May 2018

I am curious if the currently working code alignment theory explains the odd discrepancy with executions with and without this folder?

It might. Deleting the folder might involve different code paths that result in different/additional methods being JIT compiled. This will affect the alignment of methods that are being benchmarked.

Is this only impacting Haswell processors?

And likely newer CPU generations such as Skylake.

If so, I might buy a newer laptop specifically to work around this issue. SorryNotSorry, I'm sooooo done dealing with this sucker if I can help it, LOL.

You'd need to buy many laptops having whatever CPUs you care about. That's probably not feasible. Even for a large company it would be rather difficult to have a ton of hardware around, run all kinds of benchmarks on them and gather and analyze all results.

mikedn on 12 May 2018

👍1

Cool... thanks for your help here, @mikedn.

And likely newer CPU generations such as Skylake.

OK good to know. FWIW, after I posted my last message I recalled that I have an Azure instance with a Xeon processor sitting around. I logged into it and was able to run my examples on it to reproduce this issue easily on both 2.1 x86 and x64.

The killer thing about this problem is that it is very subtle if you are not paying attention to it (like you said you might as well count the atoms in the universe 😆). But once you do notice it, it becomes very obvious to the point you wonder why it's not a bigger/larger/known/discussed problem. I am actually thinking it is happening on a much wider scale but no one is saying anything because they are not seeing it (nor do they know they could/should).

Mike-E-angelo on 12 May 2018

If you see or suspect you see this happening in programs that are not microbenchmarky, then we'd appreciate hearing more.

There are many microarchitectural issues that can show themselves when the overall performance you're measuring depends critically on one loop or a few branches, or the speed of one or two memory loads. These exhibit similar behavior patterns -- multiple "stable" values for measurements, changing from machine to machine, day to day, etc. Performance can vary based on a whole host of subtle things like
ASLR, installation of service packs, the total volume of strings in your environment, the number of characters in the current working directory, which versions of hardware and drivers you have installed, etc.

It is often hard or impossible to control for all this. I once spent a lot of time chasing down an odd regression that turned out to be 4K aliasing between a stack variable and a global variable-- which was "fixed" by reordering the declaration of global variables. But thankfully these cases tend to be rare.

Super-robust benchmarking schemes like Stabilizer try and use randomization to effectively average out this "noise". But these techniques have not yet made it into the mainstream. Perhaps BenchmarkDotNet will get there someday.

So in the meantime, it's just something we have to live with. Measuring performance by measuring time is imperfect and subject to noise. Sometimes this noise is surprisingly large and influenced in unexpected and subtle ways.

In bigger and more "realistic" applications we generally expect that we get a similar kind of averaging out and so the overall performance should not be sensitive to fine grained details.

AndyAyersMS on 13 May 2018

👍1

I feel your pain @AndyAyersMS. As I mentioned, it seems that I have put in over 100 hours now into this issue myself in all its baffling and most-irritating various forms. I cannot get over how I simply cannot let this thing go without a fight. 😆 As for seeing this in programs that do not involve microbenchmarks, I am not sure what that would look like. The only reason I am able to see the discrepancy here is because of the readings provided by Benchmark.NET. I wouldn't know how to start to look for and/or identify this issue without a reading/guide/indicator of some sort. However, if anything, this has definitely made me more aware of the underlying structure of what takes place at the hardware(-ish) level.

The complete, entertaining irony here, of course, is that all this tooling and frameworking is supposed to save poor souls like me from spending alllllll this time learning all the guts and infrastructure of what really goes on BUT HERE I AM ANYWAYS! 🎉

OK... that said, I have been doing some extra work around this today and have landed on a very reproducible and _SIMPLE_ repro that I have provided for everyone to check out here if you are interested:

DotNetCore.CodeAlignment

In the above repo, rather than having a bunch of different branches as with my old example set, I managed to produce this issue by creating a Benchmark.NET switch runner with 3 simple cases. The third and final case _always_ ends up with a different result than than the results achieved in the first two.

At least they did on my machines. All four of them. 😄 I even created a beefy Azure VM to try this on and that reproduced this issue as well. More information is in the readme.

All THAT said, with how consistent these results are now, I still half-way (quarter-way?) suspect that there still might be a chance that this is a Benchmark.NET issue that is intertwined with this one.

Again, more investigation is necessary, but I thought I would share this for now. It's still benchmarky, and it's all 2.1, but it's still a reference point of data that _may_ or may _not_ be valuable here. :)

Thank you again for all your help and for any additional feedback you may have as well (as always!). 👍

Mike-E-angelo on 13 May 2018

I created an issue for that https://github.com/dotnet/BenchmarkDotNet/issues/756

@AndyAyersMS is there any way to enforce the alignment today?

adamsitnik on 15 May 2018

🎉1 👍1

There is JitAlignLoops but that only does 16 byte alignment. That may be useful in some cases but it seems that there the problem is different and would require 32 byte alignment.

mikedn on 15 May 2018

but it seems that there the problem is different and would require 32 byte alignment.

It would still worth a try. 16 byte alignment would imply that the loop head will be either at the start of a 32 bytes chunk or right in the middle, it will never be near the end like it seems to happen in this case.

mikedn on 15 May 2018

👍1

Awesome... thank you @adamsitnik for opening that issue. FWIW I did try using new EnvironmentVariable("COMPlus_JitAlignLoops", "1") in my Benchmark.NET testing and that did not seem to impact the problem observed here. I might have done something wrong, however.

Mike-E-angelo on 15 May 2018

FWIW I did try using new EnvironmentVariable("COMPlus_JitAlignLoops", "1")

Are you still using the in-process toolchain? It might not work in this case.

mikedn on 15 May 2018

👍1

Ah! Good point. I will try this now and report back here.

Also, what about using x86... would that impact this switch? I was thinking about trying this as well.

Mike-E-angelo on 15 May 2018

Also, what about using x86... would that impact this switch?

x86 is equally affected by code alignment. But due to differences in code size you may get lucky and not observe the issue, at least in this particular benchmark.

mikedn on 15 May 2018

DING DING DING!!! WE HAVE A WINNER!!! I cannot tell you how happy I am to finally figure this out, or at least get me to a place where I can be a little more confident about the results here and get me unblocked to start writing code again. 😆

FWIW @mikedn I didn't even have to bother with x86. The problem is how I was using the COMPlus_JitAlignLoops variable. The trick is to use BOTH as described in my provided link. Before, I was running only one job with this setting set to "1", so I was only getting one result, which could have been one or the other. Using two jobs with one set to "0" and the other set to "1" is what yields the magic here.

It seems that for now, this issue can be recognized and/or handled by Benchmark.NET. From the immediate-time perspective, I can use the configuration as outlined in the aforementioned link and get unblocked. Medium-time perspective might include a report and/or setting. Long-term setting would be a fix here in .NET Core.

However, from my perspective, I consider this issue resolved as I have enough to unblock me and move me forward in confidence _immediately_. If you feel this issue is covered and identified and covered in other issues (as @AndyAyersMS has previously stated with dotnet/runtime#8108, dotnet/runtime#9912, and dotnet/runtime#9908), then please feel free to close this issue.

A HUGE thank you to everyone for your patience, expertise, and assistance here! I am indebted and entirely grateful. :)

Mike-E-angelo on 15 May 2018

A quick shout of appreciation here to everyone here that helped figure out this issue. I am very appreciative of your efforts and have assembled a list of thanks and general gratitude towards the state of .NET that I posted and can be seen here, FWIW. Thanks again. 👍

Mike-E-angelo on 25 May 2018

Closing per request.