Benchmarkdotnet: Enforce code alignment

Created on 15 May 2018  路  18Comments  路  Source: dotnet/BenchmarkDotNet

Repro: https://github.com/Mike-EEE/DotNetCore.CodeAlignment
Discussion: https://github.com/dotnet/coreclr/issues/17932

AFAIK this is not possible as of today. At least I don't know yet how to enforce it.

discussion

Most helpful comment

I completely agree with @AndyAyersMS: it's not a good idea to enable COMPlus_JitAlignLoops _by default_. However, we have a problem that should be solved. I have seen perf fluctuation problems because of the "random" alignment algorithm in RyuJIT before, but I didn't know about COMPlus_JitAlignLoops. Some of these problems were solved via UnrollFactor=16 for nanobenchmarks (I believe that such cases are also suffering from different branch prediction strategies on Haswell+, but it's another topic for discussion). It seems that unrolling doesn't solve all the problems, so we have to come up with something else.
BenchmarkDotNet should help people to write reproducible benchmarks. Unfortunately, we can't guarantee it because of some JIT "features." But we can tell people about these problems. I suggest to do the following things:

  1. Detect cases when this phenomenon significantly affects results. E.g., if we know that benchmarks use RyuJIT AND there are two benchmarks with a small difference between them AND (probably some other diagnostics), we can print a warning under the summary table and suggest to use COMPlus_JitAlignLoops in order to stabilize results.
    @adamsitnik, what do you think? Does it make any sense?
  2. Since JitAlignLoops is very important JitAlignLoops for performance, it makes sense to include information about it in the summary info (e.g. RyuJIT[AlignLoops]). We can use the same mechanism for other important flags (e.g., COMPLUS_EXPERIMENTAL_TieredCompilation / COMPlus_TieredCompilation).
    @adamsitnik, @AndyAyersMS what is the most reliable way to check these features? I know that we can manually check environment variables (probably we have to also check registry flags, runtimeconfig.json, MSBuild properties, or something else). However, it would be better to have a reliable API for that.

All 18 comments

I've been following the same issue.

I was thinking that instead of enforcing code alignment, maybe it would be better for BDN to report on it, as part of the analysis. For example, "the code for this method is NOT aligned, which may cause performance issues".

We could report it today, I know that some of our users use DisassemblyDiagnoser to find out the start address of given method.

At best you can report method alignment but that's not very relevant as what matters is loop head alignment (and sometimes the more general branch target alignment).

Would running the benchmark several times, each time in a new process, help? That way, the final result should roughly represent the "average" alignment, at least assuming the CLR is sufficiently random about where it places code.

at least assuming the CLR is sufficiently random about where it places code.

If only it was random :grin:. Experiments show that it can be pretty stubborn about placing code - it's all determined by the order methods are being compiled. You'd need to introduce some randomness in this, possibly by having a particular BDN method that is called or not depending on a random value.

I am all for reporting! That would be amazing. At least something to show awareness so that someone like me doesn't go off and spend eternity chasing perceived ghosts. 馃槅

If we can somehow do better than that, then even better. 馃憤 But reporting for me would work _very_ well in the interim in that I would know there is some funny business going on. I could then test on the other machines that I have to sort of get a better picture.

To provide more context in my case, I am using baselines to compare results of similar methods. When this issue doesn't impact results, I am seeing differences like 1-2%. When this issue _does_ impact, that jumps to 10-20% (!).

If I was able to see a significant difference, along with a warning of some sort, then I would know to not only curb my enthusiasm (馃槃) but to also try other machines to see what they say as well.

So the Scaled column and getting an accurate sense of relative performance comparison is of primary importance to me here, rather than the Mean column for absolute comparison (although that is valuable, too, of course).

At best you can report method alignment but that's not very relevant as what matters is loop head alignment (and sometimes the more general branch target alignment).

I tried to run some benchmarks with following config (you can pass env vars with BDN):

class Program
{
    static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Program).GetTypeInfo().Assembly).Run(args, 
            DefaultConfig.Instance
                .With(Job.Default.With(CsProjCoreToolchain.NetCoreApp21).With(new[] { new EnvironmentVariable("COMPlus_JitAlignLoops", "1") }).WithId("1"))
                .With(Job.Default.With(CsProjCoreToolchain.NetCoreApp21).With(new[] { new EnvironmentVariable("COMPlus_JitAlignLoops", "0") }).WithId("0")));
    }
}

[DisassemblyDiagnoser]
public class Jit_LoopUnrolling
{
    [Benchmark]
    public int Sum()
    {
        int sum = 0;
        for (int i = 0; i < 1024; i++)
            sum += i;
        return sum;
    }
}

I saw the difference in the disassm:

00007fff`71aecaf0 BenchmarkDotNet.Samples.JIT.Jit_LoopUnrolling.Sum()
00007fff`71aecaf0 33c0            xor     eax,eax
00007fff`71aecaf2 33d2            xor     edx,edx
00007fff`71aecaf4 03c2            add     eax,edx
00007fff`71aecaf6 ffc2            inc     edx
00007fff`71aecaf8 81fa00040000    cmp     edx,400h
00007fff`71aecafe 7cf4            jl      00007fff`71aecaf4
00007fff`71aecb00 c3              ret
00007fff`71b0cc50 BenchmarkDotNet.Samples.JIT.Jit_LoopUnrolling.Sum()
00007fff`71b0cc50 33c0            xor     eax,eax
00007fff`71b0cc52 33d2            xor     edx,edx
00007fff`71b0cc54 0f1f4000        nop     dword ptr [rax]
00007fff`71b0cc58 0f1f840000000000 nop     dword ptr [rax+rax]
00007fff`71b0cc60 03c2            add     eax,edx
00007fff`71b0cc62 ffc2            inc     edx
00007fff`71b0cc64 81fa00040000    cmp     edx,400h
00007fff`71b0cc6a 7cf4            jl      00007fff`71b0cc60
00007fff`71b0cc6c c3              ret

But the difference was very small (4ns)

| Method | Job | EnvironmentVariables | Mean | Error | StdDev |
|------- |---- |------------------------ |---------:|---------:|---------:|
| Sum | 0 | COMPlus_JitAlignLoops=0 | 281.6 ns | 5.338 ns | 4.993 ns |
| Sum | 1 | COMPlus_JitAlignLoops=1 | 285.3 ns | 5.484 ns | 5.632 ns |

Would running the benchmark several times, each time in a new process, help?

it's possible today with job.WithLaunchCount(nrOfProcesses)

Experiments show that it can be pretty stubborn about placing code - it's all determined by the order methods are being compiled.

It's an advantage for us because we can get very similar results for same code.

You'd need to introduce some randomness in this, possibly by having a particular BDN method that is called or not depending on a random value.

This would be definitely possible, however I am not sure if people would be willing to pay the price (waiting much longer for the results).

@AndreyAkinshin what do you think? Maybe we should expose a mode for running the benchmarks twice: once for COMPlus_JitAlignLoops=1, once COMPlus_JitAlignLoops=0?

Btw the stabilizer project mentioned by @AndyAyersMS went even one step further:

Programs built with Stabilizer run with randomly-placed functions, stack frames, and heap objects. Functions and stack frames are moved repeatedly during execution. A random memory layout eliminates the effect of layout on performance, and repeated randomization leads to normally-distributed execution times.

I am not sure if people would be willing to pay the price (waiting much longer for the results).

I for one would certainly not mind waiting if it is a setting/attribute that we opt-in to enable. Along such lines...

Maybe we should expose a mode for running the benchmarks twice: once for COMPlus_JitAlignLoops=1, once COMPlus_JitAlignLoops=0

THIS! So, I think this setting should:

(Btw totally open to feedback here, and writing off the top of my head):

  • Have the ability to opt-in to enable.
  • Have an Automatic or Detect mode that is the default value. This is where Benchmark.NET will detect if there is an alignment issue. If so, then it performs two different runs as suggested. Otherwise, it's Always.
  • Have a way to select the output to render: Fastest that chooses the faster of the two results, Slowest chooses the slower of the two, Average is the average, and Both shows both results. The trick here is that Both obviously shows two rows vs. one row with everything else.

I have been spending the past hour and it looks like we have caught our gigantic ghost of a whale here. Using the configuration that @adamsitnik provided and getting out the in-process mode as @mikedn suggested, I was able to _consistently_ tease apart our two modes using the code from the repro repo:

Method | EnvironmentVariables | Mean | Error | StdDev |
------- |------------------------ |---------:|----------:|----------:|
First | COMPlus_JitAlignLoops=0 | 21.83 us | 13.686 us | 0.7733 us |
First | COMPlus_JitAlignLoops=1 | 19.19 us | 4.377 us | 0.2473 us |

Method | EnvironmentVariables | Mean | Error | StdDev |
------- |------------------------ |---------:|---------:|----------:|
Second | COMPlus_JitAlignLoops=0 | 19.66 us | 1.356 us | 0.0766 us |
Second | COMPlus_JitAlignLoops=1 | 21.88 us | 2.699 us | 0.1525 us |

Method | EnvironmentVariables | Mean | Error | StdDev | Scaled | ScaledSD |
------- |------------------------ |---------:|---------:|----------:|-------:|---------:|
First | COMPlus_JitAlignLoops=0 | 21.81 us | 3.170 us | 0.1791 us | 1.00 | 0.00 |
Second | COMPlus_JitAlignLoops=0 | 19.94 us | 5.967 us | 0.3372 us | 0.91 | 0.01 |
| | | | | | | |
First | COMPlus_JitAlignLoops=1 | 19.31 us | 5.633 us | 0.3183 us | 1.00 | 0.00 |
Second | COMPlus_JitAlignLoops=1 | 21.20 us | 1.586 us | 0.0896 us | 1.10 | 0.02 |

SO PRETTY! 鉂わ笍 Notice the Scaled column as I was discussing earlier. Huge difference here between the different results, considering.

In any case, this is a much better alternative in the interim. And here I was about to throw in the towel! 馃ぃ

https://youtu.be/qorFS7X1i4Q?t=2m21s

(Feeling very much like this due to having seen things that aren't really there. 馃槈)

(BTW, I realize that I just went a little crazy with that feature request, but I wanted to underscore here that I am finally unblocked from this issue using the configuration above -- _immediately_. As far as I am concerned, anything else is an added bonus here. Thanks again to everyone for your help. 馃憤)

FWIW I was also able to verify this behavior is consistent with my alternate repro suite which features additional cases. All four of my available machines were able to produce the two different values by way of COMPlus_JitAlignLoops per above.

(Guess I should have done this _before_ I started dancing atop the heads of people. 馃槅)

For occasional one-off experiment COMPlus_JitAlignLoops can be helpful in trying to confirm or refute that code alignment is a cause of unexpected performance fluctuations.

But, I would not advocate using COMPlus_JitAlignLoops as part of a regular performance program or as any kind of default setting just yet -- it's something we never test, may not always work as expected, will be ignored on some architectures, has some inherent limitations (we don't constrain method entry alignment, and don't recognize all loops), does not use optimally encoded NOP sequences, won't fix issues in prejitted code, and is only supported by RyuJit.

@AndyAyersMS thanks for the warning!

Before we introduce any new JIT-related changes to BDN I will ask you first for validation of the idea.

I completely agree with @AndyAyersMS: it's not a good idea to enable COMPlus_JitAlignLoops _by default_. However, we have a problem that should be solved. I have seen perf fluctuation problems because of the "random" alignment algorithm in RyuJIT before, but I didn't know about COMPlus_JitAlignLoops. Some of these problems were solved via UnrollFactor=16 for nanobenchmarks (I believe that such cases are also suffering from different branch prediction strategies on Haswell+, but it's another topic for discussion). It seems that unrolling doesn't solve all the problems, so we have to come up with something else.
BenchmarkDotNet should help people to write reproducible benchmarks. Unfortunately, we can't guarantee it because of some JIT "features." But we can tell people about these problems. I suggest to do the following things:

  1. Detect cases when this phenomenon significantly affects results. E.g., if we know that benchmarks use RyuJIT AND there are two benchmarks with a small difference between them AND (probably some other diagnostics), we can print a warning under the summary table and suggest to use COMPlus_JitAlignLoops in order to stabilize results.
    @adamsitnik, what do you think? Does it make any sense?
  2. Since JitAlignLoops is very important JitAlignLoops for performance, it makes sense to include information about it in the summary info (e.g. RyuJIT[AlignLoops]). We can use the same mechanism for other important flags (e.g., COMPLUS_EXPERIMENTAL_TieredCompilation / COMPlus_TieredCompilation).
    @adamsitnik, @AndyAyersMS what is the most reliable way to check these features? I know that we can manually check environment variables (probably we have to also check registry flags, runtimeconfig.json, MSBuild properties, or something else). However, it would be better to have a reliable API for that.

I am currently working on the .NET Core 2.2 vs 3.0 comparison and the loop alignment has hit me again.

I was thinking about the stabilizer project mentioned by @AndyAyersMS to me a long time ago and I was wondering if re-jitting the benchmark method after every iteration would help?

@AndyAyersMS : would it be possible to implement and expose an API similar to RuntimeHelpers.PrepareMethod that would always re-compile the provided method as Tier 1 and store it in a different address?

It might possible to force rejitting, sure. One simpleminded way to do this would be to attach as a profiler and pretend to modify the IL (note this might have other undesirable impacts). Getting the transient closure of all methods invoked by the benchmark to rejit would be tricky.

If you run each trial in a new process you should get some code and heap jittering via ASLR. I don't know how well-distributed this would end up being.

We could also force 32 bit alignment of Tier1 methods (on x86/x64) which should reduce some of the variability. Also, from what I understand, these code alignment issues aren't as common on Arm64 so it would also be interesting (once we have enough data) to look at how variance levels differ across Arm64/x64.

One randomization thing you could do fairly easily is jitter the stack. You could have the benchmark invoker go through a proxy layer that does smallish different-sized locallocs before invoking the method, and make sure that proxy layer is not an initlocals method (so the localloc region is not zeroed). Or say have a bunch of different proxies that have local structs of different sizes.

Stack address related perf artifacts are somewhat rare, but they do exist (eg 4K aliasing from say a stack slot and a static).

We have all agreed that BenchmarkDotNet should not use any custom settings and force the code alignment on its own.

Currently, the JIT Team is experimenting with method and loop alignment for .NET 6 to stabilize the benchmark results. This work can be tracked at https://github.com/dotnet/runtime/issues/43227

I am going to close this issue as there are no actionable work items on the BDN side.

Thank you all for the great discussion!

Was this page helpful?
0 / 5 - 0 ratings