Benchmarkdotnet: [Discussion] Changes on Engine, diagnosers and result reporting?

Created on 8 Nov 2016 · 10Comments · Source: dotnet/BenchmarkDotNet

DISCLAIMER: this issue is only to start a discussion, I have no good solution for it yet, sorry.

There're two major issues with current engine implementation.

Engine should report measurements by itself, as here. If missed, things go wrong; see #296.
The diagnoser support is implemented on top of result reporting. This hurts as there're additional allocations, and there's no proper synchronization. The diagnoser can receive signal after the benchmark was actually started.

To be honest I have no good idea how to fix it. As initial proposal:

Add diagnoser run as a separate engine stage.
During run-with-diagnoser stage:
- Notify the diagnoser using any system synchronization primitive (e.g ManualResetEvent) (we do not want to allocate anything during run).
- Wait for confirmation from the host (we do not want to run iteration until diagnoser receives the notification).
- Run the iteration.
- Notify diagnoser the run was completed & wait for confirmation.
During normal stages:
- Do not write anything into console / output.
- Result reporting should be responsibility of the caller, engine should only collect measurements into preallocated list.

I think, there's no need to use any framework for IPC as it'll incur additional allocations and it'll add more latency. Not to mention there's no such lightweight framework out-of-the-box, only sockets, pipes and mem-mapped files available for netCore.

Diagnosers discussion

Source

ig-sinicyn

All 10 comments

I am not sure if we have any problem today.

Currently if there is any Diagnoser attached we perform separate run for Diagnostics only (code). Non-diagnostic output is ignored, so it does not matter if don't we print the results. We don't care about them because they might be affected by diagnosers side-effects.
Reading "signals" (code) is done in synchronous way. It means that child process is blocked until parent process finishes the reading. So it's impossible for diagnoser to attach to process too late or too early.

adamsitnik on 8 Nov 2016

👍1

Proof: with the recent diagnoser changes I was able to get 100% accurate memory diagnostics #284

adamsitnik on 8 Nov 2016

@adamsitnik

Non-diagnostic output is ignored, so it does not matter if don't we print the results

Ok, what about case when there's no diagnosers? Can we delay result reporting till all runs are completed? Or, at least, introduce more obvious contract for it?

Reading "signals" (code) is done in synchronous way

Afair standard input/output is buffered. So there can be multiple race combinations, simplest one:

Host                                |   Benchmark process
                                    |
                                    |             writes Signals.AfterSetup
receives Signals.AfterSetup         | 
                                    |             runs benchmark
diagnoser.AfterSetup called         |

both of these are not an issue for common benchmarks as they can easily take minute or two but I'd prefer to have something more robust for perftests as they're typically limited with 1-2 sec timeout.

ig-sinicyn on 8 Nov 2016

👍1

Ok, what about case when there's no diagnosers? Can we delay result reporting till all runs are completed? Or, at least, introduce more obvious contract for it?

We could do so, so far @AndreyAkinshin wanted to have continuous updates.

Afair standard input/output is buffered. So there can be multiple race combinations, simplest one

I was not aware of that. Thanks!

@ig-sinicyn I assume that you want to merge your perf test runner changes and the current Engine does not fit in well? Have you thought about creating a new class that would implement IEngine and reuse the code from Engine that you want (running tests, not displaying results to console etc)

adamsitnik on 8 Nov 2016

👍1

We could do so, so far @AndreyAkinshin wanted to have continuous updates.

Ok, great!:)

I assume that you want to merge your perf test runner changes and the current Engine does not fit in well?

Actually no, at least for now. I'm not sure most of our changes are required at all with bench.net v0.10.
Will do some comparison tests later.

Our current design is following: each benchmark method performs few thousands iterations by itself

            [CompetitionBaseline]
            public bool Test00IsDefined()
            {
                var a = false;
                for (var i = 0; i < Count; i++)
                    a = EnumHelper.IsDefined(F.C | F.D);
                return a;
            }

            [CompetitionBenchmark(28.00, 34.00)]
            public bool Test02EnumIsDefined()
            {
                var a = false;
                for (var i = 0; i < Count; i++)
                    a = Enum.IsDefined(typeof(F), F.C | F.D);
                return a;
            }

(yep, EnumHelper.IsDefined() method is ~30x faster than Enum.IsDefined())

the engine measures time for each call (as if the invokeCount parameter here is 1)
and there're at least 300 runs measured.

This produce precise results and entire test takes ~2.5 sec (no perf optimizations yet, will be faster),

BenchmarkDotNet=v0.9.9-develop
OS=Microsoft Windows NT 10.0.14393.0
Processor=Intel(R) Core(TM) i7-3537U CPU 2.00GHz, ProcessorCount=4
Frequency=2435869 Hz, Resolution=410.5311 ns, Timer=TSC
Host Runtime=Clr 4.0.30319.42000, Arch=64-bit  [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1586.0
Job Runtime(s):
    Clr 4.0.30319.42000, Arch=64-bit  [RyuJIT]

Job=CompetitionDefaultJobAnyCpu  AnalyzeLaunchVariance=False  EvaluateOverhead=False  
RemoveOutliers=True  Platform=AnyCpu  Force=False  
Toolchain=InProcessToolchain  InvocationCount=1  LaunchCount=1  
RunStrategy=Throughput  TargetCount=300  UnrollFactor=1  
WarmupCount=100  

              Method |        Mean |     StdDev | Lnml(min) | Lnml(max) |      Median |         Min |         Max | Scaled | Scaled-StdDev |
-------------------- |------------ |----------- |---------- |---------- |------------ |------------ |------------ |------- |-------------- |
     Test00IsDefined |  12.1898 us |  0.5738 us |      1.00 |      1.00 |  12.3159 us |  10.6738 us |  13.9581 us |   1.00 |          0.00 |
 Test02EnumIsDefined | 387.1198 us | 19.6775 us |     31.75 |     31.75 | 384.2571 us | 338.2776 us | 442.5525 us |  31.83 |          2.21 |


============= IsDefinedCase ==============
// ? CodeJam.EnumHelperPerfTests+IsDefinedCase, CodeJam-Tests.Performance

---- Run 1, total runs (expected): 1 -----
// ? #1.1  02.551s, Informational@Analyser: CompetitionAnnotateAnalyser: All competition limits are ok.
// !<-- ------ xml_annotation_begin ------
<CompetitionBenchmarks>
    <Competition Target="CodeJam.EnumHelperPerfTests+IsDefinedCase, CodeJam-Tests.Performance">
        <Candidate Target="Test00IsDefined" Baseline="true" />
        <Candidate Target="Test02EnumIsDefined" MinRatio="28.00" MaxRatio="34.00" />
    </Competition>
</CompetitionBenchmarks>
// !--> ------- xml_annotation_end -------

it uses our own in-process toolchain, so we have no issues with replacing the engine so far.

Our code is very close to the current implementation in bench.net so it looks like we can drop our custom engine with release of v0.10, but there's still an issue with diagnosers.
Our toolchain captures console output into preallocated buffer and I do not want to use it for diagnoser notifications as it'll screw the timings.
We can make custom diagnoser notifications using Mutex or even events (all in all, the hacks are needed for in-process benchmarks only), but I'd prefer to not invent the wheel and to use code from bench.net while it's possible.

ig-sinicyn on 8 Nov 2016

UPD Yeah, I've checked all our perftests and current engine implementation from Benchmark.Net works fine for us. I've just switched to it as a default option. Cheers!

The only issue we have for now is diagnoser support for in-process benchmarks. We definitely need something better than reading the stdout : )

ig-sinicyn on 13 Nov 2016

@ig-sinicyn that's great!

I have finished the universal memory diagnoser, but due to dependency to netcoreapp1.1 which is still in beta we need to wait until the end of November to get it RTM so we don't need to have two separate nuget packages again like we did with dnxcore50.

@AndreyAkinshin IMHO Then we could merge Igor's changes to separate branch of BDN and I could work on having it work together with recent diagnosers changes without passing the results through standard input/output ;)

So we could have something like:
0.10.1 - build-in, accurate and cross platform memory diagnoser
0.10.2 - in-process benchmark runner

What do you guys think?

adamsitnik on 13 Nov 2016

🎉1

LGTM