Benchmarkdotnet: be accurate about allocated bytes/op

Created on 5 Jun 2016  Â·  20Comments  Â·  Source: dotnet/BenchmarkDotNet

Bench.Net 0.9.7

Code:

    class Program
    {
        static void Main(string[] args)
        {
            new SimpleBenchmark().BaselineOutput();
            var config = new ManualConfig();
            config.Add(DefaultConfig.Instance);
            config.Add(new MemoryDiagnoser());

            BenchmarkRunner.Run<SimpleBenchmark>(config);
        }
    }

    public class SimpleBenchmark
    {
        private volatile int x;

        [Benchmark]
        public int Baseline()
        {
            x = 0;
            for (int i = 0; i < 10000000; i++)
            {
                x++;
            }
            return x;
        }

        public void BaselineOutput()
        {
            var m = GC.GetTotalMemory(false);
            var sw = Stopwatch.StartNew();

            Baseline();

            sw.Stop();
            var m2 = GC.GetTotalMemory(false);
            Console.WriteLine($"Allocated: {m2 - m}, elapsed {sw.Elapsed.TotalMilliseconds}ms");
        }
    }

Output:

Allocated: 0, elapsed 25,7643ms
...
   Method |     Median |    StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
--------- |----------- |---------- |------ |------ |------ |------------------- |
 Baseline | 24.9992 ms | 0.1676 ms |     - |     - |     - |           2 696,37 |

The code obviously runs without allocations. The issue is blocker for us as it means we cannot trust output from the diagnoser. If you had any idea or workaround, write a comment please:)

As far as I can see this can be fixed only with introduction of two-way notifications before and after each of benchmark run.

There's alternate solution: run the diagnosers in-process with the benchmark. Sadly, it fires a new issue: there should be a way to report results from the diagnoser back to the main process. Ouch:)

Diagnosers

All 20 comments

The code obviously runs without allocations. The issue is blocker for us as it means we cannot trust output from the diagnoser.

Sorry about that, I'll make sure that it gets fixed, we want BenchmarkDotNet to be a tool that people can rely on.

This output shows what's going wrong (see the final line):

// Run, Diagnostic
...

Pilot 1: 4 op, 85538476.98 ns, 21.3846 ms/op
Pilot 2: 8 op, 167628168.21 ns, 20.9535 ms/op
Pilot 3: 16 op, 319523745.99 ns, 19.9702 ms/op

IdleWarmup 1: 16 op, 3040.92 ns, 190.0576 ns/op
IdleWarmup 2: 16 op, 760.23 ns, 47.5144 ns/op
IdleWarmup 3: 16 op, 760.23 ns, 47.5144 ns/op
IdleWarmup 4: 16 op, 1140.35 ns, 71.2716 ns/op
IdleWarmup 5: 16 op, 2660.81 ns, 166.3004 ns/op

IdleTarget 1: 16 op, 760.23 ns, 47.5144 ns/op
...
IdleTarget 20: 16 op, 760.23 ns, 47.5144 ns/op

MainWarmup 1: 16 op, 335495428.92 ns, 20.9685 ms/op
...
MainWarmup 10: 16 op, 342577736.42 ns, 21.4111 ms/op

MainTarget 1: 16 op, 332159537.41 ns, 20.7600 ms/op
...
MainTarget 10: 16 op, 332681435.66 ns, 20.7926 ms/op

Result 1: 16 op, 332158777.18 ns, 20.7599 ms/op
...
Result 10: 16 op, 332680675.43 ns, 20.7925 ms/op

// Run, Diagnostic - 908 Ops in total, across 58 Runs

We currently capture the memory allocations across the entire Diagnostic process/run (as you state), then divide by the # of Ops, which in this case is 908, a relatively small number.

When I designed the Memory Diagnoser, I did it with the assumption that the # of Ops would be large and so any memory allocated outside of the [Benchmark] methods would not show up in the calculations. Generally this means that you see something like 0.04 Bytes Per/Op, when it should be 0.0, but as you've shown, this is a bad assumption, sorry about that.

As far as I can see this can be fixed only with introduction of two-way notifications before and after each of benchmark run.

Agree, that's a more accurate way. One issue is that ETW events are only fired when 100K of memory has been allocated, i.e. it sums up allocations up until they exceed 100K, then fires a single event. So we might still get notified of some allocations that happen outside the [Benchmark] methods, but it'll be _a lot_ better than the current situation.

Just one other note, the reason that I didn't use GC.GetTotalMemory(..) is the following. If you have code like this:

var m = GC.GetTotalMemory(false);

// Some code that allocates a lot of memory and then triggers a GC 

var m2 = GC.GetTotalMemory(false);

You could end up in the situation where m2 is the same as m1, even though a lot of memory had been allocated between the 2 calls. If the GC had cleaned all the memory up by the time m2 was calculated, it wouldn't been accounted for.

@mattwarren as far as I can remember the ETW is very tricky to hold it the right way.
You should collect all trace events and parse them only after the entire run is completed. Here's a proof from TraceEventProgrammersGuide (supplied with TraceEvent NuGet package ):

In real time mode, events are buffered and there is at least a second or so delay (typically 3 sec) between the firing of the event and the reception by the session (to allow events to be delivered in efficient clumps of many events).

As for me, I'd prefer to use resource monitoring instead.

the reason that I didn't use GC.GetTotalMemory(..) is the following

Yep, I know about this. Added just to proof that the diagnoser produces incorrect numbers.

as far as I can remember the ETW is very tricky to hold it the right way.

Yep :-) but they're incredibly useful/powerful, so it's probably worth the hassle?!

You should collect all trace events and parse them only after the entire run is completed. Here's a proof from TraceEventProgrammersGuide (supplied with TraceEvent NuGet package ):

That scenario we have covered, see this code

As for me, I'd prefer to use resource monitoring instead.

Wow, I never know about those, they look useful thanks for the link.

There's one other option that I came across last week. Creating our own ETW events, i.e. something like "Benchmark-Start", "Benchmark-Stop". Then when we process the ETW event stream we look out for those events and only process the memory allocation events that happen between a start/stop pair.

Maybe that's the easiest way to do it.

That scenario we have covered, see this code

Err, I'm afraid this could not work. The delays can be larger than 3 seconds. In our project I use custom etw events to mark a start/end of segments that should be analysed.

As a sidenote: is separate run for diagnoser required at all? In our projects the etw overhead is negligible and it can be lovered even further by redirecting into etl file.

There's one other option that I came across last week

Yep, I've used use same approach to log the events. Should work, at least with full .net stack ( there's no .Net Core support, https://github.com/dotnet/corefx/issues/4864 )

I don't know if that is enough still, because ETW events are collected into per-CPU buffers. To the best of my understanding, you might receive your custom end event before an allocation event that actually occurred earlier, but in a different buffer.

Generally speaking, the GC diagnoser is a best-effort solution. The 100k alloc ticks are only an approximation anyway. For the # collections we could definitely switch to using performance counters or GC.CollectionCount in the benchmark process.

As for writing events to a file, that would actually introduce overhead and non-determinism on slower disks that I think would be best to avoid if possible.

On Jun 6, 2016, at 13:06, ig-sinicyn [email protected] wrote:

That scenario we have covered, see this code

Err, I'm afraid this could not work. Dhe delays can be larger. In our project I use custom etw events to mark a start/end of segments that should be analysed.

As a sidenote: is separate run for diagnoser required at all? In our projects the etw overhead is negligible and it can be lovered even further by redirecting into etl file.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

As a sidenote: is separate run for diagnoser required at all? In our projects the etw overhead is negligible and it can be lovered even further by redirecting into etl file.

This is something I talked about with @AndreyAkinshin a while ago. The reasoning is that the other runs are sensitive and we want to ensure that our measurements are as accurate as possible. Because BenchmarkDotNet reports results down to ns precision, any overhead is bad. We really don't want diagnostics to affect the measurements in any way, so we thought it was safer to keep them completely seperate.

Generally speaking, the GC diagnoser is a best-effort solution. The 100k alloc ticks are only an approximation anyway.

@goldshtn in what way? I know they are bucketed up, but you seem to get the actual total for the bucket, i.e. I've seen totals like "100,435 bytes". My understanding is that it buckets up allocations until it exceeds 100K, then it sends an ETW event with the exact total, or am I missing something? Are there allocations that it doesn't include?

@goldshtn

Ok, thanks for the clarification!

@mattwarren
Ok, got this one too. As it turns out there's no quick and easy way to make current implementation anyway better.

The only approach I could imagine is to replace the entire MethodInvoker with custom implementation that will collect GC count / allocations before/after each targetAction() call. Definitely not a good thing as it will not work with out-of-process benchmarks and will break sooner or later.

Ok, will finish our benchmark runner instead. The only trouble we have for now is terrifying results when run on appveyor (look at MainTarget 2 timings),

MainTarget 1: 20 op,  10292100 ns, 514.6050 us/op
MainTarget 2: 20 op, 585070800 ns, 29.2535 ms/op
MainTarget 3: 20 op,  11789000 ns, 589.4500 us/op
MainTarget 4: 20 op,  10457900 ns, 522.8950 us/op
MainTarget 5: 20 op,  10462400 ns, 523.1200 us/op
MainTarget 6: 20 op,  10755600 ns, 537.7800 us/op
MainTarget 7: 20 op,  10349900 ns, 517.4950 us/op
MainTarget 8: 20 op,  10295600 ns, 514.7800 us/op
MainTarget 9: 20 op,  10301600 ns, 515.0800 us/op
MainTarget 10: 20 op, 10402800 ns, 520.1400 us/op

The benchmark is dummy loop with counter, no IO or anything else that could introduce delays.

As far as I can see the best approach will be to perform multiple reruns of benchmarks, say 3 to 5 times. Will try:)

@goldshtn in what way? I know they are bucketed up, but you seem to get the actual total for the bucket, i.e. I've seen totals like "100,435 bytes". My understanding is that it buckets up allocations until it exceeds 100K, then it sends an ETW event with the exact total, or am I missing something? Are there allocations that it doesn't include?

Well, if you allocate 80k you won't get an event. So that's not very accurate :-)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

As it turns out there's no quick and easy way to make current implementation anyway better.

I will at least investigate using custom ETW events to indicate the Start and Stop of the [Benchmark] method execution(s). This will also help with #186, which is another problem with the current approach. I will also look at the idea of using GC.Collection count in-process (as proposed by @goldshtn).

Also I'll update the docs to indicate that the Memory Diagnoser is a best-effort attempt to show what's going on (or maybe put it directly in the benchmark output). At least people will then know that it's numbers are meant as an indication, which I think it what most people really want, i.e. "_does this piece of code allocate or not_" is normally the question they're trying to answer.

Ok, will finish our benchmark runner instead. The only trouble we have for now is terrifying results when run on appveyor (look at MainTarget 2 timings),

I wonder if this is related to the tests running on a VM? See #167 for a bit more info.

Well, if you allocate 80k you won't get an event. So that's not very accurate :-)

Ahh I see what you're saying now, hadn't thought about that.

Btw while we are at it, maybe you could also use the type info provided by the allocation event to give extra detail on what kind of allocations were performed. Or just refer folks to PerfView -- there's a point where BDN turns into a profiler... But you could e.g. say something like "most objects you allocated were strings, try StringBuilder" or whatever.

On Jun 6, 2016, at 14:02, Matt Warren [email protected] wrote:

As it turns out there's no quick and easy way to make current implementation anyway better.

I will at least investigate using custom ETW events to indicate the Start and Stop of the [Benchmark] method execution(s). This will also help with #186, which is another problem with the current approach. I will also look at the idea of using GC.Collection count in-process (as proposed by @goldshtn).

Also I'll update the docs to indicate that the Memory Diagnoser is a best-effort attempt to show what's going on (or maybe put it directly in the benchmark output). At least people will then know that it's numbers are meant as an indication, which I think it what most people really want, i.e. "does this piece of code allocate or not" is normally the question they're trying to answer.

Ok, will finish our benchmark runner instead. The only trouble we have for now is terrifying results when run on appveyor (look at MainTarget 2 timings),

I wonder if this is related to the tests running on a VM? See #167 for a bit more info.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Btw while we are at it, maybe you could also use the type info provided by the allocation event to give extra detail on what kind of allocations were performed. Or just refer folks to PerfView -- there's a point where BDN turns into a profiler... But you could e.g. say something like "most objects you allocated were strings, try StringBuilder" or whatever.

I looked into that and from my understanding (I might be wrong!) the type info is only from the last allocation that triggered the 100K threshold. So you might have 99K of String allocations, but if a byte [] tips it over 100K, it's the byte [] and it's stack-trace that is reported.

However, because BenchmarkDotNet run's lots of iterations of your benchmark this might not be such an issue

But yes it would be really nice to have this, especially with Stack Traces.

@mattwarren

I wonder if this is related to the tests running on a VM?

Looks like a root of the issue:) Ok, will perform multiple runs with fingers crossed:)

I think your description is accurate, but indeed with a lot of repetitions the samples should be good enough. That's also what PerfView does, and from my experience it's pretty good when you alloc >100MB or so.

On Jun 6, 2016, at 14:14, Matt Warren [email protected] wrote:

Btw while we are at it, maybe you could also use the type info provided by the allocation event to give extra detail on what kind of allocations were performed. Or just refer folks to PerfView -- there's a point where BDN turns into a profiler... But you could e.g. say something like "most objects you allocated were strings, try StringBuilder" or whatever.

I looked into that and from my understanding (I might be wrong!) the type info is only from the last allocation that triggered the 100K threshold. So you might have 99K of String allocations, but if a byte [] tips it over 100K, it's the byte [] and it's stack-trace that is reported.

However, because BenchmarkDotNet run's lots of iterations of your benchmark this might not be such an issue

But yes it would be really nice to have this, especially with Stack Traces.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Expose GC allocation statistics interesting discussion with Maoni about possible ways of getting GC allocation stats. If They expose mentioned api I write MemoryDiagnoser for .NET Core same day ;)

Thanks for the link.

I never knew about AppDomain.MonitoringTotalAllocatedMemorySize, that might solve all our problems!

Was this page helpful?
0 / 5 - 0 ratings