Benchmarkdotnet: Assistance Understanding Results

Created on 25 Jan 2017  路  20Comments  路  Source: dotnet/BenchmarkDotNet

Hello,

I looked around in the documentation, but I am having trouble finding a page that breaks down the results. My question is pretty straight-forward, however. Consider the following results:

// *** Execute ***
// Launch: 1 / 1

// Benchmark Process Environment Information:
// Runtime=Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1586.0
// GC=Concurrent Workstation
// Job: DefaultJob

Pilot 1: 16 op, 1858421.24 ns, 116.1513 us/op
Pilot 2: 32 op, 3473019.82 ns, 108.5319 us/op
Pilot 3: 64 op, 7440327.1 ns, 116.2551 us/op
Pilot 4: 128 op, 13879901.98 ns, 108.4367 us/op
Pilot 5: 256 op, 27233689.48 ns, 106.3816 us/op
Pilot 6: 512 op, 53794583.43 ns, 105.0675 us/op
Pilot 7: 1024 op, 108592409.89 ns, 106.0473 us/op
Pilot 8: 2048 op, 217038415.49 ns, 105.9758 us/op

IdleWarmup 1: 2048 op, 9963.24 ns, 4.8649 ns/op
IdleWarmup 2: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleWarmup 3: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleWarmup 4: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleWarmup 5: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleWarmup 6: 2048 op, 5811.89 ns, 2.8378 ns/op

IdleTarget 1: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 2: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 3: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 4: 2048 op, 6088.65 ns, 2.9730 ns/op
IdleTarget 5: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 6: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 7: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 8: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 9: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 10: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 11: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 12: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 13: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 14: 2048 op, 5811.89 ns, 2.8378 ns/op
IdleTarget 15: 2048 op, 5811.89 ns, 2.8378 ns/op

MainWarmup 1: 2048 op, 215666809.29 ns, 105.3061 us/op
MainWarmup 2: 2048 op, 215869948.71 ns, 105.4052 us/op
MainWarmup 3: 2048 op, 216000577.87 ns, 105.4690 us/op
MainWarmup 4: 2048 op, 216725680.42 ns, 105.8231 us/op
MainWarmup 5: 2048 op, 216335453.47 ns, 105.6325 us/op
MainWarmup 6: 2048 op, 215730463.33 ns, 105.3371 us/op
MainWarmup 7: 2048 op, 215070398.6 ns, 105.0148 us/op
MainWarmup 8: 2048 op, 215513486.08 ns, 105.2312 us/op
MainWarmup 9: 2048 op, 216232776.74 ns, 105.5824 us/op
MainWarmup 10: 2048 op, 215372616.92 ns, 105.1624 us/op

MainTarget 1: 2048 op, 215521235.26 ns, 105.2350 us/op
MainTarget 2: 2048 op, 214699821.38 ns, 104.8339 us/op
MainTarget 3: 2048 op, 214979068.89 ns, 104.9702 us/op
MainTarget 4: 2048 op, 215165879.66 ns, 105.0615 us/op
MainTarget 5: 2048 op, 214530446.28 ns, 104.7512 us/op
MainTarget 6: 2048 op, 215041062.39 ns, 105.0005 us/op
MainTarget 7: 2048 op, 214752681.91 ns, 104.8597 us/op
MainTarget 8: 2048 op, 215758415.76 ns, 105.3508 us/op
MainTarget 9: 2048 op, 214879989.99 ns, 104.9219 us/op
MainTarget 10: 2048 op, 215089494.81 ns, 105.0242 us/op
MainTarget 11: 2048 op, 214539856.01 ns, 104.7558 us/op
MainTarget 12: 2048 op, 215711090.36 ns, 105.3277 us/op
MainTarget 13: 2048 op, 214641148.96 ns, 104.8052 us/op
MainTarget 14: 2048 op, 214801667.85 ns, 104.8836 us/op
MainTarget 15: 2048 op, 214990969.43 ns, 104.9761 us/op

Result 1: 2048 op, 215515404.92 ns, 105.2321 us/op
Result 2: 2048 op, 214693991.04 ns, 104.8311 us/op
Result 3: 2048 op, 214973238.55 ns, 104.9674 us/op
Result 4: 2048 op, 215160049.32 ns, 105.0586 us/op
Result 5: 2048 op, 214524615.94 ns, 104.7483 us/op
Result 6: 2048 op, 215035232.05 ns, 104.9977 us/op
Result 7: 2048 op, 214746851.57 ns, 104.8569 us/op
Result 8: 2048 op, 215752585.41 ns, 105.3479 us/op
Result 9: 2048 op, 214874159.65 ns, 104.9190 us/op
Result 10: 2048 op, 215083664.47 ns, 105.0213 us/op
Result 11: 2048 op, 214534025.67 ns, 104.7529 us/op
Result 12: 2048 op, 215705260.02 ns, 105.3248 us/op
Result 13: 2048 op, 214635318.62 ns, 104.8024 us/op
Result 14: 2048 op, 214795837.51 ns, 104.8808 us/op
Result 15: 2048 op, 214985139.09 ns, 104.9732 us/op
GC:  204 0 0 0 30720


// Run, Diagnostic
// BeforeAnythingElse

// Benchmark Process Environment Information:
// Runtime=Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1586.0
// GC=Concurrent Workstation
// Job: DefaultJob

// AfterSetup
// BeforeCleanup
Result 1: 2048 op, 204250853.75 ns, 99.7319 us/op
Result 2: 2048 op, 204721616.89 ns, 99.9617 us/op
Result 3: 2048 op, 203804445.19 ns, 99.5139 us/op
Result 4: 2048 op, 205355666.49 ns, 100.2713 us/op
Result 5: 2048 op, 204185815.92 ns, 99.7001 us/op
Result 6: 2048 op, 204581854.76 ns, 99.8935 us/op
Result 7: 2048 op, 204505469.91 ns, 99.8562 us/op
Result 8: 2048 op, 204818204.98 ns, 100.0089 us/op
Result 9: 2048 op, 205661759.4 ns, 100.4208 us/op
Result 10: 2048 op, 204213214.84 ns, 99.7135 us/op
Result 11: 2048 op, 205521720.51 ns, 100.3524 us/op
Result 12: 2048 op, 204999203.86 ns, 100.0973 us/op
Result 13: 2048 op, 205166364.91 ns, 100.1789 us/op
Result 14: 2048 op, 205401884.86 ns, 100.2939 us/op
Result 15: 2048 op, 204544215.85 ns, 99.8751 us/op
GC:  216 0 0 1963911104 30720

As you can see, there's a solid stream of results in 104-105us area. But then there's a final "cleanup" that puts the results in 99-100. It doesn't seem that the lower numbers are used towards the final result, so I am curious on what they are about.

Additionally, I wanted to verify that also the "GC" which I understand to be Garbage collection, followed by objects collected in generations 0, 1, and 2, the total byte count of memory used, but I am unsure what the last number is.

Thank you for any assistance you can provide. :)

Most helpful comment

@Mike-EEE your case with conditional weak table is very interesting!

By default BDN induces GC for each of it's iterations. But what is iteration? Let's consider following example:

Result 1: 2048 op, 215515404.92 ns, 105.2321 us/op
Result 2: 2048 op, 214693991.04 ns, 104.8311 us/op
Result 3: 2048 op, 214973238.55 ns, 104.9674 us/op

here it's 3 iterations, each executed benchmarked method 2048 times (more). So after every 2048 we call GC.Collect. But these collections are excluded from the results you can see. Code responsible for that:

var finalGcStats = GcStats.ReadFinal(IsDiagnoserAttached);
var forcedCollections = GcStats.FromForced(forcedFullGarbageCollections);
var workGcHasDone = finalGcStats - forcedCollections - initialGcStats;

We had a long discussion about that #211 . You can disable it on demand by using:

Job.Default.With(
    new GcMode
    {
        Force = false
    });

All 20 comments

We need to enable AppDomain's resource tracking to get the number of allocated bytes. This can have side effects. We perform a separate run for the diagnosers only, take gc stats, and ignore the timings for the last run. It's done only when some diagnoser is attached, and MemoryDiagnoser is enabled by default since v0.10.1.

// Run, Diagnostic indicates that this is separate run, for diagnosers only.

Recently I wrote a blog post about the MemoryDiagnoser, the goal was to describe the internals: adamsitnik.com/the-new-Memory-Diagnoser/. You might find it usefull.

Awesome @adamsitnik! Exactly what I am looking for. I will check it out and follow up with any questions on the blog post. :) Thanks again!

I opened this issue because I want to investigate why do we have such results.

Ah @AndreyAkinshin, if you need a test case, here is the test within the commit that produced the odd behavior:
https://github.com/wojtpl2/ExtendedXmlSerializer/blob/7a1918c57917ba3ee6374c3a972e669b3797ea5a/test/ExtendedXmlSerializer.Performance.Tests/Benchmarks.cs#L55-L73

Part of the PR if interested in the context/discussion:
https://github.com/wojtpl2/ExtendedXmlSerializer/pull/40

So to backup here, are you saying the results in the memory diagnoser should be similar to the previous executions?

@Mike-EEE

if you need a test case, here is the test within the commit that produced the odd behavior:

Thanks!

So to backup here, are you saying the results in the memory diagnoser should be similar to the previous executions?

I expected that diagnosers results should be greater than results without diagnosers (because we have some overhead in the diagnoser case). For now, I don't understand why it works faster with diagnosers.

OK cool, that does make sense to me, as well. In this case I had a boxing reference on an ImmutableArray that ended up causing the odd issues discussed in that PR. It was very weird behavior, as I mentioned, where one test was always faster than the other, but never were they both "fast" as the merged commit ultimately ended up being, thankfully. Tricky bits. :)

So another thing I am noticing. If I run the same benchmark, it will emit this line:

GC: 291 0 0 0 61440

But sometimes it will be:
GC: 287 0 0 0 61440

Additionally, during the memory diagnoser it displays:

GC: 275 0 0 2311548120 61440

Or:
GC: 279 0 0 2311548120 61440

Are these results supposed to be consistent numbers or is it expected to be off a few (looks like four in this case)?

The GC line contains:

$"{ResultsLinePrefix} {Gen0Collections} {Gen1Collections} {Gen2Collections} {AllocatedBytes} {TotalOperations}";

To get the AllocatedBytes for .NET Framework we need to enable app domain monitoring. We don't do it for the main runs, because it could slow down the benchmarks. The result is any way printed, but ignored.

{Gen0Collections} {Gen1Collections} {Gen2Collections} contains the number of garbage collections in given generations. The thing is that GC itself is a heuristic and might act differently for the same input and program. So that's why these numbers might be different. But with our / 1k scaling the difference should be almost invisible.

The most important are the results printed in the table. Allocated bytes should always be consistent, Gen X might have a small difference (0.001)

Great @adamsitnik that really clears it up for me. Thank you! So silly (and hopefully final :smile:) question here: is the GC that happens during Benchmark.NET manually invoked on Benchmark's side or does it happen "naturally" as it would in a regular application?

The reason why I ask is that I had a design whereby I was using static singletons for application components that utilized ConditionalWeakTable's. Well, it turned out those were slower than using non-static ConcurrentDictionary's instead. The reason why is that the only references to the keys (TypeInfos in this case) were in those tables, and when the GC occurred, it would flush out those keys and then the cached behavior that the tables sought to leverage would have to be executed again. This caused a bit of a disruption to my design, as I had to get rid of all the singletons and replace them with instantiated dictionaries instead.

It was only after I went through all the effort that I thought that maybe the reason why it was being slower is that Benchmark.NET invokes it manually, and might be impacting the results in that regard. If so, it would be valuable to be able to disable it to see how it impacts performance.

Thanks again for any insight you can provide.

@Mike-EEE your case with conditional weak table is very interesting!

By default BDN induces GC for each of it's iterations. But what is iteration? Let's consider following example:

Result 1: 2048 op, 215515404.92 ns, 105.2321 us/op
Result 2: 2048 op, 214693991.04 ns, 104.8311 us/op
Result 3: 2048 op, 214973238.55 ns, 104.9674 us/op

here it's 3 iterations, each executed benchmarked method 2048 times (more). So after every 2048 we call GC.Collect. But these collections are excluded from the results you can see. Code responsible for that:

var finalGcStats = GcStats.ReadFinal(IsDiagnoserAttached);
var forcedCollections = GcStats.FromForced(forcedFullGarbageCollections);
var workGcHasDone = finalGcStats - forcedCollections - initialGcStats;

We had a long discussion about that #211 . You can disable it on demand by using:

Job.Default.With(
    new GcMode
    {
        Force = false
    });

AWESOME! Yes, that is good to know. I am in the middle of a significant refactor right now but when I get the chance I am going to re-enable those singletons and see if I can get my design back. :) I will update here with any findings. 馃憤

OK it turns out I was CACHING ALL THE THINGS! and this was causing the performance issues I was seeing in regards to weak vs. strong (concurrent) dictionaries. That is, the behavior that I was caching was much faster than the time required to do a lookup. Good learning lesson. :)

Right now I have four classes that are using weak tables, and the difference between using strong dictionaries is about 1us, which is pretty decent. However, if I place Job.Default.With(new GcMode {Force = false}); as a statement before I call BenchmarkSwitcher.Run, the times end up being about the same. So I am curious if I am using it correctly or if there is somewhere else I should place it. I of course looked in the documentation but did not see anything regarding this. In any case, I am happy with where things are at the moment and appreciate your assistance!

However, if I place Job.Default.With(new GcMode {Force = false}); as a statement before I call BenchmarkSwitcher.Run, the times end up being about the same

@Mike-EEE Do you get any GC collections? Is GC doing anything? It's possible that your benchmarks are not allocating any memory, so the GC does nothing. With force = true it would at least perform some cleanup.

Here is an example result, @adamsitnik:

                            Method |       Mean |    StdDev |  Gen 0 | Allocated |
---------------------------------- |----------- |---------- |------- |---------- |
   SerializationClassWithPrimitive | 50.9038 us | 0.3897 us | 4.4108 |  37.57 kB |
 DeserializationClassWithPrimitive | 57.3928 us | 0.2000 us | 2.5228 |   27.8 kB |

I guess my question is... how much time would a sweep take? Or I guess how much overhead would you expect it to add? Would it be noticeable? I should also say that I am seeing variations in the results between 2-3us. So in the case above with SerializationClassWithPrimitive the mean strikes between 50-53us. I believe that is expected, however.

how much time would a sweep take?

It depends on multiple things:

  • generation size (which is changing in time due some heuristics)
  • gc mode: server vs workstation, concurrent or not
  • what % of the objects are trash and can be reclaimed? what will be promoted to next generation?

You could measure it with BenchmarkDotNet, but it would be limited to the scenario you implement (we cant mimic the rest of the app, we just measure the part you want to benchmark)

[Params(true, false)]
public bool WithGc {get; set; }

[Params(0, 1, 2)]
public in Generation {get; set;}

[Benchmark]
public void Test()
{
   YourScenario();

   if(WithGc)
   {
      GC.Collect(Generation);
   }
}

executed with our new GcMode() { Force = false; }

OK this is really great information, @adamsitnik. Thank you for taking the time to explain. In all the test runs that I have performed with BDN, I have never seen a generation 1 or 2; everything has been relegated to generation 0. It also seems as if GC is executed during test runs, which is sort of obvious, but worth noting, as the Gen 0 column states that GC is collected nearly 4.4 times for every 1,000 operations for the first test, and ~2.5 times every 1,000 operations in the 2nd test (if I am reading this correctly). It doesn't seem that these sweeps are removing references held by the WeakConditionalTables, and those references must be held elsewhere, or else it would seem that performance would be more significantly impacted.

To demonstrate, here is the first test run with all configured parameters as suggested above:

BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4820K CPU 3.70GHz, ProcessorCount=8
Frequency=3613273 Hz, Resolution=276.7574 ns, Timer=TSC
  [Host]     : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1586.0
  DefaultJob : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1586.0


                          Method | WithGc | Generation |        Mean |    StdDev |    Gen 0 |    Gen 1 |    Gen 2 | Allocated |
-------------------------------- |------- |----------- |------------ |---------- |--------- |--------- |--------- |---------- |
 SerializationClassWithPrimitive |  False |          0 |  48.1159 us | 0.1996 us |   4.4271 |        - |        - |   36.1 kB |
 SerializationClassWithPrimitive |  False |          1 |  48.1915 us | 0.1098 us |   4.2969 |        - |        - |   36.1 kB |
 SerializationClassWithPrimitive |  False |          2 |  48.6290 us | 0.1708 us |   4.3620 |        - |        - |   36.1 kB |
 SerializationClassWithPrimitive |   True |          0 |  82.3417 us | 0.3189 us | 997.5911 | 108.6751 |        - |  40.96 kB |
 SerializationClassWithPrimitive |   True |          1 |  98.9735 us | 0.2618 us | 995.4427 | 995.4427 |        - |  40.58 kB |
 SerializationClassWithPrimitive |   True |          2 | 328.3520 us | 0.9580 us | 991.1458 | 991.1458 | 991.1458 |  40.96 kB |

You can see that forcing GC nearly doubles the amount of time taken. It would seem that if references were being held by WCT were being swept during execution, then the mean would reflect this accordingly (and does not).

Also of note, when I run this with Force = false (essentially uncommenting this line) these are the results I get:

                          Method | WithGc | Generation |        Mean |    StdDev |    Gen 0 |    Gen 1 |    Gen 2 | Allocated |
-------------------------------- |------- |----------- |------------ |---------- |--------- |--------- |--------- |---------- |
 SerializationClassWithPrimitive |  False |          0 |  48.3399 us | 0.1595 us |   4.3620 |        - |        - |   36.1 kB |
 SerializationClassWithPrimitive |  False |          1 |  48.6142 us | 0.1512 us |   4.3620 |        - |        - |   36.1 kB |
 SerializationClassWithPrimitive |  False |          2 |  48.5202 us | 0.1215 us |   4.4922 |        - |        - |   36.1 kB |
 SerializationClassWithPrimitive |   True |          0 |  80.7770 us | 0.1604 us | 997.4609 | 108.5449 |        - |  40.96 kB |
 SerializationClassWithPrimitive |   True |          1 |  98.8535 us | 0.3849 us | 995.3125 | 995.3125 |        - |  40.58 kB |
 SerializationClassWithPrimitive |   True |          2 | 334.0823 us | 0.6462 us | 990.8854 | 990.8854 | 990.8854 |  40.96 kB |

I am still learning to process the results here, but they look nearly identical. So I am curious if I am using this correctly or if perhaps we are running into a bug of some sort (or if I am simply misunderstanding something).

In any case, please note that I was able to further refine and optimize the code so that ~48us and ~55us are the current results of our tests, which is more than acceptable as in v1 code they were ~75us and ~100us respectively. So performance is no longer the driving factor here, but understanding is, if that makes sense. :)

Hey @adamsitnik never mind about this issue now. I finally figured out the issue: poor configuration (described in https://github.com/dotnet/BenchmarkDotNet/issues/363#issuecomment-282535089). Whew... problem(s!) solved. 馃憤

Since the original problem was solved, I close this issue.

Looks like this issue wasn't closed, so I will do that now. Thanks again for all your assistance in helping me understand your great product!

Was this page helpful?
0 / 5 - 0 ratings