Benchmarkdotnet: Bug or Guidance?

Created on 19 Dec 2016 · 10Comments · Source: dotnet/BenchmarkDotNet

Hello Benchmark.NET Community,

First off, excellent project. I am very impressed with the efforts here. I have been learning it in my efforts over at the ExtendedXmlSerializer.

It seems that I have run into a problem that I can't figure out, unfortunately. I am building a serializer for EXSv2.0, and I have gotten Benchmark.NET to say that it is running at around 71us/op (lowest) to 75us/up (highest). Which is perfectly awesome.

However, I have run into a strange condition and I am hoping to get some guidance on it. It turns out that when I remove a class that _reduces_ code and the amount of calls made in the serializer, the results mysteriously jump from 71-75 to 74-77.

I have tried profiling this in dotTrace, and it conveys what I would expect: the code with the removed class is faster than the code with the class. Running this in Visual Studio (while not as accurate) also shows this to be true, as well. So it seems that Benchmark.NET is the outlier here.

So, I am curious here on what could be wrong. Am I misunderstanding the readings, perhaps? Is there another consideration I should be made aware?

FWIW, I have created a branch that demonstrates this behavior here:
https://github.com/Mike-EEE/ExtendedXmlSerializer/tree/issues/Benchmark.NET/330

There are two tests in the Benchmarks file and together they show what I am experiencing:

FastButShouldBeSlowExtendedXmlSerializerTest.Benchmark (this is currently getting 71-75us on my machine)
SlowButShouldBeFastExtendedXmlSerializerTest.Benchmark (you can see that essentially I remove this class here -- leading to 74-77us, which again is not expected)

Please let me know if there are any questions around this that I can help answer, and/or if I have anything misunderstood. This is very possible since I have only been using your excellent project for a week now. :)

Thank you,
Michael

question

Source

Mike-E-angelo

Most helpful comment

I updated BenchmarkDotNet to v0.10.0 and rewrote your benchmark in the following way:

[RankColumn, WelchTTestPValueColumn]
public class ExtendedXmlSerializerTest
{
    private readonly TestClassOtherClass _obj = new TestClassOtherClass();
    private readonly string _xml;
    private readonly IExtendedXmlSerializer _serializer1 = new FastButShouldBeSlowExtendedXmlSerializer();
    private readonly IExtendedXmlSerializer _serializer2 = new SlowButShouldBeFastExtendedXmlSerializer();

    public ExtendedXmlSerializerTest()
    {
        _obj.Init();
        _xml = _serializer1.Serialize(_obj);
        _xml = _serializer2.Serialize(_obj);
    }

    [Benchmark(Baseline = true)]
    public string Benchmark1() => _serializer1.Serialize(_obj);

    [Benchmark]
    public string Benchmark2() => _serializer2.Serialize(_obj);
}

The [RankColumn, WelchTTestPValueColumn] attributes help to check the difference between benchmark methods. Here are results on my laptop:

Here are the results:

// * Detailed results *
ExtendedXmlSerializerTest.Benchmark1: DefaultJob
Mean = 95.5063 us, StdErr = 0.2568 us (0.27%); N = 15, StdDev = 0.9947 us
Min = 93.5106 us, Q1 = 95.0572 us, Median = 95.7841 us, Q3 = 96.2840 us, Max = 96.5502 us
IQR = 1.2268 us, LowerFence = 93.2170 us, UpperFence = 98.1243 us
ConfidenceInterval = [95.0029 us; 96.0096 us] (CI 95%)
Skewness = -0.902343070783061, Kurtosis = 2.29455215199608


ExtendedXmlSerializerTest.Benchmark2: DefaultJob
Mean = 95.4350 us, StdErr = 0.0914 us (0.1%); N = 15, StdDev = 0.3541 us
Min = 94.9368 us, Q1 = 95.1838 us, Median = 95.3325 us, Q3 = 95.7133 us, Max = 96.1441 us
IQR = 0.5295 us, LowerFence = 94.3895 us, UpperFence = 96.5076 us
ConfidenceInterval = [95.2558 us; 95.6142 us] (CI 95%)
Skewness = 0.36094218325461, Kurtosis = 1.86039432864697


Total time: 00:00:23 (23.6 sec)

// * Summary *

BenchmarkDotNet=v0.10.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU 2.20GHz, ProcessorCount=8
Frequency=2143475 Hz, Resolution=466.5321 ns, Timer=TSC
Host Runtime=Clr 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1586.0
Job Runtime(s):
        Clr 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]


     Method |       Mean |    StdDev |     Median | Scaled | Scaled-StdDev | t-test p-value | Rank |
----------- |----------- |---------- |----------- |------- |-------------- |--------------- |----- |
 Benchmark1 | 95.5063 us | 0.9947 us | 95.7841 us |   1.00 |          0.00 |         1.0000 |    1 |
 Benchmark2 | 95.4350 us | 0.3541 us | 95.3325 us |   1.00 |          0.01 |         0.7969 |    1 |

Thus, according to Welch's t-test, there is no statistical significant difference between these two methods. Note that I shut down all the application which could be closed (even some system processes) before I run the benchmark. It seems that your methods have a big variance and the performance is depended on many factors. And it's really easy to get 2–5us difference (which is a few percents of the total time) between means because of 3rd party processes. And it's also easy to get situation when one method seems faster than another several times in a row. You could increase the precision level by increasing duration of a single iteration and total amount of iteration. But you still have to close all applications (includes IDE and browser) and check the distributions.

AndreyAkinshin on 19 Dec 2016

👍2

All 10 comments

Hello @Mike-EEE, glad you like our project.

I will look at you problem in a few days. Which benchmark should I run?

AndreyAkinshin on 19 Dec 2016

Great! Thank you, @AndreyAkinshin. And yes, the exact benchmark is slightly relevant to the discussion, isn't it? 😄 Details details! That benchmark would be ExtendedXmlSerializerTest.SerializationClassWithPrimitive and I have also updated the original post with this information as well. Please let me know if there is anything further I can provide to assist with this issue.

Mike-E-angelo on 19 Dec 2016

@Mike-EEE, could you create a revision with both implementation? Let's say ExtendedXmlSerializer and ExtendedXmlSerializer2? I understand that it will look ugly, but it will really help to produce a nice performance investigation.

AndreyAkinshin on 19 Dec 2016

Ah yes that would be helpful, wouldn't it, @AndreyAkinshin? 😄 I guess I was thinking there might be something obvious to consider and didn't want to go down the road of putting more time into this than I already have if there was a "did you try this?" consideration.

So I have created a new branch altogether, and everything you need now is in the Benchmarks file. With two tests each exercising two different serializers. I have also taken the liberty of renaming the serializers and subsequent test classes so that the are easier to tell apart:

Hopefully that will help you out. Please let me know if there is anything further I can do if not. 👍

Mike-E-angelo on 19 Dec 2016

I updated BenchmarkDotNet to v0.10.0 and rewrote your benchmark in the following way:

[RankColumn, WelchTTestPValueColumn]
public class ExtendedXmlSerializerTest
{
    private readonly TestClassOtherClass _obj = new TestClassOtherClass();
    private readonly string _xml;
    private readonly IExtendedXmlSerializer _serializer1 = new FastButShouldBeSlowExtendedXmlSerializer();
    private readonly IExtendedXmlSerializer _serializer2 = new SlowButShouldBeFastExtendedXmlSerializer();

    public ExtendedXmlSerializerTest()
    {
        _obj.Init();
        _xml = _serializer1.Serialize(_obj);
        _xml = _serializer2.Serialize(_obj);
    }

    [Benchmark(Baseline = true)]
    public string Benchmark1() => _serializer1.Serialize(_obj);

    [Benchmark]
    public string Benchmark2() => _serializer2.Serialize(_obj);
}

The [RankColumn, WelchTTestPValueColumn] attributes help to check the difference between benchmark methods. Here are results on my laptop:

Here are the results:

// * Detailed results *
ExtendedXmlSerializerTest.Benchmark1: DefaultJob
Mean = 95.5063 us, StdErr = 0.2568 us (0.27%); N = 15, StdDev = 0.9947 us
Min = 93.5106 us, Q1 = 95.0572 us, Median = 95.7841 us, Q3 = 96.2840 us, Max = 96.5502 us
IQR = 1.2268 us, LowerFence = 93.2170 us, UpperFence = 98.1243 us
ConfidenceInterval = [95.0029 us; 96.0096 us] (CI 95%)
Skewness = -0.902343070783061, Kurtosis = 2.29455215199608


ExtendedXmlSerializerTest.Benchmark2: DefaultJob
Mean = 95.4350 us, StdErr = 0.0914 us (0.1%); N = 15, StdDev = 0.3541 us
Min = 94.9368 us, Q1 = 95.1838 us, Median = 95.3325 us, Q3 = 95.7133 us, Max = 96.1441 us
IQR = 0.5295 us, LowerFence = 94.3895 us, UpperFence = 96.5076 us
ConfidenceInterval = [95.2558 us; 95.6142 us] (CI 95%)
Skewness = 0.36094218325461, Kurtosis = 1.86039432864697


Total time: 00:00:23 (23.6 sec)

// * Summary *

BenchmarkDotNet=v0.10.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU 2.20GHz, ProcessorCount=8
Frequency=2143475 Hz, Resolution=466.5321 ns, Timer=TSC
Host Runtime=Clr 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1586.0
Job Runtime(s):
        Clr 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]


     Method |       Mean |    StdDev |     Median | Scaled | Scaled-StdDev | t-test p-value | Rank |
----------- |----------- |---------- |----------- |------- |-------------- |--------------- |----- |
 Benchmark1 | 95.5063 us | 0.9947 us | 95.7841 us |   1.00 |          0.00 |         1.0000 |    1 |
 Benchmark2 | 95.4350 us | 0.3541 us | 95.3325 us |   1.00 |          0.01 |         0.7969 |    1 |

AndreyAkinshin on 19 Dec 2016

👍2

WOW thank you for taking the time to look at this, @AndreyAkinshin! It is very much appreciated.

So yes, to start with, I was not using the latest version of Benchmark.NET -- shame on me! I did update to it and used the tests as you described and I am seeing the exact same thing:

     Method |       Mean |    StdDev | Scaled | Scaled-StdDev | t-test p-value | Rank |
----------- |----------- |---------- |------- |-------------- |--------------- |----- |
 Benchmark1 | 74.6404 us | 0.1122 us |   1.00 |          0.00 |         1.0000 |    2 |
 Benchmark2 | 74.4679 us | 0.0454 us |   1.00 |          0.00 |      5.41e-005 |    1 |

However, there seems to be something that occurs when you combine these two classes in the same benchmark class, because when I separate them out (as I have been testing extensively up to this point), I am definitely seeing a difference between the two.

Here's what I get for FastButShouldBeSlowExtendedXmlSerializerTest configured as prescribed above with 0.10.1:

    [RankColumn, WelchTTestPValueColumn]
    public class FastButShouldBeSlowExtendedXmlSerializerTest
    {
        private readonly TestClassOtherClass _obj = new TestClassOtherClass();
        private readonly string _xml;
        private readonly IExtendedXmlSerializer _serializer1 = new FastButShouldBeSlowExtendedXmlSerializer();

        public FastButShouldBeSlowExtendedXmlSerializerTest()
        {
            _obj.Init();
            _xml = _serializer1.Serialize(_obj);
        }

        [Benchmark(Baseline = true)]
        public string Benchmark1() => _serializer1.Serialize(_obj);
    }

Here is the lowest value I could produce after many times of trying. Note that I did not get 71 this time, but it was relatively quickly that I got the 72:

     Method |       Mean |    StdDev | Scaled | Scaled-StdDev | t-test p-value | Rank |
----------- |----------- |---------- |------- |-------------- |--------------- |----- |
 Benchmark1 | 72.1818 us | 0.0115 us |   1.00 |          0.00 |         1.0000 |    1 |

And for SlowButShouldBeFastExtendedXmlSerializerTest:

    [RankColumn, WelchTTestPValueColumn]
    public class SlowButShouldBeFastExtendedXmlSerializerTest
    {
        private readonly TestClassOtherClass _obj = new TestClassOtherClass();
        private readonly string _xml;
        private readonly IExtendedXmlSerializer _serializer2 = new SlowButShouldBeFastExtendedXmlSerializer();

        public SlowButShouldBeFastExtendedXmlSerializerTest()
        {
            _obj.Init();
            _xml = _serializer2.Serialize(_obj);
            // _xml = _serializer2.Serialize(_obj);
        }

        [Benchmark(Baseline = true)]
        public string Benchmark1() => _serializer2.Serialize(_obj);
    }

Mostly I was getting 75s-77s, but I did land a 74:

     Method |       Mean |    StdDev | Scaled | Scaled-StdDev | t-test p-value | Rank |
----------- |----------- |---------- |------- |-------------- |--------------- |----- |
 Benchmark1 | 74.4981 us | 0.1098 us |   1.00 |          0.00 |         1.0000 |    1 |

Note that I never even landed a 77 or even 76 with FastButShouldBeSlowExtendedXmlSerializerTest.

Also, I have been testing this without any apps running, and by turning off Windows Defender and pretty much my whole system tray. :) I understand that there are fluctuations in system load and that times are not consistent. However, what _has_ been consistent is that the two tests simply never go beyond certain thresholds (when they are separated into separate Benchmark classes), and that the class that should be "faster" and the one that should be "slower"... well, isn't.

So, I guess I am curious if you find that there are differing values when you separate out the tests into two different benchmark classes? I understand and realize that values can and will differ, and that testing times are a nebulous, inconsistent art. However, seeing on how I have _never_ gotten a below a ~74 for SlowButShouldBeFastExtendedXmlSerializerTest or _never_ above a ~75 for FastButShouldBeSlowExtendedXmlSerializerTest it makes me suspect that something is amiss here.

However, I also concede that I can be horribly incorrect about this as I am continuing to learn here! Thanks again for your time and assistance in any case. 👍

Mike-E-angelo on 19 Dec 2016

Ok, I will check such case, but I need more time for that.

AndreyAkinshin on 19 Dec 2016

🎉1

Hey @AndreyAkinshin seeing on how you were originally going to take a look at this issue in a few days, that is perfectly acceptable to me. 😄 No worries on this end. Please take your time!

Mike-E-angelo on 19 Dec 2016

@Mike-EEE can this one be closed? Did you find the answer?

adamsitnik on 2 Apr 2017

Ah, yes @adamsitnik we can indeed. I think we can chalk this one up to tripping over myself while learning way too many things at once. Thanks again to you and @AndreyAkinshin for all your help here!

Mike-E-angelo on 2 Apr 2017

Was this page helpful?

0 / 5 - 0 ratings