Go: testing: benchmark iteration reports incorrectly

Created on 26 Sep 2020  Â·  20Comments  Â·  Source: golang/go

What version of Go are you using (go version)?

$ go version
go version devel +8266570ba7 Fri Sep 25 19:10:05 2020 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Yes. 1.15.2.

What operating system and processor architecture are you using (go env)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/changkun/Library/Caches/go-build"
GOENV="/Users/changkun/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/changkun/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/changkun/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/changkun/dev/godev/go-gerrit"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/changkun/dev/godev/go-gerrit/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/changkun/dev/godev/go-gerrit/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/jd/vym7tt9s2_379d4ccjcj1xrw0000gn/T/go-build637901099=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

According to the doc:

The benchmark function must run the target code b.N times. During benchmark execution, b.N is adjusted until the benchmark function lasts long enough to be timed reliably. The output

BenchmarkRandInt-8    68453040            17.8 ns/op

means that the loop ran 68453040 times at a speed of 17.8 ns per loop.

But the following test fails and the benchmark runs the target code more than b.N times.

// main_test.go
func TestResultIteration(t *testing.T) {
    var v int32
    r := testing.Benchmark(func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            atomic.AddInt32(&v, 1)
        }
    })

    if r.N != int(v) {
        t.Fatalf("incorrect iteration count, want %d got %d", v, r.N)
    }
}

What did you expect to see?

PASS

What did you see instead?

=== RUN   TestResultIteration
    main_test.go:27: incorrect iteration count, want 359815341 got 258805240
--- FAIL: TestResultIteration (1.67s)
FAIL
NeedsInvestigation

Most helpful comment

The implementation could run the benchmark incrementally other than trying multiple times to predict the iteration. In this way, the testing facility could offer a more consistent result.

Running the benchmark incrementally would potentially be faster, but not necessarily more consistent. Many benchmarks have a non-trivial transient at the start of the benchmark (for example, due to CPU cache misses). Summing the results of multiple incremental runs would also sum the transient effects, whereas taking only the last run (with the final computed N) includes only a single transient, and should therefore have less noise for certain benchmarks.

All 20 comments

Because the benchmark function is run more than once.

(~/src) % cat main_test.go
package main

import (
        "sync/atomic"
        "testing"
)

// main_test.go
func TestResultIteration(t *testing.T) {
        var v int32
        var count int
        testing.Benchmark(func(b *testing.B) {
                count++
                for i := 0; i < b.N; i++ {
                        atomic.AddInt32(&v, 1)
                }
        })
        t.Logf("benchmark function ran %v times", count)

}
(~/src) % go test -bench=. -v main_test.go
=== RUN   TestResultIteration
    main_test.go:18: benchmark function ran 6 times
--- PASS: TestResultIteration (1.57s)
PASS
ok      command-line-arguments  1.582s

@davecheney Yes. Should a fix go for the doc or go for the code?

@davecheney Yes. Should a fix go for the doc or go for the code?

First, what's the actual problem?

First, what's the actual problem?

The test states the actual problem.

I don't think this is a bug, testing.Benchmark returns a BenchmarkResult which is the result of _a_ benchmark run.

https://golang.org/pkg/testing/#BenchmarkResult

I don't think this is a bug, testing.Benchmark returns a BenchmarkResult, which is the result of _a_ benchmark run.

Maybe. But the doc left the impression that the target code runs b.N times. If I have a shared resource, running the benchmark multiple times leads to more manipulation on the shared resource, which isn't what was expected.

The implementation could run the benchmark incrementally other than trying multiple times to predict the iteration. In this way, the testing facility could offer a more consistent result.

The other option could be at least document this behavior.

Change https://golang.org/cl/257647 mentions this issue: testing: grow benchmark iteration incrementally

@bcmills @jayconrod @matloob

Doesn't seem like a bug.

The benchmark function runs more than once with varying numbers of iterations. It starts with a small number of iterations so that if the benchmark function is slow, it doesn't take too long. It ramps up to a larger number of iterations for functions, since shorter runs will be noisy. BenchmarkResult only includes the results from the last run. It would be a bug for it to include results from earlier runs, since we expect them to be noisy.

What would change in the documentation? It seems concise but complete. The sentence "During benchmark execution, b.N is adjusted until the benchmark function lasts long enough to be timed reliably." describes this behavior.

@jayconrod Thanks for your response.

It ramps up to a larger number of iterations for functions, since shorter runs will be noisy

Is this the core reason that the current implementation iterates multiple times to predict the actual run batch size? As mentioned above, the implementation could run the benchmark incrementally other than trying multiple times. Statistically speaking, the final average using benchstat can avoid noise sources and provide comparable results.

"During benchmark execution, b.N is adjusted until the benchmark function lasts long enough to be timed reliably." describes this behavior.

Not really. The first impression could be: "b.N increases incrementally until it meets the time constraints." The confusing is how "be timed reliability" was defined in this case. There is no further prior knowledge on the benchmark function, it is totally possible with the default time window (1s) is too short, and a developer is responsible to decide the bench time. Thus the feasible way to produce comparable results is still to further verify with benchstat.

Will you willing to review CL 257647? It suggests code change other than doc change and also addresses #27217.

Is this the core reason that the current implementation iterates multiple times to predict the actual run batch size?

I think so. I'm not super-familiar with this code, but that's probably how I would write it.

As mentioned above, the implementation could run the benchmark incrementally other than trying multiple times.

I think by "incrementally", you mean that go test would accumulate the elapsed time and iteration count over several calls to the benchmark function instead of just returning the time and count from the last run (please correct me if I have that wrong).

That could add significant noise in fast benchmarks where the loop body only takes a few cycles to execute, i.e., fewer cycles than it takes to measure elapsed time and do the other benchmarking calculations. It's been a while since I did low-level benchmarking, but on older architectures, getting the time took a couple hundred cycles since it flushed CPU pipelines. That stuff adds a lot of noise.

That could add significant noise in fast benchmarks where the loop body only takes a few cycles to execute, i.e., fewer cycles than it takes to measure elapsed time and do the other benchmarking calculations. It's been a while since I did low-level benchmarking, but on older architectures, getting the time took a couple hundred cycles since it flushed CPU pipelines. That stuff adds a lot of noise.

Indeed, that's the theoretical impression. But I benchmarked a few benchmarks, for instance, the go1 bench.

Two executions of the go1 bench before CL 257647 (tip v.s. tip):

name                     old time/op    new time/op    delta
BinaryTree17-6              2.08s ± 2%     2.02s ± 2%  -2.81%  (p=0.000 n=9+9)
Fannkuch11-6                1.91s ± 1%     1.89s ± 0%  -0.70%  (p=0.000 n=10+10)
FmtFprintfEmpty-6          29.8ns ± 2%    29.1ns ± 1%  -2.17%  (p=0.001 n=10+8)
FmtFprintfString-6         51.4ns ± 4%    50.7ns ± 3%    ~     (p=0.071 n=10+10)
FmtFprintfInt-6            55.2ns ± 3%    54.9ns ± 3%    ~     (p=0.183 n=10+10)
FmtFprintfIntInt-6         84.0ns ± 1%    83.5ns ± 3%    ~     (p=0.093 n=10+10)
FmtFprintfPrefixedInt-6    94.2ns ± 1%    93.4ns ± 3%    ~     (p=0.136 n=10+10)
FmtFprintfFloat-6           151ns ± 0%     149ns ± 0%  -1.06%  (p=0.000 n=6+10)
FmtManyArgs-6               361ns ± 0%     358ns ± 1%  -0.94%  (p=0.000 n=10+10)
GobDecode-6                3.82ms ± 1%    3.75ms ± 1%  -1.83%  (p=0.000 n=10+10)
GobEncode-6                3.01ms ± 1%    2.94ms ± 1%  -2.20%  (p=0.000 n=9+9)
Gzip-6                      165ms ± 1%     161ms ± 0%  -2.41%  (p=0.000 n=9+10)
Gunzip-6                   24.7ms ± 1%    24.4ms ± 1%  -1.21%  (p=0.000 n=10+9)
HTTPClientServer-6         56.3µs ± 2%    54.7µs ± 0%  -2.92%  (p=0.000 n=9+8)
JSONEncode-6               6.35ms ± 2%    6.19ms ± 1%  -2.48%  (p=0.000 n=10+9)
JSONDecode-6               28.0ms ± 0%    27.9ms ± 1%  -0.48%  (p=0.008 n=9+9)
Mandelbrot200-6            3.31ms ± 0%    3.30ms ± 0%  -0.36%  (p=0.004 n=8+9)
GoParse-6                  2.27ms ± 0%    2.27ms ± 1%  -0.32%  (p=0.046 n=8+9)
RegexpMatchEasy0_32-6      45.7ns ± 0%    46.1ns ± 2%  +0.78%  (p=0.018 n=10+9)
RegexpMatchEasy0_1K-6       126ns ± 0%     126ns ± 1%    ~     (p=0.405 n=8+10)
RegexpMatchEasy1_32-6      41.9ns ± 3%    42.4ns ± 9%    ~     (p=0.373 n=10+10)
RegexpMatchEasy1_1K-6       213ns ± 1%     210ns ± 1%  -1.11%  (p=0.000 n=9+10)
RegexpMatchMedium_32-6     3.90ns ± 1%    3.84ns ± 0%  -1.31%  (p=0.000 n=9+9)
RegexpMatchMedium_1K-6     21.7µs ± 0%    21.9µs ± 6%    ~     (p=0.315 n=8+10)
RegexpMatchHard_32-6       1.04µs ± 1%    1.04µs ± 1%    ~     (p=0.914 n=8+10)
RegexpMatchHard_1K-6       31.4µs ± 1%    31.0µs ± 1%  -1.34%  (p=0.000 n=10+10)
Revcomp-6                   264ms ± 0%     260ms ± 2%  -1.41%  (p=0.004 n=9+9)
Template-6                 35.1ms ± 0%    34.9ms ± 1%  -0.38%  (p=0.021 n=8+9)
TimeParse-6                 224ns ± 1%     222ns ± 0%  -0.76%  (p=0.000 n=10+8)
TimeFormat-6                219ns ± 0%     218ns ± 0%  -0.46%  (p=0.000 n=9+9)

name                     old speed      new speed      delta
GobDecode-6               201MB/s ± 1%   205MB/s ± 1%  +1.87%  (p=0.000 n=10+10)
GobEncode-6               255MB/s ± 1%   261MB/s ± 1%  +2.25%  (p=0.000 n=9+9)
Gzip-6                    118MB/s ± 1%   121MB/s ± 0%  +2.47%  (p=0.000 n=9+10)
Gunzip-6                  786MB/s ± 1%   796MB/s ± 1%  +1.22%  (p=0.000 n=10+9)
JSONEncode-6              306MB/s ± 2%   313MB/s ± 1%  +2.54%  (p=0.000 n=10+9)
JSONDecode-6             69.3MB/s ± 0%  69.6MB/s ± 1%  +0.49%  (p=0.007 n=9+9)
GoParse-6                25.5MB/s ± 0%  25.6MB/s ± 1%  +0.33%  (p=0.038 n=8+9)
RegexpMatchEasy0_32-6     700MB/s ± 1%   695MB/s ± 2%  -0.72%  (p=0.017 n=10+9)
RegexpMatchEasy0_1K-6    8.07GB/s ± 1%  8.12GB/s ± 1%    ~     (p=0.105 n=10+10)
RegexpMatchEasy1_32-6     764MB/s ± 3%   756MB/s ± 9%    ~     (p=0.315 n=10+10)
RegexpMatchEasy1_1K-6    4.81GB/s ± 1%  4.87GB/s ± 0%  +1.31%  (p=0.000 n=10+9)
RegexpMatchMedium_32-6    257MB/s ± 1%   260MB/s ± 0%  +1.31%  (p=0.000 n=9+9)
RegexpMatchMedium_1K-6   47.3MB/s ± 0%  46.8MB/s ± 6%    ~     (p=0.304 n=8+10)
RegexpMatchHard_32-6     30.6MB/s ± 1%  30.7MB/s ± 1%    ~     (p=0.897 n=8+10)
RegexpMatchHard_1K-6     32.6MB/s ± 1%  33.1MB/s ± 1%  +1.36%  (p=0.000 n=10+10)
Revcomp-6                 962MB/s ± 0%   976MB/s ± 2%  +1.44%  (p=0.004 n=9+9)
Template-6               55.3MB/s ± 0%  55.5MB/s ± 1%  +0.38%  (p=0.021 n=8+9)

Go 1 bench before and after the CL (tip v.s. tip+CL):

name                     old time/op    new time/op    delta
BinaryTree17-6              2.02s ± 2%     1.72s ± 1%  -14.78%  (p=0.000 n=9+9)
Fannkuch11-6                1.89s ± 0%     1.89s ± 0%   -0.19%  (p=0.004 n=10+8)
FmtFprintfEmpty-6          29.1ns ± 1%    29.0ns ± 1%     ~     (p=0.292 n=8+10)
FmtFprintfString-6         50.7ns ± 3%    50.4ns ± 1%     ~     (p=0.780 n=10+10)
FmtFprintfInt-6            54.9ns ± 3%    54.6ns ± 1%     ~     (p=0.743 n=10+8)
FmtFprintfIntInt-6         83.5ns ± 3%    84.8ns ± 1%   +1.64%  (p=0.022 n=10+10)
FmtFprintfPrefixedInt-6    93.4ns ± 3%    94.5ns ± 2%     ~     (p=0.085 n=10+10)
FmtFprintfFloat-6           149ns ± 0%     150ns ± 1%   +0.54%  (p=0.019 n=10+10)
FmtManyArgs-6               358ns ± 1%     357ns ± 1%     ~     (p=0.301 n=10+10)
GobDecode-6                3.75ms ± 1%    3.75ms ± 1%     ~     (p=0.633 n=10+8)
GobEncode-6                2.94ms ± 1%    2.97ms ± 2%   +0.76%  (p=0.031 n=9+9)
Gzip-6                      161ms ± 0%     160ms ± 0%   -0.46%  (p=0.003 n=10+9)
Gunzip-6                   24.4ms ± 1%    24.2ms ± 1%   -0.71%  (p=0.004 n=9+10)
HTTPClientServer-6         54.7µs ± 0%    52.4µs ± 1%   -4.20%  (p=0.000 n=8+9)
JSONEncode-6               6.19ms ± 1%    6.31ms ± 1%   +1.92%  (p=0.000 n=9+10)
JSONDecode-6               27.9ms ± 1%    27.9ms ± 0%     ~     (p=0.190 n=9+9)
Mandelbrot200-6            3.30ms ± 0%    3.27ms ± 1%   -0.87%  (p=0.000 n=9+10)
GoParse-6                  2.27ms ± 1%    2.28ms ± 1%   +0.46%  (p=0.017 n=9+10)
RegexpMatchEasy0_32-6      46.1ns ± 2%    44.6ns ± 1%   -3.16%  (p=0.000 n=9+9)
RegexpMatchEasy0_1K-6       126ns ± 1%     126ns ± 1%     ~     (p=0.192 n=10+9)
RegexpMatchEasy1_32-6      42.4ns ± 9%    41.3ns ± 1%     ~     (p=0.666 n=10+10)
RegexpMatchEasy1_1K-6       210ns ± 1%     208ns ± 1%   -0.92%  (p=0.000 n=10+8)
RegexpMatchMedium_32-6     3.84ns ± 0%    3.73ns ± 1%   -2.92%  (p=0.000 n=9+10)
RegexpMatchMedium_1K-6     21.9µs ± 6%    22.0µs ± 1%     ~     (p=0.156 n=10+9)
RegexpMatchHard_32-6       1.04µs ± 1%    1.03µs ± 1%     ~     (p=0.127 n=10+9)
RegexpMatchHard_1K-6       31.0µs ± 1%    31.2µs ± 1%   +0.65%  (p=0.005 n=10+10)
Revcomp-6                   260ms ± 2%     265ms ± 1%   +1.69%  (p=0.002 n=9+8)
Template-6                 34.9ms ± 1%    35.1ms ± 0%     ~     (p=0.063 n=9+9)
TimeParse-6                 222ns ± 0%     227ns ± 0%   +2.25%  (p=0.000 n=8+10)
TimeFormat-6                218ns ± 0%     216ns ± 0%   -0.92%  (p=0.000 n=9+9)

name                     old speed      new speed      delta
GobDecode-6               205MB/s ± 1%   205MB/s ± 1%     ~     (p=0.633 n=10+8)
GobEncode-6               261MB/s ± 1%   259MB/s ± 2%   -0.75%  (p=0.031 n=9+9)
Gzip-6                    121MB/s ± 0%   121MB/s ± 0%   +0.46%  (p=0.003 n=10+9)
Gunzip-6                  796MB/s ± 1%   801MB/s ± 1%   +0.71%  (p=0.004 n=9+10)
JSONEncode-6              313MB/s ± 1%   308MB/s ± 1%   -1.89%  (p=0.000 n=9+10)
JSONDecode-6             69.6MB/s ± 1%  69.6MB/s ± 0%     ~     (p=0.216 n=9+9)
GoParse-6                25.6MB/s ± 1%  25.4MB/s ± 1%   -0.46%  (p=0.017 n=9+10)
RegexpMatchEasy0_32-6     695MB/s ± 2%   717MB/s ± 1%   +3.22%  (p=0.000 n=9+9)
RegexpMatchEasy0_1K-6    8.12GB/s ± 1%  8.15GB/s ± 1%     ~     (p=0.243 n=10+9)
RegexpMatchEasy1_32-6     756MB/s ± 9%   775MB/s ± 1%     ~     (p=0.684 n=10+10)
RegexpMatchEasy1_1K-6    4.87GB/s ± 0%  4.90GB/s ± 1%   +0.64%  (p=0.010 n=9+10)
RegexpMatchMedium_32-6    260MB/s ± 0%   268MB/s ± 1%   +3.06%  (p=0.000 n=9+10)
RegexpMatchMedium_1K-6   46.8MB/s ± 6%  46.6MB/s ± 1%     ~     (p=0.156 n=10+9)
RegexpMatchHard_32-6     30.7MB/s ± 1%  30.9MB/s ± 1%     ~     (p=0.108 n=10+9)
RegexpMatchHard_1K-6     33.1MB/s ± 1%  32.9MB/s ± 1%   -0.64%  (p=0.005 n=10+10)
Revcomp-6                 976MB/s ± 2%   960MB/s ± 1%   -1.67%  (p=0.002 n=9+8)
Template-6               55.5MB/s ± 1%  55.4MB/s ± 0%     ~     (p=0.066 n=9+9)

which doesn't show significant noise in fast benchmarks (with the CL) on i5-8500B. Instead, the execution time of the benchmark is significantly improved from 438.864s to 215.693s.

These are slower benchmarks than I was thinking of. I'd expect to see more noise in something that runs in 10 ns or less, like acquiring an uncontested lock.

The execution time improvement is nice, but precision and accuracy are the highest priority. Since I don't think the current methodology is actually wrong, any proposed change should at least hold those steady. There are some pretty sizable differences in times above (BinaryTree17 is different by 14.78%).

There are some pretty sizable differences in times above (BinaryTree17 is different by 14.78%).

The diff that shows bigger diff is clearly not caused by the CL, as you can tell from the others. Since we are talking about microbenchmarks here, thus I don't fully convince by this argument. Could you maybe suggest more counter-examples?
Of course, a more strict benchmark test in a strictly controlled environment can be conducted for the CL, but what's showed here is a quick test for the first impression.

As you agreed, we are talking about code that runs in 10ns or less. It is pretty easy to verify with atomics:

func BenchmarkAtomic(b *testing.B) {
    var v int32
    for i := 0; i < b.N; i++ {
        atomic.AddInt32(&v, 1)
    }
}

tip vs tip+CL:

name       old time/op  new time/op  delta
Atomic-12  4.58ns ± 0%  4.58ns ± 0%   ~     (p=1.094 n=8+9)

What is the concrete problem you are trying to address here?

(*testing.B.N) _requests_ the number of iterations for the user-supplied function _to_ run, it does not _report_ the number of iterations actually run. (Compare #38677.)

For improving the overall execution time of fast benchmarks, see #10930.

The implementation could run the benchmark incrementally other than trying multiple times to predict the iteration. In this way, the testing facility could offer a more consistent result.

Running the benchmark incrementally would potentially be faster, but not necessarily more consistent. Many benchmarks have a non-trivial transient at the start of the benchmark (for example, due to CPU cache misses). Summing the results of multiple incremental runs would also sum the transient effects, whereas taking only the last run (with the final computed N) includes only a single transient, and should therefore have less noise for certain benchmarks.

Many benchmarks have a non-trivial transient at the start of the benchmark (for example, due to CPU cache misses) [...] should therefore have less noise for certain benchmarks.

Yes, I know. But the argument of suggesting running it incrementally is that the noise can be omitted when later verify with significance testing.

What is the concrete problem you are trying to address here?

The original purpose is to address "I have a shared resource, running the benchmark leads to more manipulation on the shared resource, which isn't what was expected." A subsequent investigation found that it is also one of the causes of https://github.com/golang/go/issues/27217, as described in CL 257647 message.

the argument of suggesting running it incrementally is that the noise can be omitted when later verify with significance testing.

Note that the testing package _does not_ currently perform any significance testing when computing b.N.

Understood. Close because no changes are suggested to proceed.

Was this page helpful?
0 / 5 - 0 ratings