Go: proposal: testing: add (*PB).NextIndex() (int, bool) method

Created on 6 Dec 2020  Â·  14Comments  Â·  Source: golang/go

What version of Go are you using (go version)?

$ go version
go version go1.15.5 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/xx/Library/Caches/go-build"
GOENV="/Users/xx/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/xx/gocode/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/xx/gocode"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.15.5/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.15.5/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/xr/hsc7mk1n5glg6vfqb4z5nlkh0000gn/T/go-build663468880=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

testing.PB.Next() only returns a boolean whether there exits another iteration when running parallel benchmarks.
This is insufficient in many situations where one needs to know the current iteration index.
See the Go code in the attached stencil.go.zip file.
Consider the serial elision of the 3-point stencil example, which reads from an input slice in and writes to output slice out while computing in[i-1] + in[i] + in[i+1] for iteration index i and writing out to out[i].

func stencilSerial() {
    r := testing.Benchmark(func(b *testing.B) {
        in := make([]int, b.N)
        out := make([]int, b.N)
        for i := 0; i < b.N; i++ {
            in[i] = i
        }

        for i := 0; i < b.N; i++ {
            left := 0
            if i > 0 {
                left = in[i-1]
            }
            right := 0
            if i < int(b.N)-1 {
                right = in[i+1]
            }
            out[i] = left + in[i] + right
        }
    })
    fmt.Printf("\n stencilSerial %v", r)
}

Try to convert it to a parallel benchmark; here is a first racy version, which is definitely incorrect.
The for i := 0; pb.Next(); i++ loop is local to each go routine and multiple go routines will overwrite the same index out[i] and not all indices will be read/updated and hence it is functionally incorrect.
Furthermore, the cacheline pingpoining will deteriorate the performance of this parallel code (shown at the end of this issue).

func stencilParallelRacy() {
    r := testing.Benchmark(func(b *testing.B) {
        in := make([]int, b.N)
        out := make([]int, b.N)
        for i := 0; i < b.N; i++ {
            in[i] = i
        }
        b.RunParallel(func(pb *testing.PB) {
            for i := 0; pb.Next(); i++ {
                left := 0
                if i > 0 {
                    left = in[i-1]
                }
                right := 0
                if i < int(b.N)-1 {
                    right = in[i+1]
                }
                // racy update and not all in[i] are updated!
                out[i] = left + in[i] + right
            }
        })
    })
    fmt.Printf("\n stencilParallelRacy %v", r)
}

A somewhat sane parallel version is to create in and out slices inside each Go routine as shown below.
However, this benchmark has several problems.
1) it will bloat the memory needs by O(GOMAXPROCS). One cannot know the size of the slice to create for each goroutine due to dynamic load balancing.
2) it is not functionally equivalent to the original one because not all indices of the input slice will be read and output slices be written.

func stencilParallel() {
    r := testing.Benchmark(func(b *testing.B) {
        b.RunParallel(func(pb *testing.PB) {
            in := make([]int, b.N)
            out := make([]int, b.N)
            for i := 0; i < b.N; i++ {
                in[i] = i
            }

            for i := 0; pb.Next(); i++ {
                left := 0
                if i > 0 {
                    left = in[i-1]
                }
                right := 0
                if i < int(b.N)-1 {
                    right = in[i+1]
                }
                out[i] = left + in[i] + right
            }
        })
    })
    fmt.Printf("\n stencilParallel %v", r)
}

We need a func (pb *PB) NextIndex() (int, bool) API that can return the iteration index so that each iteration inside parallel region can take the appropriate action. See the code below, with the proposed NextIndex API in action.

func stencilParallelNextIndex() {
    r := testing.Benchmark(func(b *testing.B) {
        in := make([]int, b.N)
        out := make([]int, b.N)
        for i := 0; i < b.N; i++ {
            in[i] = i
        }
        b.RunParallel(func(pb *testing.PB) {
            for {
                i, ok := pb.NextIndex()
                if !ok {
                    break
                }
                left := 0
                if i > 0 {
                    left = in[i-1]
                }
                right := 0
                if int(i) < b.N-1 {
                    right = in[i+1]
                }
                out[i] = left + in[i] + right
            }
        })
    })
    fmt.Printf("\n stencilParallelNextIndex %v", r)
}

It is worth observing the running time of each of these examples (I have implemented NextIndex locally) when scaling GOMAXPROCS from 1-8.

 GOMAXPROCS=1 go run stencil.go 

 stencilSerial 233812798             6.19 ns/op
 stencilParallelRacy 323430168           5.72 ns/op
 stencilParallel 314360209           3.81 ns/op
 stencilParallelNextIndex 309754171  3.89 ns/op
GOMAXPROCS=2 go run stencil.go 

 stencilSerial 234993747             6.34 ns/op
 stencilParallelRacy 437399902           5.94 ns/op
 stencilParallel 432057882           6.89 ns/op
 stencilParallelNextIndex 443609341  2.66 ns/op
 GOMAXPROCS=4 go run stencil.go 

 stencilSerial 243237640             6.39 ns/op
 stencilParallelRacy 542654800           3.41 ns/op
 stencilParallel 194487830           7.74 ns/op
 stencilParallelNextIndex 572145619  2.10 ns/op



md5-a89e01f839f3e071f2a51ed8e07ae4ed



GOMAXPROCS=8 go run stencil.go 

 stencilSerial 240228440             6.31 ns/op
 stencilParallelRacy 697957074           4.28 ns/op
 stencilParallel 100000000           15.0 ns/op
 stencilParallelNextIndex 651700406  1.87 ns/op

stencilSerial as expected stays put.
stencilParallelRacy has a poor scalability due to a heavy cacheline conflict (and of course incorrect result).
stencilParallel has pathetic parallel scaling due to bloated memory needs and its associated initialization in each Goroutine.
stencilParallelNextIndex shows higher (although not perfectly linear) throughput with more GOMAXPROCS and it is much desirable when writing parallel benchmarks.

This is a small enough change, which introduces a new API.

What did you expect to see?

NA

What did you see instead?

NA

Proposal

Most helpful comment

What I don't see here is much indication that this use of the extra index is something that is generally advisable in parallel benchmarks. My understanding of the parallel benchmark functionality was to be able to measure how fast it would be to run b.N iterations in parallel goroutines, keeping all the goroutines busy even when there is obviously some variation in how quickly each executes, so that you can't just hand out b.N/#goroutines to each one.

The more I think about this, the more I think we've gotten derailed by performance and API and have failed to recognize that adding either Index or NextIndex would be a mistake. If the work done for iteration i is not interchangeable with the work done for iteration j, that's not a valid benchmark at all.

My first objection to the stencilParallel example is that it probably matters which indexes a goroutine is writing. If pb.Next is doing some kind of fine-grained iteration where it splits the work across G goroutines and hands out k, k+G, k+2G, k+3G in a single goroutine until we get close to the end, then all the different goroutines are going to have cache collisions writing to nearby entries in "out". On the other hand, if pb.Next works on blocks and hands out mostly contiguous indexes to particular goroutines, then there will be few cache collisions. The reported speed of a benchmarked function should clearly not depend on the exact iteration order returned by package testing. And of course the way to make that happen is to hide the iteration order completely, as we do now.

But my second, more serious objection is that the stencilSerial example is not even a correct single-threaded benchmark. It's really important that b.N only change the number of times the code runs, not _what_ code runs. In particular, it is a mistake to change the size of the data being processed based on b.N, because cache behavior changes non-linearly with data size, which then invalidates "divide total time by b.N to find per-operation time". When the data size is changing too, per-operation time is not actually a meaningful value - it's not constant.

I think if we were going to adopt Index as an API, we would need a compelling, correct example. We don't have one.

All 14 comments

Can you show sample documentation for the NextIndex method? I don't understand what it returns. Thanks.

Could a particular benchmark solves this problem by using its own local variable with atomic.AddInt32? Could the testing package do that more efficiently in some way?

The documentation for (*PB).NextIndex() (int, bool) can look like below.

NextIndex returns whether there are more iterations to execute.
If there are more iterations to execute, a unique integer index in the range  [0-b.N) is additionally reported.
Due to concurrent execution of each iteration, there is no ordering guarantee for the indices returned by NextIndex().
However, it is guaranteed that every index in [0-b.N) is reported once and only once.

There is definitely a performance advantage to adding the NextIndex API in the testing package. It will eliminate the atomic addition on each loop iteration; instead, there will be one atomic add per PB.grain (https://golang.org/src/testing/benchmark.go?s=20045:20070#L706).

Furthermore, there is a programmatic advantage to keeping it inside the testing package; making each parallel benchmark repeatedly manage its own atomic.AddInt32 is tedious, error prone, and it will make the code look ugly.

Why is this part of Next? Why not just have a pb.Index that returns the current goroutine's index?

@rsc is there a provable superiority (programmatic or performance) of having Index over NextIndex? I see the performance advantage of NextIndex which both moves the iterator and provides the index. If one wants to access only
the Index because they call it many times from different parts of the program (although I find it less efficient), the Index would be justified.

is there a provable superiority (programmatic or performance) of having Index over NextIndex?

b.RunParallel(func(pb *testing.PB) {
    for {
        i, ok := pb.NextIndex()
        if !ok {
            break
        }
        // ...
    }
})

instead of

b.RunParallel(func(pb *testing.PB) {
    for pb.Next() {
        i := pb.Index()
        // ...
    }
})

makes me ask the inverse question: is there a provable superiority (programmatic or performance) of having NextIndex over Index?

I'm unconvinced the loss of readability is justified by having one less function call per iteration (hypothetical, considering inlining).

@antichris, Index() is an interesting proposal, but I would still like to argue in favor of NextIndex() for the following two reasons.

  1. Usability:
    pb.Index() is not a standalone API unlike how you use in your second example. pb.Index() is intricately coupled with what happened at pb.Next(). pb.Index() cannot return an index when pb.Next() is not called even once by the calling goroutine and also whenpb.Next() has crossed the last iteration. Hence, the API should look more like: (*PB) Index() (int, bool). With this API signature, its usage will look not very different from NextIndex(); granted you can ignore the return value when it is inside the same loop where pb.Next was performed; but in full generality, calls to Index have to check its return value. So, in summary, the readability benefits of Index() are not significant.

  2. Performance:
    Let's not discount or speculate performance assuming inlining will happen or the compiler is omniscient. Look at the code below, which exercises both NextIndex() and Index variants. On my iMac with 8 cores (2x SMT), there is a 32% performance loss when using Next+Index compared with NextIndex. Your mileage may vary.

$ ./Index 
benchIndex      267    3771623 ns/op
benchNextIndex      372    2846961 ns/op
package main

import (
    "fmt"
    "testing"
)

func benchNextIndex() {
    r := testing.Benchmark(func(b *testing.B) {
        for i := 0; i < 100000; i++ {
            b.RunParallel(func(pb *testing.PB) {
                sum := uint64(0)
                for {
                    v, ok := pb.NextIndex()
                    if !ok {
                        break
                    }
                    sum += v
                }
            })
        }
    })
    fmt.Printf("benchNextIndex %v\n", r)
}

func benchIndex() {
    r := testing.Benchmark(func(b *testing.B) {
        for i := 0; i < 100000; i++ {
            b.RunParallel(func(pb *testing.PB) {
                sum := uint64(0)
                for pb.Next() {
                    v, _ := pb.Index() // Comment this line to gain 32% performance
                    sum += v
                }
            })
        }
    })
    fmt.Printf("benchIndex %v\n", r)
}

func main() {
    benchIndex()
    benchNextIndex()
}

Of course, it raises the question, how did I implement these functions. The below code snippets should give you an idea:

// A PB is used by RunParallel for running parallel benchmarks.
type PB struct {
        globalN *uint64 // shared between all worker goroutines iteration counter
        grain   uint64  // acquire that many iterations from globalN at once
        cache   uint64  // local cache of acquired iterations
        bN      uint64  // total number of iterations to execute (b.N)
        where   uint64  // last n we got  <===== added this line
}
// Next reports whether there are more iterations to execute.
func (pb *PB) Next() bool {
        if pb.cache == 0 { 
                n := atomic.AddUint64(pb.globalN, pb.grain)
                if n <= pb.bN {
                        pb.cache = pb.grain
                } else if n < pb.bN+pb.grain {
                        pb.cache = pb.bN + pb.grain - n 
                } else {
                        return false
                }   
                pb.where = n - pb.grain  <===== added this line
        }   
        pb.where++ <===== added this line
        pb.cache--
        return true
}

// Index provides the current index.
func (pb *PB) Index() (uint64, bool) {
        if pb.where == 0 {
                return 0, false
        }
        if pb.where > pb.bN {
                return 0, false
        }
        return pb.where - 1, true
}

// NextIndex provides the next index and reports whether there are more iterations to execute.
func (pb *PB) NextIndex() (uint64, bool) {
        if pb.cache == 0 {
                n := atomic.AddUint64(pb.globalN, pb.grain)
                if n <= pb.bN {
                        pb.cache = pb.grain
                } else if n < pb.bN+pb.grain {
                        pb.cache = pb.bN + pb.grain - n
                } else {
                        return 0, false
                }
                pb.where = n - pb.grain
        }
        retVal := pb.where
        pb.where++
        pb.cache--
        return retVal, true
}

@chabbimilind, with your code unaltered, I get an average

name       time/op
Index      1.64ms ± 1%
NextIndex  1.61ms ± 1%

over 50 runs, which amounts to at most a 2% performance loss on my machine. There must be something wrong with yours.

If Index is reduced to just return the goroutine's index,

func (pb *PB) Index() uint64 { return pb.where - 1 }

there is no performance loss at all, there's actually even a 0.42% gain, according to benchstat (I had to rename both benchmarks to match so that they could be compared):

name         old time/op  new time/op  delta
(Next)Index  1.61ms ± 1%  1.61ms ± 1%  -0.42%  (p=0.000 n=48+50)

That might be due to elimination of an XCHGL AX, AX at 0x000f in benchNextIndex.func1.1 which is replaced by a NOP in benchIndex.func1.1, but I suspect that's a more or less architecture-specific optimization together with the inlining that happened to both functions (just as I speculated earlier).

This, I believe, is the way pb.Index() is intended (when @rsc suggested it) to be used — always exclusively after a pb.Next() call:

// ...
for pb.Next() {
    i := pb.Index()
    // use i
}
// ...

The removal of those checks (and overhead) it had in your version might make it less safe to use improperly: it would return -1, before the first call to Next, and the last index after Next has started returning false, which might not be what one expects to get. Although, I think, one should not be expecting anything great from calling it outside of the appropriate context (which is only when Next has returned true).

Even you said so yourself:

pb.Index() is not a standalone API unlike how you use in your second example. pb.Index() is intricately coupled with what happened at pb.Next().

Looking at the code sample I suppose you were referring to (which is partially reproduced just above here), I can't really understand what you mean by "unlike how I use", though.

Your mileage may vary.

Oh, it did, my friend. Indeed it did.

@antichris how many threads (CPU cores) on your machine?

Are you prosing returning -1 on failure from pb.Index(). If that is the desired way Go standard APIs work, fine!
I have recorded the fact that there can be noticeable performance losses. I understand the focus of Golang is not high-performance.

how many threads (CPU cores) on your machine?

4 cores, 8 threads, according to the manufacturer's [datasheet].

I think I know where you're headed with this, though. That's why I rewrote your benchmark code to "proper" Go benchmarks, so that they could be executed with all the wonderful features that test flags provide, including (but not limited to) ergonomic control over the run count and GOMAXPROCS.

(Expand for source)

package main_test

import "testing"

func BenchmarkNextIndex(b *testing.B) {
    for i := 0; i < 100000; i++ {
        b.RunParallel(func(pb *testing.PB) {
            sum := uint64(0)
            for {
                v, ok := pb.NextIndex()
                if !ok {
                    break
                }
                sum += v
            }
        })
    }
}

func BenchmarkIndex(b *testing.B) {
    for i := 0; i < 100000; i++ {
        b.RunParallel(func(pb *testing.PB) {
            sum := uint64(0)
            for pb.Next() {
                v, _ := pb.Index()
                // v := pb.Index() // Use this line instead of the above
                                   // when a single value is returned.
                sum += v
            }
        })
    }
}

The invocation I used for running the benchmarks:

go test -bench=. -cpu=1,2,4,8 -count=10|tee results

I did not want to babysit my machine (it has a tendency for a sharp performance drop at putting the screen to/waking it up from sleep) for more than half an hour, waiting for 400 runs to complete, therefore the -count=10. Feel free to experiment with larger values.

The benchmark results were renamed to matching ones when comparing/summarizing with benchstat, using BenchmarkNextIndex as the baseline for each comparison.

  • The results for your original implementation of Index:

    (Expand for full results)

    name           old time/op  new time/op  delta
    (Next)Index     827µs ± 1%   826µs ± 1%    ~     (p=0.730 n=9+9)
    (Next)Index-2  1.04ms ± 1%  1.05ms ± 2%  +1.16%  (p=0.004 n=10+9)
    (Next)Index-4  1.36ms ± 1%  1.38ms ± 3%  +1.81%  (p=0.010 n=9+10)
    (Next)Index-8  1.61ms ± 1%  1.63ms ± 0%  +1.06%  (p=0.000 n=10+8)
    

    goos: linux
    goarch: amd64
    pkg: testing/index
    BenchmarkNextIndex          1365        822306 ns/op
    BenchmarkNextIndex          1238        827097 ns/op
    BenchmarkNextIndex          1250        826455 ns/op
    BenchmarkNextIndex          1256        825291 ns/op
    BenchmarkNextIndex          1248        829401 ns/op
    BenchmarkNextIndex          1274        831540 ns/op
    BenchmarkNextIndex          1264        824213 ns/op
    BenchmarkNextIndex          1262        828072 ns/op
    BenchmarkNextIndex          1260        827564 ns/op
    BenchmarkNextIndex          1262        848524 ns/op
    BenchmarkNextIndex-2        1072       1040714 ns/op
    BenchmarkNextIndex-2        1126       1038666 ns/op
    BenchmarkNextIndex-2        1105       1044023 ns/op
    BenchmarkNextIndex-2        1108       1024421 ns/op
    BenchmarkNextIndex-2        1119       1033484 ns/op
    BenchmarkNextIndex-2        1076       1044225 ns/op
    BenchmarkNextIndex-2        1105       1040204 ns/op
    BenchmarkNextIndex-2        1093       1029051 ns/op
    BenchmarkNextIndex-2        1122       1041408 ns/op
    BenchmarkNextIndex-2        1106       1040794 ns/op
    BenchmarkNextIndex-4         906       1372349 ns/op
    BenchmarkNextIndex-4         903       1365971 ns/op
    BenchmarkNextIndex-4         895       1349072 ns/op
    BenchmarkNextIndex-4         895       1357245 ns/op
    BenchmarkNextIndex-4         901       1359918 ns/op
    BenchmarkNextIndex-4         913       1362478 ns/op
    BenchmarkNextIndex-4         906       1358369 ns/op
    BenchmarkNextIndex-4         896       1359996 ns/op
    BenchmarkNextIndex-4         904       1403334 ns/op
    BenchmarkNextIndex-4         892       1347904 ns/op
    BenchmarkNextIndex-8         697       1618363 ns/op
    BenchmarkNextIndex-8         697       1610496 ns/op
    BenchmarkNextIndex-8         698       1610698 ns/op
    BenchmarkNextIndex-8         704       1609407 ns/op
    BenchmarkNextIndex-8         702       1595361 ns/op
    BenchmarkNextIndex-8         697       1615093 ns/op
    BenchmarkNextIndex-8         696       1618305 ns/op
    BenchmarkNextIndex-8         692       1607051 ns/op
    BenchmarkNextIndex-8         708       1613780 ns/op
    BenchmarkNextIndex-8         690       1597287 ns/op
    BenchmarkIndex              1375        816488 ns/op
    BenchmarkIndex              1239        825060 ns/op
    BenchmarkIndex              1274        822574 ns/op
    BenchmarkIndex              1218        832834 ns/op
    BenchmarkIndex              1275        823519 ns/op
    BenchmarkIndex              1270        841124 ns/op
    BenchmarkIndex              1261        828743 ns/op
    BenchmarkIndex              1270        828936 ns/op
    BenchmarkIndex              1254        823869 ns/op
    BenchmarkIndex              1263        830012 ns/op
    BenchmarkIndex-2            1094       1029340 ns/op
    BenchmarkIndex-2            1090       1018124 ns/op
    BenchmarkIndex-2            1105       1047049 ns/op
    BenchmarkIndex-2            1114       1049284 ns/op
    BenchmarkIndex-2            1129       1059698 ns/op
    BenchmarkIndex-2            1095       1044032 ns/op
    BenchmarkIndex-2            1111       1042077 ns/op
    BenchmarkIndex-2            1093       1045498 ns/op
    BenchmarkIndex-2            1138       1055870 ns/op
    BenchmarkIndex-2            1132       1074446 ns/op
    BenchmarkIndex-4             900       1408361 ns/op
    BenchmarkIndex-4             898       1378696 ns/op
    BenchmarkIndex-4             908       1397115 ns/op
    BenchmarkIndex-4             889       1373676 ns/op
    BenchmarkIndex-4             901       1404665 ns/op
    BenchmarkIndex-4             879       1344952 ns/op
    BenchmarkIndex-4             898       1402569 ns/op
    BenchmarkIndex-4             890       1358545 ns/op
    BenchmarkIndex-4             895       1372714 ns/op
    BenchmarkIndex-4             889       1397904 ns/op
    BenchmarkIndex-8             694       1631860 ns/op
    BenchmarkIndex-8             694       1627596 ns/op
    BenchmarkIndex-8             684       1622259 ns/op
    BenchmarkIndex-8             693       1627623 ns/op
    BenchmarkIndex-8             697       1624575 ns/op
    BenchmarkIndex-8             682       1627252 ns/op
    BenchmarkIndex-8             691       1608856 ns/op
    BenchmarkIndex-8             700       1626498 ns/op
    BenchmarkIndex-8             680       1637182 ns/op
    BenchmarkIndex-8             693       1625829 ns/op
    PASS
    ok      testing/index   212.816s
    

  • Running with Index reduced to a single return:

    (Expand for full results)

    name           old time/op  new time/op  delta
    (Next)Index     808µs ± 1%   810µs ± 1%    ~     (p=0.278 n=9+10)
    (Next)Index-2  1.02ms ± 1%  1.03ms ± 2%  +1.23%  (p=0.035 n=10+10)
    (Next)Index-4  1.38ms ± 1%  1.38ms ± 1%    ~     (p=0.315 n=10+10)
    (Next)Index-8  1.61ms ± 1%  1.61ms ± 1%    ~     (p=0.853 n=10+10)
    

    goos: linux
    goarch: amd64
    pkg: testing/index
    BenchmarkNextIndex          1185        845181 ns/op
    BenchmarkNextIndex          1378        800164 ns/op
    BenchmarkNextIndex          1309        802616 ns/op
    BenchmarkNextIndex          1274        804395 ns/op
    BenchmarkNextIndex          1286        805231 ns/op
    BenchmarkNextIndex          1258        808409 ns/op
    BenchmarkNextIndex          1296        814793 ns/op
    BenchmarkNextIndex          1244        811059 ns/op
    BenchmarkNextIndex          1298        805301 ns/op
    BenchmarkNextIndex          1246        818426 ns/op
    BenchmarkNextIndex-2        1114       1023248 ns/op
    BenchmarkNextIndex-2        1106       1027178 ns/op
    BenchmarkNextIndex-2        1118       1014118 ns/op
    BenchmarkNextIndex-2        1104       1016580 ns/op
    BenchmarkNextIndex-2        1135       1011715 ns/op
    BenchmarkNextIndex-2        1122       1011229 ns/op
    BenchmarkNextIndex-2        1108       1019155 ns/op
    BenchmarkNextIndex-2        1130       1029966 ns/op
    BenchmarkNextIndex-2        1131       1028267 ns/op
    BenchmarkNextIndex-2        1130       1013857 ns/op
    BenchmarkNextIndex-4         912       1384737 ns/op
    BenchmarkNextIndex-4         901       1374862 ns/op
    BenchmarkNextIndex-4         913       1374009 ns/op
    BenchmarkNextIndex-4         897       1366385 ns/op
    BenchmarkNextIndex-4         907       1386615 ns/op
    BenchmarkNextIndex-4         902       1378177 ns/op
    BenchmarkNextIndex-4         913       1389875 ns/op
    BenchmarkNextIndex-4         907       1399369 ns/op
    BenchmarkNextIndex-4         904       1379526 ns/op
    BenchmarkNextIndex-4         908       1387491 ns/op
    BenchmarkNextIndex-8         702       1613103 ns/op
    BenchmarkNextIndex-8         703       1597941 ns/op
    BenchmarkNextIndex-8         696       1608357 ns/op
    BenchmarkNextIndex-8         704       1605088 ns/op
    BenchmarkNextIndex-8         708       1610483 ns/op
    BenchmarkNextIndex-8         705       1607491 ns/op
    BenchmarkNextIndex-8         709       1593077 ns/op
    BenchmarkNextIndex-8         702       1611638 ns/op
    BenchmarkNextIndex-8         699       1607134 ns/op
    BenchmarkNextIndex-8         709       1601878 ns/op
    BenchmarkIndex              1281        811523 ns/op
    BenchmarkIndex              1287        807519 ns/op
    BenchmarkIndex              1292        808608 ns/op
    BenchmarkIndex              1278        812850 ns/op
    BenchmarkIndex              1257        812999 ns/op
    BenchmarkIndex              1292        811811 ns/op
    BenchmarkIndex              1268        804930 ns/op
    BenchmarkIndex              1312        807247 ns/op
    BenchmarkIndex              1262        808766 ns/op
    BenchmarkIndex              1280        808776 ns/op
    BenchmarkIndex-2            1100       1022728 ns/op
    BenchmarkIndex-2            1111       1016110 ns/op
    BenchmarkIndex-2            1144       1036095 ns/op
    BenchmarkIndex-2            1128       1013398 ns/op
    BenchmarkIndex-2            1131       1021993 ns/op
    BenchmarkIndex-2            1142       1052461 ns/op
    BenchmarkIndex-2            1137       1036276 ns/op
    BenchmarkIndex-2            1131       1050766 ns/op
    BenchmarkIndex-2            1142       1034039 ns/op
    BenchmarkIndex-2            1124       1036667 ns/op
    BenchmarkIndex-4             908       1373611 ns/op
    BenchmarkIndex-4             909       1371849 ns/op
    BenchmarkIndex-4             904       1374661 ns/op
    BenchmarkIndex-4             902       1362660 ns/op
    BenchmarkIndex-4             915       1371153 ns/op
    BenchmarkIndex-4             904       1380963 ns/op
    BenchmarkIndex-4             908       1381588 ns/op
    BenchmarkIndex-4             904       1386198 ns/op
    BenchmarkIndex-4             903       1390065 ns/op
    BenchmarkIndex-4             878       1385323 ns/op
    BenchmarkIndex-8             688       1611506 ns/op
    BenchmarkIndex-8             700       1606052 ns/op
    BenchmarkIndex-8             704       1610623 ns/op
    BenchmarkIndex-8             706       1604036 ns/op
    BenchmarkIndex-8             693       1618573 ns/op
    BenchmarkIndex-8             705       1596077 ns/op
    BenchmarkIndex-8             681       1629448 ns/op
    BenchmarkIndex-8             705       1596526 ns/op
    BenchmarkIndex-8             699       1594360 ns/op
    BenchmarkIndex-8             706       1597677 ns/op
    PASS
    ok      testing/index   215.121s
    

I have recorded the fact that there can be noticeable performance losses.

Those "noticeable performance losses" seem to be very slight in general, and practically non-existent in the case of the simplified, single value return implementation of Index.

Could you please be so kind as to run some benchmarks yourself (preferably with -count substantially more than 1) to confirm that your original result was not just a one time fluke/malfunction of your machine?

Otherwise, based on my experience with benchmarking this issue, I believe it is only a fair assumption to be made, that you cherry-picked your 32-percent-performance-loss result in bad faith, and, having grown so attached to the design you originally proposed, you are unwilling to concede it for a potentially more sensible and versatile one.

I have a question for @rsc, though: what about the fact that the Next+Index split API requires changes to Next to increment the goroutine index, changes that are very likely to have an impact on the performance of all existing benchmarks that currently rely on it?

Are you prosing returning -1 on failure from pb.Index(). If that is the desired way Go standard APIs work...

No, not really. I believe, failure, in this case, is something one would have checked for via a Next call. If that has returned false, there is no point in calling Index. You yourself explicitly stated that a call to Index on its own does not make sense; the API you originally proposed does not even provide such a function to call. And for the past few posts in this thread we came to a conclusion that there is, after all, a potential for inefficient implementation to have a negative impact on the performance.

... fine!

Da-yumn, @chabbimilind! If that ain't what they call "butthurt", I sincerely have no clue what was the impression you were trying to make with that. :stuck_out_tongue_winking_eye:

Just breathe, friend, relax and focus: having your proposal revised and improved by the gopher community is not the worst thing that could have happened to you today, nor is it a damning statement about your qualities as an individual and a member of the society.

is there a provable superiority (programmatic or performance) of having NextIndex over Index?

There's already a Next, so adding just Index avoids having two different Next operations.

It is also obviously possible to implement these in such a way that there is no performance difference:

func (pb *PB) Next() bool {
    var ok bool
    ok, pb.index = pb.nextIndex()
    return ok
}

func (pb *PB) Index() int {
    return pb.index
}

So I'm not placing much weight in the benchmarks showing a difference.

What I don't see here is much indication that this use of the extra index is something that is generally advisable in parallel benchmarks. My understanding of the parallel benchmark functionality was to be able to measure how fast it would be to run b.N iterations in parallel goroutines, keeping all the goroutines busy even when there is obviously some variation in how quickly each executes, so that you can't just hand out b.N/#goroutines to each one.

The more I think about this, the more I think we've gotten derailed by performance and API and have failed to recognize that adding either Index or NextIndex would be a mistake. If the work done for iteration i is not interchangeable with the work done for iteration j, that's not a valid benchmark at all.

My first objection to the stencilParallel example is that it probably matters which indexes a goroutine is writing. If pb.Next is doing some kind of fine-grained iteration where it splits the work across G goroutines and hands out k, k+G, k+2G, k+3G in a single goroutine until we get close to the end, then all the different goroutines are going to have cache collisions writing to nearby entries in "out". On the other hand, if pb.Next works on blocks and hands out mostly contiguous indexes to particular goroutines, then there will be few cache collisions. The reported speed of a benchmarked function should clearly not depend on the exact iteration order returned by package testing. And of course the way to make that happen is to hide the iteration order completely, as we do now.

But my second, more serious objection is that the stencilSerial example is not even a correct single-threaded benchmark. It's really important that b.N only change the number of times the code runs, not _what_ code runs. In particular, it is a mistake to change the size of the data being processed based on b.N, because cache behavior changes non-linearly with data size, which then invalidates "divide total time by b.N to find per-operation time". When the data size is changing too, per-operation time is not actually a meaningful value - it's not constant.

I think if we were going to adopt Index as an API, we would need a compelling, correct example. We don't have one.

/cc @dvyukov

There is something I don't fully understand in the problem statement: for parallel algorithms don't we want to measure the _production_ parallelization rather than some random scheme provided by the testing package? If we get some parallel speed up, it does not seem to be useful because the actual production code does not have any parallelization. Perhaps you want some parallel algorithm support package that can be used in production code, and then measured in benchmarks.

RunParallel is not intended for parallel algorithms, it's intended for measuring N logically independent copies of an operation to check how well that scales.

The proposed NextIndex scheme also seems to imply dependence of data size on the number of iterations. That's generally a wrong thing because of the associated non-linear effects. E.g. a faster version may trigger 10x larger data size and end up being slower... A better scheme is to do whole large operation b.N times on data set of target size (or sizes).

But per-se unique index assignment can be done with pb.Next as follows:

        var index uint64
        b.RunParallel(func(pb *testing.PB) {
            for pb.Next() {
                const batch = 42
                myIndex := atomic.AddUint64(&index, 1) - 1
                for i := myIndex * batch; i < (myIndex + 1) * batch; i++ {
                    ...
                }
            }
        })
Was this page helpful?
0 / 5 - 0 ratings

Related issues

jayhuang75 picture jayhuang75  Â·  3Comments

bradfitz picture bradfitz  Â·  3Comments

longzhizhi picture longzhizhi  Â·  3Comments

michaelsafyan picture michaelsafyan  Â·  3Comments

Miserlou picture Miserlou  Â·  3Comments