go version)?$ go version go version go1.15.5 darwin/amd64
Yes
go env)?go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/xx/Library/Caches/go-build"
GOENV="/Users/xx/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/xx/gocode/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/xx/gocode"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.15.5/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.15.5/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/xr/hsc7mk1n5glg6vfqb4z5nlkh0000gn/T/go-build663468880=/tmp/go-build -gno-record-gcc-switches -fno-common"
testing.PB.Next() only returns a boolean whether there exits another iteration when running parallel benchmarks.
This is insufficient in many situations where one needs to know the current iteration index.
See the Go code in the attached stencil.go.zip file.
Consider the serial elision of the 3-point stencil example, which reads from an input slice in and writes to output slice out while computing in[i-1] + in[i] + in[i+1] for iteration index i and writing out to out[i].
func stencilSerial() {
r := testing.Benchmark(func(b *testing.B) {
in := make([]int, b.N)
out := make([]int, b.N)
for i := 0; i < b.N; i++ {
in[i] = i
}
for i := 0; i < b.N; i++ {
left := 0
if i > 0 {
left = in[i-1]
}
right := 0
if i < int(b.N)-1 {
right = in[i+1]
}
out[i] = left + in[i] + right
}
})
fmt.Printf("\n stencilSerial %v", r)
}
Try to convert it to a parallel benchmark; here is a first racy version, which is definitely incorrect.
The for i := 0; pb.Next(); i++ loop is local to each go routine and multiple go routines will overwrite the same index out[i] and not all indices will be read/updated and hence it is functionally incorrect.
Furthermore, the cacheline pingpoining will deteriorate the performance of this parallel code (shown at the end of this issue).
func stencilParallelRacy() {
r := testing.Benchmark(func(b *testing.B) {
in := make([]int, b.N)
out := make([]int, b.N)
for i := 0; i < b.N; i++ {
in[i] = i
}
b.RunParallel(func(pb *testing.PB) {
for i := 0; pb.Next(); i++ {
left := 0
if i > 0 {
left = in[i-1]
}
right := 0
if i < int(b.N)-1 {
right = in[i+1]
}
// racy update and not all in[i] are updated!
out[i] = left + in[i] + right
}
})
})
fmt.Printf("\n stencilParallelRacy %v", r)
}
A somewhat sane parallel version is to create in and out slices inside each Go routine as shown below.
However, this benchmark has several problems.
1) it will bloat the memory needs by O(GOMAXPROCS). One cannot know the size of the slice to create for each goroutine due to dynamic load balancing.
2) it is not functionally equivalent to the original one because not all indices of the input slice will be read and output slices be written.
func stencilParallel() {
r := testing.Benchmark(func(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
in := make([]int, b.N)
out := make([]int, b.N)
for i := 0; i < b.N; i++ {
in[i] = i
}
for i := 0; pb.Next(); i++ {
left := 0
if i > 0 {
left = in[i-1]
}
right := 0
if i < int(b.N)-1 {
right = in[i+1]
}
out[i] = left + in[i] + right
}
})
})
fmt.Printf("\n stencilParallel %v", r)
}
We need a func (pb *PB) NextIndex() (int, bool) API that can return the iteration index so that each iteration inside parallel region can take the appropriate action. See the code below, with the proposed NextIndex API in action.
func stencilParallelNextIndex() {
r := testing.Benchmark(func(b *testing.B) {
in := make([]int, b.N)
out := make([]int, b.N)
for i := 0; i < b.N; i++ {
in[i] = i
}
b.RunParallel(func(pb *testing.PB) {
for {
i, ok := pb.NextIndex()
if !ok {
break
}
left := 0
if i > 0 {
left = in[i-1]
}
right := 0
if int(i) < b.N-1 {
right = in[i+1]
}
out[i] = left + in[i] + right
}
})
})
fmt.Printf("\n stencilParallelNextIndex %v", r)
}
It is worth observing the running time of each of these examples (I have implemented NextIndex locally) when scaling GOMAXPROCS from 1-8.
GOMAXPROCS=1 go run stencil.go
stencilSerial 233812798 6.19 ns/op
stencilParallelRacy 323430168 5.72 ns/op
stencilParallel 314360209 3.81 ns/op
stencilParallelNextIndex 309754171 3.89 ns/op
GOMAXPROCS=2 go run stencil.go
stencilSerial 234993747 6.34 ns/op
stencilParallelRacy 437399902 5.94 ns/op
stencilParallel 432057882 6.89 ns/op
stencilParallelNextIndex 443609341 2.66 ns/op
GOMAXPROCS=4 go run stencil.go
stencilSerial 243237640 6.39 ns/op
stencilParallelRacy 542654800 3.41 ns/op
stencilParallel 194487830 7.74 ns/op
stencilParallelNextIndex 572145619 2.10 ns/op
md5-a89e01f839f3e071f2a51ed8e07ae4ed
GOMAXPROCS=8 go run stencil.go
stencilSerial 240228440 6.31 ns/op
stencilParallelRacy 697957074 4.28 ns/op
stencilParallel 100000000 15.0 ns/op
stencilParallelNextIndex 651700406 1.87 ns/op
stencilSerial as expected stays put.
stencilParallelRacy has a poor scalability due to a heavy cacheline conflict (and of course incorrect result).
stencilParallel has pathetic parallel scaling due to bloated memory needs and its associated initialization in each Goroutine.
stencilParallelNextIndex shows higher (although not perfectly linear) throughput with more GOMAXPROCS and it is much desirable when writing parallel benchmarks.
This is a small enough change, which introduces a new API.
NA
NA
Can you show sample documentation for the NextIndex method? I don't understand what it returns. Thanks.
Could a particular benchmark solves this problem by using its own local variable with atomic.AddInt32? Could the testing package do that more efficiently in some way?
The documentation for (*PB).NextIndex() (int, bool) can look like below.
NextIndex returns whether there are more iterations to execute.
If there are more iterations to execute, a unique integer index in the range [0-b.N) is additionally reported.
Due to concurrent execution of each iteration, there is no ordering guarantee for the indices returned by NextIndex().
However, it is guaranteed that every index in [0-b.N) is reported once and only once.
There is definitely a performance advantage to adding the NextIndex API in the testing package. It will eliminate the atomic addition on each loop iteration; instead, there will be one atomic add per PB.grain (https://golang.org/src/testing/benchmark.go?s=20045:20070#L706).
Furthermore, there is a programmatic advantage to keeping it inside the testing package; making each parallel benchmark repeatedly manage its own atomic.AddInt32 is tedious, error prone, and it will make the code look ugly.
Why is this part of Next? Why not just have a pb.Index that returns the current goroutine's index?
@rsc is there a provable superiority (programmatic or performance) of having Index over NextIndex? I see the performance advantage of NextIndex which both moves the iterator and provides the index. If one wants to access only
the Index because they call it many times from different parts of the program (although I find it less efficient), the Index would be justified.
is there a provable superiority (programmatic or performance) of having
IndexoverNextIndex?
b.RunParallel(func(pb *testing.PB) {
for {
i, ok := pb.NextIndex()
if !ok {
break
}
// ...
}
})
instead of
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
i := pb.Index()
// ...
}
})
makes me ask the inverse question: is there a provable superiority (programmatic or performance) of having NextIndex over Index?
I'm unconvinced the loss of readability is justified by having one less function call per iteration (hypothetical, considering inlining).
@antichris, Index() is an interesting proposal, but I would still like to argue in favor of NextIndex() for the following two reasons.
Usability:
pb.Index() is not a standalone API unlike how you use in your second example. pb.Index() is intricately coupled with what happened at pb.Next(). pb.Index() cannot return an index when pb.Next() is not called even once by the calling goroutine and also whenpb.Next() has crossed the last iteration. Hence, the API should look more like: (*PB) Index() (int, bool). With this API signature, its usage will look not very different from NextIndex(); granted you can ignore the return value when it is inside the same loop where pb.Next was performed; but in full generality, calls to Index have to check its return value. So, in summary, the readability benefits of Index() are not significant.
Performance:
Let's not discount or speculate performance assuming inlining will happen or the compiler is omniscient. Look at the code below, which exercises both NextIndex() and Index variants. On my iMac with 8 cores (2x SMT), there is a 32% performance loss when using Next+Index compared with NextIndex. Your mileage may vary.
$ ./Index
benchIndex 267 3771623 ns/op
benchNextIndex 372 2846961 ns/op
package main
import (
"fmt"
"testing"
)
func benchNextIndex() {
r := testing.Benchmark(func(b *testing.B) {
for i := 0; i < 100000; i++ {
b.RunParallel(func(pb *testing.PB) {
sum := uint64(0)
for {
v, ok := pb.NextIndex()
if !ok {
break
}
sum += v
}
})
}
})
fmt.Printf("benchNextIndex %v\n", r)
}
func benchIndex() {
r := testing.Benchmark(func(b *testing.B) {
for i := 0; i < 100000; i++ {
b.RunParallel(func(pb *testing.PB) {
sum := uint64(0)
for pb.Next() {
v, _ := pb.Index() // Comment this line to gain 32% performance
sum += v
}
})
}
})
fmt.Printf("benchIndex %v\n", r)
}
func main() {
benchIndex()
benchNextIndex()
}
Of course, it raises the question, how did I implement these functions. The below code snippets should give you an idea:
// A PB is used by RunParallel for running parallel benchmarks.
type PB struct {
globalN *uint64 // shared between all worker goroutines iteration counter
grain uint64 // acquire that many iterations from globalN at once
cache uint64 // local cache of acquired iterations
bN uint64 // total number of iterations to execute (b.N)
where uint64 // last n we got <===== added this line
}
// Next reports whether there are more iterations to execute.
func (pb *PB) Next() bool {
if pb.cache == 0 {
n := atomic.AddUint64(pb.globalN, pb.grain)
if n <= pb.bN {
pb.cache = pb.grain
} else if n < pb.bN+pb.grain {
pb.cache = pb.bN + pb.grain - n
} else {
return false
}
pb.where = n - pb.grain <===== added this line
}
pb.where++ <===== added this line
pb.cache--
return true
}
// Index provides the current index.
func (pb *PB) Index() (uint64, bool) {
if pb.where == 0 {
return 0, false
}
if pb.where > pb.bN {
return 0, false
}
return pb.where - 1, true
}
// NextIndex provides the next index and reports whether there are more iterations to execute.
func (pb *PB) NextIndex() (uint64, bool) {
if pb.cache == 0 {
n := atomic.AddUint64(pb.globalN, pb.grain)
if n <= pb.bN {
pb.cache = pb.grain
} else if n < pb.bN+pb.grain {
pb.cache = pb.bN + pb.grain - n
} else {
return 0, false
}
pb.where = n - pb.grain
}
retVal := pb.where
pb.where++
pb.cache--
return retVal, true
}
@chabbimilind, with your code unaltered, I get an average
name time/op
Index 1.64ms ± 1%
NextIndex 1.61ms ± 1%
over 50 runs, which amounts to at most a 2% performance loss on my machine. There must be something wrong with yours.
If Index is reduced to just return the goroutine's index,
func (pb *PB) Index() uint64 { return pb.where - 1 }
there is no performance loss at all, there's actually even a 0.42% gain, according to benchstat (I had to rename both benchmarks to match so that they could be compared):
name old time/op new time/op delta
(Next)Index 1.61ms ± 1% 1.61ms ± 1% -0.42% (p=0.000 n=48+50)
That might be due to elimination of an XCHGL AX, AX at 0x000f in benchNextIndex.func1.1 which is replaced by a NOP in benchIndex.func1.1, but I suspect that's a more or less architecture-specific optimization together with the inlining that happened to both functions (just as I speculated earlier).
This, I believe, is the way pb.Index() is intended (when @rsc suggested it) to be used — always exclusively after a pb.Next() call:
// ...
for pb.Next() {
i := pb.Index()
// use i
}
// ...
The removal of those checks (and overhead) it had in your version might make it less safe to use improperly: it would return -1, before the first call to Next, and the last index after Next has started returning false, which might not be what one expects to get. Although, I think, one should not be expecting anything great from calling it outside of the appropriate context (which is only when Next has returned true).
Even you said so yourself:
pb.Index()is not a standalone API unlike how you use in your second example.pb.Index()is intricately coupled with what happened atpb.Next().
Looking at the code sample I suppose you were referring to (which is partially reproduced just above here), I can't really understand what you mean by "unlike how I use", though.
Your mileage may vary.
Oh, it did, my friend. Indeed it did.
@antichris how many threads (CPU cores) on your machine?
Are you prosing returning -1 on failure from pb.Index(). If that is the desired way Go standard APIs work, fine!
I have recorded the fact that there can be noticeable performance losses. I understand the focus of Golang is not high-performance.
how many threads (CPU cores) on your machine?
4 cores, 8 threads, according to the manufacturer's [datasheet].
I think I know where you're headed with this, though. That's why I rewrote your benchmark code to "proper" Go benchmarks, so that they could be executed with all the wonderful features that test flags provide, including (but not limited to) ergonomic control over the run count and GOMAXPROCS.
(Expand for source)
package main_test
import "testing"
func BenchmarkNextIndex(b *testing.B) {
for i := 0; i < 100000; i++ {
b.RunParallel(func(pb *testing.PB) {
sum := uint64(0)
for {
v, ok := pb.NextIndex()
if !ok {
break
}
sum += v
}
})
}
}
func BenchmarkIndex(b *testing.B) {
for i := 0; i < 100000; i++ {
b.RunParallel(func(pb *testing.PB) {
sum := uint64(0)
for pb.Next() {
v, _ := pb.Index()
// v := pb.Index() // Use this line instead of the above
// when a single value is returned.
sum += v
}
})
}
}
The invocation I used for running the benchmarks:
go test -bench=. -cpu=1,2,4,8 -count=10|tee results
I did not want to babysit my machine (it has a tendency for a sharp performance drop at putting the screen to/waking it up from sleep) for more than half an hour, waiting for 400 runs to complete, therefore the -count=10. Feel free to experiment with larger values.
The benchmark results were renamed to matching ones when comparing/summarizing with benchstat, using BenchmarkNextIndex as the baseline for each comparison.
The results for your original implementation of Index:
(Expand for full results)
name old time/op new time/op delta
(Next)Index 827µs ± 1% 826µs ± 1% ~ (p=0.730 n=9+9)
(Next)Index-2 1.04ms ± 1% 1.05ms ± 2% +1.16% (p=0.004 n=10+9)
(Next)Index-4 1.36ms ± 1% 1.38ms ± 3% +1.81% (p=0.010 n=9+10)
(Next)Index-8 1.61ms ± 1% 1.63ms ± 0% +1.06% (p=0.000 n=10+8)
goos: linux
goarch: amd64
pkg: testing/index
BenchmarkNextIndex 1365 822306 ns/op
BenchmarkNextIndex 1238 827097 ns/op
BenchmarkNextIndex 1250 826455 ns/op
BenchmarkNextIndex 1256 825291 ns/op
BenchmarkNextIndex 1248 829401 ns/op
BenchmarkNextIndex 1274 831540 ns/op
BenchmarkNextIndex 1264 824213 ns/op
BenchmarkNextIndex 1262 828072 ns/op
BenchmarkNextIndex 1260 827564 ns/op
BenchmarkNextIndex 1262 848524 ns/op
BenchmarkNextIndex-2 1072 1040714 ns/op
BenchmarkNextIndex-2 1126 1038666 ns/op
BenchmarkNextIndex-2 1105 1044023 ns/op
BenchmarkNextIndex-2 1108 1024421 ns/op
BenchmarkNextIndex-2 1119 1033484 ns/op
BenchmarkNextIndex-2 1076 1044225 ns/op
BenchmarkNextIndex-2 1105 1040204 ns/op
BenchmarkNextIndex-2 1093 1029051 ns/op
BenchmarkNextIndex-2 1122 1041408 ns/op
BenchmarkNextIndex-2 1106 1040794 ns/op
BenchmarkNextIndex-4 906 1372349 ns/op
BenchmarkNextIndex-4 903 1365971 ns/op
BenchmarkNextIndex-4 895 1349072 ns/op
BenchmarkNextIndex-4 895 1357245 ns/op
BenchmarkNextIndex-4 901 1359918 ns/op
BenchmarkNextIndex-4 913 1362478 ns/op
BenchmarkNextIndex-4 906 1358369 ns/op
BenchmarkNextIndex-4 896 1359996 ns/op
BenchmarkNextIndex-4 904 1403334 ns/op
BenchmarkNextIndex-4 892 1347904 ns/op
BenchmarkNextIndex-8 697 1618363 ns/op
BenchmarkNextIndex-8 697 1610496 ns/op
BenchmarkNextIndex-8 698 1610698 ns/op
BenchmarkNextIndex-8 704 1609407 ns/op
BenchmarkNextIndex-8 702 1595361 ns/op
BenchmarkNextIndex-8 697 1615093 ns/op
BenchmarkNextIndex-8 696 1618305 ns/op
BenchmarkNextIndex-8 692 1607051 ns/op
BenchmarkNextIndex-8 708 1613780 ns/op
BenchmarkNextIndex-8 690 1597287 ns/op
BenchmarkIndex 1375 816488 ns/op
BenchmarkIndex 1239 825060 ns/op
BenchmarkIndex 1274 822574 ns/op
BenchmarkIndex 1218 832834 ns/op
BenchmarkIndex 1275 823519 ns/op
BenchmarkIndex 1270 841124 ns/op
BenchmarkIndex 1261 828743 ns/op
BenchmarkIndex 1270 828936 ns/op
BenchmarkIndex 1254 823869 ns/op
BenchmarkIndex 1263 830012 ns/op
BenchmarkIndex-2 1094 1029340 ns/op
BenchmarkIndex-2 1090 1018124 ns/op
BenchmarkIndex-2 1105 1047049 ns/op
BenchmarkIndex-2 1114 1049284 ns/op
BenchmarkIndex-2 1129 1059698 ns/op
BenchmarkIndex-2 1095 1044032 ns/op
BenchmarkIndex-2 1111 1042077 ns/op
BenchmarkIndex-2 1093 1045498 ns/op
BenchmarkIndex-2 1138 1055870 ns/op
BenchmarkIndex-2 1132 1074446 ns/op
BenchmarkIndex-4 900 1408361 ns/op
BenchmarkIndex-4 898 1378696 ns/op
BenchmarkIndex-4 908 1397115 ns/op
BenchmarkIndex-4 889 1373676 ns/op
BenchmarkIndex-4 901 1404665 ns/op
BenchmarkIndex-4 879 1344952 ns/op
BenchmarkIndex-4 898 1402569 ns/op
BenchmarkIndex-4 890 1358545 ns/op
BenchmarkIndex-4 895 1372714 ns/op
BenchmarkIndex-4 889 1397904 ns/op
BenchmarkIndex-8 694 1631860 ns/op
BenchmarkIndex-8 694 1627596 ns/op
BenchmarkIndex-8 684 1622259 ns/op
BenchmarkIndex-8 693 1627623 ns/op
BenchmarkIndex-8 697 1624575 ns/op
BenchmarkIndex-8 682 1627252 ns/op
BenchmarkIndex-8 691 1608856 ns/op
BenchmarkIndex-8 700 1626498 ns/op
BenchmarkIndex-8 680 1637182 ns/op
BenchmarkIndex-8 693 1625829 ns/op
PASS
ok testing/index 212.816s
Running with Index reduced to a single return:
(Expand for full results)
name old time/op new time/op delta
(Next)Index 808µs ± 1% 810µs ± 1% ~ (p=0.278 n=9+10)
(Next)Index-2 1.02ms ± 1% 1.03ms ± 2% +1.23% (p=0.035 n=10+10)
(Next)Index-4 1.38ms ± 1% 1.38ms ± 1% ~ (p=0.315 n=10+10)
(Next)Index-8 1.61ms ± 1% 1.61ms ± 1% ~ (p=0.853 n=10+10)
goos: linux
goarch: amd64
pkg: testing/index
BenchmarkNextIndex 1185 845181 ns/op
BenchmarkNextIndex 1378 800164 ns/op
BenchmarkNextIndex 1309 802616 ns/op
BenchmarkNextIndex 1274 804395 ns/op
BenchmarkNextIndex 1286 805231 ns/op
BenchmarkNextIndex 1258 808409 ns/op
BenchmarkNextIndex 1296 814793 ns/op
BenchmarkNextIndex 1244 811059 ns/op
BenchmarkNextIndex 1298 805301 ns/op
BenchmarkNextIndex 1246 818426 ns/op
BenchmarkNextIndex-2 1114 1023248 ns/op
BenchmarkNextIndex-2 1106 1027178 ns/op
BenchmarkNextIndex-2 1118 1014118 ns/op
BenchmarkNextIndex-2 1104 1016580 ns/op
BenchmarkNextIndex-2 1135 1011715 ns/op
BenchmarkNextIndex-2 1122 1011229 ns/op
BenchmarkNextIndex-2 1108 1019155 ns/op
BenchmarkNextIndex-2 1130 1029966 ns/op
BenchmarkNextIndex-2 1131 1028267 ns/op
BenchmarkNextIndex-2 1130 1013857 ns/op
BenchmarkNextIndex-4 912 1384737 ns/op
BenchmarkNextIndex-4 901 1374862 ns/op
BenchmarkNextIndex-4 913 1374009 ns/op
BenchmarkNextIndex-4 897 1366385 ns/op
BenchmarkNextIndex-4 907 1386615 ns/op
BenchmarkNextIndex-4 902 1378177 ns/op
BenchmarkNextIndex-4 913 1389875 ns/op
BenchmarkNextIndex-4 907 1399369 ns/op
BenchmarkNextIndex-4 904 1379526 ns/op
BenchmarkNextIndex-4 908 1387491 ns/op
BenchmarkNextIndex-8 702 1613103 ns/op
BenchmarkNextIndex-8 703 1597941 ns/op
BenchmarkNextIndex-8 696 1608357 ns/op
BenchmarkNextIndex-8 704 1605088 ns/op
BenchmarkNextIndex-8 708 1610483 ns/op
BenchmarkNextIndex-8 705 1607491 ns/op
BenchmarkNextIndex-8 709 1593077 ns/op
BenchmarkNextIndex-8 702 1611638 ns/op
BenchmarkNextIndex-8 699 1607134 ns/op
BenchmarkNextIndex-8 709 1601878 ns/op
BenchmarkIndex 1281 811523 ns/op
BenchmarkIndex 1287 807519 ns/op
BenchmarkIndex 1292 808608 ns/op
BenchmarkIndex 1278 812850 ns/op
BenchmarkIndex 1257 812999 ns/op
BenchmarkIndex 1292 811811 ns/op
BenchmarkIndex 1268 804930 ns/op
BenchmarkIndex 1312 807247 ns/op
BenchmarkIndex 1262 808766 ns/op
BenchmarkIndex 1280 808776 ns/op
BenchmarkIndex-2 1100 1022728 ns/op
BenchmarkIndex-2 1111 1016110 ns/op
BenchmarkIndex-2 1144 1036095 ns/op
BenchmarkIndex-2 1128 1013398 ns/op
BenchmarkIndex-2 1131 1021993 ns/op
BenchmarkIndex-2 1142 1052461 ns/op
BenchmarkIndex-2 1137 1036276 ns/op
BenchmarkIndex-2 1131 1050766 ns/op
BenchmarkIndex-2 1142 1034039 ns/op
BenchmarkIndex-2 1124 1036667 ns/op
BenchmarkIndex-4 908 1373611 ns/op
BenchmarkIndex-4 909 1371849 ns/op
BenchmarkIndex-4 904 1374661 ns/op
BenchmarkIndex-4 902 1362660 ns/op
BenchmarkIndex-4 915 1371153 ns/op
BenchmarkIndex-4 904 1380963 ns/op
BenchmarkIndex-4 908 1381588 ns/op
BenchmarkIndex-4 904 1386198 ns/op
BenchmarkIndex-4 903 1390065 ns/op
BenchmarkIndex-4 878 1385323 ns/op
BenchmarkIndex-8 688 1611506 ns/op
BenchmarkIndex-8 700 1606052 ns/op
BenchmarkIndex-8 704 1610623 ns/op
BenchmarkIndex-8 706 1604036 ns/op
BenchmarkIndex-8 693 1618573 ns/op
BenchmarkIndex-8 705 1596077 ns/op
BenchmarkIndex-8 681 1629448 ns/op
BenchmarkIndex-8 705 1596526 ns/op
BenchmarkIndex-8 699 1594360 ns/op
BenchmarkIndex-8 706 1597677 ns/op
PASS
ok testing/index 215.121s
I have recorded the fact that there can be noticeable performance losses.
Those "noticeable performance losses" seem to be very slight in general, and practically non-existent in the case of the simplified, single value return implementation of Index.
Could you please be so kind as to run some benchmarks yourself (preferably with -count substantially more than 1) to confirm that your original result was not just a one time fluke/malfunction of your machine?
Otherwise, based on my experience with benchmarking this issue, I believe it is only a fair assumption to be made, that you cherry-picked your 32-percent-performance-loss result in bad faith, and, having grown so attached to the design you originally proposed, you are unwilling to concede it for a potentially more sensible and versatile one.
I have a question for @rsc, though: what about the fact that the Next+Index split API requires changes to Next to increment the goroutine index, changes that are very likely to have an impact on the performance of all existing benchmarks that currently rely on it?
Are you prosing returning -1 on failure from pb.Index(). If that is the desired way Go standard APIs work...
No, not really. I believe, failure, in this case, is something one would have checked for via a Next call. If that has returned false, there is no point in calling Index. You yourself explicitly stated that a call to Index on its own does not make sense; the API you originally proposed does not even provide such a function to call. And for the past few posts in this thread we came to a conclusion that there is, after all, a potential for inefficient implementation to have a negative impact on the performance.
... fine!
Da-yumn, @chabbimilind! If that ain't what they call "butthurt", I sincerely have no clue what was the impression you were trying to make with that. :stuck_out_tongue_winking_eye:
Just breathe, friend, relax and focus: having your proposal revised and improved by the gopher community is not the worst thing that could have happened to you today, nor is it a damning statement about your qualities as an individual and a member of the society.
is there a provable superiority (programmatic or performance) of having NextIndex over Index?
There's already a Next, so adding just Index avoids having two different Next operations.
It is also obviously possible to implement these in such a way that there is no performance difference:
func (pb *PB) Next() bool {
var ok bool
ok, pb.index = pb.nextIndex()
return ok
}
func (pb *PB) Index() int {
return pb.index
}
So I'm not placing much weight in the benchmarks showing a difference.
What I don't see here is much indication that this use of the extra index is something that is generally advisable in parallel benchmarks. My understanding of the parallel benchmark functionality was to be able to measure how fast it would be to run b.N iterations in parallel goroutines, keeping all the goroutines busy even when there is obviously some variation in how quickly each executes, so that you can't just hand out b.N/#goroutines to each one.
The more I think about this, the more I think we've gotten derailed by performance and API and have failed to recognize that adding either Index or NextIndex would be a mistake. If the work done for iteration i is not interchangeable with the work done for iteration j, that's not a valid benchmark at all.
My first objection to the stencilParallel example is that it probably matters which indexes a goroutine is writing. If pb.Next is doing some kind of fine-grained iteration where it splits the work across G goroutines and hands out k, k+G, k+2G, k+3G in a single goroutine until we get close to the end, then all the different goroutines are going to have cache collisions writing to nearby entries in "out". On the other hand, if pb.Next works on blocks and hands out mostly contiguous indexes to particular goroutines, then there will be few cache collisions. The reported speed of a benchmarked function should clearly not depend on the exact iteration order returned by package testing. And of course the way to make that happen is to hide the iteration order completely, as we do now.
But my second, more serious objection is that the stencilSerial example is not even a correct single-threaded benchmark. It's really important that b.N only change the number of times the code runs, not _what_ code runs. In particular, it is a mistake to change the size of the data being processed based on b.N, because cache behavior changes non-linearly with data size, which then invalidates "divide total time by b.N to find per-operation time". When the data size is changing too, per-operation time is not actually a meaningful value - it's not constant.
I think if we were going to adopt Index as an API, we would need a compelling, correct example. We don't have one.
/cc @dvyukov
There is something I don't fully understand in the problem statement: for parallel algorithms don't we want to measure the _production_ parallelization rather than some random scheme provided by the testing package? If we get some parallel speed up, it does not seem to be useful because the actual production code does not have any parallelization. Perhaps you want some parallel algorithm support package that can be used in production code, and then measured in benchmarks.
RunParallel is not intended for parallel algorithms, it's intended for measuring N logically independent copies of an operation to check how well that scales.
The proposed NextIndex scheme also seems to imply dependence of data size on the number of iterations. That's generally a wrong thing because of the associated non-linear effects. E.g. a faster version may trigger 10x larger data size and end up being slower... A better scheme is to do whole large operation b.N times on data set of target size (or sizes).
But per-se unique index assignment can be done with pb.Next as follows:
var index uint64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
const batch = 42
myIndex := atomic.AddUint64(&index, 1) - 1
for i := myIndex * batch; i < (myIndex + 1) * batch; i++ {
...
}
}
})
Most helpful comment
What I don't see here is much indication that this use of the extra index is something that is generally advisable in parallel benchmarks. My understanding of the parallel benchmark functionality was to be able to measure how fast it would be to run b.N iterations in parallel goroutines, keeping all the goroutines busy even when there is obviously some variation in how quickly each executes, so that you can't just hand out b.N/#goroutines to each one.
The more I think about this, the more I think we've gotten derailed by performance and API and have failed to recognize that adding either Index or NextIndex would be a mistake. If the work done for iteration i is not interchangeable with the work done for iteration j, that's not a valid benchmark at all.
My first objection to the stencilParallel example is that it probably matters which indexes a goroutine is writing. If pb.Next is doing some kind of fine-grained iteration where it splits the work across G goroutines and hands out k, k+G, k+2G, k+3G in a single goroutine until we get close to the end, then all the different goroutines are going to have cache collisions writing to nearby entries in "out". On the other hand, if pb.Next works on blocks and hands out mostly contiguous indexes to particular goroutines, then there will be few cache collisions. The reported speed of a benchmarked function should clearly not depend on the exact iteration order returned by package testing. And of course the way to make that happen is to hide the iteration order completely, as we do now.
But my second, more serious objection is that the stencilSerial example is not even a correct single-threaded benchmark. It's really important that b.N only change the number of times the code runs, not _what_ code runs. In particular, it is a mistake to change the size of the data being processed based on b.N, because cache behavior changes non-linearly with data size, which then invalidates "divide total time by b.N to find per-operation time". When the data size is changing too, per-operation time is not actually a meaningful value - it's not constant.
I think if we were going to adopt Index as an API, we would need a compelling, correct example. We don't have one.