@tombergan and I have been debugging a severe performance regression that Kubernetes observed when switching from Go 1.7 to Go 1.8. The problem ended up being the addition of net/http.Server.Shutdown that's currently implemented by closing a channel that all the open connections select on.
(Details in https://github.com/kubernetes/kubernetes/issues/45216 and https://github.com/golang/go/issues/20302)
But the short summary is:
package selbench
import "testing"
func BenchmarkSelectShared(b *testing.B) {
idleShared := make(chan struct{})
b.RunParallel(func(pb *testing.PB) {
ch := make(chan int, 1)
for pb.Next() {
select {
case ch <- 1:
case <-ch:
case <-idleShared:
}
}
})
}
func BenchmarkSelectPrivate(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
ch := make(chan int, 1)
idlePrivate := make(chan struct{})
for pb.Next() {
select {
case ch <- 1:
case <-ch:
case <-idlePrivate:
}
}
})
}
Note that the idle channels below are never closed and are never selectable, but the other two are, and stay busy.
But when the channel is private to the goroutine (uncontended), things are fast. When it's a shared channel, things are slow.
$ go test -v -bench=Select -cpu=1,2,4,8,16,32,64
BenchmarkSelectShared 10000000 194 ns/op
BenchmarkSelectShared-2 10000000 147 ns/op
BenchmarkSelectShared-4 5000000 395 ns/op
BenchmarkSelectShared-8 3000000 449 ns/op
BenchmarkSelectShared-16 5000000 354 ns/op
BenchmarkSelectShared-32 5000000 320 ns/op
BenchmarkSelectShared-64 5000000 296 ns/op
BenchmarkSelectPrivate 10000000 192 ns/op
BenchmarkSelectPrivate-2 20000000 98.0 ns/op
BenchmarkSelectPrivate-4 30000000 49.3 ns/op
BenchmarkSelectPrivate-8 50000000 25.5 ns/op
BenchmarkSelectPrivate-16 100000000 13.8 ns/op
BenchmarkSelectPrivate-32 200000000 7.07 ns/op
BenchmarkSelectPrivate-64 200000000 6.31 ns/op
Are there any optimizations to be done here?
We'll work around it in the meantime and generally keep this in mind as an anti-pattern for performance.
/cc @aclements @randall77 @ianlancetaylor @rsc @josharian @mdempsky
The root problem is lock contention in runtime.sellock and runtime.selunlock. One idea is to replace runtime.mutex with a scalable lock, such as an MCS lock.
An MCS lock can scale a spinlock, but that's not what this is. Though I suppose maybe it could be, it's not long we ever hold channels locked.
One thing that might speed up this test case would be to scan all the channels without locking them to see if they are ready. Then lock only the ones that are ready, and check them again. But that will only help the real code if there is usually some ready channel. If the select usually has to wait, it won't help, because we need to queue the goroutine on each channel. Even if we use a spinlock that queuing step is going to be a point of contention.
I think the usage pattern is a bit more accurately described with this benchmark:
package selbench
import (
"fmt"
"math/rand"
"testing"
"time"
)
func BenchmarkSelectContended(b *testing.B) {
for workers := 1; workers <= 100000; workers *= 10 {
b.Run(fmt.Sprintf("workers=%d", workers), func(b *testing.B) {
done := make(chan struct{})
// Spin up workers, each with their own work channel.
wc := make([]chan struct{}, workers)
for i := range wc {
c := make(chan struct{})
go func() {
for {
select {
case <-done:
return
case <-c:
}
}
}()
wc[i] = c
}
b.ResetTimer()
// Have the workers do b.N units of work.
b.RunParallel(func(pb *testing.PB) {
src := rand.New(rand.NewSource(time.Now().UnixNano()))
for pb.Next() {
wc[src.Intn(len(wc))] <- struct{}{}
}
})
b.StopTimer()
close(done)
})
}
}
For this, I see a total lack of scaling at small numbers of workers and a performance cliff as the number of workers grows large.
$ go test -bench=. -cpu=1,2,4,8,16,32,64 -benchtime=100ms x_test.go
goos: darwin
goarch: amd64
BenchmarkSelectContended/workers=1 500000 350 ns/op
BenchmarkSelectContended/workers=1-2 300000 417 ns/op
BenchmarkSelectContended/workers=1-4 300000 487 ns/op
BenchmarkSelectContended/workers=1-8 300000 561 ns/op
BenchmarkSelectContended/workers=1-16 200000 623 ns/op
BenchmarkSelectContended/workers=1-32 200000 628 ns/op
BenchmarkSelectContended/workers=1-64 200000 594 ns/op
BenchmarkSelectContended/workers=10 500000 376 ns/op
BenchmarkSelectContended/workers=10-2 300000 513 ns/op
BenchmarkSelectContended/workers=10-4 300000 620 ns/op
BenchmarkSelectContended/workers=10-8 300000 575 ns/op
BenchmarkSelectContended/workers=10-16 300000 533 ns/op
BenchmarkSelectContended/workers=10-32 300000 508 ns/op
BenchmarkSelectContended/workers=10-64 300000 461 ns/op
BenchmarkSelectContended/workers=100 300000 385 ns/op
BenchmarkSelectContended/workers=100-2 300000 443 ns/op
BenchmarkSelectContended/workers=100-4 300000 439 ns/op
BenchmarkSelectContended/workers=100-8 300000 539 ns/op
BenchmarkSelectContended/workers=100-16 200000 594 ns/op
BenchmarkSelectContended/workers=100-32 200000 601 ns/op
BenchmarkSelectContended/workers=100-64 200000 696 ns/op
BenchmarkSelectContended/workers=1000 300000 452 ns/op
BenchmarkSelectContended/workers=1000-2 300000 426 ns/op
BenchmarkSelectContended/workers=1000-4 300000 412 ns/op
BenchmarkSelectContended/workers=1000-8 200000 563 ns/op
BenchmarkSelectContended/workers=1000-16 200000 551 ns/op
BenchmarkSelectContended/workers=1000-32 200000 606 ns/op
BenchmarkSelectContended/workers=1000-64 300000 601 ns/op
BenchmarkSelectContended/workers=10000 200000 894 ns/op
BenchmarkSelectContended/workers=10000-2 200000 691 ns/op
BenchmarkSelectContended/workers=10000-4 200000 597 ns/op
BenchmarkSelectContended/workers=10000-8 200000 788 ns/op
BenchmarkSelectContended/workers=10000-16 100000 1174 ns/op
BenchmarkSelectContended/workers=10000-32 200000 792 ns/op
BenchmarkSelectContended/workers=10000-64 100000 1070 ns/op
BenchmarkSelectContended/workers=100000 1000 157938 ns/op
BenchmarkSelectContended/workers=100000-2 200 1044491 ns/op
BenchmarkSelectContended/workers=100000-4 50 2425451 ns/op
BenchmarkSelectContended/workers=100000-8 300 848236 ns/op
BenchmarkSelectContended/workers=100000-16 500 201122 ns/op
BenchmarkSelectContended/workers=100000-32 20 9354451 ns/op
BenchmarkSelectContended/workers=100000-64 1 187918798 ns/op
PASS
ok command-line-arguments 18.974s
@josharian, that rand.Intn call in there has a mutex (var globalRand = New(&lockedSource{src: NewSource(1).(Source64)})). I'd remove that and use a goroutine-local math/rand.Rand instead.
@bradfitz sheesh. Thanks. Updated in situ above. Results are unchanged, though.
$ go test -bench=. -cpu=1,2,4,8,16,32,64 -benchtime=100ms x_test.go
BenchmarkSelectContended/workers=1 300000 358 ns/op
BenchmarkSelectContended/workers=1-2 300000 526 ns/op
BenchmarkSelectContended/workers=1-4 300000 552 ns/op
BenchmarkSelectContended/workers=1-8 300000 600 ns/op
BenchmarkSelectContended/workers=1-16 200000 713 ns/op
BenchmarkSelectContended/workers=1-32 200000 674 ns/op
BenchmarkSelectContended/workers=1-64 200000 701 ns/op
BenchmarkSelectContended/workers=10 500000 399 ns/op
BenchmarkSelectContended/workers=10-2 300000 551 ns/op
BenchmarkSelectContended/workers=10-4 300000 585 ns/op
BenchmarkSelectContended/workers=10-8 200000 660 ns/op
BenchmarkSelectContended/workers=10-16 300000 574 ns/op
BenchmarkSelectContended/workers=10-32 300000 551 ns/op
BenchmarkSelectContended/workers=10-64 300000 498 ns/op
BenchmarkSelectContended/workers=100 500000 439 ns/op
BenchmarkSelectContended/workers=100-2 300000 481 ns/op
BenchmarkSelectContended/workers=100-4 300000 494 ns/op
BenchmarkSelectContended/workers=100-8 300000 569 ns/op
BenchmarkSelectContended/workers=100-16 200000 733 ns/op
BenchmarkSelectContended/workers=100-32 200000 650 ns/op
BenchmarkSelectContended/workers=100-64 200000 645 ns/op
BenchmarkSelectContended/workers=1000 200000 583 ns/op
BenchmarkSelectContended/workers=1000-2 300000 467 ns/op
BenchmarkSelectContended/workers=1000-4 300000 492 ns/op
BenchmarkSelectContended/workers=1000-8 200000 633 ns/op
BenchmarkSelectContended/workers=1000-16 200000 651 ns/op
BenchmarkSelectContended/workers=1000-32 200000 645 ns/op
BenchmarkSelectContended/workers=1000-64 200000 708 ns/op
BenchmarkSelectContended/workers=10000 100000 1056 ns/op
BenchmarkSelectContended/workers=10000-2 200000 756 ns/op
BenchmarkSelectContended/workers=10000-4 200000 691 ns/op
BenchmarkSelectContended/workers=10000-8 200000 795 ns/op
BenchmarkSelectContended/workers=10000-16 200000 810 ns/op
BenchmarkSelectContended/workers=10000-32 200000 804 ns/op
BenchmarkSelectContended/workers=10000-64 200000 889 ns/op
BenchmarkSelectContended/workers=100000 1000 134482 ns/op
BenchmarkSelectContended/workers=100000-2 500 208302 ns/op
BenchmarkSelectContended/workers=100000-4 100 1339142 ns/op
BenchmarkSelectContended/workers=100000-8 1 170459942 ns/op
BenchmarkSelectContended/workers=100000-16 100 1204770 ns/op
BenchmarkSelectContended/workers=100000-32 300 573343 ns/op
BenchmarkSelectContended/workers=100000-64 1 188693577 ns/op
PASS
ok command-line-arguments 19.280s
Cc @dvyukov
Running a benchmark with fast iterations and heavy setup for 1 iteration does not make sense. The numbers at the bottom is setup cost, not the cost of chan operation. Run them for more iterations.
@dvyukov, see also the original benchmark at top. But ignoring specific benchmarks (flawed or not), do you have any thoughts?
@josharian @dvyukov the addition of a waitgroup fixes the 100k worker case:
package selbench
import (
"fmt"
"math/rand"
"sync"
"testing"
"time"
)
func BenchmarkSelectContended(b *testing.B) {
for workers := 1; workers <= 100000; workers *= 10 {
b.Run(fmt.Sprintf("workers=%d", workers), func(b *testing.B) {
done := make(chan struct{})
// Spin up workers, each with their own work channel.
wc := make([]chan struct{}, workers)
var wg sync.WaitGroup
wg.Add(workers)
for i := range wc {
c := make(chan struct{})
go func() {
wg.Done()
for {
select {
case <-done:
return
case <-c:
}
}
}()
wc[i] = c
}
wg.Wait()
b.ResetTimer()
// Have the workers do b.N units of work.
b.RunParallel(func(pb *testing.PB) {
src := rand.New(rand.NewSource(time.Now().UnixNano()))
for pb.Next() {
wc[src.Intn(len(wc))] <- struct{}{}
}
})
b.StopTimer()
close(done)
})
}
}
This gives me the following on a 4 core i7 (w/ hyper threading):
name time/op
SelectContended/workers=1 408ns ± 1%
SelectContended/workers=1-2 483ns ± 0%
SelectContended/workers=1-4 528ns ± 2%
SelectContended/workers=1-8 595ns ± 4%
SelectContended/workers=10 454ns ± 0%
SelectContended/workers=10-2 536ns ± 1%
SelectContended/workers=10-4 634ns ± 3%
SelectContended/workers=10-8 880ns ±10%
SelectContended/workers=100 490ns ± 2%
SelectContended/workers=100-2 522ns ± 2%
SelectContended/workers=100-4 596ns ± 3%
SelectContended/workers=100-8 855ns ± 6%
SelectContended/workers=1000 582ns ± 2%
SelectContended/workers=1000-2 510ns ± 5%
SelectContended/workers=1000-4 563ns ± 3%
SelectContended/workers=1000-8 838ns ± 5%
SelectContended/workers=10000 913ns ± 5%
SelectContended/workers=10000-2 542ns ± 2%
SelectContended/workers=10000-4 533ns ± 4%
SelectContended/workers=10000-8 911ns ± 4%
SelectContended/workers=100000 1.70µs ± 7%
SelectContended/workers=100000-2 915ns ± 7%
SelectContended/workers=100000-4 685ns ±12%
SelectContended/workers=100000-8 1.15µs ± 3%
Lock-free channels help to some degree (#8899):
tip:
BenchmarkSelectContended10000-64 2000000 1985 ns/op
BenchmarkSelectContended10000-64 2000000 1929 ns/op
BenchmarkSelectContended10000-64 2000000 2031 ns/op
3c3be622011747f6db4b4cf81ed3a975dfca2b51 (base for lock-free chans):
BenchmarkSelectContended10000-64 50000 110105 ns/op
BenchmarkSelectContended10000-64 50000 110315 ns/op
BenchmarkSelectContended10000-64 50000 113208 ns/op
3c3be622011747f6db4b4cf81ed3a975dfca2b51 with lock-free chans:
BenchmarkSelectContended10000-64 5000000 1141 ns/op
BenchmarkSelectContended10000-64 5000000 1150 ns/op
BenchmarkSelectContended10000-64 5000000 1208 ns/op
@bradfitz What is that shared chan in http? srv.closeDoneChanLocked? I don't see where it is checked during connection serving.
The select is slow, but it does not seem to be too slow. Also numbers does not seem to depend too much on number of goroutines (I guess it is that you just mostly need enough of them to create contention from multiple CPUs).
Shouldn't Served.Shutdown do closeDoneChanLocked before closeListenersLocked? Otherwise error handling logic in Server.Serve is broken. I am not sure if there will a temp error on listener close or not, but bad both ways.
The shared chan is gracefulShutdownCh in x/net/http2.
How permanent there connections? FWIW we could create a per-conn done chan, add it to a global list of done chans once and then select on it. Shutdown will then need to go over all registered chans and close all of them.
We already did that :) See the CLs in #20302. This bug is about finding a more general solution.
What is that http2/server.go file? Does not seen to be in std checkout.
Starting a goroutine for awaitGracefulShutdown looks somewhat wasteful. But up to you.
http2/server.go is in golang.org/x/net/http2. The goroutine was a quick fix for a possible go 1.8.2 release. See https://golang.org/cl/43455 and https://golang.org/cl/43230 and feel free to comment on that last CL if you have other ideas.
We're getting off topic. The general problem this bug is trying to address is demonstrated by the benchmark in Brad's original comment.
Edit: fixed CL link. By "that last CL", I meant https://golang.org/cl/43230.
We're getting off topic. The general problem this bug is trying to address is demonstrated by the benchmark in Brad's original comment.
Yes, let's use the right bugs for the right discussion.
This bug is about the general problem in the Go runtime.
https://github.com/golang/go/issues/20302 is about working around the problem in http/http2.
https://github.com/kubernetes/kubernetes/issues/45216 is the original Kubernetes speed regression/rollback bug.
@dvyukov, see also the original benchmark at top. But ignoring specific benchmarks (flawed or not), do you have any thoughts?
I think this case is inherently hard. Multiple threads are hammering the same complex mutex-protected object. When memory is write-shared a simple memory load can take up to 100ns. And the mutex makes it all worse.
The numbers are need to be interpreted carefully. We divide work across multiple threads. So this:
BenchmarkSelectContended/workers=1000 200000 583 ns/op
BenchmarkSelectContended/workers=1000-64 200000 708 ns/op
rescaled to op/per-core is actually (assuming that the machine does have 64 cores):
BenchmarkSelectContended/workers=1000 200000 583 ns/op
BenchmarkSelectContended/workers=1000-64 200000 45312 ns/op
so in some cases we see up to 100x slowdown.
There is no easy way to fix this entirely. That would require some monstrous distributed construct that consumes considerably more memory and then some logic to detect when a chan actually falls into this case and dynamically alter the algorithm to a different scheme, and then alter it back if we see that we misguessed (e.g. sends appear of what we though is a close-notification chan). I don't think it's worth it. At least we have lots of lower hanging fruits.
Now, what I think we should do to improve things to some degree is:
And a more realistic benchmark for this case would be to add a bunch of local non-ready chans to the select, because that's what http2 does, and these additional chans significantly prolong critical section in select.
- Take the finer-grained locking in select logic from https://codereview.appspot.com/12544043 and apply it. It's orthogonal to the lock-free stuff. It will considerably reduce duration of critical section.
For the reference: fine-grained locking for select is https://github.com/golang/go/issues/8896.
I am experimenting with generalizing the single-channel select optimizations done by the compiler to multiple channels. A quick prototype implemented in the runtime looks promising, but before spending too much time on it (there's one regression that needs work) I wanted to ask for help in understanding if the approach is, in general, ok from the perspective of the language spec. I tried to ask on gophers slack, but we could not find a good argument as to why this shouldn't be allowed.
The gist of the idea is to try to avoid locking all channels, if we detect that any of the channels is ready.
Basically, are the following transformations allowed by the spec?
// non-blocking case
select {
case <-f1():
case <-f2():
default:
// body default
}
to
c1, c2 := f1(), f2()
if randbool() {
c1, c2 = c2, c1
}
select {
case <-c1:
default:
select {
case <-c2:
default:
// body default
}
}
and
// blocking case
select {
case <-f1():
case <-f2():
}
to
c1, c2 := f1(), f2()
if randbool() {
c1, c2 = c2, c1
}
select {
case <-c1:
default:
select {
case <-c2:
default:
select {
case <-c1:
case <-c2:
}
}
}
The one I'm least confident about is the non-blocking case, as I'm specifically not sure whether picking the default case in that way is correct. This could possibly be a safer, but slower, approach:
c1, c2 := f1(), f2()
if randbool() {
c1, c2 = c2, c1
}
select {
case <-c1:
default:
select {
case <-c2:
default:
select {
case <-c1:
case <-c2:
default:
// body default
}
}
}
@CAFxX This issue is about a specific problem that I don't think is addressed by your proposal. So this issue is not the best place to discuss your idea. The best place to discuss it would be the golang-dev mailing list. Thanks.
Sure, will do, although it is definitely addressing the problem of lock contention (as it reduces the number of sellock/selunlock calls, and therefore the total number of lock2/unlock2 calls, except for the blocking case in which no channel is ready) and it should help with both bradfitz's and josharian's benchmarks (although I haven't run them yet).
FWIW, you actually proposed almost the same idea in your first comment: https://github.com/golang/go/issues/20351#issuecomment-301215452
update: asked on https://groups.google.com/forum/#!topic/golang-dev/jX4oQEls3uk
Honestly I don't see my suggestion as the same as yours, since I was suggesting looking at all the channels without taking any locks, which is not what your code does. I may well be missing something.
Most helpful comment
I think this case is inherently hard. Multiple threads are hammering the same complex mutex-protected object. When memory is write-shared a simple memory load can take up to 100ns. And the mutex makes it all worse.
The numbers are need to be interpreted carefully. We divide work across multiple threads. So this:
rescaled to op/per-core is actually (assuming that the machine does have 64 cores):
so in some cases we see up to 100x slowdown.
There is no easy way to fix this entirely. That would require some monstrous distributed construct that consumes considerably more memory and then some logic to detect when a chan actually falls into this case and dynamically alter the algorithm to a different scheme, and then alter it back if we see that we misguessed (e.g. sends appear of what we though is a close-notification chan). I don't think it's worth it. At least we have lots of lower hanging fruits.
Now, what I think we should do to improve things to some degree is:
And a more realistic benchmark for this case would be to add a bunch of local non-ready chans to the select, because that's what http2 does, and these additional chans significantly prolong critical section in select.