Go: runtime: select on a shared channel is slow with many Ps

Created on 13 May 2017  Â·  24Comments  Â·  Source: golang/go

@tombergan and I have been debugging a severe performance regression that Kubernetes observed when switching from Go 1.7 to Go 1.8. The problem ended up being the addition of net/http.Server.Shutdown that's currently implemented by closing a channel that all the open connections select on.

(Details in https://github.com/kubernetes/kubernetes/issues/45216 and https://github.com/golang/go/issues/20302)

But the short summary is:

package selbench

import "testing"

func BenchmarkSelectShared(b *testing.B) {
        idleShared := make(chan struct{})
        b.RunParallel(func(pb *testing.PB) {
                ch := make(chan int, 1)
                for pb.Next() {
                        select {
                        case ch <- 1:
                        case <-ch:
                        case <-idleShared:
                        }
                }
        })      
}       

func BenchmarkSelectPrivate(b *testing.B) {
        b.RunParallel(func(pb *testing.PB) {
                ch := make(chan int, 1)
                idlePrivate := make(chan struct{})
                for pb.Next() {
                        select {
                        case ch <- 1:
                        case <-ch:
                        case <-idlePrivate:
                        }
                }
        })
} 

Note that the idle channels below are never closed and are never selectable, but the other two are, and stay busy.

But when the channel is private to the goroutine (uncontended), things are fast. When it's a shared channel, things are slow.

$ go test -v -bench=Select -cpu=1,2,4,8,16,32,64
BenchmarkSelectShared           10000000               194 ns/op
BenchmarkSelectShared-2         10000000               147 ns/op
BenchmarkSelectShared-4          5000000               395 ns/op
BenchmarkSelectShared-8          3000000               449 ns/op
BenchmarkSelectShared-16         5000000               354 ns/op
BenchmarkSelectShared-32         5000000               320 ns/op
BenchmarkSelectShared-64         5000000               296 ns/op
BenchmarkSelectPrivate          10000000               192 ns/op
BenchmarkSelectPrivate-2        20000000                98.0 ns/op
BenchmarkSelectPrivate-4        30000000                49.3 ns/op
BenchmarkSelectPrivate-8        50000000                25.5 ns/op
BenchmarkSelectPrivate-16       100000000               13.8 ns/op
BenchmarkSelectPrivate-32       200000000                7.07 ns/op
BenchmarkSelectPrivate-64       200000000                6.31 ns/op

Are there any optimizations to be done here?

We'll work around it in the meantime and generally keep this in mind as an anti-pattern for performance.

/cc @aclements @randall77 @ianlancetaylor @rsc @josharian @mdempsky

Performance

Most helpful comment

@dvyukov, see also the original benchmark at top. But ignoring specific benchmarks (flawed or not), do you have any thoughts?

I think this case is inherently hard. Multiple threads are hammering the same complex mutex-protected object. When memory is write-shared a simple memory load can take up to 100ns. And the mutex makes it all worse.

The numbers are need to be interpreted carefully. We divide work across multiple threads. So this:

BenchmarkSelectContended/workers=1000             200000           583 ns/op
BenchmarkSelectContended/workers=1000-64          200000           708 ns/op

rescaled to op/per-core is actually (assuming that the machine does have 64 cores):

BenchmarkSelectContended/workers=1000             200000           583 ns/op
BenchmarkSelectContended/workers=1000-64          200000         45312 ns/op

so in some cases we see up to 100x slowdown.

There is no easy way to fix this entirely. That would require some monstrous distributed construct that consumes considerably more memory and then some logic to detect when a chan actually falls into this case and dynamically alter the algorithm to a different scheme, and then alter it back if we see that we misguessed (e.g. sends appear of what we though is a close-notification chan). I don't think it's worth it. At least we have lots of lower hanging fruits.

Now, what I think we should do to improve things to some degree is:

  1. Take the finer-grained locking in select logic from https://codereview.appspot.com/12544043 and apply it. It's orthogonal to the lock-free stuff. It will considerably reduce duration of critical section.
  2. Optimize runtime mutex more. Off the top of my head: apply the so called "thundering herd" optimization; don't call futex wake when there is nobody to wake; tune spinning logic; maybe don't wake waiters when there are spinning threads.
  3. We may try the so called combining locks (https://software.intel.com/en-us/blogs/2013/02/22/combineraggregator-synchronization-primitive). Sometimes they are useful to somewhat improve behavior of highly contented data structures.
  4. Do some spot optimizations in the select, esp inside critical sections.

And a more realistic benchmark for this case would be to add a bunch of local non-ready chans to the select, because that's what http2 does, and these additional chans significantly prolong critical section in select.

All 24 comments

The root problem is lock contention in runtime.sellock and runtime.selunlock. One idea is to replace runtime.mutex with a scalable lock, such as an MCS lock.

An MCS lock can scale a spinlock, but that's not what this is. Though I suppose maybe it could be, it's not long we ever hold channels locked.

One thing that might speed up this test case would be to scan all the channels without locking them to see if they are ready. Then lock only the ones that are ready, and check them again. But that will only help the real code if there is usually some ready channel. If the select usually has to wait, it won't help, because we need to queue the goroutine on each channel. Even if we use a spinlock that queuing step is going to be a point of contention.

I think the usage pattern is a bit more accurately described with this benchmark:

package selbench

import (
    "fmt"
    "math/rand"
    "testing"
    "time"
)

func BenchmarkSelectContended(b *testing.B) {
    for workers := 1; workers <= 100000; workers *= 10 {
        b.Run(fmt.Sprintf("workers=%d", workers), func(b *testing.B) {
            done := make(chan struct{})
            // Spin up workers, each with their own work channel.
            wc := make([]chan struct{}, workers)
            for i := range wc {
                c := make(chan struct{})
                go func() {
                    for {
                        select {
                        case <-done:
                            return
                        case <-c:
                        }
                    }
                }()
                wc[i] = c
            }
            b.ResetTimer()
            // Have the workers do b.N units of work.
            b.RunParallel(func(pb *testing.PB) {
                src := rand.New(rand.NewSource(time.Now().UnixNano()))
                for pb.Next() {
                    wc[src.Intn(len(wc))] <- struct{}{}
                }
            })
            b.StopTimer()
            close(done)
        })
    }
}

For this, I see a total lack of scaling at small numbers of workers and a performance cliff as the number of workers grows large.

$ go test -bench=. -cpu=1,2,4,8,16,32,64 -benchtime=100ms x_test.go 
goos: darwin
goarch: amd64
BenchmarkSelectContended/workers=1                500000           350 ns/op
BenchmarkSelectContended/workers=1-2              300000           417 ns/op
BenchmarkSelectContended/workers=1-4              300000           487 ns/op
BenchmarkSelectContended/workers=1-8              300000           561 ns/op
BenchmarkSelectContended/workers=1-16             200000           623 ns/op
BenchmarkSelectContended/workers=1-32             200000           628 ns/op
BenchmarkSelectContended/workers=1-64             200000           594 ns/op
BenchmarkSelectContended/workers=10               500000           376 ns/op
BenchmarkSelectContended/workers=10-2             300000           513 ns/op
BenchmarkSelectContended/workers=10-4             300000           620 ns/op
BenchmarkSelectContended/workers=10-8             300000           575 ns/op
BenchmarkSelectContended/workers=10-16            300000           533 ns/op
BenchmarkSelectContended/workers=10-32            300000           508 ns/op
BenchmarkSelectContended/workers=10-64            300000           461 ns/op
BenchmarkSelectContended/workers=100              300000           385 ns/op
BenchmarkSelectContended/workers=100-2            300000           443 ns/op
BenchmarkSelectContended/workers=100-4            300000           439 ns/op
BenchmarkSelectContended/workers=100-8            300000           539 ns/op
BenchmarkSelectContended/workers=100-16           200000           594 ns/op
BenchmarkSelectContended/workers=100-32           200000           601 ns/op
BenchmarkSelectContended/workers=100-64           200000           696 ns/op
BenchmarkSelectContended/workers=1000             300000           452 ns/op
BenchmarkSelectContended/workers=1000-2           300000           426 ns/op
BenchmarkSelectContended/workers=1000-4           300000           412 ns/op
BenchmarkSelectContended/workers=1000-8           200000           563 ns/op
BenchmarkSelectContended/workers=1000-16          200000           551 ns/op
BenchmarkSelectContended/workers=1000-32          200000           606 ns/op
BenchmarkSelectContended/workers=1000-64          300000           601 ns/op
BenchmarkSelectContended/workers=10000            200000           894 ns/op
BenchmarkSelectContended/workers=10000-2          200000           691 ns/op
BenchmarkSelectContended/workers=10000-4          200000           597 ns/op
BenchmarkSelectContended/workers=10000-8          200000           788 ns/op
BenchmarkSelectContended/workers=10000-16         100000          1174 ns/op
BenchmarkSelectContended/workers=10000-32         200000           792 ns/op
BenchmarkSelectContended/workers=10000-64         100000          1070 ns/op
BenchmarkSelectContended/workers=100000             1000        157938 ns/op
BenchmarkSelectContended/workers=100000-2            200       1044491 ns/op
BenchmarkSelectContended/workers=100000-4             50       2425451 ns/op
BenchmarkSelectContended/workers=100000-8            300        848236 ns/op
BenchmarkSelectContended/workers=100000-16           500        201122 ns/op
BenchmarkSelectContended/workers=100000-32            20       9354451 ns/op
BenchmarkSelectContended/workers=100000-64             1     187918798 ns/op
PASS
ok      command-line-arguments  18.974s

@josharian, that rand.Intn call in there has a mutex (var globalRand = New(&lockedSource{src: NewSource(1).(Source64)})). I'd remove that and use a goroutine-local math/rand.Rand instead.

@bradfitz sheesh. Thanks. Updated in situ above. Results are unchanged, though.

$ go test -bench=. -cpu=1,2,4,8,16,32,64 -benchtime=100ms x_test.go 
BenchmarkSelectContended/workers=1                300000           358 ns/op
BenchmarkSelectContended/workers=1-2              300000           526 ns/op
BenchmarkSelectContended/workers=1-4              300000           552 ns/op
BenchmarkSelectContended/workers=1-8              300000           600 ns/op
BenchmarkSelectContended/workers=1-16             200000           713 ns/op
BenchmarkSelectContended/workers=1-32             200000           674 ns/op
BenchmarkSelectContended/workers=1-64             200000           701 ns/op
BenchmarkSelectContended/workers=10               500000           399 ns/op
BenchmarkSelectContended/workers=10-2             300000           551 ns/op
BenchmarkSelectContended/workers=10-4             300000           585 ns/op
BenchmarkSelectContended/workers=10-8             200000           660 ns/op
BenchmarkSelectContended/workers=10-16            300000           574 ns/op
BenchmarkSelectContended/workers=10-32            300000           551 ns/op
BenchmarkSelectContended/workers=10-64            300000           498 ns/op
BenchmarkSelectContended/workers=100              500000           439 ns/op
BenchmarkSelectContended/workers=100-2            300000           481 ns/op
BenchmarkSelectContended/workers=100-4            300000           494 ns/op
BenchmarkSelectContended/workers=100-8            300000           569 ns/op
BenchmarkSelectContended/workers=100-16           200000           733 ns/op
BenchmarkSelectContended/workers=100-32           200000           650 ns/op
BenchmarkSelectContended/workers=100-64           200000           645 ns/op
BenchmarkSelectContended/workers=1000             200000           583 ns/op
BenchmarkSelectContended/workers=1000-2           300000           467 ns/op
BenchmarkSelectContended/workers=1000-4           300000           492 ns/op
BenchmarkSelectContended/workers=1000-8           200000           633 ns/op
BenchmarkSelectContended/workers=1000-16          200000           651 ns/op
BenchmarkSelectContended/workers=1000-32          200000           645 ns/op
BenchmarkSelectContended/workers=1000-64          200000           708 ns/op
BenchmarkSelectContended/workers=10000            100000          1056 ns/op
BenchmarkSelectContended/workers=10000-2          200000           756 ns/op
BenchmarkSelectContended/workers=10000-4          200000           691 ns/op
BenchmarkSelectContended/workers=10000-8          200000           795 ns/op
BenchmarkSelectContended/workers=10000-16         200000           810 ns/op
BenchmarkSelectContended/workers=10000-32         200000           804 ns/op
BenchmarkSelectContended/workers=10000-64         200000           889 ns/op
BenchmarkSelectContended/workers=100000             1000        134482 ns/op
BenchmarkSelectContended/workers=100000-2            500        208302 ns/op
BenchmarkSelectContended/workers=100000-4            100       1339142 ns/op
BenchmarkSelectContended/workers=100000-8              1     170459942 ns/op
BenchmarkSelectContended/workers=100000-16           100       1204770 ns/op
BenchmarkSelectContended/workers=100000-32           300        573343 ns/op
BenchmarkSelectContended/workers=100000-64             1     188693577 ns/op
PASS
ok      command-line-arguments  19.280s

Cc @dvyukov

Running a benchmark with fast iterations and heavy setup for 1 iteration does not make sense. The numbers at the bottom is setup cost, not the cost of chan operation. Run them for more iterations.

@dvyukov, see also the original benchmark at top. But ignoring specific benchmarks (flawed or not), do you have any thoughts?

@josharian @dvyukov the addition of a waitgroup fixes the 100k worker case:

package selbench

import (
    "fmt"
    "math/rand"
    "sync"
    "testing"
    "time"
)

func BenchmarkSelectContended(b *testing.B) {
    for workers := 1; workers <= 100000; workers *= 10 {
        b.Run(fmt.Sprintf("workers=%d", workers), func(b *testing.B) {
            done := make(chan struct{})
            // Spin up workers, each with their own work channel.
            wc := make([]chan struct{}, workers)
            var wg sync.WaitGroup
            wg.Add(workers)
            for i := range wc {
                c := make(chan struct{})
                go func() {
                    wg.Done()
                    for {
                        select {
                        case <-done:
                            return
                        case <-c:
                        }
                    }
                }()
                wc[i] = c
            }
            wg.Wait()
            b.ResetTimer()
            // Have the workers do b.N units of work.
            b.RunParallel(func(pb *testing.PB) {
                src := rand.New(rand.NewSource(time.Now().UnixNano()))
                for pb.Next() {
                    wc[src.Intn(len(wc))] <- struct{}{}
                }
            })
            b.StopTimer()
            close(done)
        })
    }
}

This gives me the following on a 4 core i7 (w/ hyper threading):

name                              time/op
SelectContended/workers=1          408ns ± 1%
SelectContended/workers=1-2        483ns ± 0%
SelectContended/workers=1-4        528ns ± 2%
SelectContended/workers=1-8        595ns ± 4%
SelectContended/workers=10         454ns ± 0%
SelectContended/workers=10-2       536ns ± 1%
SelectContended/workers=10-4       634ns ± 3%
SelectContended/workers=10-8       880ns ±10%
SelectContended/workers=100        490ns ± 2%
SelectContended/workers=100-2      522ns ± 2%
SelectContended/workers=100-4      596ns ± 3%
SelectContended/workers=100-8      855ns ± 6%
SelectContended/workers=1000       582ns ± 2%
SelectContended/workers=1000-2     510ns ± 5%
SelectContended/workers=1000-4     563ns ± 3%
SelectContended/workers=1000-8     838ns ± 5%
SelectContended/workers=10000      913ns ± 5%
SelectContended/workers=10000-2    542ns ± 2%
SelectContended/workers=10000-4    533ns ± 4%
SelectContended/workers=10000-8    911ns ± 4%
SelectContended/workers=100000    1.70µs ± 7%
SelectContended/workers=100000-2   915ns ± 7%
SelectContended/workers=100000-4   685ns ±12%
SelectContended/workers=100000-8  1.15µs ± 3%

Lock-free channels help to some degree (#8899):

tip:
BenchmarkSelectContended10000-64 2000000 1985 ns/op
BenchmarkSelectContended10000-64 2000000 1929 ns/op
BenchmarkSelectContended10000-64 2000000 2031 ns/op

3c3be622011747f6db4b4cf81ed3a975dfca2b51 (base for lock-free chans):
BenchmarkSelectContended10000-64 50000 110105 ns/op
BenchmarkSelectContended10000-64 50000 110315 ns/op
BenchmarkSelectContended10000-64 50000 113208 ns/op

3c3be622011747f6db4b4cf81ed3a975dfca2b51 with lock-free chans:
BenchmarkSelectContended10000-64 5000000 1141 ns/op
BenchmarkSelectContended10000-64 5000000 1150 ns/op
BenchmarkSelectContended10000-64 5000000 1208 ns/op

@bradfitz What is that shared chan in http? srv.closeDoneChanLocked? I don't see where it is checked during connection serving.
The select is slow, but it does not seem to be too slow. Also numbers does not seem to depend too much on number of goroutines (I guess it is that you just mostly need enough of them to create contention from multiple CPUs).

Shouldn't Served.Shutdown do closeDoneChanLocked before closeListenersLocked? Otherwise error handling logic in Server.Serve is broken. I am not sure if there will a temp error on listener close or not, but bad both ways.

The shared chan is gracefulShutdownCh in x/net/http2.

How permanent there connections? FWIW we could create a per-conn done chan, add it to a global list of done chans once and then select on it. Shutdown will then need to go over all registered chans and close all of them.

We already did that :) See the CLs in #20302. This bug is about finding a more general solution.

What is that http2/server.go file? Does not seen to be in std checkout.

Starting a goroutine for awaitGracefulShutdown looks somewhat wasteful. But up to you.

http2/server.go is in golang.org/x/net/http2. The goroutine was a quick fix for a possible go 1.8.2 release. See https://golang.org/cl/43455 and https://golang.org/cl/43230 and feel free to comment on that last CL if you have other ideas.

We're getting off topic. The general problem this bug is trying to address is demonstrated by the benchmark in Brad's original comment.

Edit: fixed CL link. By "that last CL", I meant https://golang.org/cl/43230.

We're getting off topic. The general problem this bug is trying to address is demonstrated by the benchmark in Brad's original comment.

Yes, let's use the right bugs for the right discussion.

This bug is about the general problem in the Go runtime.

https://github.com/golang/go/issues/20302 is about working around the problem in http/http2.

https://github.com/kubernetes/kubernetes/issues/45216 is the original Kubernetes speed regression/rollback bug.

@dvyukov, see also the original benchmark at top. But ignoring specific benchmarks (flawed or not), do you have any thoughts?

I think this case is inherently hard. Multiple threads are hammering the same complex mutex-protected object. When memory is write-shared a simple memory load can take up to 100ns. And the mutex makes it all worse.

The numbers are need to be interpreted carefully. We divide work across multiple threads. So this:

BenchmarkSelectContended/workers=1000             200000           583 ns/op
BenchmarkSelectContended/workers=1000-64          200000           708 ns/op

rescaled to op/per-core is actually (assuming that the machine does have 64 cores):

BenchmarkSelectContended/workers=1000             200000           583 ns/op
BenchmarkSelectContended/workers=1000-64          200000         45312 ns/op

so in some cases we see up to 100x slowdown.

There is no easy way to fix this entirely. That would require some monstrous distributed construct that consumes considerably more memory and then some logic to detect when a chan actually falls into this case and dynamically alter the algorithm to a different scheme, and then alter it back if we see that we misguessed (e.g. sends appear of what we though is a close-notification chan). I don't think it's worth it. At least we have lots of lower hanging fruits.

Now, what I think we should do to improve things to some degree is:

  1. Take the finer-grained locking in select logic from https://codereview.appspot.com/12544043 and apply it. It's orthogonal to the lock-free stuff. It will considerably reduce duration of critical section.
  2. Optimize runtime mutex more. Off the top of my head: apply the so called "thundering herd" optimization; don't call futex wake when there is nobody to wake; tune spinning logic; maybe don't wake waiters when there are spinning threads.
  3. We may try the so called combining locks (https://software.intel.com/en-us/blogs/2013/02/22/combineraggregator-synchronization-primitive). Sometimes they are useful to somewhat improve behavior of highly contented data structures.
  4. Do some spot optimizations in the select, esp inside critical sections.

And a more realistic benchmark for this case would be to add a bunch of local non-ready chans to the select, because that's what http2 does, and these additional chans significantly prolong critical section in select.

  1. Take the finer-grained locking in select logic from https://codereview.appspot.com/12544043 and apply it. It's orthogonal to the lock-free stuff. It will considerably reduce duration of critical section.

For the reference: fine-grained locking for select is https://github.com/golang/go/issues/8896.

I am experimenting with generalizing the single-channel select optimizations done by the compiler to multiple channels. A quick prototype implemented in the runtime looks promising, but before spending too much time on it (there's one regression that needs work) I wanted to ask for help in understanding if the approach is, in general, ok from the perspective of the language spec. I tried to ask on gophers slack, but we could not find a good argument as to why this shouldn't be allowed.

The gist of the idea is to try to avoid locking all channels, if we detect that any of the channels is ready.

Basically, are the following transformations allowed by the spec?

// non-blocking case
select {
case <-f1():
case <-f2():
default:
  // body default
}

to

c1, c2 := f1(), f2()
if randbool() {
    c1, c2 = c2, c1
}
select {
case <-c1:
default:
  select {
  case <-c2:
  default:
    // body default
  }
}

and

// blocking case
select {
case <-f1():
case <-f2():
}

to

c1, c2 := f1(), f2()
if randbool() {
    c1, c2 = c2, c1
}
select {
case <-c1:
default:
  select {
  case <-c2:
  default:
    select {
    case <-c1:
    case <-c2:
    }
  }
}

The one I'm least confident about is the non-blocking case, as I'm specifically not sure whether picking the default case in that way is correct. This could possibly be a safer, but slower, approach:

c1, c2 := f1(), f2()
if randbool() {
    c1, c2 = c2, c1
}
select {
case <-c1:
default:
  select {
  case <-c2:
  default:
    select {
    case <-c1:
    case <-c2:
    default:
      // body default
    }
  }
}

@CAFxX This issue is about a specific problem that I don't think is addressed by your proposal. So this issue is not the best place to discuss your idea. The best place to discuss it would be the golang-dev mailing list. Thanks.

Sure, will do, although it is definitely addressing the problem of lock contention (as it reduces the number of sellock/selunlock calls, and therefore the total number of lock2/unlock2 calls, except for the blocking case in which no channel is ready) and it should help with both bradfitz's and josharian's benchmarks (although I haven't run them yet).

FWIW, you actually proposed almost the same idea in your first comment: https://github.com/golang/go/issues/20351#issuecomment-301215452


update: asked on https://groups.google.com/forum/#!topic/golang-dev/jX4oQEls3uk

Honestly I don't see my suggestion as the same as yours, since I was suggesting looking at all the channels without taking any locks, which is not what your code does. I may well be missing something.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

natefinch picture natefinch  Â·  3Comments

enoodle picture enoodle  Â·  3Comments

dominikh picture dominikh  Â·  3Comments

OneOfOne picture OneOfOne  Â·  3Comments

bbodenmiller picture bbodenmiller  Â·  3Comments