Please answer these questions before submitting your issue. Thanks!
go version)?go version devel +dfad3f8 Tue Apr 5 16:10:33 2016 +0300 linux/amd64
go env)?GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GORACE=""
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build040466407=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
I run an http server serving more than a million concurrent TCP keep-alive connections on a system with 32 CPU cores. Each connection has read and write deadlines set via net.Conn.SetReadDeadline and net.Conn.SetWriteDeadline.
Server performance should scale with GOMAXPROCS up to the number of CPU cores.
addtimer and deltimer functions from runtime/time.go are at the top of CPU profile.iowait reaches 20% due to memory ping-pong across CPU cores inside siftupTimer and siftdownTimer functions in runtime/time.go.@dvyukov , could you look into this?
Server performance should scale with GOMAXPROCS up to the number of CPU cores.
What did you see instead? What is performance depending on number of cores?
iowait reaches 20% due to memory ping-pong across CPU cores inside siftupTimer and siftdownTimer functions in runtime/time.go.
I don't understand relation between iowait and memory ping-pong? iowait if waiting for IO, like hard drive, memory accesses are not IO.
How exactly does profile look?
Each connection has read and write deadlines set via net.Conn.SetReadDeadline and net.Conn.SetWriteDeadline.
Would it be possible to change it to SetDeadline? It would setup 1 timer instead of 2.
We probably can merge this into #6239
What did you see instead? What is performance depending on number of cores?
iowait and system CPU shares grow with increasing the number of CPU cores
I don't understand relation between iowait and memory ping-pong? iowait if waiting for IO, like hard drive, memory accesses are not IO.
I'm not an expert in iowait, but the fact is that iowait completely vanishes from 20% to 0% when deadlines on net.TCPConn connections are disabled. See the following image:
.
Connection deadlines are enabled during 15:43-15:56
Connection deadlines are disabled starting from 15:56
How exactly does profile look?
Here is CPU profile for the process with deadlines enabled:
(pprof) top20
9.85mins of 15.65mins total (62.97%)
Dropped 623 nodes (cum <= 0.08mins)
Showing top 20 nodes out of 187 (cum >= 4.98mins)
flat flat% sum% cum cum%
3.27mins 20.92% 20.92% 3.27mins 20.92% runtime.futex
1.98mins 12.66% 33.57% 2.20mins 14.04% syscall.Syscall
0.66mins 4.25% 37.82% 0.66mins 4.25% runtime.memmove
0.54mins 3.48% 41.30% 0.54mins 3.48% runtime.procyield
0.52mins 3.31% 44.61% 3.18mins 20.29% runtime.lock
0.41mins 2.61% 47.22% 0.41mins 2.61% runtime/internal/atomic.Xchg
0.34mins 2.18% 49.40% 0.34mins 2.18% runtime.epollwait
0.25mins 1.61% 51.02% 0.25mins 1.61% runtime.epollctl
0.25mins 1.60% 52.61% 0.25mins 1.60% runtime.siftdownTimer
0.24mins 1.55% 54.16% 0.24mins 1.55% runtime.siftupTimer
0.24mins 1.54% 55.70% 0.24mins 1.54% runtime.osyield
0.19mins 1.19% 56.89% 0.19mins 1.19% runtime/internal/atomic.Cas
0.15mins 0.96% 57.85% 3.03mins 19.36% runtime.deltimer
0.14mins 0.91% 58.76% 11.56mins 73.90% github.com/valyala/fasthttp.(*Server).serveConn
0.12mins 0.76% 59.52% 2mins 12.78% runtime.unlock
0.12mins 0.74% 60.26% 0.16mins 1% runtime.deferreturn
0.11mins 0.71% 60.97% 0.12mins 0.77% runtime.gopark
0.11mins 0.71% 61.68% 0.42mins 2.68% runtime.gentraceback
0.10mins 0.66% 62.35% 0.23mins 1.47% lib/metric.(*sample).Update
0.10mins 0.62% 62.97% 4.98mins 31.86% net.runtime_pollSetDeadline
(pprof) top20 -cum
387.31s of 938.84s total (41.25%)
Dropped 623 nodes (cum <= 4.69s)
Showing top 20 nodes out of 187 (cum >= 112.06s)
flat flat% sum% cum cum%
0 0% 0% 864.20s 92.05% runtime.goexit
0.01s 0.0011% 0.0011% 733.84s 78.16% github.com/valyala/fasthttp.(*workerPool).getCh.func1
0.39s 0.042% 0.043% 733.75s 78.15% github.com/valyala/fasthttp.(*workerPool).workerFunc
0.10s 0.011% 0.053% 693.90s 73.91% github.com/valyala/fasthttp.(*Server).(github.com/valyala/fasthttp.serveConn)-fm
8.53s 0.91% 0.96% 693.80s 73.90% github.com/valyala/fasthttp.(*Server).serveConn
1.28s 0.14% 1.10% 306.32s 32.63% net.setDeadlineImpl
5.86s 0.62% 1.72% 299.09s 31.86% net.runtime_pollSetDeadline
196.36s 20.92% 22.64% 196.36s 20.92% runtime.futex
31.04s 3.31% 25.94% 190.52s 20.29% runtime.lock
9s 0.96% 26.90% 181.79s 19.36% runtime.deltimer
0.80s 0.085% 26.99% 159.09s 16.95% lib/httpserver.(*customConn).SetReadDeadline
0.72s 0.077% 27.06% 158.10s 16.84% net.(*conn).SetReadDeadline
0.36s 0.038% 27.10% 157.38s 16.76% net.(*netFD).setReadDeadline
2.63s 0.28% 27.38% 154.28s 16.43% lib/httpserver.(*customConn).SetWriteDeadline
0.84s 0.089% 27.47% 150.37s 16.02% net.(*conn).SetWriteDeadline
0.23s 0.024% 27.50% 149.53s 15.93% net.(*netFD).setWriteDeadline
0.80s 0.085% 27.58% 139.24s 14.83% runtime.addtimer
118.83s 12.66% 40.24% 131.79s 14.04% syscall.Syscall
7.14s 0.76% 41.00% 120.01s 12.78% runtime.unlock
2.39s 0.25% 41.25% 112.06s 11.94% github.com/valyala/fasthttp.acquireByteReader
(pprof) list addtimer$
Total: 15.65mins
ROUTINE ======================== runtime.addtimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
800ms 2.32mins (flat, cum) 14.83% of Total
. . 80:// Ready the goroutine arg.
. . 81:func goroutineReady(arg interface{}, seq uintptr) {
. . 82: goready(arg.(*g), 0)
. . 83:}
. . 84:
360ms 360ms 85:func addtimer(t *timer) {
50ms 1.35mins 86: lock(&timers.lock)
320ms 7.42s 87: addtimerLocked(t)
30ms 50.47s 88: unlock(&timers.lock)
40ms 40ms 89:}
. . 90:
. . 91:// Add a timer to the heap and start or kick the timer proc.
. . 92:// If the new timer is earlier than any of the others.
. . 93:// Timers are locked.
. . 94:func addtimerLocked(t *timer) {
(pprof) list deltimer$
Total: 15.65mins
ROUTINE ======================== runtime.deltimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
9s 3.03mins (flat, cum) 19.36% of Total
. . 117: }
. . 118:}
. . 119:
. . 120:// Delete timer t from the heap.
. . 121:// Do not need to update the timerproc: if it wakes up early, no big deal.
200ms 200ms 122:func deltimer(t *timer) bool {
. . 123: // Dereference t so that any panic happens before the lock is held.
. . 124: // Discard result, because t might be moving in the heap.
50ms 50ms 125: _ = t.i
. . 126:
180ms 1.61mins 127: lock(&timers.lock)
. . 128: // t may not be registered anymore and may have
. . 129: // a bogus i (typically 0, if generated by Go).
. . 130: // Verify it before proceeding.
570ms 570ms 131: i := t.i
1.64s 1.64s 132: last := len(timers.t) - 1
2.64s 2.64s 133: if i < 0 || i > last || timers.t[i] != t {
. . 134: unlock(&timers.lock)
. . 135: return false
. . 136: }
. . 137: if i != last {
1.47s 1.47s 138: timers.t[i] = timers.t[last]
330ms 330ms 139: timers.t[i].i = i
. . 140: }
1.71s 1.71s 141: timers.t[last] = nil
30ms 30ms 142: timers.t = timers.t[:last]
. . 143: if i != last {
30ms 11.32s 144: siftupTimer(i)
30ms 6.23s 145: siftdownTimer(i)
. . 146: }
30ms 58.89s 147: unlock(&timers.lock)
90ms 90ms 148: return true
. . 149:}
. . 150:
. . 151:// Timerproc runs the time-driven events.
. . 152:// It sleeps until the next event in the timers heap.
. . 153:// If addtimer inserts a new earlier event, addtimer1 wakes timerproc early.
Here is CPU profile for the app with connection deadlines disabled:
(pprof) top20
465.38s of 1002.60s total (46.42%)
Dropped 622 nodes (cum <= 5.01s)
Showing top 20 nodes out of 192 (cum >= 15.87s)
flat flat% sum% cum cum%
54.74s 5.46% 5.46% 54.74s 5.46% runtime.memmove
54.54s 5.44% 10.90% 150.32s 14.99% runtime.lock
48.96s 4.88% 15.78% 330.74s 32.99% runtime.findrunnable
34.12s 3.40% 19.19% 34.12s 3.40% runtime.futex
32.93s 3.28% 22.47% 32.93s 3.28% runtime/internal/atomic.Xchg
27.41s 2.73% 25.20% 27.41s 2.73% runtime.procyield
26.82s 2.68% 27.88% 33.59s 3.35% runtime.runqgrab
20.20s 2.01% 29.89% 35.12s 3.50% runtime.selectgoImpl
17.45s 1.74% 31.63% 17.45s 1.74% runtime/internal/atomic.Cas
16.69s 1.66% 33.30% 21.79s 2.17% github.com/valyala/fasthttp.(*headerScanner).next
16.56s 1.65% 34.95% 16.56s 1.65% math/big.addMulVVW
16.38s 1.63% 36.58% 25.14s 2.51% github.com/valyala/fasthttp.(*RequestHeader).AppendBytes
14.24s 1.42% 38.01% 446.04s 44.49% github.com/valyala/fasthttp.(*Server).serveConn
13.88s 1.38% 39.39% 22.81s 2.28% runtime.netpollready
13.72s 1.37% 40.76% 13.72s 1.37% runtime/internal/atomic.Load
12.19s 1.22% 41.97% 32.18s 3.21% runtime.unlock
11.54s 1.15% 43.12% 351.66s 35.07% runtime.schedule
11.42s 1.14% 44.26% 11.42s 1.14% runtime.indexbytebody
11.02s 1.10% 45.36% 11.02s 1.10% runtime.osyield
10.57s 1.05% 46.42% 15.87s 1.58% github.com/valyala/fasthttp.appendArgBytes
(pprof) top20 -cum
155.46s of 1002.60s total (15.51%)
Dropped 622 nodes (cum <= 5.01s)
Showing top 20 nodes out of 192 (cum >= 61.52s)
flat flat% sum% cum cum%
0 0% 0% 636.02s 63.44% runtime.goexit
0 0% 0% 466.84s 46.56% github.com/valyala/fasthttp.(*workerPool).getCh.func1
2.85s 0.28% 0.28% 466.84s 46.56% github.com/valyala/fasthttp.(*workerPool).workerFunc
0.28s 0.028% 0.31% 446.32s 44.52% github.com/valyala/fasthttp.(*Server).(github.com/valyala/fasthttp.serveConn)-fm
14.24s 1.42% 1.73% 446.04s 44.49% github.com/valyala/fasthttp.(*Server).serveConn
3.38s 0.34% 2.07% 361.68s 36.07% runtime.mcall
1.30s 0.13% 2.20% 354.98s 35.41% runtime.park_m
11.54s 1.15% 3.35% 351.66s 35.07% runtime.schedule
48.96s 4.88% 8.23% 330.74s 32.99% runtime.findrunnable
1.74s 0.17% 8.41% 206.95s 20.64% lib/httpserver.getHTTPHandler.func1
1.21s 0.12% 8.53% 199.45s 19.89% lib/httpserver.httpHandler
0.43s 0.043% 8.57% 161.15s 16.07% main.requestHandler
0.14s 0.014% 8.58% 158.64s 15.82% main.proxyHandler
54.54s 5.44% 14.02% 150.32s 14.99% runtime.lock
2.46s 0.25% 14.27% 149.65s 14.93% main.proxyRequest
0 0% 14.27% 112.92s 11.26% github.com/valyala/fasthttp.(*PipelineClient).worker.func1
4.15s 0.41% 14.68% 112.92s 11.26% github.com/valyala/fasthttp.(*PipelineClient).writer
5.54s 0.55% 15.24% 106.39s 10.61% github.com/valyala/fasthttp.acquireByteReader
0.73s 0.073% 15.31% 67.26s 6.71% github.com/valyala/fasthttp.(*RequestHeader).SetBytesV
1.97s 0.2% 15.51% 61.52s 6.14% github.com/valyala/fasthttp.(*RequestHeader).SetCanonical
(pprof) list addtimer$
Total: 16.71mins
ROUTINE ======================== runtime.addtimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
150ms 1.60s (flat, cum) 0.16% of Total
. . 80:// Ready the goroutine arg.
. . 81:func goroutineReady(arg interface{}, seq uintptr) {
. . 82: goready(arg.(*g), 0)
. . 83:}
. . 84:
100ms 100ms 85:func addtimer(t *timer) {
40ms 780ms 86: lock(&timers.lock)
10ms 310ms 87: addtimerLocked(t)
. 410ms 88: unlock(&timers.lock)
. . 89:}
. . 90:
. . 91:// Add a timer to the heap and start or kick the timer proc.
. . 92:// If the new timer is earlier than any of the others.
. . 93:// Timers are locked.
(pprof) list deltimer$
Total: 16.71mins
ROUTINE ======================== runtime.deltimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
2.21s 4.53s (flat, cum) 0.45% of Total
. . 117: }
. . 118:}
. . 119:
. . 120:// Delete timer t from the heap.
. . 121:// Do not need to update the timerproc: if it wakes up early, no big deal.
370ms 370ms 122:func deltimer(t *timer) bool {
. . 123: // Dereference t so that any panic happens before the lock is held.
. . 124: // Discard result, because t might be moving in the heap.
10ms 10ms 125: _ = t.i
. . 126:
60ms 1.03s 127: lock(&timers.lock)
. . 128: // t may not be registered anymore and may have
. . 129: // a bogus i (typically 0, if generated by Go).
. . 130: // Verify it before proceeding.
40ms 40ms 131: i := t.i
30ms 30ms 132: last := len(timers.t) - 1
1.06s 1.06s 133: if i < 0 || i > last || timers.t[i] != t {
. . 134: unlock(&timers.lock)
. . 135: return false
. . 136: }
. . 137: if i != last {
230ms 230ms 138: timers.t[i] = timers.t[last]
70ms 70ms 139: timers.t[i].i = i
. . 140: }
250ms 250ms 141: timers.t[last] = nil
. . 142: timers.t = timers.t[:last]
. . 143: if i != last {
20ms 180ms 144: siftupTimer(i)
10ms 110ms 145: siftdownTimer(i)
. . 146: }
20ms 1.11s 147: unlock(&timers.lock)
40ms 40ms 148: return true
. . 149:}
. . 150:
. . 151:// Timerproc runs the time-driven events.
. . 152:// It sleeps until the next event in the timers heap.
. . 153:// If addtimer inserts a new earlier event, addtimer1 wakes timerproc early.
Would it be possible to change it to SetDeadline? It would setup 1 timer instead of 2.
No, this won't work, since read and write deadlines are configured independently in our application.
FYI, this is not an OS scalability problem, since the CPU usage graph above until 15:33 corresponds to 'app process per CPU core' mode of the app with enabled connection deadlines. We had to switch to this mode due to the issue with timer scalability in Go runtime.
Thanks for the detained info!
This confirms that the main issue is the global timers mutex. siftdown/up consume ~3% of time. If we distribute timers, then siftdown/up should become faster as well (smaller per-P heap + better locality).
Unfortunately this is not trivial to do. There is #6239 for this.
FYI, this is not an OS scalability problem, since the CPU usage graph above until 15:33 corresponds to 'app process per CPU core' mode of the app with enabled connection deadlines. We had to switch to this mode due to the issue with timer scalability in Go runtime.
Sounds pretty bad. @rsc @aclements
CL https://golang.org/cl/34784 mentions this issue.
The CL fixes timers scalability issue for us. Simple repro:
Hammer the following http server with more than 100K concurrent http keep-alive connections and send serveral requests per second on each such a connection. The server must run on multi-core machine with GOMAXPROCS=NumCPU.
package main
import (
"flag"
"fmt"
"log"
"net/http"
"time"
)
var addr = flag.String("addr", ":80", "TCP address to listen to")
func main() {
flag.Parse()
s := &http.Server{
Addr: *addr,
// These timeouts trigger high iowait without the CL 34784
// if many requests are sent over more than 100K
// keep-alive http connections.
ReadTimeout: 90*time.Second,
WriteTimeout: 5*time.Second,
Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "OK")
}),
}
if err := s.ListenAndServe(); err != nil {
log.Fatalf("error in ListenAndServe: %s", err)
}
}
CPU load should look like

iowait share
Most helpful comment
The CL fixes timers scalability issue for us. Simple repro:
Hammer the following http server with more than 100K concurrent http keep-alive connections and send serveral requests per second on each such a connection. The server must run on multi-core machine with GOMAXPROCS=NumCPU.
CPU load should look like

iowaitshare