(Maybe helpful to bypass the issue #10958 in a quite different but probably better and more
clean way. Coming from this thread, many thanks to the participants of this discussion)
Hi all :)
I have implemented a sub-option in the go tool trace recently named "diagreedy"
which means "diagnoses and finds out the greediest several goroutines". This tool
already helped us tracked down several deep hidden problems in our go applications
and achieved more stable and short GC pause latency by fixing them afterwards.
N = GOMAXPROCS
Tmax_stw = sum{top max N on-processor time slices}
Tgc = Tmax_stw + Tgcwork
t1=12 t2=11 t3=10
| | | |
__________________________________________________
| | | |
---------------------------------------------------> time line
P0 [------go 1------][-go 7-][-go 1-][-go 6-][-go 4-]
P1 [-go 3-][-go 4-][------go 2-----][-go 1-][-go 2-]
P2 [-go 4-][-go 3-][-go 8-][-go 9-][-----go 10-----]
In this diagram, the value of N is 3 (i.e. the GOMAXPROCS), and the top 3 long on-processor
time slices is t1, t2, t3 and they are corresponding to goroutine 1, 2, 10.
The max time cost of the pre-GC-action Stop-the-World is named Tmax_stw and is the
sum of t1, t2 and t3.
The Tgcwork is the time cost by the acually effective Garbage Collecting action after
the phase Stop-the-World. Tgcwork is determined by the amount of the allocated objects
and the complexity of the relationship among them.
Tmax_stw and Tgcwork together finally determined the one typical GC pause cost. As
for Tgcwork, we could optimize it by reusing the objects (via sync.Pool for example) and
by many other methods. But we do not discuss Tgcwork futher here. We concentrate on
the Tmax_stw right now.
We could get this conclusion plainly:
The bigger the Tmax_stw is and the greedier and less cooperative the goroutines are,
the more thrashing and churning our go applications would become, thus the worse
latency our programs would perform (caused by the GC pause).
For some cpu intensive applications, the top max N on-processor time slices is often
very big (if without very carefully task splits). Especially when our project is including
tons of other and heavily used open source projects which will cause it becomes more
complex and harder to trace.
Although we already had some tools to sample out the top n cpu usage of the functions.
But the accumulated cpu usage of the functions and the longest top n time slices is not
the same thing at all. For example, if we have 1000 different cpu intensive goroutines
something like:
go func(){
do_cpu_staff() -- time cost 1ms
time.Sleep(time.Second)
}()
Our Tmax_stw would become GOMAXPROCS*1 ms. It's terrible because sometimes
we would set the GOMAXPROCS to a pretty big value (16 or even 36 for instance).
With the sample of top n cpu usage, we may couldn't find out the sources of the problem
easily. But with the 'top max N on-processor time slices' approach, we could track down
the sources of the long GC pause immediately and fix them like this:
go func(){
do_cpu_staff0() -- time cost 200us
runtime.Gosched()
do_cpu_staff1() -- time cost 200us
runtime.Gosched()
do_cpu_staff2() -- time cost 200us
runtime.Gosched()
do_cpu_staff3() -- time cost 200us
runtime.Gosched()
do_cpu_staff4() -- time cost 200us
time.Sleep(time.Second)
}()
Then our Tmax_stw would turns from GOMAXPROCS*1 ms to GOMAXPROCS/5 ms afterwards :)
(If your GOMAXPROCS is very big, just splits do_cpu_staff even further. And you even
could implement some logic which would let the do_cpu_staff automated splits its tasks
based on the value of GOMAXPROCS and a other parameter could named time_cost_wanted to
solve this problem for good.)
Sorry, I might had make a mistake:
Tmax_stw = sum{top max N on-processor time slices}
It should probably be:
Tmax_stw = max{top max N on-processor time slices}
Although I hadn't dive in the go scheduler's code yet, but
I guess the 2nd is most likely the right answer, because the
1st algorithm is just too inefficient and unlikely using by the
go scheduler.
But no matter which one is the right answer, the 1st one or the
2nd, it couldn't change our conclusion made before utterly.
And there is a example about the usage of go tool trace -diagreedy
$ go tool trace -h
Usage of 'go tool trace':
Given a trace file produced by 'go test':
go test -trace=trace.out pkg
Open a web browser displaying trace:
go tool trace [flags] [pkg.test] trace.out
Generate a pprof-like profile from the trace:
go tool trace -pprof=TYPE [pkg.test] trace.out
[pkg.test] argument is required for traces produced by Go 1.6 and below.
Go 1.7 does not require the binary argument.
Supported profile types are:
- net: network blocking profile
- sync: synchronization blocking profile
- syscall: syscall blocking profile
- sched: scheduler latency profile
Flags:
-http=addr: HTTP service address (e.g., ':6060')
-pprof=type: print a pprof-like profile instead
-dump: dump all traced out as format text to stdout <-
-diagreedy=N: dump the top N greedy goroutines <- new added options
Sample code to diagnose:
package main
import "os"
import "fmt"
import "runtime"
import "runtime/trace"
//import "sync/atomic"
//import "math/big"
import "time"
func init() {
//runtime.LockOSThread()
}
func main() {
//debug.SetGCPercent(-1)
runtime.GOMAXPROCS(2)
go func() {
for {
fmt.Println("---------------")
f, err := os.Create("trace_my_" +
time.Now().Format("2006-01-02-15-04-05") + ".out")
if err != nil {
panic(err)
}
err = trace.Start(f)
if err != nil {
panic(err)
}
time.Sleep(time.Second / 1)
trace.Stop()
time.Sleep(time.Second * 2)
}
}()
go func() {
for {
time.Sleep(time.Millisecond * 10)
go func() {
time.Sleep(time.Millisecond * 100)
}()
}
}()
go func() {
for {
fmt.Println("heartbeat... ",
time.Now().Format("2006-01-02-15:04:05"))
time.Sleep(time.Second)
}
}()
// maybe highest cpu usage
go func() {
l := make([]int, 0, 8)
l = append(l, 0)
ct := 0
for {
l = append(l, ct)
ct++
l = l[1:]
runtime.Gosched()
}
}()
amount := 1000000
granul := 100000
go godeadloop0(amount, granul)
go godeadloop1(amount, granul)
godeadloop2(amount, granul)
return
}
// potential greediest go functions
func godeadloop0(amount, granul int) {
func() {
for {
ct := 0
for {
ct++
if ct%granul == 0 {
runtime.Gosched()
}
if ct >= amount {
fmt.Println("--->")
break
}
}
time.Sleep(time.Second / 10)
}
}()
}
func godeadloop1(amount, granul int) {
func() {
for {
ct := 0
for {
ct++
if ct%granul == 0 {
runtime.Gosched()
}
if ct >= amount {
fmt.Println("--->")
break
}
}
time.Sleep(time.Second / 10)
}
}()
}
func godeadloop2(amount, granul int) {
func() {
for {
ct := 0
for {
ct++
if ct%granul == 0 {
runtime.Gosched()
}
if ct >= amount {
fmt.Println("--->")
break
}
}
time.Sleep(time.Second / 10)
}
}()
}
Diagnosis for the 1st time:
$ go tool trace -diagreedy=10 /root/go/src/diaggreedy/trace_my_2017-06-25-10-01-13.out
finally the found toppest 10 greedy goroutines info
7486357ns -> 0.007486s
GoStart Ts: 739889373 P: 1 G: 9 StkID: 0 args0: 9 args1: 0 args2: 0
GoSched
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
GoSched Ts: 747375730 P: 1 G: 9 StkID: 20 args0: 0 args1: 0 args2: 0
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
3617038ns -> 0.003617s
GoStartLabel Ts: 649291006 P: 0 G: 27 StkID: 0 args0: 27 args1: 7 args2: 2
GoStart -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock
4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
GC (fractional)
GoBlock Ts: 652908044 P: 0 G: 27 StkID: 0 args0: 0 args1: 0 args2: 0
GoStart -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock -> GoStartLabel
4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
2763199ns -> 0.002763s
GoStart Ts: 925956546 P: 1 G: 10 StkID: 0 args0: 10 args1: 0 args2: 0
GoSleep -> GoUnblock
4452857 time.Sleep /usr/local/go/src/runtime/time.go 59
4746323 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 112
4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
GoSched Ts: 928719745 P: 1 G: 10 StkID: 21 args0: 0 args1: 0 args2: 0
4746333 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 105
4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
2573480ns -> 0.002573s
GoStart Ts: 750086169 P: 0 G: 9 StkID: 0 args0: 9 args1: 0 args2: 0
GoSched
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
GoSched Ts: 752659649 P: 0 G: 9 StkID: 20 args0: 0 args1: 0 args2: 0
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
2558799ns -> 0.002559s
GoStart Ts: 625418894 P: 0 G: 9 StkID: 0 args0: 9 args1: 0 args2: 0
GoSched
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
GoSched Ts: 627977693 P: 0 G: 9 StkID: 20 args0: 0 args1: 0 args2: 0
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
2462919ns -> 0.002463s
GoStart Ts: 975882408 P: 0 G: 9 StkID: 0 args0: 9 args1: 97 args2: 0
GoSched
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
GoSched Ts: 978345327 P: 0 G: 9 StkID: 20 args0: 0 args1: 0 args2: 0
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
2374319ns -> 0.002374s
GoStart Ts: 240141393 P: 1 G: 10 StkID: 0 args0: 10 args1: 0 args2: 0
GoSleep -> GoUnblock
4452857 time.Sleep /usr/local/go/src/runtime/time.go 59
4746323 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 112
4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
GoSched Ts: 242515712 P: 1 G: 10 StkID: 21 args0: 0 args1: 0 args2: 0
4746333 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 105
4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
2250439ns -> 0.002250s
GoStart Ts: 724664379 P: 0 G: 1 StkID: 0 args0: 1 args1: 0 args2: 0
GoSched
4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
4744280 main.main /root/go/src/diaggreedy/x.go 74
GoSched Ts: 726914818 P: 0 G: 1 StkID: 19 args0: 0 args1: 0 args2: 0
4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
4744280 main.main /root/go/src/diaggreedy/x.go 74
2230559ns -> 0.002231s
GoStart Ts: 382544542 P: 1 G: 9 StkID: 0 args0: 9 args1: 0 args2: 0
GoSched
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
GoSched Ts: 384775101 P: 1 G: 9 StkID: 20 args0: 0 args1: 0 args2: 0
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
2112279ns -> 0.002112s
GoStartLabel Ts: 866437288 P: 1 G: 28 StkID: 0 args0: 28 args1: 3 args2: 2
GoStart -> GoBlock -> GoUnblock
4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
GC (fractional)
GoBlock Ts: 868549567 P: 1 G: 28 StkID: 0 args0: 0 args1: 0 args2: 0
GoStart -> GoBlock -> GoUnblock -> GoStartLabel
4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
And then, we could found the source of the long GC pause godeadloop0
and chage the variable granul to value 10000. Test again:
$ go tool trace -diagreedy=10 /root/go/src/diaggreedy/trace_my_2017-06-25-09-52-03.out
finally the found toppest 10 greedy goroutines info
408760ns -> 0.000409s
GoStart Ts: 564369612 P: 0 G: 8 StkID: 0 args0: 8 args1: 0 args2: 0
GoSched
4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66
GoSched Ts: 564778372 P: 0 G: 8 StkID: 21 args0: 0 args1: 0 args2: 0
4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66
393760ns -> 0.000394s
GoStart Ts: 363796356 P: 1 G: 8 StkID: 0 args0: 8 args1: 0 args2: 0
GoSched
4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66
GoSched Ts: 364190116 P: 1 G: 8 StkID: 21 args0: 0 args1: 0 args2: 0
4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66
375760ns -> 0.000376s
GoStartLabel Ts: 583805009 P: 0 G: 16 StkID: 0 args0: 16 args1: 3 args2: 2
GoStart -> GoBlock -> GoUnblock
4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
GC (fractional)
GoBlock Ts: 584180769 P: 0 G: 16 StkID: 0 args0: 0 args1: 0 args2: 0
GoStart -> GoBlock -> GoUnblock -> GoStartLabel
4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
364040ns -> 0.000364s
GoStart Ts: 211900654 P: 0 G: 10 StkID: 0 args0: 10 args1: 0 args2: 0
GoSleep -> GoUnblock
4452857 time.Sleep /usr/local/go/src/runtime/time.go 59
4746323 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 112
4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
GoSched Ts: 212264694 P: 0 G: 10 StkID: 23 args0: 0 args1: 0 args2: 0
4746333 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 105
4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
358280ns -> 0.000358s
GoStart Ts: 736130271 P: 1 G: 9 StkID: 0 args0: 9 args1: 0 args2: 0
GoSched
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
GoSched Ts: 736488551 P: 1 G: 9 StkID: 24 args0: 0 args1: 0 args2: 0
4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
357360ns -> 0.000357s
GoStart Ts: 551030053 P: 0 G: 8 StkID: 0 args0: 8 args1: 0 args2: 0
GoSched
4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66
GoSched Ts: 551387413 P: 0 G: 8 StkID: 21 args0: 0 args1: 0 args2: 0
4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66
323360ns -> 0.000323s
GoStart Ts: 103892587 P: 1 G: 1 StkID: 0 args0: 1 args1: 0 args2: 0
GoSched
4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
4744280 main.main /root/go/src/diaggreedy/x.go 74
GoSched Ts: 104215947 P: 1 G: 1 StkID: 22 args0: 0 args1: 0 args2: 0
4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
4744280 main.main /root/go/src/diaggreedy/x.go 74
320520ns -> 0.000321s
GoStart Ts: 736132431 P: 0 G: 1 StkID: 0 args0: 1 args1: 0 args2: 0
GoSched
4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
4744280 main.main /root/go/src/diaggreedy/x.go 74
GoSched Ts: 736452951 P: 0 G: 1 StkID: 22 args0: 0 args1: 0 args2: 0
4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
4744280 main.main /root/go/src/diaggreedy/x.go 74
320480ns -> 0.000320s
GoStartLabel Ts: 777280946 P: 0 G: 16 StkID: 0 args0: 16 args1: 5 args2: 2
GoStart -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock
4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
GC (fractional)
GoBlock Ts: 777601426 P: 0 G: 16 StkID: 0 args0: 0 args1: 0 args2: 0
GoStart -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock -> GoStartLabel
4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
319480ns -> 0.000319s
GoStart Ts: 940403207 P: 0 G: 10 StkID: 0 args0: 10 args1: 0 args2: 0
GoSleep -> GoUnblock
4452857 time.Sleep /usr/local/go/src/runtime/time.go 59
4746323 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 112
4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
GoSched Ts: 940722687 P: 0 G: 10 StkID: 23 args0: 0 args1: 0 args2: 0
4746333 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 105
4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
As you see, the typical GC pause finally changed from 8ms to 0.4ms. Of course
the situation would be much more complex in a real big project, but the key idea
of the tracking is the same and clear.
Also implemented a 'dump' option like this below:
$ # Ts is the event's timestamp in nanoseconds
$ # by reading the go/src/internal/trace/parser.go to get more info
$ go tool trace -dump /root/go/src/diaggreedy/trace_my_2017-06-25-10-03-15.out
0xc42001c120 Off: 24 type: 13 GoCreate Ts: 0 P: 1 G: 0 StkID: 1 args0: 1 args1: 2 args2: 0 link: 0xc42001d050
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001c1b0 Off: 30 type: 13 GoCreate Ts: 3879 P: 1 G: 0 StkID: 1 args0: 2 args1: 3 args2: 0 link: 0x0
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001c240 Off: 36 type: 31 GoWaiting Ts: 4119 P: 1 G: 2 StkID: 0 args0: 2 args1: 0 args2: 0 link: 0x0
0xc42001c2d0 Off: 39 type: 13 GoCreate Ts: 8079 P: 1 G: 0 StkID: 1 args0: 3 args1: 4 args2: 0 link: 0x0
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001c360 Off: 45 type: 31 GoWaiting Ts: 8199 P: 1 G: 3 StkID: 0 args0: 3 args1: 0 args2: 0 link: 0xc427e29e60
0xc42001c3f0 Off: 48 type: 13 GoCreate Ts: 13639 P: 1 G: 0 StkID: 1 args0: 4 args1: 5 args2: 0 link: 0x0
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001c480 Off: 55 type: 31 GoWaiting Ts: 13759 P: 1 G: 4 StkID: 0 args0: 4 args1: 0 args2: 0 link: 0x0
0xc42001c510 Off: 58 type: 13 GoCreate Ts: 14159 P: 1 G: 0 StkID: 1 args0: 5 args1: 6 args2: 0 link: 0xc42001cb40
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001c5a0 Off: 64 type: 13 GoCreate Ts: 14399 P: 1 G: 0 StkID: 1 args0: 6 args1: 7 args2: 0 link: 0x0
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001c630 Off: 70 type: 31 GoWaiting Ts: 14439 P: 1 G: 6 StkID: 0 args0: 6 args1: 0 args2: 0 link: 0xc4223ee2d0
0xc42001c6c0 Off: 73 type: 13 GoCreate Ts: 18039 P: 1 G: 0 StkID: 1 args0: 7 args1: 8 args2: 0 link: 0x0
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001c750 Off: 79 type: 31 GoWaiting Ts: 18119 P: 1 G: 7 StkID: 0 args0: 7 args1: 0 args2: 0 link: 0xc44016d710
0xc42001c7e0 Off: 82 type: 13 GoCreate Ts: 21479 P: 1 G: 0 StkID: 1 args0: 8 args1: 9 args2: 0 link: 0xc42001d290
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001c870 Off: 88 type: 13 GoCreate Ts: 25199 P: 1 G: 0 StkID: 1 args0: 9 args1: 10 args2: 0 link: 0xc42592f440
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001c900 Off: 94 type: 13 GoCreate Ts: 29159 P: 1 G: 0 StkID: 1 args0: 10 args1: 11 args2: 0 link: 0xc42001d170
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001c990 Off: 100 type: 13 GoCreate Ts: 33279 P: 1 G: 0 StkID: 1 args0: 17 args1: 12 args2: 0 link: 0x0
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001ca20 Off: 106 type: 32 GoInSyscall Ts: 33359 P: 1 G: 17 StkID: 0 args0: 17 args1: 0 args2: 0 link: 0x0
0xc42001cab0 Off: 109 type: 5 ProcStart Ts: 33439 P: 1 G: 0 StkID: 0 args0: 3 args1: 0 args2: 0 link: 0x0
0xc42001cb40 Off: 112 type: 14 GoStart Ts: 33719 P: 1 G: 5 StkID: 6 args0: 5 args1: 0 args2: 0 link: 0xc42001ce10
4744545 main.main.func1 /root/go/src/diaggreedy/x.go 22
0xc42001cbd0 Off: 115 type: 33 HeapAlloc Ts: 36559 P: 1 G: 5 StkID: 0 args0: 253952 args1: 0 args2: 0 link: 0x0
0xc42001cc60 Off: 120 type: 33 HeapAlloc Ts: 36879 P: 1 G: 5 StkID: 0 args0: 262144 args1: 0 args2: 0 link: 0x0
0xc42001ccf0 Off: 172 type: 4 Gomaxprocs Ts: 42519 P: 1 G: 5 StkID: 13 args0: 2 args1: 0 args2: 0 link: 0x0
4366993 runtime.startTheWorld /usr/local/go/src/runtime/proc.go 951
4456480 runtime.StartTrace /usr/local/go/src/runtime/trace.go 256
4743634 runtime/trace.Start /usr/local/go/src/runtime/trace/trace.go 23
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001cd80 Off: 177 type: 13 GoCreate Ts: 64239 P: 1 G: 5 StkID: 15 args0: 18 args1: 14 args2: 0 link: 0xc42001cea0
4743712 runtime/trace.Start /usr/local/go/src/runtime/trace/trace.go 34
4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
0xc42001ce10 Off: 184 type: 19 GoSleep Ts: 69599 P: 1 G: 5 StkID: 16 args0: 0 args1: 0 args2: 0 link: 0xc44016e2d0
4452857 time.Sleep /usr/local/go/src/runtime/time.go 59
4744919 main.main.func1 /root/go/src/diaggreedy/x.go 34
0xc42001cea0 Off: 188 type: 14 GoStart Ts: 72679 P: 1 G: 18 StkID: 14 args0: 18 args1: 0 args2: 0 link: 0xc42001cfc0
4743809 runtime/trace.Start.func1 /usr/local/go/src/runtime/trace/trace.go 26
0xc42001cf30 Off: 191 type: 28 GoSysCall Ts: 73959 P: 1 G: 18 StkID: 17 args0: 0 args1: 0 args2: 0 link: 0x0
4552069 syscall.write /usr/local/go/src/syscall/zsyscall_linux_amd64.go 1064
4550393 syscall.Write /usr/local/go/src/syscall/syscall_unix.go 181
4591004 os.(*File).write /usr/local/go/src/os/file_unix.go 186
4587980 os.(*File).Write /usr/local/go/src/os/file.go 142
# ......
I would be very glad to push it to golang project and benefits more if this proposal is accepted.
Many thanks for your time and good patience.
If you are interested in this little tool, there is the repository.
Sorry, I might had make a mistake:
Tmax_stw = sum{top max N on-processor time slices}
It should probably be:
Tmax_stw = max{top max N on-processor time slices}
Although I hadn't dive in the go scheduler's code yet, but
I guess the 2nd is most likely the right answer, because the
1st algorithm is just too inefficient and unlikely using by the
go scheduler.
But no matter which one is the right answer, the 1st one or the
2nd, it couldn't change our conclusion made before utterly.
/cc @aclements @RLH
/cc @dvyukov
Thanks a lot, rsc :)
Nice tool.
Moving forward the Go compiler needs to be taught how to preempt tight
loops so, while valuable, the usefulness of this tool is short term. The
unfortunate downside is that once the tool is in the source base it would
need to be supported even though its usefulness will be minimal once the
compiler adds loop preemption.
On Mon, Jun 26, 2017 at 3:53 PM, Russ Cox notifications@github.com wrote:
/cc @aclements https://github.com/aclements @RLH
https://github.com/rlh—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/20792#issuecomment-311163861, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn34dy3_FVSqGspt0jSod-__54uniks5sIAxKgaJpZM4OEhc5
.
Thanks Richard for your reply :)
Honestly, I don't think this tool would become useless even after our compiler adds a 'mature default-on or even always-on loop preemption option'. And there are several reasons.
The approach of loop preemption cannot eliminate all cpu greedy problems, there are just too many difficult and sticky corner cases in cpu greedy problems that loop preemption cannot handle them all efficiently right now (or maybe would pay a very dear price to achieve it, list them all, and fix them all efficiently? It's just too expensive even you could list all the corner cases).
Even when the loop preemption is confident about itself that it can eliminate all cpu greedy problems and all the difficult and sticky corner cases, and after the option loop preemption is always enabled in the go compiler, shouldn't we still need a tool for the users to test out of the box that wether the go binary is working as the compiler promised ( which is 'all cpu greedy problems is solved by option loop preemption') and as they intended to? And then, if they still found some cpu greedy hot path, they could optimize it by themself if they want to or report it as a issue, isn't be even better?
Personally view, although loop preemption is a pretty good feature but it shouldn't be always 'on' or even be default 'on', it should be default 'off' even when it is 'mature' (and to what extent you could believe it is mature?). And because 'loop preemption' is just a little 'anti-common-sense' and radical, it should not be even a global option. It should be a option which has a granularity of every single source files (I do not know whether it is or not now already).
Go already has tons of open source projects and libraries which they only have the assumption of 'preempt only at function calls' at that time when they were developed. And there maybe something would turn wrong when the compiler performs the 'loop preemption transformation' for all of them. It is a little dangerous and risky as many go applications plays a very important role in their systems.
With the assumption of 'preempt only at function calls', users can do some high priority batching jobs without been blocked by scheduler in a very high speed and low latency manner. But with the option 'loop preemption' is 'on', such jobs will become less efficient and the latency will be worse.
So, the trade-off of the overhead of 'loop preemption' should be chosen and decided by the users themself: have it or not. (As far as I know, the current implementation of 'insert_resched_checks' is 'The simpler the loop, the worse the overhead, that is why this phase is not normally enabled. -- from David Chase', and it maybe not a quite easy job at all to bring a efficient and complete 'loop preempt' function to our users in a short time, one year? for instance.)
So above all, I think it would be better to provide a diagnose tool for the users to see the status of their goroutines' cpu time slices and optimize their codes by themself if they really want to, no matter the 'loop preemption' is default 'off' now or may default 'on' in the future. Still, it maybe often better to provide more choices to our users then to limit them to a single one.
The approach of loop preemption cannot eliminate all cpu greedy problems, there are just too many difficult and sticky corner cases in cpu greedy problems that loop preemption cannot handle them all efficiently right now (or maybe would pay a very dear price to achieve it, list them all, and fix them all efficiently? It's just too expensive even you could list all the corner cases).
I'm not sure I understand your claim, or perhaps I don't understand your definition of "cpu greedy problems". Can you give an example of such a problem that loop preemption doesn't solve?
I also wouldn't call the cost "very dear". Our current prototype of loop preemption shows a ~2.5% geomean slow-down across the go1 benchmarks, which are CPU-intensive and have several tight loops. Each release generally speeds up code by more than that just through code generation improvements.
Hi Austin, I think you are actually underestimating the distance and the difficulty of the gap between 'local optimization' and 'perfect and optimal solution' of the "cpu greedy problems" in golang.
The definition of "cpu greedy problems" is:
A cpu greedy goroutine is a goroutine which would very rarely do the things that may let he himself been scheduled away from the cpu he had already been occupied. The frequency of the word 'rarely' here is the degree that could let the others starve a little or even to death.
(this thread had discussed such influence)
The 'cpu greediest goroutine' example maybe something like this:
go func(){
for{}
}()
But the 'loop' is only one kind big ordinary case of such problems, how about this one:
_adder := func(c *uint32, depth uint32) uint32 {
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
// ....
/* omitted, because there are */
/* 100,000 of them :( */
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
return atomic.LoadUint32(c)
}
godeadloop := func() {
var ct uint32
for {
ct++
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
// ....
/* omitted, because there are */
/* 100,000 of them :( */
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
}
}
go godeadloop()
intact version source code is in here
In this example there is no loop anyway right? But it is pretty 'cpu greedy' and this kind of piece code could let other goroutines to starve or even to death (whole go process would stuck for several minutes probably I think if the STW happened together).
Of course that you could solve this problem by a new 'insert_more_resched_checks' to the go AST after our discussion, but this is not the point, the point is the 'loop preempt' solution we currently use for the "cpu greedy problems" already had neglect this one corner case by mistakenly, how could we know there is not some more corner cases just hidden there that we didn't find them yet?
If you really think the 'loop preempt' solution could be the 'perfect and optimal solution', then could you write a formal and strict mathematical proof for it then? If you couldn't, we can only use it as a suboptimal solution. If we use a 'suboptimal solution' as a 'perfect optimal solution', terrible things may happen.
So, this is why I still insist on to add some diagnosis tool like 'diagreedy' to golang, you could use it as a verification tool or whatever, just give some more information to our users and let them know more about what is happening deep inside the go processes, it is also a win-win solution not only for the users, but also for the developers.
I also wouldn't call the cost "very dear". Our current prototype of loop preemption shows a ~2.5% geomean slow-down across the go1 benchmarks, which are CPU-intensive and have several tight loops. Each release generally speeds up code by more than that just through code generation improvements.
If you have a large computer cluster which worth 100,000,000 dollars, then the improvement of 2.5% on the performance could save you 2,500,000 dollars. And 2,500,000 dollars could do many valuable things :D
For some people who prefer performance, please give them other choices, do not close the doors to them, all they need to do is just to analyze the trace.out the go process have dumped on the fly.
"Loop preempt" is a good tool, I think a good usage of it is to enable the "loop preempt" on some opensource project that we don't trust yet and disable them on our own reliable code to have a better performance and controllability.
( Sorry, I might have a impression that you may want to do anything for the users, but that would be some kind of over optimization if you do too much of it :p )
The compiler has already been taught to identify problematic loops. I'd suggest that a good next step would be to print out that information when requested, perhaps with -m or -m -m or with a new flag (regardless of whether the code generation will be altered). That is likely to be helpful long term anyway, and it is very likely a small, unobtrusive change. If it is small and safe enough, maybe (but just maybe) it could even be snuck into 1.9.
cc @dr2chase @bradfitz
A very good idea, and I agree with you 👍
The more tools and options to diagnose the possibly problematic code the better.
(But one problem here is that maybe nearly all 'for loop' uses in the sources will be reported as 'problematic'. Although I don't know the details implementation of the current 'loop preempt' in the compiler. But the optimization is often one of the most difficult parts in compiler science ( because compiler can not read the mind of programers). And the optimization nearly all is 'suboptimal solution' and there is hardly any 'perfect optimization'. This why I still also insist on to provide some more external diagnosis tools to help programmers to improve their code if they want. )
I also thought that some code had been wrote with go assembly language like crc32 would be very cpu intensive and has some 'loops' that insert_resched_checks maybe not handled yet at current implementation. (Not quite sure about it. Still, the preemption in go assembly codes may break the semantic of go assembler.)
/cc @rsc @aclements @RLH @josharian
It has been nearly 10 days since our first discussion. Maybe we should make a conclusion about it?
@hnes proposals are reviewed on a schedule. And 10 days is not very long; many decisions take much longer to be made.
Thank Josh for letting me know :)
proposals are reviewed on a schedule. And 10 days is not very long; many decisions take much longer to be made.
It is pretty reasonable.
Certainly loop preemption in Go 1.10 should make this less necessary, but as noted not completely unnecessary. In addition to straight line code there is also the question of large memmoves, which are in non-preemptible assembly, and maybe other operations like those.
The question is how to expose this in the tool. "go tool trace" is really about the web view. This new flag turns it into a command-line program with a new output format. Can we instead somehow highlight "greedy" goroutines in the web view? Then no new option is needed. Or maybe it should be a third-party separate tool anyway?
@dvyukov, as owner of go tool trace, any thoughts?
Perhaps if we trace when we requested STW/foreachP and when we actually get STW/foreachP, it will give enough info in the web view to understand the bad cases (what's running at the time of the request for for how long).
Perhaps if we trace when we requested STW/foreachP and when we actually get STW/foreachP, it will give enough info in the web view to understand the bad cases (what's running at the time of the request for for how long).
But the sampling could be imprecise and still "need some good luck" for users to find the real top N greedy goroutines.
Traversing the whole events[] list and use the min-heap sort algorithm to find the precise Top-N greedy goroutines, this kind of approach's worst time complexity is only about O(n*lgn) (it's much faster than parsing the trace.out binary file in practical).
How do you propose to detect non-cooperative goroutines? The fact that it runs on CPU for a long time does not mean anything, it still can be perfectly cooperative if/when we try to preempt it.
You understand that this is a wrong way to write Go code, right?
do_cpu_staff0() -- time cost 200us
runtime.Gosched()
do_cpu_staff1() -- time cost 200us
runtime.Gosched()
do_cpu_staff2() -- time cost 200us
runtime.Gosched()
Goroutines are preemptive since Go 1.1 and soon they become even more preemptive.
How do you propose to detect non-cooperative goroutines?
You are right and I quite agree with you.
The STW/foreachP should record the information too when it found some non-cooperative goroutines who had take too long (say more than 1ms or could be configured by user) to yield the cpu.
The fact that it runs on CPU for a long time does not mean anything
The more goroutines are cpu greedy, the worst the overall latency the application would be.
it still can be perfectly cooperative if/when we try to preempt it.
Yes, but before you chose to "preempt" it, you probably have to know where the problem exactly is, and this is why we may still need the diagnosis result of Top N greedy gorountines and the precise source line numbers (as we already discussed in this thread, the the for loop preemptive approach cannot perfectly solves all this kind of problems, think about assembly for one instance.)
You understand that this is a wrong way to write Go code, right?
do_cpu_staff0() -- time cost 200us
runtime.Gosched()
do_cpu_staff1() -- time cost 200us
runtime.Gosched()
do_cpu_staff2() -- time cost 200us
runtime.Gosched()
Goroutines are preemptive since Go 1.1 and soon they become even more preemptive.
May have a look about this although it is some kinda hacky but it still could mean something. This piece code in go 1.8.3 the STW will stuck nearly forever.
The preemptive is just relatively.
So, how about let we combine both of them?
The STW/foreachP records the information when it found some non-cooperative goroutines who takes too long (say more than 1ms or could be configured by user) to yield the cpu when STW/foreachP wants to stop the world (May even dumps some warning log to stout or to log file when application is running? That would be very useful in the production environment).
The diagnosis of Top N greedy gorountines we already discussed previously.
The more goroutines are cpu greedy, the worst the overall latency the application would be.
No. What matters is only if a goroutine is not cooperative.
Yes, but before you chose to "preempt" it, you probably have to know where the problem exactly is
No. Runtime preempts all currently running goorutines.
No. What matters is only if a goroutine is not cooperative.
Agree with you. Maybe this is the literal ambiguity, the phrase "cpu greedy goroutines" I used here does imply that it's less cooperative or completely no cooperative (a example is the go function godeadloop0 defined in this proposal code example).
No. Runtime preempts all currently running goorutines.
The more precise is "Go Runtime always try to preempts all currently running gorutines."
But if the goroutines do not give the runtime any single chance, the runtime would never preempt even one goroutines successfully, and that is why we want this useful diagnosis tool to optimize our codes :)
the phrase "cpu greedy goroutines" I used here does imply that it's less cooperative or completely no cooperative (a example is the go function godeadloop0 defined in this proposal code example).
So, how do you propose to detect these 'completely non-cooperative' goroutines?
So, how do you propose to detect these 'completely non-cooperative' goroutines?
In fact I had think about this before, and here is maybe a appropriate solution:
t1=12 t2=11 t3=10
| | | |
__________________________________________________
| | | |
---------------------------------------------------> time line
P0 [------go 1------][-go 7-][-----------------][-go 6-][-go 4-]
P1 [-go 3-][-go 4-][------------go 2-----------][-go 1-][-go 2-]
P2 [-go 4-][-go 3-][-go 8-][-------------------][-----go 10-----]
|
|---->---->---->---->---->|
| |
STW start release
STW:
timeout := 1ms
max_total_time := 2s
total_time := 0
set try_to_stw_flag
for _,p = range [P0,P1,P2] {
locked,timecost := mutex_lock_timeout(p,timeout)
total_time+=timecost
for locked is nil {
scan [P0,P1,P2]'s currently running user goroutines' info
dump it out // to trace struct || some log fd
if total_time >= max_total_time{
panic("stw take too long")
}
locked,timecost = mutex_lock_timeout(p,timeout)
total_time+=timecost
}
}
A little correction.
STW:
timeout := 1ms
max_total_time := 2s
total_time := 0
// ++ do not allocate the idle Processors to ready goroutines anymore
set try_to_stw_flag
for _,p = range [P0,P1,P2] {
locked,timecost := mutex_lock_timeout(p,timeout)
total_time+=timecost
for locked is nil {
// ++ may be it is hard to get the details( include func call stack)
// ++ of a running goroutine dynamically?
scan [P0,P1,P2]'s currently running user goroutines' info
dump it out // to trace struct || some log fd
if total_time >= max_total_time{
panic("stw take too long")
}
locked,timecost = mutex_lock_timeout(p,timeout)
total_time+=timecost
}
}
// ++
unset try_to_stw_flag
// stw successfully
It seems that we just need to trace when we requested STW/foreachP and when we actually get STW/foreachP. The rest of info is already in the trace and can be inferred in the trace command (namely, what goroutines were running at the time of the request).
It seems that we just need to trace when we requested STW/foreachP and when we actually get STW/foreachP. The rest of info is already in the trace and can be inferred in the trace command (namely, what goroutines were running at the time of the request).
Bravo! And it would be much convinient to implement our new tool :D
OK, so it sounds like we should add that trace info to the trace dump file. But I still don't see what we're going to do to expose it, unless the STW is a different clearly marked bar already. What will it look like? A special command-line mode people need to know about is not discoverable enough.
We currently record mark termination STW, but somehow forgot about sweep termination STW. I've got a CL to add this for computing MMUs, but it would help with this analysis, too.
Change https://golang.org/cl/55411 mentions this issue: runtime,cmd/trace: trace GC STW events
OK, so after CL 55411, @aclements, is there anything else for this issue? Should we at least mark it proposal-accepted?
OK, so after CL 55411, @aclements, is there anything else for this issue?
In terms of infrastructure support, I believe CL 55411 is sufficient.
Should we at least mark it proposal-accepted?
I certainly hope we can make this analysis simply unnecessary, but it could be useful in the interim and in flushing out bugs. Presentation questions aside, it should be relatively straightforward to implement a robust version on top of the events added by CL 55411.
Marking Proposal-Hold until there's a clearer explanation of what the exposed UI is.
Most helpful comment
The compiler has already been taught to identify problematic loops. I'd suggest that a good next step would be to print out that information when requested, perhaps with
-mor-m -mor with a new flag (regardless of whether the code generation will be altered). That is likely to be helpful long term anyway, and it is very likely a small, unobtrusive change. If it is small and safe enough, maybe (but just maybe) it could even be snuck into 1.9.cc @dr2chase @bradfitz