go version
)?$ go version go version go1.14 darwin/amd64
Yes
go env
)?go env
Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/matt/Library/Caches/go-build"
GOENV="/Users/matt/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/matt/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/matt/Dev/caddyserver/caddy/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/97/r6t5z7zj0sd8pgh13bnwjh2w0000gn/T/go-build879531905=/tmp/go-build -gno-record-gcc-switches -fno-common"
Ran go run main.go
, in my case, specifically in this folder.
The program should run.
Sometimes, the CPU spins at 100% and the entry point of the program is never reached reached after anywhere from 5-50s (but there are probably no hard bounds on this, it's just what I've observed).
This is part of my typical development cycle where I write some code, then run the program, write more code, run the program, etc. Sometimes it happens, sometimes it doesn't. Even without changes, re-running the command will work. I can't pin it down, unfortunately, nor make a simpler test case.
I verified that the entry point main()
is never reached, by changing main()
so it is now:
func main() {
log.Println("Got here")
caddycmd.Main()
}
and can verify that "Got here" is never printed. Instead, the CPU spins:
In the past, @whitestrake has reported a similar phenomenon independently, but I only started noticing it recently myself. So, I know it's not isolated to just my machine.
Let me know what more would be useful to debug this. Thanks!
How often does this happen? Is it possible to reproduce by trying it some number of times?
Does it happen with go build
too, or only go run main.go
?
All imported packages get initialized before main
is executed (see https://golang.org/ref/spec#Package_initialization for details). Is it possible some of your imported packages have init()
that sometimes take too long? Or top level variable declarations that call functions?
In the screenshot you posted, the process name is main
, which I believe means what's running and using 100% CPU is the binary that was built, not the go
command.
That's a good point about init()
, I could put the log line into init()
but I guess that won't help since inits are called in a nondeterminstic (or at least unknown) order, so any offending init() might be called first.
You're probably right... although I haven't seen it happen with go build
or any non-temporary binaries. I'll get more info and reopen if I can pin it down. Thanks
To clarify, package initialization is done in a deterministic order, as defined in the Go specification.
I agree it might be hard to know what init
functions are being called when you are importing many packages transitively. Maybe a debugger or some other tool can help when investigating an issue like this.
To clarify, package initialization is done in a deterministic order,
Yes, but it's pretty opaque in practice :) i.e. it's not trivial to know in a large project the exact order of all instances of init()
. If there's a tool that can print that out, that could be helpful in this case. It's deterministic to the compiler, but not to the program being compiled, since the program being compiled doesn't have the scope of perspective of the compiler AFAIK. Anyway, thanks!
@dmitshur Do you know of a way I can run a profile that starts before the init phase? Based on your feedback, I'm 99% sure this is a bad init() function or bad var foo = ...
declaration somewhere. So I need to start a profile _before_ all those are run.
Starting the profiler early would be hard. Can you run under a debugger, then just ctrl-C when it hangs and see where the code is executing?
Nevermind! Gopher Slack is awesome. Someone gave me a tip: pkill -ABRT main
and then you get a stack trace and registers. Worked like a charm. Thanks everyone!
@dmitshur @randall77 The plot thickens, as I can only reproduce the bug in Go 1.14 and not Go 1.13: https://github.com/census-instrumentation/opencensus-go/issues/1200
I suspect this might be a bug in the Go scheduler? One or more goroutines are blocking on select
, but the stack trace of the active goroutine is still "unavailable". That linked issue has more details.
Any advice on where to go from here? I wish I could make the reproduction more minimal, but I just haven't figured it out as it seems dependent on importing (and using) these packages (even though the "using" code doesn't even get reached).
Also /cc @aclements per owners.
As per https://github.com/census-instrumentation/opencensus-go/issues/1200#issuecomment-597328486 it seems pretty clear that this is likely a bug in the Go 1.14 preemptive scheduler. When I disable it, the bug does not manifest.
I cannot reproduce:
$ mkdir issue37741
$ cd issue37741
$ git clone https://github.com/caddyserver/caddy
$ cd caddy
$ git checkout 6011ce120a4b2df9314a800fc93a61f852b766c1
$ cd cmd/caddy
$ ~/go1.14/bin/go build main.go
$ repeat 1000 (./main; sleep 0.5; date)
Runs 1000 times without fail.
I'm on OSX 10.15.2.
Anything I'm missing from the repro instructions?
@randall77 Thanks for trying! That's a pretty old commit, that doesn't add the dependencies which trigger the bug. Please try the linked branch, currently at commit f0f5490b9c0d07ae3acc35d559c5813f09c8a219
. Thanks!
Ok, now I can reproduce. Thanks.
Also reproduces at tip.
Seems to be stuck in an endless attempt-to-preempt loop.
With this patch:
--- a/src/runtime/signal_unix.go
+++ b/src/runtime/signal_unix.go
@@ -329,12 +329,21 @@ func doSigPreempt(gp *g, ctxt *sigctxt) {
if wantAsyncPreempt(gp) && isAsyncSafePoint(gp, ctxt.sigpc(), ctxt.sigsp(), ctxt.siglr()) {
// Inject a call to asyncPreempt.
ctxt.pushCall(funcPC(asyncPreempt))
+ nfail = 0
+ }
+ println("not at safepoint", nfail)
+ nfail++
+ if nfail == 1000 {
+ traceback(ctxt.sigpc(), ctxt.sigsp(), ctxt.siglr(), gp)
+ throw("BAD")
}
// Acknowledge the preemption.
atomic.Xadd(&gp.m.preemptGen, 1)
}
+var nfail int64
+
It will occasionally fail (and without the throw
, increment nfail forever).
The stack trace it prints is simple:
github.com/yuin/goldmark/util.init()
/Users/khr/gopath/pkg/mod/github.com/yuin/[email protected]/util/html5entities.go:747 +0xcb886
The line number fluctuates from run to run. It seems to always be in a big map initializer, that looks like this:
var html5entities = map[string]*HTML5Entity{
"AElig": {Name: "AElig", CodePoints: []int{198}, Characters: []byte{0xc3, 0x86}},
"AMP": {Name: "AMP", CodePoints: []int{38}, Characters: []byte{0x26}},
"Aacute": {Name: "Aacute", CodePoints: []int{193}, Characters: []byte{0xc3, 0x81}},
...
"zscr": {Name: "zscr", CodePoints: []int{120015}, Characters: []byte{0xf0, 0x9d, 0x93, 0x8f}},
"zwj": {Name: "zwj", CodePoints: []int{8205}, Characters: []byte{0xe2, 0x80, 0x8d}},
"zwnj": {Name: "zwnj", CodePoints: []int{8204}, Characters: []byte{0xe2, 0x80, 0x8c}},
}
With over 2000+ lines.
Each entry is inserted separately, so there are ~2000 chunks of assembly like this:
html5entities.go:20 0x1b06407 488d05d2064600 LEAQ _runtime.types+1834016(SB), AX
html5entities.go:20 0x1b0640e 48890424 MOVQ AX, 0(SP)
html5entities.go:20 0x1b06412 e8a99a50ff CALL _runtime.newobject(SB)
html5entities.go:20 0x1b06417 488b442408 MOVQ 0x8(SP), AX
html5entities.go:20 0x1b0641c 48898424a82e0000 MOVQ AX, 0x2ea8(SP)
html5entities.go:20 0x1b06424 48c7400806000000 MOVQ $0x6, 0x8(AX)
html5entities.go:20 0x1b0642c 488d0dd81e5600 LEAQ _go.string.*+29427(SB), CX
html5entities.go:20 0x1b06433 488908 MOVQ CX, 0(AX)
html5entities.go:20 0x1b06436 488d15235e3600 LEAQ _runtime.types+807840(SB), DX
html5entities.go:20 0x1b0643d 48891424 MOVQ DX, 0(SP)
html5entities.go:20 0x1b06441 e87a9a50ff CALL _runtime.newobject(SB)
html5entities.go:20 0x1b06446 488b442408 MOVQ 0x8(SP), AX
html5entities.go:20 0x1b0644b 48c700c1000000 MOVQ $0xc1, 0(AX)
html5entities.go:20 0x1b06452 488b8c24a82e0000 MOVQ 0x2ea8(SP), CX
html5entities.go:20 0x1b0645a 48c7411801000000 MOVQ $0x1, 0x18(CX)
html5entities.go:20 0x1b06462 48c7412001000000 MOVQ $0x1, 0x20(CX)
html5entities.go:20 0x1b0646a 833d0f832c0100 CMPL $0x0, _runtime.writeBarrier(SB)
html5entities.go:20 0x1b06471 0f8551380d00 JNE 0x1bd9cc8
html5entities.go:20 0x1b06477 48894110 MOVQ AX, 0x10(CX)
html5entities.go:20 0x1b0647b 488d051ea83600 LEAQ _runtime.types+826848(SB), AX
html5entities.go:20 0x1b06482 48890424 MOVQ AX, 0(SP)
html5entities.go:20 0x1b06486 e8359a50ff CALL _runtime.newobject(SB)
html5entities.go:20 0x1b0648b 488b442408 MOVQ 0x8(SP), AX
html5entities.go:20 0x1b06490 66c700c381 MOVW $-0x7e3d, 0(AX)
html5entities.go:20 0x1b06495 488b8c24a82e0000 MOVQ 0x2ea8(SP), CX
html5entities.go:20 0x1b0649d 48c7413002000000 MOVQ $0x2, 0x30(CX)
html5entities.go:20 0x1b064a5 48c7413802000000 MOVQ $0x2, 0x38(CX)
html5entities.go:20 0x1b064ad 833dcc822c0100 CMPL $0x0, _runtime.writeBarrier(SB)
html5entities.go:20 0x1b064b4 0f8500380d00 JNE 0x1bd9cba
html5entities.go:20 0x1b064ba 48894128 MOVQ AX, 0x28(CX)
html5entities.go:17 0x1b064be 488d055bfc3b00 LEAQ _runtime.types+1176160(SB), AX
html5entities.go:17 0x1b064c5 48890424 MOVQ AX, 0(SP)
html5entities.go:17 0x1b064c9 488b8c24b02e0000 MOVQ 0x2eb0(SP), CX
html5entities.go:17 0x1b064d1 48894c2408 MOVQ CX, 0x8(SP)
html5entities.go:17 0x1b064d6 488d152e1e5600 LEAQ _go.string.*+29427(SB), DX
html5entities.go:17 0x1b064dd 4889542410 MOVQ DX, 0x10(SP)
html5entities.go:17 0x1b064e2 48c744241806000000 MOVQ $0x6, 0x18(SP)
html5entities.go:17 0x1b064eb e830fd50ff CALL _runtime.mapassign_faststr(SB)
html5entities.go:17 0x1b064f0 488b7c2420 MOVQ 0x20(SP), DI
html5entities.go:17 0x1b064f5 8407 TESTB AL, 0(DI)
html5entities.go:17 0x1b064f7 833d82822c0100 CMPL $0x0, _runtime.writeBarrier(SB)
html5entities.go:17 0x1b064fe 0f85a4370d00 JNE 0x1bd9ca8
html5entities.go:17 0x1b06504 488b8424a82e0000 MOVQ 0x2ea8(SP), AX
html5entities.go:17 0x1b0650c 488907 MOVQ AX, 0(DI)
@cherryyz, how do I tell whether the code here is marked as preemptable?
Total guess, but the runtime is probably trying to start a GC. It's trying to interrupt this code to do so. The code is always sent a signal at a bad place, so it resumes and runs a bit more. Eventually it hits a write barrier or an allocation that the runtime stops it at, saying "no more! I need to GC!". Of course that's at a non-preemptable location. Deadlock ensues.
Kind of a half-baked theory. I'll look some more.
@cherrymui
If I look at the places where preempt is attempted (change the println
in my patch above to println("not at safepoint", nfail, hex(ctxt.sigpc()))
, I get something like this:
not at safepoint 0 0x1045b79
not at safepoint 1 0x7fff68a01d36
not at safepoint 2 0x10225d7
not at safepoint 3 0x10225d7
not at safepoint 4 0x10225d7
not at safepoint 5 0x1022896
not at safepoint 6 0x10228cd
not at safepoint 7 0x7fff68ac1ace
not at safepoint 8 0x1022095
not at safepoint 9 0x1017ad9
not at safepoint 10 0x7fff68a00ce6
not at safepoint 11 0x1022f8e
not at safepoint 12 0x7fff68a047fa
not at safepoint 13 0x7fff68a047fa
not at safepoint 14 0x7fff68a00ce6
not at safepoint 15 0x7fff68a047fa
not at safepoint 16 0x10229a2
not at safepoint 17 0x10229a2
not at safepoint 0 0x106ade0
not at safepoint 1 0x106ade0
not at safepoint 2 0x1068be6
not at safepoint 3 0x1bd0ff8
not at safepoint 4 0x1bd0ff8
Then lots of that same address repeating, with an occasional one of these:
not at safepoint 46 0x7fff68a00ce6
not at safepoint 454 0x100e172
The common line is a call to the write barrier code, for a check in that big map constructor of the gc enabled bit. That is indeed a non-preemptable location.
html5entities.go:17 0x1bd0ff8 488b8424a82e0000 MOVQ 0x2ea8(SP), AX
html5entities.go:17 0x1bd1000 e87b8549ff CALL _runtime.gcWriteBarrier(SB)
html5entities.go:17 0x1bd1005 e9a986f6ff JMP 0x1b396b3
I don't see how, if we resumed at 0x1bd0ff8, we'd ever get back to that same address again. As long as we made at least one instruction of progress.
0x100e172
is in the runtime, in notetsleep_internal
.
0x7fff68a00ce6
is presumably in vdso somewhere, also not preemptable.
My hunch is now that the thread is somehow not making any progress from 0x1bd0ff8
by the time we interrupt it again. My earlier theory, that the init thread was blocking on the runtime to do a GC, looks wrong.
(haven't read the whole thread, just answer this first)
how do I tell whether the code here is marked as preemptable?
Compiler's -d=pctab=pctopcdata
flag will do. Look for -2. It is not preemptible from a -2 (inclusive) to the next number that is not -2 (exclusive)
This patch seems to fix the bug:
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -4680,6 +4680,7 @@ func preemptone(_p_ *p) bool {
if preemptMSupported && debug.asyncpreemptoff == 0 {
_p_.preempt = true
preemptM(mp)
+ usleep(1000)
}
return true
I suspect that we're issuing preemptone
requests too quickly, and the target M never has the chance to make progress. The sleep gives the OS time to schedule it.
We should probably rate limit preemptM
requests to any (all?) targets. We do such a thing in suspendG
in runtime/preempt.go
, just not preemptone
in runtime/proc.go
.
This definitely seems to be a fundamental problem with async preempt + give up and retry...
It would be really nice to ask the OS to "interrupt the target thread only after it has been given at least a bit of time slice".
Attached is a single-file repro. It hangs pretty reliably when run 100 times.
Rename it from test.txt to test.go. For some reason you can't attach *.go files in github. Isn't this a programming website?
Here's a pretty simple fix.
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -4667,6 +4667,15 @@ func preemptone(_p_ *p) bool {
if gp == nil || gp == mp.g0 {
return false
}
+ if gp.preempt {
+ // Someone is already trying to preempt the goroutine
+ // running on this P. Don't try again, that can lead
+ // to lots of signals heading to the preempted thread,
+ // causing it not to make any progress on its actual code.
+ // That causes live-lock if the actual code is not
+ // at a preemptible point. See issue 37741.
+ return false
+ }
gp.preempt = true
Now I just need a decent test. The repro above is fine for one-offs, but it's too expensive for a checked-in test.
Edit:
The build finally passed if waiting longer, multiple times slower.
time -p ./make.bash
w./o the fix:
real 81.00
user 349.97
sys 96.57
w. the fix:
real 698.99
user 617.18
sys 1615.17
=============================================================
Hi @randall77 , building on an arm64 machine hangs with the fix, the machine used to experience long running time when testing 'reflect' package with "internal linking of -buildmode=pie", as mentioned in #37779 .
// command that hangs
`
format=, a= []interface {} = {...}, ~r2=...) at /home/ent-user/go/src/fmt/print.go:218
`
Change https://golang.org/cl/223240 mentions this issue: runtime: throttle signal senders when receivers aren't making progress
@shawn-xdji , could you try the CL above on your example?
@gopherbot please open a backport issue to 1.14.1.
Backport issue(s) opened: #37833 (for 1.14).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.
Hi @randall77 thanks, build time backs to normal and is even better now, 'sys' is below 80s consistently, compared with 100s or so without the fix.
And the long running time issue of testing 'reflect' package is addressed as well.
Change https://golang.org/cl/223737 mentions this issue: runtime: don't send preemption signal if there is a signal pending
Change https://golang.org/cl/223939 mentions this issue: [release-branch.go1.14] runtime: don't send preemption signal if there is a signal pending
Most helpful comment
Attached is a single-file repro. It hangs pretty reliably when run 100 times.
Rename it from test.txt to test.go. For some reason you can't attach *.go files in github. Isn't this a programming website?
test.txt