Please answer these questions before submitting your issue. Thanks!
go version
)?go version go1.6.2 freebsd/amd64
go env
)?GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GO15VENDOREXPERIMENT="1"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="clang++"
package main
/* stdlib includes */
import (
"fmt"
"os/exec"
)
func run(done chan struct{}) {
cmd := exec.Command("true")
if err := cmd.Start(); err != nil {
goto finished
}
cmd.Wait()
finished:
done <- struct{}{}
return
}
func main() {
fmt.Println("Starting a bunch of goroutines...")
// 8 & 16 are arbitrary
done := make(chan struct{}, 16)
for i := 0; i < 8; i++ {
go run(done)
}
for {
select {
case <-done:
go run(done)
}
}
}
I expect this strange program to spawn instances of /bin/true in parallel, until I stop it.
Various types of panics caused by what looks to be corruption within the finalizer lists, caused by what I am assuming is based on race conditions. These panics can happen as quickly as 2 minutes, or much longer. 10 minutes seems a good round number.
Occasionally addspecial gets stuck in an infinite loop holding the lock, and the process wedges. This is illustrated in log 1462933614, with x.next pointing to x. This appears to be corruption of that data structure. I have seen processes in this state run for 22 hours.
I understand there is some trepidation expressed in issue #11485 around the locking of the data structures involved.
Here are some sample messages:
fatal error: runtime.SetFinalizer: finalizer already set
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: workbuf is empty
1462926841-SetFinalizer-ex1.txt
1462926969-SetFinalizer-ex2.txt
1462933295-nonempty-check-fails.txt
1462933614-wedged.txt
This was run on an 8-core processor, and a 4-core 8-thread processor with ECC RAM, similar results.
Additionally, while this example is an extreme, it also represents the core functionality of a project I've been working on part-time for many months. I'm happy to provide any further assistance diagnosing this issue - I'm very invested!
Here's another panic experienced in mallocgc by the same sample code:
@derekmarcotte, can you also reproduce this at master? (which will become Go 1.7)
And do you only see it on FreeBSD, or other operating systems as well?
/cc @aclements @ianlancetaylor
@RLH, weren't you seeing "finalizer already set" failures a little while ago? Did you track that down? Could it have been related?
On closer inspection of the failures you posted (thanks for collecting several, BTW), this smells like memory corruption. Just guessing, there are a few likely culprits. It may be the finalizer code, but I actually think that's less likely. More likely is that it's the fork/exec code: that code is really subtle, mucks with the address space, and contains system-specific parts (which would explain why it's showing up on FreeBSD, but I haven't been able to reproduce it on Linux yet).
@derekmarcotte, can you try commenting out the runtime.SetFinalizer call in newProcess in os/exec.go (your test doesn't need that finalizer) and see if you can still reproduce it? If you can, that will rule out finalizers.
Note that FreeBSD runs via gomote, if this is that easily reproducible. I haven't yet tried.
Just got a golang/go dev environment set up on my machine (was from FreeBSD packages). Will report back soon.
Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value.
I didn't include too many of them in my initial report, because I thought they were largely redundant.
@bradfitz: these logs are against master:
1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set
runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80
1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set
runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8
1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set
runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80
1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set
runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80
@aclements: I will try your patch next. One variable at a time.
The "fatal error: runtime.SetFinalizer: finalizer already set" bug I was
seeing a few days ago were on the 1.8 dev.garbage branch and the result of
a TOC write barrier not marking an object as being published, TOC doing a
rollback, and reallocating a new object over an old one. Both objects had
finalizers and the bug was tripped.
None of the TOC code is in 1.7 or for that matter on TIP. I can force
myself to imagine that the 1.7 allocCache code could cause a similar
situation if the allocCache and bitmaps were not coherent. but if that were
the case it would expect lots of failures all over the place and across all
platforms, not just freeBSD.
On Fri, May 13, 2016 at 6:47 PM, Derek Marcotte [email protected]
wrote:
Here's the heads of a bunch of logs with the epoch at the start of the
process, so you can see the interval. I suspected a race vs. memory
corruption because by and large it is the finalizer already set that
crashes the process. I thought maybe the gc was setting these as free (or
otherwise touching them) before the SetFinalizer had a chance to set their
value.I didn't include too many of them in my initial report, because I thought
they were largely redundant.@bradfitz https://github.com/bradfitz: these logs are against master:
1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already setruntime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f801463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already setruntime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab81463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already setruntime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f801463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already setruntime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80@aclements https://github.com/aclements: I will try your patch next.
One variable at a time.—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/golang/go/issues/15658#issuecomment-219177155
@aclements: I've run it with your patch, although I haven't been able to babysit it too much.
The first time, all threads were idle after a number of hours (i.e. 0% cpu across the board). Connecting gdb to that process gave me trouble, and I couldn't get any logging out of it.
This morning, I was able to connect to a different process that looks a lot like 1462933614-wedged.txt. I've attached a log from gdb there:
Will keep trying to come up with more info.
@aclements: Here's some more logs from a binary build with the patch:
1463315708-finalizer-already-set.txt
1463349807-finalizer-already-set.txt
1463352601-workbuf-not-empty.txt
1463362849-workbuf-empty.txt
1463378745-wedged-gdb.txt
Please let me know if I can be of further assistance?
Thanks for the logs! I hadn't realized there were two finalizers involved here. Could you also comment out the SetFinalizer in NewFile in os/file_unix.go and see if it's still reproducible? (Your test also doesn't need that finalizer.)
I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process.
I didn't mean to say that it isn't necessarily a race. It's actually quite likely a race, but it's resulting in corruption of internal runtime structures, which suggests that the race is happening on freed memory. The "workbuf is empty" failure mode especially points at memory corruption, which is why my initial guess is that the finalizers (and the specials queue in general) may be victims rather than perpetrators. It's also easy to get finalizers out of the picture, while it's harder to get fork/exec out of the picture without completely changing the program. :)
Thanks @aclements !
One crash, ~4 hours into running, since removing the SetFinalizer in NewFile.
I have a second process running for almost 11 hours, with 4 of the threads wedged, but it is still doing work.
After ~11 hours + 5 minutes, the process panic'd:
Thanks for the new logs. I suspect that's actually not the same failure, which suggests that the original problem is in fact related to finalizers.
My pleasure, thanks for the feedback. Are there any next steps for me? (I'm likely to poke around this bottom issue, although I'm neither a go runtime guy, nor a FreeBSD systems-level guy - yet. Would like to be as helpful as possible.)
Thanks again!
I'm going to post a few more here. This'll be my last batch, unless requested. :smile: 1463584079 is a new message.
==> 1463485780 <==
Starting a bunch of goroutines...
fatal error: workbuf is not empty
runtime stack:
runtime.throw(0x4bff02, 0x14)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*workbuf).checkempty(0x8007dae00)
/home/derek/go/src/github.com/golang/go/src/runtime/mgcwork.go:301 +0x3f
runtime.getempty(0x8007dae00)
==> 1463584079 <==
Starting a bunch of goroutines...
fatal error: MHeap_AllocLocked - MSpan not free
runtime stack:
runtime.throw(0x4c2528, 0x22)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*mheap).allocSpanLocked(0x52dbe0, 0x1, 0xc4200ae1a0)
/home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:637 +0x498
runtime.(*mheap).alloc_m(0x52dbe0, 0x1, 0x12, 0xc420447fe0)
/home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:510 +0xd6
==> 1463603516 <==
Starting a bunch of goroutines...
acquirep: p->m=842351485952(10) p->status=1
acquirep: p->m=842350813184(4) p->status=1
fatal error: acquirep: invalid p state
fatal error: acquirep: invalid p state
runtime stack:
runtime.throw(0x4c0ab3, 0x19)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc420015500)
==> 1463642257 <==
Starting a bunch of goroutines...
acquirep: p->m=0(0) p->status=2
fatal error: acquirep: invalid p state
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state
runtime stack:
runtime.throw(0x4c0ab3, 0x19)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc42001c000)
Thanks! I assume these last few failures are also with master and with the two SetFinalizer calls commented out?
That's correct! Thanks. Anyone else able to reproduce?
@derekmarcotte, what version of FreeBSD and how many CPUs are you running with? I haven't had any luck yet reproducing on FreeBSD 10.1 with 2 CPUs.
@aclements: both machines were 4 core, 8 thread CPUs:
Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz
(32GB/ECC) - none of the logs included were from this machine, but could not keep my process (of my project, not the listing in this issue) running for very long
AMD FX(tm)-8350 Eight-Core Processor
(8GB) - this is my dev box where all the logs are included
The Xeon is running 10.3-RELEASE, and the AMD was running 10.1-RELEASE at the time of the logs (has since been upgraded to 10.3-RELEASE).
I suspect I would be able to chew through many more invocations in the same time as a 2 core machine on these hosts, and additionally increase probability of contention/collisions in any given instant.
The Xeon has since moved to production, so I don't have that hardware at my disposal for the time being, although I might be able to arrange something if it's required.
I can get dmesgs/kldstats for the xeon, and amd if helpful (would rather post out of band).
@derekmarcotte, thanks for the extra details (I see now you already gave the CPU configuration; sorry I missed that).
Two more experiments to try:
@aclements, Thanks for your suggestions. I'm currently exploring a different option. I was using gb for building my project (sorry! forgot to mention), and additionally for this test case.
I certainly didn't expect wildly differing binaries in a project with no external dependencies, as gb uses the go compiler internally. I've got more research to do here to account for this, so I apologize for that.
I've built using go directly and am in the process of testing. So far it has been running for 12 hours, without problem (with the SetFinalizers disabled). I have had previous test runs last this long, so I'm not ready to call it a success just yet. I'll be out of town for the next few days, so I can leave it running for a while and see where it ends up.
I think this is a very promising lead, based on the objdump of the two artefacts. It might be interesting to include in the Issue Report template, with the build tool ecosystem that is currently out there (or that there is an ecosystem at all).
@aclements, rebuilding gb from source using the go based on master, and then rebuilding the test with the new gb creates nearly identical binaries (minus file path TEXT entries), this is to be expected.
Perhaps there's something to this. Will keep you posted.
cc @davecheney
@aclements, the go-only build of the binary did eventually crash... somewhere around 24 hours (~153M goroutines later), so I don't think it's gb-related. I'm going to re-build with the following, per your suggestion:
func run(done chan struct{}) {
cmd := exec.Command("doesnotexist")
cmd.Wait()
done <- struct{}{}
return
}
@alements, after the above change, it ran for 4 days without a panic.
@derekmarcotte, argh, I'm sorry. That wasn't a useful test. :( That eliminates the fork, the exec, _and_ the finalizer, so there isn't really anything left. It needs to be something like:
func run(done chan struct{}) {
cmd := exec.Command("/doesnotexist")
cmd.Start()
cmd.Wait()
done <- struct{}{}
return
}
It's important that the path be absolute. Otherwise, Go will try to resolve the path itself, fail at that point, and not do anything for the subsequent method calls. It's also important that the cmd.Start()
still happen (even though it will fail). Again, without that, the Wait just returns nil immediately.
I'm sorry I didn't catch that earlier.
@aclements, at least we know that Go can spin for 4 days without crashing on my dev machine - so not a total waste.
I've recompiled with the above, and crashed after ~32 hours. Log attached:
@derekmarcotte, thanks! Was that with go or gb? With or without the SetFinalizer calls?
Also, 1.6.2 or master?
@aclements great questions. Was thinking we aught to recap all the variables at this point anyways.
Trying to narrow things down as much as possible. So, this is with go build
using master (as of 7af2ce3), with this diff applied:
diff --git a/src/os/exec.go b/src/os/exec.go
index 239fd92..6a8eed5 100644
--- a/src/os/exec.go
+++ b/src/os/exec.go
@@ -5,7 +5,6 @@
package os
import (
- "runtime"
"sync/atomic"
"syscall"
)
@@ -19,7 +18,6 @@ type Process struct {
func newProcess(pid int, handle uintptr) *Process {
p := &Process{Pid: pid, handle: handle}
- runtime.SetFinalizer(p, (*Process).Release)
return p
}
diff --git a/src/os/file_unix.go b/src/os/file_unix.go
index 9b64f21..a997e9e 100644
--- a/src/os/file_unix.go
+++ b/src/os/file_unix.go
@@ -54,7 +54,6 @@ func NewFile(fd uintptr, name string) *File {
return nil
}
f := &File{&file{fd: fdi, name: name}}
- runtime.SetFinalizer(f.file, (*file).close)
return f
}
with source:
package main
/* stdlib includes */
import (
"fmt"
"os/exec"
)
func run(done chan struct{}) {
cmd := exec.Command("/doesnotexist")
cmd.Start()
cmd.Wait()
done <- struct{}{}
return
}
func main() {
fmt.Println("Starting a bunch of goroutines...")
// 8 & 16 are arbitrary
done := make(chan struct{}, 16)
for i := 0; i < 8; i++ {
go run(done)
}
for {
select {
case <-done:
go run(done)
}
}
}
It sounds like this is FreeBSD-specific, so marking as such. Please correct me if I am wrong.
@ianlancetaylor : thanks for this. I'm just building a linux box dedicated to testing tonight. I haven't yet been able to test on linux. I'll update the thread as I find more information out.
Looks like i'm having the same issue with github.com/influxdb/telegraf. Panic log attached.
acquirep: p->m=859531515904(14) p->status=1
fatal error: acquirep: invalid p state
acquirep: p->m=859535287296(8) p->status=1
fatal error: acquirep: invalid p state
% uname -a
FreeBSD 9.3-RELEASE-p5 FreeBSD 9.3-RELEASE-p5 #0: Mon Nov 3 22:38:58 UTC 2014 [email protected]:/usr/obj/usr/src/sys/GENERIC amd64
% cat /var/run/dmesg.boot | grep -i cpu
CPU: Intel(R) Xeon(R) CPU E3-1270 V2 @ 3.50GHz (3492.14-MHz K8-class CPU)
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
go version go1.6.2 freebsd/amd64
I'll try to rebuild with go tip.
@derekmarcotte are your test machines metal or VM's?
I've just kicked off a test on 10.2-RELEASE box here to see if I can repro locally.
It took many hours but I did get a panic in the end.
This is from go 1.6.2 (no patches) on 10.2-RELEASE amd64 (metal) no Virtualisation
./test
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set
runtime stack:
runtime.SetFinalizer.func2()
/usr/local/go/src/runtime/mfinal.go:372 +0x6f
goroutine 23221055 [running]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc8202eca58 sp=0xc8202eca50
runtime.SetFinalizer(0x4f2f00, 0xc8204a19e0, 0x4d68a0, 0x54f550)
/usr/local/go/src/runtime/mfinal.go:374 +0x4b5 fp=0xc8202ecbd8 sp=0xc8202eca58
os.NewFile(0x3, 0x51efd0, 0x9, 0x323200000000)
/usr/local/go/src/os/file_unix.go:57 +0xfc fp=0xc8202ecc30 sp=0xc8202ecbd8
os.OpenFile(0x51efd0, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/os/file_unix.go:123 +0x1bd fp=0xc8202ecca8 sp=0xc8202ecc30
os.Open(0x51efd0, 0x9, 0xdeaddeaddeaddead, 0x0, 0x0)
/usr/local/go/src/os/file.go:244 +0x48 fp=0xc8202ecce8 sp=0xc8202ecca8
os/exec.(*Cmd).stdin(0xc8203eac80, 0x0, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:171 +0x6e fp=0xc8202ecd80 sp=0xc8202ecce8
os/exec.(*Cmd).Start(0xc8203eac80, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:316 +0x2f4 fp=0xc8202ecf68 sp=0xc8202ecd80
main.run(0xc82006c060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x50 fp=0xc8202ecfa8 sp=0xc8202ecf68
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc8202ecfb0 sp=0xc8202ecfa8
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:31 +0x19b
are your test machines metal or VM's?
For me, telegraf crashes on both vmware VMs and baremetal machines.
@stevenh: So it's random, and I think the probability of crashing is a function of the number of cpus being used, and while I say 8 is arbitrary, it's also the number of cpus I had on hand. If you have more, by all means, set it higher.
Try it a few times, and you'll see. I just stuck it in a loop like while true; do date; ./ex-from-comment-225006179; done
, and then you can see them over a day period.
Anecdotally, I've been able to run a longer-lived process with GOMAXPROCS=2
, but will be doing more thorough testing.
Most of those dumps are from a jail running on bare metal. The Xeon was straight on the jail host (although there were jails running).
@stevenh: Oh, and the one that uses /doesnotexist
crashes after 32 hours, but the one that uses true
is much sooner (because it appears there are multiple issues going on).
With GOMAXPROC=1 and "true" its been running for 20+ hours so far, so it does indeed to seem to be effected by the number of procs available.
I've been running /doesnotexist
with stock go 1.6.2 for just shy of a week with GOMAXPROCS="2". Interesting to note, there are 11 OS-threads, which is surprising to me, but may be expected?
I've been able to cause the panic much quicker being more aggressive with the garbage collector. I've been using this script to test:
while true; do
echo "==== NEW RUN $(date) ===="
echo
time sh -c 'export GOGC=5; export GODEBUG=gctrace=2,schedtrace=100; ./15658-doesnotexit'
echo
echo
done >> ../logs/15658-doesnotexit-logs-gogc5-gtrace2 2>> ../logs/15658-doesnotexit-logs-gogc5-gtrace2
Here's a sample of the crashing interval:
==== NEW RUN Tue Aug 30 16:36:11 EDT 2016 ====
==== NEW RUN Tue Aug 30 16:37:18 EDT 2016 ====
==== NEW RUN Tue Aug 30 17:13:02 EDT 2016 ====
==== NEW RUN Tue Aug 30 17:28:30 EDT 2016 ====
==== NEW RUN Tue Aug 30 17:54:22 EDT 2016 ====
==== NEW RUN Tue Aug 30 18:04:29 EDT 2016 ====
==== NEW RUN Tue Aug 30 18:36:36 EDT 2016 ====
==== NEW RUN Tue Aug 30 18:57:48 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:09:12 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:37:33 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:42:37 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:52:31 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:03:22 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:07:05 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:10:31 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:25:37 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:31:00 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:31:28 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:34:30 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:38:16 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:40:09 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:56:57 EDT 2016 ====
==== NEW RUN Tue Aug 30 21:31:54 EDT 2016 ====
==== NEW RUN Tue Aug 30 21:39:05 EDT 2016 ====
==== NEW RUN Tue Aug 30 21:51:06 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:28:44 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:32:35 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:53:50 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:55:44 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:57:44 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:58:08 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:59:58 EDT 2016 ====
==== NEW RUN Tue Aug 30 23:52:15 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:28:37 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:46:25 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:53:03 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:57:34 EDT 2016 ====
==== NEW RUN Wed Aug 31 01:06:02 EDT 2016 ====
==== NEW RUN Wed Aug 31 01:13:20 EDT 2016 ====
==== NEW RUN Wed Aug 31 01:21:55 EDT 2016 ====
==== NEW RUN Wed Aug 31 02:17:49 EDT 2016 ====
==== NEW RUN Wed Aug 31 02:55:35 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:43:36 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:47:24 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:48:14 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:50:32 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:59:05 EDT 2016 ====
==== NEW RUN Wed Aug 31 04:43:24 EDT 2016 ====
==== NEW RUN Wed Aug 31 04:55:07 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:04:41 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:34:44 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:50:45 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:53:55 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:58:08 EDT 2016 ====
==== NEW RUN Wed Aug 31 06:19:07 EDT 2016 ====
==== NEW RUN Wed Aug 31 06:34:13 EDT 2016 ====
==== NEW RUN Wed Aug 31 06:41:30 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:52:08 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:54:32 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:56:02 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:59:04 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:59:48 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:01:46 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:06:30 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:23:03 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:29:57 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:41:58 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:43:14 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:45:00 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:59:22 EDT 2016 ====
==== NEW RUN Wed Aug 31 09:24:11 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:01:22 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:05:45 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:09:24 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:12:51 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:23:35 EDT 2016 ====
^^^^ process was idle/wedged, sent kill manually
==== NEW RUN Thu Sep 1 06:47:27 EDT 2016 ====
==== NEW RUN Thu Sep 1 06:49:27 EDT 2016 ====
$ go version
go version go1.7 freebsd/amd64
$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH="/home/derek/dev/gopath"
GORACE=""
GOROOT="/home/derek/go"
GOTOOLDIR="/home/derek/go/pkg/tool/freebsd_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -gno-record-gcc-switches"
CXX="clang++"
CGO_ENABLED="1"
I am trying to reproduce this on DragonFly BSD. So far I was able to reproduce the idle state:
localhost% uname -a
DragonFly localhost.localdomain 4.6-RELEASE DragonFly v4.6.0-RELEASE #0: Mon Aug 1 12:46:25 EDT 2016 [email protected]:/usr/obj/build/sys/X86_64_GENERIC x86_64
localhost% go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="dragonfly"
GOOS="dragonfly"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/dragonfly_amd64"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build276771230=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
localhost% go version
go version go1.7 dragonfly/amd64
SCHED 783590ms: gomaxprocs=2 idleprocs=0 threads=8 spinningthreads=0 idlethreads=5 runqueue=8 [0 1]
I have been running this test inside Virtualbox, with 2 virtual CPUs.
On DragonFly it's actually causing kernel panic. Therefor I created a bug report there. It includes the dump.
Cannot reproduce on freebsd-amd64-gce101 gomote running FreeBSD 10.1.
The Dragonfly panic is suggestive that there may be a similar kernel problem in FreeBSD.
Unlikely we are going to solve this.
Thanks very much for taking a look. I'm very grateful for your time.
Are you able to elaborate more fully about what you mean by "Unlikely we are going to solve this." Are you referring to the issue in general, or the DragonFly kernel crash? Is the "we" The Go Team?
I believe the implication of a kernel problem in FreeBSD, because of a panic in DragonFly, simply doesn't follow. I can give a lot more detail about this, but I fear it's redundant, unless requested.
For what it's worth, this issue is reliably reproducible enough that I've been able to bisect down to a commit that makes sense (https://github.com/golang/go/commit/e6d8bfe218a5f387d7aceddcaee5067a59181838), in my first go. Additionally, the commits leading up to the implicated commit in the bisect are documentation-only, and don't reproduce the behaviour. I have very high confidence in the bisect.
I haven't yet published this work yet, as I've been asking some new questions based on the bisect. I was hoping to be able to speak intelligently about this when I reported results, but I feel now the pressure has increased. I've been making progress slowly, but I am making progress (this isn't my daytime gig). I will post a repo later today with the work and the bisec log, so that others can reference it.
Here's my thoughts:
New lines of thought are:
I plan to investigate these questions, but again, this isn't my daytime gig.
As I mentioned earlier, I'm very happy to work with others on this.
I'd be very interested in more details in the environment you were unable to reproduce in. Specifically, how many CPUs? Is the gomote a virtualized environment, or do you have direct access to hardware somewhere? Perhaps we can compare build artefacts? i.e. does your artefact crash in my environment, and vice versa
Thanks again for taking a look.
@derekmarcotte, thanks for investigating. gomote is our interface to freshly-created FreeBSD VMs on GCE.
See
https://github.com/golang/build/blob/master/env/freebsd-amd64/make.bash
https://github.com/golang/build/blob/master/dashboard/builders.go#L81
They're both n1-highcpu-4
, which is https://cloud.google.com/compute/docs/machine-types#standard_machine_types ... "Standard 4 CPU machine type with 4 virtual CPUs and 15 GB of memory. ... 1For the n1 series of machine types, a virtual CPU is implemented as a single hardware hyper-thread on a 2.6 GHz Intel Xeon E5 (Sandy Bridge), 2.5 GHz Intel Xeon E5 v2 (Ivy Bridge), 2.3 GHz Intel Xeon E5 v3 (Haswell), or 2.2 GHz Intel Xeon E5 v4 (Broadwell)"
So it's at least 2 cores.
@derekmarcotte Thanks for continuing to investigate. I'm fairly convinced that this is a kernel problem, and your identification of that CL as the cause just makes that seems more likely to me. I have never been able to recreate the problem myself. I suspect that it requires real hardware, not a VM.
@rsc Actually the kernel bug causing the panic existed only for a small amount of time on DragonFly. It most likely was introduced with DragonFly 4.6.0. Using 4.6.1 I fail to reproduce it. So I am sorry for laying a false trail there.
I will keep it running though, to make sure nothing happens, but I already tested it far longer than originally.
Also as a side note, since the topic of VMs came up: All my tests have been done using Virtualbox.
Here's the repo, as promised: https://github.com/derekmarcotte/goruntime-testing
@ianlancetaylor: Thanks for checking it out. Would you be interested in running 15658-doesnotexist.go to completion? Should be on the order of 8-12 hours, maybe more? If I could suggest pulling the command line from the bisect-15658, to get the environment a bit more hostile. Check the logs directory for the bisect.
If there were a kernel problem, I would expect others to have problems with say, inetd (because it is a glorified fork/exec machine), in the wild. This would be a pretty hot code path for many production FreeBSD uses. Additionally, when looking at whether this is a kernel problem, the kernel data structures themselves would become corrupt (and panic like DragonFly), but I see no evidence of this. My main testing machine has a current uptime of 105 days, with probably hundreds of runtime panics, yet no other services have faltered, and everything behaves flawlessly.
I've been running this now on 1.7.1 for 12h+ now on an 24 core E5-2640 running 8.2-RELEASE no hang or panic:
package main
/* stdlib includes */
import (
"fmt"
"os/exec"
"runtime"
)
func run(done chan struct{}) {
cmd := exec.Command("true")
cmd.Start()
cmd.Wait()
done <- struct{}{}
return
}
func main() {
fmt.Println("Starting a bunch of goroutines...")
fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))
// 8 & 16 are arbitrary
done := make(chan struct{}, 16)
for i := 0; i < 8; i++ {
go run(done)
}
for range done {
go run(done)
}
}
It's not uncommon for Go programs to behave significantly different from C programs, to the extent that they expose problems that have been latent in kernels for many years. I'm not saying I'm 100% sure that's going on here, but it's definitely possible, especially given that fork/exec seems to be required.
Using stevenh's code I received this seemingly new panic.
OS: FreeBSD 11.0
CPU: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz (2394.61-MHz K8-class CPU)
GOGC=5 ./test
Starting a bunch of goroutines...
GOMAXPROCS: 8
runtime: failed MSpanList_Remove 0x800528320 0x8005c7d20 0x53b670 0x53b660
fatal error: MSpanList_Remove
runtime stack:
runtime.throw(0x4c973e, 0x10)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*mSpanList).remove(0x53b660, 0x800528320)
/usr/local/go/src/runtime/mheap.go:1001 +0x19d
runtime.(*mcentral).cacheSpan(0x53b650, 0xc42001b228)
/usr/local/go/src/runtime/mcentral.go:55 +0x3d0
runtime.(*mcache).refill(0x800524e10, 0xc40000001e, 0xc4200b82e0)
/usr/local/go/src/runtime/mcache.go:121 +0xae
runtime.(*mcache).nextFree.func1()
/usr/local/go/src/runtime/malloc.go:505 +0x33
runtime.systemstack(0xc42007cee0)
/usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1079
goroutine 643553 [running]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc4201e68e0 sp=0xc4201e68d8
runtime.(*mcache).nextFree(0x800524e10, 0x1e, 0x0, 0x40f354, 0x80064429b)
/usr/local/go/src/runtime/malloc.go:506 +0xb2 fp=0xc4201e6938 sp=0xc4201e68e0
runtime.mallocgc(0x380, 0x4a8020, 0x1, 0xc4200b82e0)
/usr/local/go/src/runtime/malloc.go:658 +0x809 fp=0xc4201e69d8 sp=0xc4201e6938
runtime.makeslice(0x4a8020, 0x338, 0x338, 0xc4200b82e0, 0x1f, 0x40e996)
/usr/local/go/src/runtime/slice.go:57 +0x7b fp=0xc4201e6a30 sp=0xc4201e69d8
syscall.ByteSliceFromString(0xc420010000, 0x337, 0xc4201064d8, 0xc4200b82e0, 0x1f, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:53 +0x64 fp=0xc4201e6a90 sp=0xc4201e6a30
syscall.BytePtrFromString(0xc420010000, 0x337, 0xc4200b82e0, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:69 +0x39 fp=0xc4201e6ad8 sp=0xc4201e6a90
syscall.SlicePtrFromStrings(0xc42000c840, 0x15, 0x15, 0xc4200881f0, 0x2, 0x2, 0x0, 0x0)
/usr/local/go/src/syscall/exec_unix.go:87 +0x9a fp=0xc4201e6b48 sp=0xc4201e6ad8
syscall.forkExec(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6d60, 0xc420162018, 0x40e996, 0xc4201e6cc8)
/usr/local/go/src/syscall/exec_unix.go:158 +0x16e fp=0xc4201e6c78 sp=0xc4201e6b48
syscall.StartProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6d60, 0x2, 0x4, 0x44ab20, 0xc4200f0280)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc4201e6cd0 sp=0xc4201e6c78
os.startProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6f08, 0xc42000c840, 0x15, 0x15)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3 fp=0xc4201e6db8 sp=0xc4201e6cd0
os.StartProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6f08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64 fp=0xc4201e6e10 sp=0xc4201e6db8
os/exec.(*Cmd).Start(0xc42000c2c0, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9 fp=0xc4201e6f60 sp=0xc4201e6e10
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68 fp=0xc4201e6fa8 sp=0xc4201e6f60
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc4201e6fb0 sp=0xc4201e6fa8
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 1 [runnable]:
main.main()
/home/reezer/test.go:30 +0x222
goroutine 643497 [runnable]:
sync.(*RWMutex).RUnlock(0x550950)
/usr/local/go/src/sync/rwmutex.go:55
syscall.Environ(0xc4200fc000, 0x15, 0x15)
/usr/local/go/src/syscall/env_unix.go:148 +0x210
os.Environ(0x0, 0x0, 0x0)
/usr/local/go/src/os/env.go:116 +0x22
os/exec.(*Cmd).envv(0xc420076160, 0xc420126080, 0x0, 0x1)
/usr/local/go/src/os/exec/exec.go:171 +0x38
os/exec.(*Cmd).Start(0xc420076160, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:356 +0x2d9
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 643540 [runnable]:
syscall.Syscall(0x3, 0x1c, 0xc4201fabc8, 0x8, 0x0, 0x8, 0x0)
/usr/local/go/src/syscall/asm_freebsd_amd64.s:21 +0x5
syscall.readlen(0x1c, 0xc4201fabc8, 0x8, 0x2, 0xc42010e000, 0x16)
/usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1303 +0x56
syscall.forkExec(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201fad60, 0xc4200ea020, 0x40e996, 0xc4201facc8)
/usr/local/go/src/syscall/exec_unix.go:202 +0x39f
syscall.StartProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201fad60, 0x2, 0x4, 0x44ab20, 0xc4200f0140)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201faf08, 0xc420076420, 0x15, 0x15)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201faf08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42000c580, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 643472 [runnable]:
syscall.Syscall6(0x214, 0x0, 0xeb54, 0x0, 0x18, 0x0, 0x0, 0xeb54, 0x0, 0x0)
/usr/local/go/src/syscall/asm_freebsd_amd64.s:44 +0x5
os.(*Process).blockUntilWaitable(0xc420166090, 0xc420014060, 0xc420014060, 0x4)
/usr/local/go/src/os/wait_wait6.go:29 +0x6b
os.(*Process).wait(0xc420166090, 0x0, 0xc4201ea160, 0x0)
/usr/local/go/src/os/exec_unix.go:22 +0xab
os.(*Process).Wait(0xc420166090, 0xc4201e8f08, 0x0, 0x0)
/usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc4201ea160, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc420072060)
/home/reezer/test.go:13 +0x76
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 643498 [semacquire]:
sync.runtime_Semacquire(0x550934)
/usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*Mutex).Lock(0x550930)
/usr/local/go/src/sync/mutex.go:85 +0xd0
sync.(*RWMutex).Lock(0x550930)
/usr/local/go/src/sync/rwmutex.go:86 +0x31
syscall.forkExec(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7d60, 0xc420080030, 0x40e996, 0xc4201f7cc8)
/usr/local/go/src/syscall/exec_unix.go:185 +0x25e
syscall.StartProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7d60, 0x2, 0x4, 0x44ab20, 0xc420108100)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7f08, 0xc420132000, 0x15, 0x15)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7f08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420044420, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 643440 [runnable]:
syscall.Syscall(0x3, 0xc, 0xc4201e7bc8, 0x8, 0x0, 0x8, 0x0)
/usr/local/go/src/syscall/asm_freebsd_amd64.s:21 +0x5
syscall.readlen(0xc, 0xc4201e7bc8, 0x8, 0x2, 0xc420106160, 0x16)
/usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1303 +0x56
syscall.forkExec(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7d60, 0xc420080018, 0x40e996, 0xc4201e7cc8)
/usr/local/go/src/syscall/exec_unix.go:202 +0x39f
syscall.StartProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7d60, 0x2, 0x4, 0x44ab20, 0xc4200f0140)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7f08, 0xc420132420, 0x15, 0x15)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7f08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42000c420, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 643499 [runnable]:
syscall.fcntl(0x4, 0x2, 0x1, 0x0, 0x0, 0xc4c0000000)
/usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:255
syscall.forkExecPipe(0xc4201fbc18, 0x2, 0x2, 0xc42018e8f0, 0x16)
/usr/local/go/src/syscall/exec_bsd.go:264 +0x87
syscall.forkExec(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbd60, 0xc420148040, 0x40e996, 0xc4201fbcc8)
/usr/local/go/src/syscall/exec_unix.go:188 +0x281
syscall.StartProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbd60, 0x2, 0x4, 0x44ab20, 0xc4200f00c0)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbf08, 0xc4200446e0, 0x15, 0x15)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbf08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420044580, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d
@reezer Which version of Go?
go1.7.1 from the FreeBSD package
Go 1.7.1 has two notable bugs you could be hitting. See https://github.com/golang/go/issues?q=milestone%3AGo1.7.3
Can you try Go 1.7.3 or tip instead?
I will still test it longer, but so far it appears to be fixed.
Again, I am sorry for the noise. It was a test system I quickly set up and should have taken greater care to make sure it runs the latest version.
Okay, it takes a while on go1.7.3 freebsd/amd64 (FreeBSD 11.0), but I again receive one of the original errors. For people who want to try to reproduce it. GOGC=5 makes a big difference in the time it takes until the error appears.
```GOGC=5 ./test &> out.log
cat out.log
Starting a bunch of goroutines...
GOMAXPROCS: 8
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
runtime stack:
runtime.throw(0x4ca671, 0x14)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(workbuf).checkempty(0x800585000)
/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x80052a260)
/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.handoff(0x800585000, 0xc41fff49b0)
/usr/local/go/src/runtime/mgcwork.go:419 +0x26
runtime.(gcWork).balance(0xc420019c28)
/usr/local/go/src/runtime/mgcwork.go:263 +0x54
runtime.gcDrainN(0xc420019c28, 0x10000, 0xc4fffffffe)
/usr/local/go/src/runtime/mgcmark.go:1042 +0x195
runtime.gcAssistAlloc.func1()
/usr/local/go/src/runtime/mgcmark.go:462 +0xe5
runtime.systemstack(0xc42017c380)
/usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1079
goroutine 681244 [running]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc42014b8b0 sp=0xc42014b8a8
runtime.gcAssistAlloc(0xc420129520)
/usr/local/go/src/runtime/mgcmark.go:499 +0x194 fp=0xc42014b938 sp=0xc42014b8b0
runtime.mallocgc(0x7, 0x4a80a0, 0x1, 0xc42000e220)
/usr/local/go/src/runtime/malloc.go:563 +0x9a4 fp=0xc42014b9d8 sp=0xc42014b938
runtime.makeslice(0x4a80a0, 0x7, 0x7, 0xc42000e220, 0x16, 0x40e996)
/usr/local/go/src/runtime/slice.go:57 +0x7b fp=0xc42014ba30 sp=0xc42014b9d8
syscall.ByteSliceFromString(0xc42000a010, 0x6, 0xc4200a01a0, 0xc42000e220, 0x16, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:53 +0x64 fp=0xc42014ba90 sp=0xc42014ba30
syscall.BytePtrFromString(0xc42000a010, 0x6, 0xc42000e220, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:69 +0x39 fp=0xc42014bad8 sp=0xc42014ba90
syscall.SlicePtrFromStrings(0xc42012c300, 0x18, 0x18, 0xc42013c050, 0x2, 0x2, 0x0, 0x0)
/usr/local/go/src/syscall/exec_unix.go:87 +0x9a fp=0xc42014bb48 sp=0xc42014bad8
syscall.forkExec(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bd60, 0xc42010a008, 0x40e996, 0xc42014bcc8)
/usr/local/go/src/syscall/exec_unix.go:158 +0x16e fp=0xc42014bc78 sp=0xc42014bb48
syscall.StartProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bd60, 0x2, 0x4, 0x44ab30, 0xc4201ce280)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc42014bcd0 sp=0xc42014bc78
os.startProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bf08, 0xc42012c300, 0x18, 0x18)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3 fp=0xc42014bdb8 sp=0xc42014bcd0
os.StartProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bf08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64 fp=0xc42014be10 sp=0xc42014bdb8
os/exec.(*Cmd).Start(0xc4201962c0, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9 fp=0xc42014bf60 sp=0xc42014be10
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68 fp=0xc42014bfa8 sp=0xc42014bf60
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42014bfb0 sp=0xc42014bfa8
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 1 [chan receive]:
main.main()
/home/reezer/test.go:30 +0x222
goroutine 681182 [semacquire]:
syscall.forkAndExecInChild(0xc42012e150, 0xc42012e160, 0x2, 0x2, 0xc420194000, 0x19, 0x19, 0x0, 0x0, 0xc4201c8d60, ...)
/usr/local/go/src/syscall/exec_bsd.go:54 +0x5e
syscall.forkExec(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8d60, 0xc42013a010, 0x40e996, 0xc4201c8cc8)
/usr/local/go/src/syscall/exec_unix.go:193 +0x334
syscall.StartProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8d60, 0x2, 0x4, 0x44ab30, 0xc420134000)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8f08, 0xc420076180, 0x18, 0x18)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8f08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc4201b2420, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 681184 [semacquire]:
os/exec.(*Cmd).Start(0xc4201d2000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:367 +0x466
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 681213 [semacquire]:
os.(Process).wait(0xc42010e150, 0x0, 0xc4201b22c0, 0x0)
/usr/local/go/src/os/exec_unix.go:48 +0x17a
os.(Process).Wait(0xc42010e150, 0xc4201adf08, 0x0, 0x0)
/usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc4201b22c0, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc420072060)
/home/reezer/test.go:13 +0x76
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 681183 [semacquire]:
os.newProcess(0x14bf2, 0x0, 0xc4200ec010)
/usr/local/go/src/os/exec.go:23 +0x31
os.startProcess(0xc42000a4d0, 0xd, 0xc4200ec010, 0x1, 0x1, 0xc4201abf08, 0xc42000c780, 0x18, 0x18)
/usr/local/go/src/os/exec_posix.go:49 +0x2c0
os.StartProcess(0xc42000a4d0, 0xd, 0xc4200ec010, 0x1, 0x1, 0xc4201abf08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420196160, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 681161 [running]:
goroutine running on other thread; stack unavailable
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 681214 [semacquire]:
syscall.ByteSliceFromString(0xc42000a010, 0x6, 0xc4200bc080, 0x16, 0x16, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc42000a010, 0x6, 0xc4200bc080, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.SlicePtrFromStrings(0xc420180180, 0x18, 0x18, 0xc420198290, 0x2, 0x2, 0x0, 0x0)
/usr/local/go/src/syscall/exec_unix.go:87 +0x9a
syscall.forkExec(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7d60, 0xc4201b8068, 0x40e996, 0xc4201c7cc8)
/usr/local/go/src/syscall/exec_unix.go:158 +0x16e
syscall.StartProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7d60, 0x2, 0x4, 0x44ab30, 0xc42004e4c0)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7f08, 0xc420180180, 0x18, 0x18)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7f08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc4201b2160, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d
goroutine 681162 [running]:
goroutine running on other thread; stack unavailable
created by main.main
/home/reezer/test.go:31 +0x24d
runtime stack:
runtime.throw(0x4ca671, 0x14)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(workbuf).checkempty(0x800585000)
/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(gcWork).init(0xc420017228)
/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc420017228, 0xc42017d380)
/usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc42017d380, 0x0, 0x0, 0xc41fff4163, 0xc400000000, 0x800765860, 0xc420017228, 0xc)
/usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc42017d380)
/usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc4200944a0, 0xc42017d380)
/usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
/usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc42003de70)
/usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc4200944a0, 0xc42017d380)
/usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc4200944a0, 0xc42017d380)
/usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc42017d380, 0x0)
/usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
/usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420129040)
/usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0xc4200001a0)
/usr/local/go/src/runtime/asm_amd64.s:240 +0x5b
runtime stack:
runtime.throw(0x4ca671, 0x14)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(workbuf).checkempty(0x800585000)
/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(gcWork).init(0xc420018728)
/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).tryGet(0xc420018728, 0x5364d0)
/usr/local/go/src/runtime/mgcwork.go:144 +0xd9
runtime.gcDrainN(0xc420018728, 0x10000, 0xfffffffd)
/usr/local/go/src/runtime/mgcmark.go:1052 +0x145
runtime.gcAssistAlloc.func1()
/usr/local/go/src/runtime/mgcmark.go:462 +0xe5
runtime.systemstack(0xc420018a00)
/usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1079
runtime stack:
runtime.throw(0x4ca671, 0x14)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(workbuf).checkempty(0x800585000)
/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(gcWork).init(0xc42001b228)
/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc42001b228, 0xc420153860)
/usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc420153860, 0x0, 0x0, 0xc41fff563c, 0xc400000000, 0x80052b2a0, 0xc42001b228, 0xf)
/usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc420153860)
/usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc4200950a0, 0xc420153860)
/usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
/usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc4200ade70)
/usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc4200950a0, 0xc420153860)
/usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc4200950a0, 0xc420153860)
/usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc420153860, 0x0)
/usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
/usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420001380)
/usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0x0)
/usr/local/go/src/runtime/asm_amd64.s:240 +0x5b
runtime stack:
runtime.throw(0x4ca671, 0x14)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(workbuf).checkempty(0x800585000)
/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(gcWork).init(0xc42001c728)
/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc42001c728, 0xc42017d040)
/usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc42017d040, 0x0, 0x0, 0xc41fff417d, 0xc400000000, 0x800765860, 0xc42001c728, 0xa)
/usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc42017d040)
/usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc42002c8a0, 0xc42017d040)
/usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
/usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc420039e70)
/usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc42002c8a0, 0xc42017d040)
/usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc42002c8a0, 0xc42017d040)
/usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc42017d040, 0x0)
/usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
/usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420045040)
/usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0x0)
/usr/local/go/src/runtime/asm_amd64.s:240 +0x5b
```
@reezer how long did it take to reproduce for you with GOGC=5?
It varied quite a bit, sometimes minutes, but up to about two hours. It does really needs the forking. What also speeds it up (very slightly) is using an absolute path for the binary. Go otherwise will iterate over all of the the PATH directories, trying to execute true, so you have a few unnecessary syscalls. Of course you should also make sure that you really have GOMAXPROCS set to something > 1.
See also the timings that derekmarcotte posted. Mine have always been higher, but not that much.
It looks that this bug doesn't appear in virtual environments. I agree that there is quite a bit of evidence that it's a bug somewhere outside of Go (ie. kernel). Another hint here is that it apparently takes longest the first time you run this. However take this statement with a grain of salt, cause I didn't really sample this enough yet.
Thanks, reproduced now :)
Here's a possibly interesting one:-
```Starting a bunch of goroutines...
GOMAXPROCS: 24
fatal error: all goroutines are asleep - deadlock!
goroutine 1 [chan receive]:
main.main()
/data/go/src/github.com/multiplay/go/apps/test/main.go:29 +0x222
goroutine 3560607 [semacquire]:
syscall.ByteSliceFromString(0x4c8a13, 0x9, 0x41aaf7, 0x800574199, 0x800574120, 0x4d, 0x4d)
/usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0x4c8a13, 0x9, 0x10, 0x4b0f60, 0x4aedc0)
/usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.Open(0x4c8a13, 0x9, 0x100001, 0xc400000000, 0x40d8de, 0xc420326000, 0x8)
/usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:820 +0x3f
os.OpenFile(0x4c8a13, 0x9, 0x1, 0x800000000, 0x0, 0x0, 0xc42029ee00)
/usr/local/go/src/os/file_unix.go:97 +0xa7
os/exec.(Cmd).writerDescriptor(0xc420258000, 0x0, 0x0, 0x1, 0xc42026a4d0, 0xc420347e48)
/usr/local/go/src/os/exec/exec.go:233 +0x569
os/exec.(Cmd).stdout(0xc420258000, 0x0, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:221 +0x3d
os/exec.(*Cmd).Start(0xc420258000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d
goroutine 3560699 [semacquire]:
os/exec.(*Cmd).Start(0xc42031a160, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:348 +0x7ac
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d
goroutine 3560559 [semacquire]:
os.newProcess(0xfc37, 0x0, 0xc4202ce450)
/usr/local/go/src/os/exec.go:23 +0x31
os.startProcess(0xc4202ce500, 0xd, 0xc4202ce450, 0x1, 0x1, 0xc42030cf08, 0xc4201843c0, 0x1d, 0x1d)
/usr/local/go/src/os/exec_posix.go:49 +0x2c0
os.StartProcess(0xc4202ce500, 0xd, 0xc4202ce450, 0x1, 0x1, 0xc42030cf08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420240000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d
goroutine 3560649 [semacquire]:
os/exec.(Cmd).writerDescriptor(0xc4202c4000, 0x0, 0x0, 0x1, 0xc4201188f0, 0xc42003b648)
/usr/local/go/src/os/exec/exec.go:237 +0x665
os/exec.(Cmd).stdout(0xc4202c4000, 0xc4202a6030, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:221 +0x3d
os/exec.(*Cmd).Start(0xc4202c4000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d
goroutine 3560772 [semacquire]:
syscall.ByteSliceFromString(0xc420012000, 0x2d, 0xc4201d0040, 0x11, 0x11, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc420012000, 0x2d, 0xc4201d0040, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.SlicePtrFromStrings(0xc4202565a0, 0x1d, 0x1d, 0xc4202e4390, 0x2, 0x2, 0x0, 0x0)
/usr/local/go/src/syscall/exec_unix.go:87 +0x9a
syscall.forkExec(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1d60, 0xc420278028, 0x40e996, 0xc4202a1cc8)
/usr/local/go/src/syscall/exec_unix.go:158 +0x16e
syscall.StartProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1d60, 0x2, 0x4, 0x44ab30, 0xc42017a180)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1f08, 0xc4202565a0, 0x1d, 0x1d)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1f08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42024a000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d
goroutine 3560475 [semacquire]:
os.(Process).wait(0xc42028c000, 0x0, 0xc42031a000, 0x0)
/usr/local/go/src/os/exec_unix.go:48 +0x17a
os.(Process).Wait(0xc42028c000, 0xc420310f08, 0x0, 0x0)
/usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc42031a000, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:12 +0x76
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d
goroutine 3560701 [semacquire]:
syscall.ByteSliceFromString(0xc4203321e0, 0xa, 0x4bb0c0, 0xc420332101, 0x100000000000a, 0x38, 0xb0)
/usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc4203321e0, 0xa, 0x8005260d0, 0xc420159d40, 0xa)
/usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.Stat(0xc4203321e0, 0xa, 0xc420328038, 0xc4203321e0, 0xa)
/usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1130 +0x3f
os.Stat(0xc4203321e0, 0xa, 0x2, 0x2, 0xc4203321e0, 0xa)
/usr/local/go/src/os/file_unix.go:169 +0x7d
os/exec.findExecutable(0xc4203321e0, 0xa, 0x2, 0xc4203321e0)
/usr/local/go/src/os/exec/lp_unix.go:20 +0x39
os/exec.LookPath(0x4c8239, 0x4, 0x4c8239, 0x4, 0x1, 0x0)
/usr/local/go/src/os/exec/lp_unix.go:53 +0x188
os/exec.Command(0x4c8239, 0x4, 0x0, 0x0, 0x0, 0xc420232420)
/usr/local/go/src/os/exec/exec.go:135 +0x1ff
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:10 +0x55
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d
goroutine 3560650 [semacquire]:
os.NewFile(0x12, 0x4c8a13, 0x9, 0x0)
/usr/local/go/src/os/file_unix.go:56 +0x3f
os.OpenFile(0x4c8a13, 0x9, 0x1, 0xc400000000, 0xc420374d70, 0x40eb85, 0xc420254230)
/usr/local/go/src/os/file_unix.go:123 +0xff
os/exec.(Cmd).writerDescriptor(0xc420178000, 0x0, 0x0, 0xc420254238, 0xc420268020, 0xc420268020)
/usr/local/go/src/os/exec/exec.go:233 +0x569
os/exec.(Cmd).stderr(0xc420178000, 0xc420268020, 0x1, 0x1)
/usr/local/go/src/os/exec/exec.go:228 +0x4e
os/exec.(*Cmd).Start(0xc420178000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d
4006.16 real 3508.55 user 29207.58 sys```
Here's the selection of errors it triggered:
fatal error: all goroutines are asleep - deadlock!
fatal error: workbuf is empty
runtime: nelems=256 nfree=233 nalloc=23 previous allocCount=18 nfreed=65531
fatal error: sweep increased allocation count
runtime: failed MSpanList_Remove 0x800698500 0x800b46d40 0x53adb0 0x53ada0
fatal error: MSpanList_Remove
This is results from 1.7.3 on FreeBSD 11.0
Had it hang on a run last night here's a full trace with GOTRACEBACK=crash so includes the runtime routines. I also have the crash dump if its of use:
https://gist.github.com/stevenh/4b6d5bde0b81503c1a4bca461a5322ae
Just wanted to give a quick status-update:
As has been mentioned the fork appears to be critical. In testing, I've removed all the Pipes plumbing in the parent, and exited the child immediately after fork (using SYS_EXIT), for my /doesnotexist example. This reliably crashes the process, with similar crashes. Removing the fork, Everything Is Fine. This implies that the exec syscall is not implicated.
What is surprising to me, is the nature of the call. i.e. - it is surprising that a child might affect a parent process, but if the syscall itself is to blame, well that is also surprising.
I can make this work available if anyone is interested.
I feel that my next step is to circumvent the regular runtime path that /doesnotexist takes, and write a second test using, say the RawSyscall interface alone.
I'd add that it also seems to need GC as with GOGC=100 it always takes many hours here, with GOGC=2-5 then its usually between 20mins and 1 hour and I've never had it happen with GOGC=off.
I'm going to leave a GOGC=off test running over the weekend to try and confirm that.
Perhaps the next step is to stop garbage collection as part of ForkLock? Ima play in here for a bit.
I thought the same so I'm running the forks with GC disabled using GOGC=off and then manually calling GC via debug.FreeOSMemory().
No results yet, but here's the code for reference.
This adds additional synchronisation due to the use of constant goroutines, which might also have an impact but we'll have to see. It will be an interesting data point either way.
package main
import (
"fmt"
"os/exec"
"runtime"
"runtime/debug"
"time"
)
var (
gcPeriod = time.Second * 10
forkRoutines = 16
)
func run(work, done chan struct{}) {
for range work {
cmd := exec.Command("/usr/bin/true")
cmd.Start()
cmd.Wait()
done <- struct{}{}
}
}
func main() {
fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))
done := make(chan struct{}, forkRoutines)
work := make(chan struct{}, forkRoutines)
for i := 0; i < forkRoutines; i++ {
work <- struct{}{}
go run(work, done)
}
for {
start := time.Now()
forks := 0
active := forkRoutines
forking:
for range done {
forks++
if time.Since(start) > gcPeriod {
active--
if active == 0 {
break forking
}
} else {
work <- struct{}{}
}
}
fmt.Println("forks:", forks)
debug.FreeOSMemory()
for i := 0; i < forkRoutines; i++ {
work <- struct{}{}
}
}
}
Here's an alternative version which is as close to the known bad version as I think we can get without editing the runtime:
package main
import (
"fmt"
"os/exec"
"runtime"
"time"
)
var (
gcPeriod = time.Second * 10
forkRoutines = 16
)
func run(done chan struct{}) {
cmd := exec.Command("/usr/bin/true")
cmd.Start()
cmd.Wait()
done <- struct{}{}
}
func main() {
fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))
done := make(chan struct{}, forkRoutines*2)
for i := 0; i < forkRoutines; i++ {
go run(done)
}
for {
start := time.Now()
active := forkRoutines
forking:
for range done {
if time.Since(start) > gcPeriod {
active--
if active == 0 {
break forking
}
} else {
go run(done)
}
}
runtime.GC()
for i := 0; i < forkRoutines; i++ {
go run(done)
}
}
}
I've been running these now here for a few hours with no errors and I usually see an error within 20m - 1 hour with GOGC=5
This seems to indicate the issue maybe centered around GC kicking in while there is a fork in progress.
Both sets of code have now been running for 12 hours with no errors, so looking more like this is indeed to do with the GC interacting with active forks causing corruption.
@stevenh,
and then manually calling GC via debug.FreeOSMemory()
That doesn't do what you think it does. That's not how you manually call a GC. That would be runtime.GC()
, but calling runtime.GC()
can be super slow. (it does a full GC while blocking the process)
Just disable & reenable with:
old := debug.SetGCPercent(0) // now disabled.
// forklock..fork..exec...unlock
debug.SetGCPercent(old) // now GC is re-enabled.
I did actually check that and saw debug.FreeOSMemory() is freeOSMemory which in turn is:
func runtime_debug_freeOSMemory() {
gcStart(gcForceBlockMode, false)
systemstack(func() { mheap_.scavenge(-1, ^uint64(0), 0) })
}
Where as runtime.GC() is:
```func GC() {
gcStart(gcForceBlockMode, false)
}
So it looks like FreeOSMemory does do a full GC + trying to get the OS to cleanup. This is reinforced by the docs for FreeOSMemory:
FreeOSMemory forces a garbage collection followed by an attempt to return as much memory to the operating system as possible. (Even if this is not called, the runtime gradually returns memory to the operating system in a background task.)
I've also confirmed that memory usage is maintained which is not the case with just GOGC=off.
Am I really mistaken?
I could use
```go
old := debug.SetGCPercent(0)
...
// Force GC
debug.SetGCPercent(old)
debug.SetGCPercent(0)
Its also worth noting that SetGCPercent directly calls runtime.GC() so would still result in a stop the world GC?
Also worth noting @bradfitz that while the first used debug.FreeOSMemory()
the second uses runtime.GC()
as I was interested in the difference.
I guess I missed or don't understand why you want to force a GC. Or why you'd want to return memory to the operating system.
I thought the whole point was to stop the GC so it's not running during a fork. That's much lighter weight than running a GC.
The idea of the tests is to confirm that we can fork without error if the GC was disabled.
As the original test used multiple routines doing the forking, to force the issue quicker, I wanted to maintain that. Given this it seemed easier to just do a stop the world GC than to try and coordinate ensuring that GC was stopped across all the routines as they fork.
The first version used debug.FreeOSMemory()
in case the freeing the memory to the OS was a contributing factor to the issue.
The second version I switched to runtime.GC()
to see if removing this fact and allowing background free made any difference.
I did consider using debug.SetGCPercent(X)
but after looking at it and noticing it actually calls runtime.GC()
anyway, I concluded it would be almost identical to calling runtime.GC()
while all the forking routines where blocked which was as mentioned previously was easier to coordinate.
Both the test versions have now been running for 24hours so it seems highly likely that the GC while forking is the cause of this issue.
I'll leave them running for now, but I need a way to confirm for sure and also to narrow down the exact piece of code which is causing the issue.
Any ideas gratefully received.
Given that we believe that this only occurs on FreeBSD, and only occurs when running on real hardware, I would guess that the problem is not the GC as such, but rather than fact that another thread in the parent process is doing a lot of reads and writes to memory pages that are also being read and written by the child process.
Hmm why do you say it only occurs on real hardware @ianlancetaylor?
Reading back through the thread I see @kaey mention it telegraph crashes for him on vmware too, so interested to know if there's another piece of the puzzle there?
I'm wondering if PCID may be at play here...
If it's PCID related, does "vm.pmap.pcid_enabled=0" help?
That's exactly what I just tried unfortunately not :(
@stevenh I said "real hardware" based on @reezer 's comment in https://github.com/golang/go/issues/15658#issuecomment-264165583. My apologies if I misunderstood.
In an attempt to produce a minimal version I've tried creating a C version but currently no joy.
Capturing here for reference.
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/resource.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
#include <errno.h>
#include <pthread.h>
#define MEMSIZE 10240
#define RANDMAX 4
#define THREADS 16
#define handle_error_en(en, msg) \
do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
struct thread_info {
pthread_t thread_id;
int thread_num;
};
void
check(int val, int *i, int *mem)
{
int j;
if (*i != val) {
printf("i: %d\n", *i);
exit(EXIT_FAILURE);
}
for (j = 0; j < MEMSIZE; j++) {
if (mem[j] != val) {
printf("mem: %d = %d\n", j, mem[j]);
exit(EXIT_FAILURE);
}
}
}
void
randomise(int val, int *i, int *mem)
{
struct timespec ts;
ts.tv_sec = 0;
srandom((unsigned long)val);
for (; val < RANDMAX; val+=2) {
*i = val;
for (int j = 0; j < MEMSIZE; j++) {
mem[j] = val;
}
check(val, i, mem);
}
}
void
dofork()
{
pid_t pid;
int i;
int mem[MEMSIZE];
i = 0;
for (int j = 0; j < MEMSIZE; j++) {
mem[j] = 0;
}
pid = fork();
switch (pid) {
case -1:
handle_error("fork");
case 0: {
// child
randomise(1, &i, mem);
exit(execv("/usr/bin/true", NULL));
}
default: {
int status;
struct rusage ru;
// parent
randomise(2, &i, mem);
// wait6 followed by wait4 as that's what golang does
wait6(P_PID, pid, 0, WNOWAIT|WEXITED, 0, 0);
wait4(pid, &status, 0, &ru);
}}
}
static void *
thread_start(void *arg)
{
struct thread_info *tinfo = arg;
char info[1024], *ret;
sprintf(info, "thread %d\n", tinfo->thread_num);
ret = strdup(info);
if (ret == NULL) {
handle_error("strdup");
}
while (1==1) {
dofork();
}
return ret;
}
int
main(int argc, char**argv)
{
int s, tnum;
struct thread_info *tinfo;
pthread_attr_t attr;
void *res;
s = pthread_attr_init(&attr);
if (s != 0) {
handle_error_en(s, "pthread_attr_init");
}
tinfo = calloc(THREADS, sizeof(struct thread_info));
if (tinfo == NULL) {
handle_error("calloc");
}
for (tnum = 0; tnum < THREADS; tnum++) {
tinfo[tnum].thread_num = tnum + 1;
s = pthread_create(&tinfo[tnum].thread_id, &attr, &thread_start, &tinfo[tnum]);
if (s != 0) {
handle_error_en(s, "pthread_create");
}
}
s = pthread_attr_destroy(&attr);
if (s != 0) {
handle_error_en(s, "pthread_attr_destroy");
}
for (tnum = 0; tnum < THREADS; tnum++) {
s = pthread_join(tinfo[tnum].thread_id, &res);
if (s != 0) {
handle_error_en(s, "pthread_join");
}
printf("joined with thread %d; returned value was %s\n", tinfo[tnum].thread_num, (char *)res);
free(res);
}
free(tinfo);
exit(EXIT_SUCCESS);
}
I've been looking through the FreeBSD specific code in this area and I noticed that newosproc uses raw sigprocmask.
If its possible that newosproc can be called from multiple threads at the same time then its possible that signals which are meant to be blocked during thr_new could be re-enabled unexpectedly causing undefined behaviour.
This is is why man SIGPROCMASK(2) states:
In threaded applications, pthread_sigmask(3) must be used instead of sigprocmask().
This may be unrelated but...
I looked at the FreeBSD library sources, and it looks like pthread_sigmask
just calls sigprocmask
after updating some user space information. There is no separate pthread_sigmask
system call. So I think the Go code is doing the right thing in making the sigprocmask
system call.
libthr
maintains internal state to ensure that signals don't get unblocked when they shouldn't (keeps a blocked counter and only unblocks when the counter drops to zero). This seems a bit quirky to me at first glance so I'd have to spend more time looking to see if there is any other things it does to ensure correct handling of signals to threaded apps it provides.
It struck me that as go can be run different routines on a given kernel thread, then that when combined with the direct use of sigprocmask
it may be possible for a signal blocked from one could then be unintentionally unblocked by another without additional protection in the runtime.
I'm not saying this is possible or that this could be the issue, but just highlighting something I noticed while investigating.
Something else that I've just noticed is that when we get the failed MSpanList_Remove
panic it seems to always due to span.list != list
with list
off by 10 every time, which doesn't seem random.
Does this ring any bells?
Some progress: I was able to add a temporary method to the runtime to test only the fork portion of the program. This work is available here:
https://github.com/derekmarcotte/go/commit/9a3667bc261085793f7386064699eb3f020d1a4b
Additionally, a test case that exercises that method is available here:
https://raw.githubusercontent.com/derekmarcotte/goruntime-testing/master/src/15658-forkonly.go
This test case has wedged once (with 3 threads in 100%, and the rest in idle), and crashed once, out of two attempts:
acquirep: p->m=842351068160(8) p->status=1
fatal error: acquirep: invalid p state
This is definitely some interaction with the fork system call. I'll next disable garbage collection and see where it goes. Will post results.
EDIT: Whoop, thought I based my branch off of master. Have rebased and will be testing with GOGC=5, and then GOGC=off. New links to come in a few days.
Just wanted to add, using the 15658-forkonly.go merged against master, of three tests:
I'll be posting updates to my branch later today.
Interesting to note: all of the panics now seem to be in acquirep.
Interesting, do you have the full trace from the GOGC=off case?
I don't suppose you have the output from procstat -k -k do you?
Given how simple that is now, it really smells like this could either be signals or something more fundamental like _umtx_op interation.
@stevenh:
Trace: https://github.com/derekmarcotte/goruntime-testing/blob/master/logs/15658-forkonly-log-head-2
No procstat, but I can try again.
Here's the test function integrated with head: https://github.com/derekmarcotte/go/commit/1deb3ec41080ce69dc4e980716fcf07db34825ac
Been running your new code with ktrace since Wed Dec 14 10:06:02 UTC 2016 no crashes yet :(
Still no crashes with your test case under ktrace after 7 days @derekmarcotte :(
I'll leave it till tomorrow then try to see if trigger it without ktrace.
Still unable to reproduce the issue after 8 days running with GOGC=OFF and no ktrace.
Are you sure that managed to reproduce the issue with GOGC=OFF @derekmarcotte?
@stevenh: FWIW, I have not been able to reproduce the crashes with GOGC=OFF, on an Intel Xeon (with ECC RAM), only on my AMD FX-8350. I'm going to (start to) verify GOGC=5 on the Xeon today. If you have an AMD (+FreeBSD 10.3 amd64) machine at your disposal, that would be interesting.
I'm afraid I don't have access to any AMD HW. I'm currently running with GOGC=2 + ktrace on a E5-2640 since the 1st no crashes yet but it does seem to take quite a while with ktrace enabled.
Here's a GOGC=5, wedged, with with thread at 100%, and the rest at 0%:
procstat -k -k 62812
PID TID COMM TDNAME KSTACK
62812 100817 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_timedwait_sig+0x10 _sleep+0x238 umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 100928 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 100937 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 101090 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 102891 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 102896 15658-forkonly-+5b6 15658-forkonly-+
62812 102897 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 102898 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 102899 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_timedwait_sig+0x10 _sleep+0x238 umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 102900 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 103024 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
I'm guessing they are blocked on the ForkLock, but I don't know how to interpret this output. I ran ktrace for a bit (~30s), and got the following (first number is counts):
1716125 CALL clock_gettime(0x4,0xc420063e88)
44278 CALL clock_gettime(0x4,0xc4201adee0)
2 CALL clock_gettime(0x4,0xc420039eb0)
1 CALL clock_gettime(0x4,0xc420039f10)
1 CALL clock_gettime(0,0xc420039ef0)
1760407 RET clock_gettime 0
858063 CALL sched_yield
858062 RET sched_yield 0
22139 CALL _umtx_op(0x4f62c8,UMTX_OP_WAIT_UINT_PRIVATE,0,0x18,0xc4201ade78)
1 CALL _umtx_op(0x4f62d8,UMTX_OP_WAIT_UINT_PRIVATE,0,0x18,0xc420039e48)
22140 RET _umtx_op -1 errno 60 Operation timed out
1 CALL nanosleep(0xc420039f18,0)
1 RET nanosleep 0https://github.com/golang/go/issues/17168
Interesting to note: I'm running ntp on this machine. I think there was another issue that depending on a MONOTONIC clock on FreeBSD. Are you running NTP?
For those that cannot reproduce, what if you run NTP?
Interesting find, I am running NTP here.
It doesn't seem like this could be the cause of the corruption given the description but I may be wrong.
I've been unable to get anything other than the single thread stuck @ 100% CPU while exercising https://github.com/derekmarcotte/go/commit/1deb3ec41080ce69dc4e980716fcf07db34825ac
I'm going to try running with 1.7.4 + https://github.com/golang/go/commit/360f2e43b78a3ea119ea8dce9649f7c1227d793b to see if this behaviour changes.
Scratch that as https://github.com/golang/go/commit/360f2e43b78a3ea119ea8dce9649f7c1227d793b was already included in 1.7 by https://github.com/golang/go/commit/a2f37b7fe130e1a5da78b605f4191547e9868944
Testing on my machines right now, got this (already mentioned in this thread):
runtime: failed MSpanList_Remove 0x800528280 0x800727480 0x53ab10 0x53ab00
fatal error: MSpanList_Remove
on the original true
test fairly quickly (GOTRACEBACK=crash GOMAXPROCS=10 GOGC=5, FreeBSD/amd64 11.0-RELEASE-p6, go1.7.4, Intel Core i5 Skylake @ 4.5GHz, VirtualBox VM)
UPDATE on my laptop (same variables; FreeBSD/amd64 12-CURRENT, go1.7.4, Intel Core i3 Haswell @ 1.8 GHz):
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
@stevenh : Testing on a Xeon w/ ECC, and GOGC=2, I haven't yet panic'd. I have seen all threads idle a few times though, and then I just break and restart. Challenge is that this machine is FreeBSD 10.1.
I'd like to use the same usb stick/test binary/environment on both hardware platforms, and see what the observed differences are. It's hard to dedicate hardware resources like this for any period of time though. At least then, it should be easily repeatable.
@myfreeweb: are you interested/able to run the fork-only tests, using:
This runtime:
This test application:
This command line:
GOMAXPROCS=8 GOGC=5 ./15658-forkonly
Thanks!
Using a fresh FreeBSD 10.3-RELEASE memstick-mini "live cd" mode (from the FTP release site), and the above, I got an aquirep panic, first try on identical hardware to this machine.
I'm going to use the GOGC=off and see if I still panic next.
At this point though, this means that I'm able to consistently repeat it on:
FreeBSD 10.3-RELEASE amd64 (AMD+non-ecc, Xeon+ecc), with a fresh memstick-mini image from FreeBSD
To see what happened when we eliminate new goroutines and channels I've been testing with the the following:
package main
import (
"log"
"net/http"
_ "net/http/pprof"
"runtime"
"syscall"
)
var (
forkRoutines = 16
)
func run() {
for {
if err := syscall.ForkOnlyBSDTest(); err != nil {
log.Fatal(err)
}
}
}
func main() {
log.Printf("Starting %v forking goroutines...\n", forkRoutines)
log.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))
for i := 0; i < forkRoutines; i++ {
go run()
}
log.Println(http.ListenAndServe(":8080", nil))
}
After 4 days of running it wedged using very little CPU.
ktrace showed it was waiting periodically but not much else. The /debug/pprof/ was unresponsive and there was one goroutine waiting to be run in Wait4:
goroutine 71 [runnable]:
syscall.Wait4(0x13e31, 0xc42008d73c, 0x0, 0x0, 0x13e31, 0x0, 0x0)
/usr/local/go/src/syscall/syscall_bsd.go:127 fp=0xc42008d700 sp=0xc42008d6f8
syscall.ForkOnlyBSDTest(0x0, 0x0)
/usr/local/go/src/syscall/exec_bsd_debug.go:35 +0xf1 fp=0xc42008d780 sp=0xc42008d700
main.run()
/data/go/src/github.com/multiplay/go/apps/test/main.go:17 +0x2e fp=0xc42008d7c0 sp=0xc42008d780
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42008d7c8 sp=0xc42008d7c0
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:28 +0x1c4
It seems like the runtime scheduler is failing to run either the one goroutine which holds the fork lock (above) or the pprof handler, so is some how wedged.
This may be irrelevant and just another symptom of memory corruption but I wanted to share as I've not seen it before.
I have the full goroutine trace after -ABRT and the gdb threads trace before that in case it might shed some additional light.
I'm seeing a similar memory corruption issue in my app, which I was able to narrow down slightly and reproduce as follows. My repro case is not as basic as the one already being used here, but I figured I'd document it anyway:
package main
import (
"log"
"os"
"syscall"
"time"
"github.com/fancybits/gopsutil/cpu"
)
func doit() {
time.Sleep(50 * time.Millisecond)
log.Printf("running exec...")
syscall.Exec(os.Args[0], os.Args, os.Environ())
log.Printf("exec failed!")
}
func main() {
log.SetFlags(log.LstdFlags)
log.Printf("starting...")
cpu.Info()
go doit()
handler := make(chan os.Signal, 1)
for range handler {
break
}
}
I'm building this with golang 1.8 (cd6b6202dd1559b3ac63179b45f1833fcfbe7eca) on amd64 freebsd in a virtualbox VM running FreeNAS 9.10 (which uses the FreeBSD 10.3 kernel). The vagrant box I used is https://atlas.hashicorp.com/drajen/boxes/freenas9
Running the test program reliably crashes within a few seconds, varying between SIGBUS, SIGSEGV, SIGILL, etc:
vagrant@freenas:~ % uname -a
FreeBSD freenas.local 10.3-STABLE FreeBSD 10.3-STABLE #0 455d13d(9.10-STABLE): Sun Jun 26 22:47:03 PDT 2016 [email protected]:/tank/home/nightlies/build-freenas9/_BE/objs/tank/home/nightlies/build-freenas9/_BE/trueos/sys/FreeNAS.amd64 amd64
vagrant@freenas:~ % ./spawntest
2017/03/07 19:56:21 starting...
2017/03/07 19:56:21 running exec...
2017/03/07 19:56:21 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
panic during panic
[signal SIGBUS: bus error code=0x3 addr=0x451730 pc=0x451730]
runtime stack:
runtime.startpanic_m()
/home/vagrant/go/src/runtime/panic.go:653 +0x19a
runtime.systemstack(0x525fc8)
/home/vagrant/go/src/runtime/asm_amd64.s:343 +0xab
runtime.startpanic()
/home/vagrant/go/src/runtime/panic.go:569 +0x1e
runtime.throw(0x5248de, 0x2a)
/home/vagrant/go/src/runtime/panic.go:595 +0x88
runtime.sigpanic()
/home/vagrant/go/src/runtime/signal_unix.go:274 +0x2db
runtime.usleep(0x333520322e322e30, 0x302e303120343430, 0x32322035312e322e, 0x6e69622f3a6e6962, 0x6f6c2f7273752f3a, 0x3a6e69622f6c6163, 0x636f6c2f7273752f, 0x3a6e6962732f6c61, 0x6e69622f7273752f, 0x2f3a6e6962732f3a, ...)
/home/vagrant/go/src/runtime/sys_freebsd_amd64.s:276 +0x40
goroutine 1 [runnable, locked to thread]:
syscall.Syscall6(0x214, 0x0, 0x6018, 0x0, 0x18, 0x0, 0x0, 0x6018, 0x0, 0x0)
/home/vagrant/go/src/syscall/asm_unix_amd64.s:42 +0x5
os.(*Process).blockUntilWaitable(0xc420070090, 0xc42009e000, 0xc420092140, 0x3)
/home/vagrant/go/src/os/wait_wait6.go:29 +0x6b
os.(*Process).wait(0xc420070090, 0x40e042, 0xc420094090, 0xc420070030)
/home/vagrant/go/src/os/exec_unix.go:22 +0x4d
os.(*Process).Wait(0xc420070090, 0x0, 0x0, 0x525c08)
/home/vagrant/go/src/os/exec.go:115 +0x2b
os/exec.(*Cmd).Wait(0xc42009e000, 0x0, 0x0)
/home/vagrant/go/src/os/exec/exec.go:435 +0x62
os/exec.(*Cmd).Run(0xc42009e000, 0xc420096050, 0x2)
/home/vagrant/go/src/os/exec/exec.go:280 +0x5c
os/exec.(*Cmd).Output(0xc42009e000, 0xc, 0xc420035db8, 0x2, 0x2, 0xc42009e000)
/home/vagrant/go/src/os/exec/exec.go:474 +0x11c
github.com/fancybits/gopsutil/internal/common.DoSysctrl(0x51f5cd, 0xd, 0x0, 0x0, 0x1, 0xc420098000, 0x0)
/home/vagrant/.go/src/github.com/fancybits/gopsutil/internal/common/common_freebsd.go:22 +0x120
github.com/fancybits/gopsutil/cpu.Times(0x5ad601, 0xc420098000, 0x1, 0x1, 0x0, 0x0)
/home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu_freebsd.go:53 +0x6a
github.com/fancybits/gopsutil/cpu.init.1()
/home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu.go:59 +0xb1
github.com/fancybits/gopsutil/cpu.init()
/home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu_freebsd.go:155 +0x76
main.init()
/home/vagrant/spawntest/main.go:31 +0x58
goroutine 17 [runnable]:
syscall.Syscall(0x3, 0x4, 0xc4200a8000, 0x200, 0x4, 0x200, 0x0)
/home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5
syscall.read(0x4, 0xc4200a8000, 0x200, 0x200, 0xc420024800, 0x800598000, 0x0)
/home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:893 +0x55
syscall.Read(0x4, 0xc4200a8000, 0x200, 0x200, 0x42561f, 0x5259f0, 0xc420021dc0)
/home/vagrant/go/src/syscall/syscall_unix.go:162 +0x49
os.(*File).read(0xc4200a0010, 0xc4200a8000, 0x200, 0x200, 0x4a4feb, 0x4f45e0, 0x200)
/home/vagrant/go/src/os/file_unix.go:165 +0x6a
os.(*File).Read(0xc4200a0010, 0xc4200a8000, 0x200, 0x200, 0x42e5be, 0xc420001a00, 0xc420021e60)
/home/vagrant/go/src/os/file.go:101 +0x76
bytes.(*Buffer).ReadFrom(0xc420098070, 0x5994a0, 0xc4200a0010, 0x800638028, 0xc420098070, 0xc420021f01)
/home/vagrant/go/src/bytes/buffer.go:179 +0x160
io.copyBuffer(0x5992a0, 0xc420098070, 0x5994a0, 0xc4200a0010, 0x0, 0x0, 0x0, 0xc420012178, 0x0, 0x0)
/home/vagrant/go/src/io/io.go:384 +0x2cb
io.Copy(0x5992a0, 0xc420098070, 0x5994a0, 0xc4200a0010, 0x4f3660, 0xc420012120, 0xc420021fb0)
/home/vagrant/go/src/io/io.go:360 +0x68
os/exec.(*Cmd).writerDescriptor.func1(0x4f3660, 0xc420012120)
/home/vagrant/go/src/os/exec/exec.go:254 +0x4d
os/exec.(*Cmd).Start.func1(0xc42009e000, 0xc4200920c0)
/home/vagrant/go/src/os/exec/exec.go:371 +0x27
created by os/exec.(*Cmd).Start
/home/vagrant/go/src/os/exec/exec.go:372 +0x4e4
goroutine 18 [runnable]:
syscall.Syscall(0x3, 0x6, 0xc4200ac000, 0x8000, 0x0, 0x8000, 0x0)
/home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5
syscall.read(0x6, 0xc4200ac000, 0x8000, 0x8000, 0x80059dc80, 0x1, 0x101000000000000)
/home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:893 +0x55
syscall.Read(0x6, 0xc4200ac000, 0x8000, 0x8000, 0x8005984b0, 0x0, 0x0)
/home/vagrant/go/src/syscall/syscall_unix.go:162 +0x49
os.(*File).read(0xc4200a0028, 0xc4200ac000, 0x8000, 0x8000, 0x4f45e0, 0xc42001c601, 0xc4200ac000)
/home/vagrant/go/src/os/file_unix.go:165 +0x6a
os.(*File).Read(0xc4200a0028, 0xc4200ac000, 0x8000, 0x8000, 0x8000, 0x8000, 0x0)
/home/vagrant/go/src/os/file.go:101 +0x76
io.copyBuffer(0x599620, 0xc420096050, 0x5994a0, 0xc4200a0028, 0xc4200ac000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
/home/vagrant/go/src/io/io.go:390 +0x100
io.Copy(0x599620, 0xc420096050, 0x5994a0, 0xc4200a0028, 0x0, 0x0, 0x0)
/home/vagrant/go/src/io/io.go:360 +0x68
os/exec.(*Cmd).writerDescriptor.func1(0x0, 0x0)
/home/vagrant/go/src/os/exec/exec.go:254 +0x4d
os/exec.(*Cmd).Start.func1(0xc42009e000, 0xc420092140)
/home/vagrant/go/src/os/exec/exec.go:371 +0x27
created by os/exec.(*Cmd).Start
/home/vagrant/go/src/os/exec/exec.go:372 +0x4e4
Your actually hitting the fork issue using gopsutil as it uses the cmd sysctl instead of just the syscall, which is silly, so for your case switching to use raw syscalls would likely fix your issue.
Well, hold on. The call to cpu.Info()
does fork/exec 3x times, but cpu.Info()
isn't being called from the goroutine doit()
, it's only being used before the goroutine is spawned.
@tmm1 does removing the call to cpu.Info()
before doit
change the abort behavior in your environment? My suspicion it will and the program will take a while to crash.
What @tmm1 is reporting is interesting to me because his test case is reproducible within seconds versus days. Without digging in, what stands out to me is this test case has two concurrent fork
calls happening at the start of the program and it's calling a real program.
I'm not sure how this is different than @stevenh 's snippet, but there's probably something in syscall.ForkOnlyBSDTest()
that's different than the fork/exec call to /sbin/sysctl
.
The trace indicates the issue is in the init() of https://github.com/fancybits/gopsutil/blob/arm-mhz-fix/cpu/cpu.go#L59
The frequency of the panic is interesting.
@tmm1 could post your spawntest script please so we can try to repo.
Ignore the above I've produced it and its is very quick, so definitely very interesting.
Your actually hitting the fork issue using gopsutil as it uses the cmd sysctl instead of just the syscall, which is silly, so for your case switching to use raw syscalls would likely fix your issue.
Indeed, I'm not sure why its using fork/exec instead of the syscall. Switching to the syscall should be an effective workaround.
@tmm1 does removing the call to cpu.Info() before doit change the abort behavior in your environment? My suspicion it will and the program will take a while to crash.
Removing the cpu.Info() makes the program take longer to crash, to a few minutes instead of seconds.
The trace indicates the issue is in the init() of https://github.com/fancybits/gopsutil/blob/arm-mhz-fix/cpu/cpu.go#L59
Indeed, it almost always happens in cpu.init()
@tmm1 could post your spawntest script please so we can try to repo.
spawntest is what I called the compiled golang binary given the source posted above.
I also tried this morning to run the same binary on a fresh freebsd VM, using the freebsd/FreeBSD-10.3-RELEASE vagrant box. I cannot seem to reproduce the crash there (atleast not in the same short amount of time that I tried). The kernel there is:
FreeBSD fbsd 10.3-RELEASE FreeBSD 10.3-RELEASE #0 r297264: Fri Mar 25 02:10:02 UTC 2016 [email protected]:/usr/obj/usr/src/sys/GENERIC amd64
However on the FreeNAS 9.10 VM (running on the same macOS host) crashes readily:
FreeBSD freenas.local 10.3-STABLE FreeBSD 10.3-STABLE #0 455d13d(9.10-STABLE): Sun Jun 26 22:47:03 PDT 2016 [email protected]:/tank/home/nightlies/build-freenas9/_BE/objs/tank/home/nightlies/build-freenas9/_BE/trueos/sys/FreeNAS.amd64 amd64
My golang app which I originally saw this memory corruption bug in is designed to be run on FreeNAS appliances, and atleast half a dozen users have reported the same crash on various pieces of hardware all running FreeNAS 9.10. (The golang program auto-updates itself by downloading a new version, then syscall.Exec
to replace itself. The crash always happens during cpu.init() after the new process takes over. Here's a gdb trace from a production crash, which usually exhibits as a SIGILL: https://gist.github.com/tmm1/86cd869d1bfdae66e4058698dd30aee9)
Anyone who can make it crash reliably, please try changing go/src/os/wait_wait6.go's func (p *Process) blockUntilWaitable() (bool, error)
to have as its very first line:
return false, nil
That will fall back to not using the wait6 system call at all. If the crash still happens, then we know that wait6 is not the problem. If the crash stops, then we have somewhere to look. Thanks.
(And then of course 'go install os')
Anyone who can make it crash reliably, please try changing go/src/os/wait_wait6.go's func (p *Process) blockUntilWaitable() (bool, error) to have as its very first line:
Made this change and my repro still crashed quickly:
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
panic during panic
[signal SIGBUS: bus error code=0x3 addr=0x451730 pc=0x451730]
runtime stack:
runtime.startpanic_m()
/home/vagrant/go/src/runtime/panic.go:653 +0x19a
runtime.systemstack(0x525fc0)
/home/vagrant/go/src/runtime/asm_amd64.s:343 +0xab
runtime.startpanic()
/home/vagrant/go/src/runtime/panic.go:569 +0x1e
runtime.throw(0x5248d9, 0x2a)
/home/vagrant/go/src/runtime/panic.go:595 +0x88
runtime.sigpanic()
/home/vagrant/go/src/runtime/signal_unix.go:274 +0x2db
runtime.usleep(0x6e69622f7273752f, 0x2f3a6e6962732f3a, 0x6e6962732f727375, 0xc420070750, 0x2d, 0xc4200707b0, 0x2d, 0xc420070810, 0x21, 0xc420094820, ...)
/home/vagrant/go/src/runtime/sys_freebsd_amd64.s:276 +0x40
goroutine 1 [runnable]:
syscall.Syscall6(0x7, 0x13612, 0xc420035a7c, 0x0, 0xc420074090, 0x0, 0x0, 0x13612, 0x0, 0x0)
/home/vagrant/go/src/syscall/asm_unix_amd64.s:42 +0x5
syscall.wait4(0x13612, 0xc420035a7c, 0x0, 0xc420074090, 0x90, 0x516b00, 0x4bc401)
/home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:34 +0x88
syscall.Wait4(0x13612, 0xc420035acc, 0x0, 0xc420074090, 0xc420035b80, 0xc420014360, 0x0)
/home/vagrant/go/src/syscall/syscall_bsd.go:129 +0x51
os.(*Process).wait(0xc420014360, 0x40e042, 0xc42000e660, 0xc420014300)
/home/vagrant/go/src/os/exec_unix.go:38 +0xbb
os.(*Process).Wait(0xc420014360, 0x0, 0x0, 0x525c00)
/home/vagrant/go/src/os/exec.go:115 +0x2b
os/exec.(*Cmd).Wait(0xc4200686e0, 0x0, 0x0)
/home/vagrant/go/src/os/exec/exec.go:435 +0x62
os/exec.(*Cmd).Run(0xc4200686e0, 0xc42004a0f0, 0x2)
/home/vagrant/go/src/os/exec/exec.go:280 +0x5c
os/exec.(*Cmd).Output(0xc4200686e0, 0xc, 0xc420035cb8, 0x2, 0x2, 0xc4200686e0)
/home/vagrant/go/src/os/exec/exec.go:474 +0x11c
github.com/fancybits/gopsutil/internal/common.DoSysctrl(0x51e5bc, 0x7, 0x40, 0x40a1ec0000000000, 0x0, 0x0, 0x0)
/home/vagrant/.go/src/github.com/fancybits/gopsutil/internal/common/common_freebsd.go:22 +0x120
github.com/fancybits/gopsutil/cpu.Info(0x51f162, 0xb, 0x0, 0x0, 0x0)
/home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu_freebsd.go:118 +0x1e4
main.main()
/home/vagrant/spawntest/main.go:28 +0x67
goroutine 8 [runnable]:
syscall.Syscall(0x3, 0x6, 0xc4200da000, 0x8000, 0x0, 0x8000, 0x0)
/home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5
syscall.read(0x6, 0xc4200da000, 0x8000, 0x8000, 0x80059e3a0, 0x402001, 0x10100c42001de08)
/home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:893 +0x55
syscall.Read(0x6, 0xc4200da000, 0x8000, 0x8000, 0x800598000, 0x0, 0xc42001de28)
/home/vagrant/go/src/syscall/syscall_unix.go:162 +0x49
os.(*File).read(0xc42000c088, 0xc4200da000, 0x8000, 0x8000, 0x4f45e0, 0xc42001de01, 0xc4200da000)
/home/vagrant/go/src/os/file_unix.go:165 +0x6a
os.(*File).Read(0xc42000c088, 0xc4200da000, 0x8000, 0x8000, 0x8000, 0x8000, 0x403fc0)
/home/vagrant/go/src/os/file.go:101 +0x76
io.copyBuffer(0x599620, 0xc42004a0f0, 0x5994a0, 0xc42000c088, 0xc4200da000, 0x8000, 0x8000, 0xc4200a44d8, 0x0, 0x0)
/home/vagrant/go/src/io/io.go:390 +0x100
io.Copy(0x599620, 0xc42004a0f0, 0x5994a0, 0xc42000c088, 0x4f3660, 0xc4200a4480, 0xc42001dfb0)
/home/vagrant/go/src/io/io.go:360 +0x68
os/exec.(*Cmd).writerDescriptor.func1(0x4f3660, 0xc4200a4480)
/home/vagrant/go/src/os/exec/exec.go:254 +0x4d
os/exec.(*Cmd).Start.func1(0xc4200686e0, 0xc42000a600)
/home/vagrant/go/src/os/exec/exec.go:371 +0x27
created by os/exec.(*Cmd).Start
/home/vagrant/go/src/os/exec/exec.go:372 +0x4e4
goroutine 7 [runnable]:
syscall.Syscall(0x3, 0x4, 0xc420072400, 0x200, 0x1, 0x200, 0x0)
/home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5
syscall.read(0x4, 0xc420072400, 0x200, 0x200, 0xc420024400, 0x8005984b0, 0x0)
/home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:893 +0x55
syscall.Read(0x4, 0xc420072400, 0x200, 0x200, 0x42561f, 0x5259e8, 0xc42001e5c0)
/home/vagrant/go/src/syscall/syscall_unix.go:162 +0x49
os.(*File).read(0xc42000c070, 0xc420072400, 0x200, 0x200, 0x4a4e2b, 0x4f45e0, 0x200)
/home/vagrant/go/src/os/file_unix.go:165 +0x6a
os.(*File).Read(0xc42000c070, 0xc420072400, 0x200, 0x200, 0x42e5be, 0xc4200a8680, 0xc42001e660)
/home/vagrant/go/src/os/file.go:101 +0x76
bytes.(*Buffer).ReadFrom(0xc420048230, 0x5994a0, 0xc42000c070, 0x800638028, 0xc420048230, 0xc42001e701)
/home/vagrant/go/src/bytes/buffer.go:179 +0x160
io.copyBuffer(0x5992a0, 0xc420048230, 0x5994a0, 0xc42000c070, 0x0, 0x0, 0x0, 0xc4200a44d8, 0x0, 0x0)
/home/vagrant/go/src/io/io.go:384 +0x2cb
io.Copy(0x5992a0, 0xc420048230, 0x5994a0, 0xc42000c070, 0x4f3660, 0xc4200a4480, 0xc42001e7b0)
/home/vagrant/go/src/io/io.go:360 +0x68
os/exec.(*Cmd).writerDescriptor.func1(0x4f3660, 0xc4200a4480)
/home/vagrant/go/src/os/exec/exec.go:254 +0x4d
os/exec.(*Cmd).Start.func1(0xc4200686e0, 0xc42000a580)
/home/vagrant/go/src/os/exec/exec.go:371 +0x27
created by os/exec.(*Cmd).Start
/home/vagrant/go/src/os/exec/exec.go:372 +0x4e4
OK, that's good progress. Thanks. @tmm1, in the trace it says pc=0x451730. What is that pc in the binary? (for example, go tool objdump yourprog | grep -C 10 0x451730
)
$ go tool objdump spawntest | grep -C 10 0x451730
sys_freebsd_amd64.s:267 0x45170b 48890424 MOVQ AX, 0(SP)
sys_freebsd_amd64.s:268 0x45170f b8e8030000 MOVL $0x3e8, AX
sys_freebsd_amd64.s:269 0x451714 f7e2 MULL DX
sys_freebsd_amd64.s:270 0x451716 4889442408 MOVQ AX, 0x8(SP)
sys_freebsd_amd64.s:272 0x45171b 4889e7 MOVQ SP, DI
sys_freebsd_amd64.s:273 0x45171e 31f6 XORL SI, SI
sys_freebsd_amd64.s:274 0x451720 b8f0000000 MOVL $0xf0, AX
sys_freebsd_amd64.s:275 0x451725 0f05 SYSCALL
sys_freebsd_amd64.s:276 0x451727 488b6c2410 MOVQ 0x10(SP), BP
sys_freebsd_amd64.s:276 0x45172c 4883c418 ADDQ $0x18, SP
sys_freebsd_amd64.s:276 0x451730 c3 RET
TEXT runtime.settls(SB) /home/vagrant/go/src/runtime/sys_freebsd_amd64.s
sys_freebsd_amd64.s:279 0x451740 4883ec10 SUBQ $0x10, SP
sys_freebsd_amd64.s:279 0x451744 48896c2408 MOVQ BP, 0x8(SP)
sys_freebsd_amd64.s:279 0x451749 488d6c2408 LEAQ 0x8(SP), BP
sys_freebsd_amd64.s:280 0x45174e 4883c708 ADDQ $0x8, DI
sys_freebsd_amd64.s:281 0x451752 48893c24 MOVQ DI, 0(SP)
sys_freebsd_amd64.s:282 0x451756 4889e6 MOVQ SP, SI
sys_freebsd_amd64.s:283 0x451759 48c7c781000000 MOVQ $0x81, DI
sys_freebsd_amd64.s:284 0x451760 48c7c0a5000000 MOVQ $0xa5, AX
Ignore the above I've produced it and its is very quick, so definitely very interesting.
What version of FreeBSD did you repro on?
I'm still confused as to why my FreeBSD VM does not reproduce, but my FreeNAS VM does.
I noticed there was an update available to FreeNAS (9.10 -> 9.10.2). I updated, and the crash still happens.
before:
FreeBSD freenas.local 10.3-STABLE FreeBSD 10.3-STABLE #0 455d13d(9.10-STABLE): Sun Jun 26 22:47:03 PDT 2016 [email protected]:/tank/home/nightlies/build-freenas9/_BE/objs/tank/home/nightlies/build-freenas9/_BE/trueos/sys/FreeNAS.amd64 amd64
after:
FreeBSD freenas.local 10.3-STABLE FreeBSD 10.3-STABLE #0 41eb257(9.10.2-STABLE): Mon Mar 6 17:03:14 UTC 2017 root@gauntlet:/freenas-9.10-releng/_BE/objs/freenas-9.10-releng/_BE/os/sys/FreeNAS.amd64 amd64
@tmm1 FreeBSD 11.0-RELEASE no VM.
Here's a different trace I got just now.. Clearly some nasty memory corruption going on. It's almost as-if the RawSyscall wrapper is wrong or we're using vfork instead of fork (neither of which appears to be the case).
runtime: nelems=6 nfree=4 nalloc=2 previous allocCount=1 nfreed=65535
fatal error: sweep increased allocation count
runtime stack:
runtime.throw(0x5231bf, 0x20)
/home/vagrant/go/src/runtime/panic.go:596 +0x95
runtime.(*mspan).sweep(0x80059cda8, 0x80059cd00, 0xc420000b60)
/home/vagrant/go/src/runtime/mgcsweep.go:288 +0x786
runtime.sweepone(0x42d876)
/home/vagrant/go/src/runtime/mgcsweep.go:110 +0x1ad
runtime.gosweepone.func1()
/home/vagrant/go/src/runtime/mgcsweep.go:125 +0x2b
runtime.systemstack(0xc420016000)
/home/vagrant/go/src/runtime/asm_amd64.s:327 +0x79
runtime.mstart()
/home/vagrant/go/src/runtime/proc.go:1132
goroutine 3 [running]:
runtime.systemstack_switch()
/home/vagrant/go/src/runtime/asm_amd64.s:281 fp=0xc420020f68 sp=0xc420020f60
runtime.gosweepone(0x0)
/home/vagrant/go/src/runtime/mgcsweep.go:126 +0x53 fp=0xc420020f98 sp=0xc420020f68
runtime.bgsweep(0xc420048000)
/home/vagrant/go/src/runtime/mgcsweep.go:59 +0xbb fp=0xc420020fd8 sp=0xc420020f98
runtime.goexit()
/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420020fe0 sp=0xc420020fd8
created by runtime.gcenable
/home/vagrant/go/src/runtime/mgc.go:212 +0x61
goroutine 1 [runnable]:
regexp.(*bitState).reset(0xc4203737a0, 0x38, 0xc)
/home/vagrant/go/src/regexp/backtrack.go:95 +0xff
regexp.(*machine).backtrack(0xc420101680, 0x59a460, 0xc420101758, 0x0, 0x38, 0xc, 0x4f)
/home/vagrant/go/src/regexp/backtrack.go:321 +0xa1
regexp.(*Regexp).doExecute(0xc42036fcc0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc420098440, 0x38, 0x0, 0xc, ...)
/home/vagrant/go/src/regexp/exec.go:439 +0x33a
regexp.(*Regexp).FindStringSubmatch(0xc42036fcc0, 0xc420098440, 0x38, 0x0, 0x0, 0x0)
/home/vagrant/go/src/regexp/regexp.go:919 +0xec
github.com/fancybits/gopsutil/cpu.Info(0x51f162, 0xb, 0x0, 0x0, 0x0)
/home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu_freebsd.go:133 +0x3ef
main.main()
/home/vagrant/spawntest/main.go:28 +0x67
Ok this is interesting, just got a crash even with GOGC=off and ktrace e.g.
time sh -c 'export GOGC=off; export GOTRACEBACK=crash; ktrace -i -f ktrace-test.out ./test '
And?
Here's an updated version of the reproduction program, now with no external dependencies or goroutines:
package main
import (
"bufio"
"log"
"os"
"os/exec"
"regexp"
"syscall"
"time"
)
func init() {
exec.Command("true").Run()
// do some work after the fork/exec, to expose memory corruption if it occurred
f, _ := os.Open("/var/run/dmesg.boot")
defer f.Close()
scanner := bufio.NewScanner(f)
for scanner.Scan() {
line := scanner.Text()
regexp.MustCompile(`Origin\s*=\s*"(.+)"\s+Id\s*=\s*(.+)\s+Family\s*=\s*(.+)\s+Model\s*=\s*(.+)\s+Stepping\s*=\s*(.+)`).FindStringSubmatch(line)
}
}
func main() {
log.SetFlags(log.LstdFlags)
log.Printf("started process")
time.Sleep(1 * time.Millisecond)
log.Printf("running exec...")
syscall.Exec(os.Args[0], os.Args, os.Environ())
log.Printf("exec failed!")
}
This is interesting:
If I comment out the time.Sleep(50 * time.Millisecond)
(or use time.Sleep(0)
), I can no longer reproduce the crash.
If I change it to time.Sleep(1 * time.Millisecond)
, the crash happens reliably within a couple seconds of running.
@tmm1, nice. Sadly, still no crash for me with FreeBSD 10.1-RELEASE on a Google Cloud VM; tried both 1ms and 50ms sleeps.
I'm starting to wonder about threading behavior, especially since it seems like init is important. (I assume if it wasn't, you'd have simplified further already.)
Some things to try. First rename init to xinit, main to xmain. Then:
(1) Make sure this still crashes:
func init() { xinit() }
func main() { xmain() }
(2) Confirm that this doesn't crash:
func main() { xinit(); xmain() }
(3) See if this crashes:
func init() { runtime.LockOSThread() }
func main() { xinit(); xmain() }
(4) See if this crashes:
func init() { runtime.LockOSThread() }
func main() { xinit(); runtime.UnlockOSThread(); xmain() }
(5) See if this crashes:
func init() { runtime.LockOSThread() }
func main() { xinit(); go xmain(); select{} }
@tmm1, nice. Sadly, still no crash for me with FreeBSD 10.1-RELEASE on a Google Cloud VM; tried both 1ms and 50ms sleeps.
Strange. One thing I just remembered is that I had trouble booting the FreeNAS VM in virtualbox and had to change some settings before it worked (https://github.com/drajen/freenas-vagrant/issues/1). I will try making that same change on the FreeBSD VM to see if it causes the repro to work there.
I'm starting to wonder about threading behavior, especially since it seems like init is important. (I assume if it wasn't, you'd have simplified further already.)
Looks like the init
is irrelevant, I just didn't get that far in the simplification process.
Here are the results from your tests:
Very interesting that (3) does not crash. That variant makes sure the fork and the final exec happen in the same thread (the main thread, it turns out).
Does (5) crash more reliably than the others? It forces the fork to happen in the main thread and then forces the exec to happen in a non-main thread.
Also interesting would be the reverse (fork forced in non-main thread, final exec forced in main thread):
(6)
func init() { runtime.LockOSThread() }
func main() {
ch := make(chan int)
go func() {
xinit();
ch <- 1
}()
<-ch
xmain()
}
It's really tempting to think there is some kind of race in exec where the OS doesn't quite manage to stop all the threads in the old process before creating the new one, hence the memory fault or SIGILL trying to execute a return instruction. But I don't see where the race could be: the final exec doesn't happen until the first fork+exec is over and cleaned up. It's almost like a thread in the new process starts running and then accidentally gets torn down as if it were part of the old process.
(6) does not crash.
(4) and (5) appear to crash more quickly than the rest.
Random sampling of timings (ofcourse these are all highly variable):
[vagrant@freenas ~]$ time ./rsctest1
Illegal instruction
real 0m0.685s
user 0m0.213s
sys 0m0.427s
[vagrant@freenas ~]$ time ./rsctest2
Illegal instruction
real 0m8.332s
user 0m3.093s
sys 0m6.164s
[vagrant@freenas ~]$ time ./rsctest3
^C
real 0m25.811s
user 0m10.192s
sys 0m18.873s
[vagrant@freenas ~]$ time ./rsctest4
Illegal instruction
real 0m0.026s
user 0m0.010s
sys 0m0.020s
[vagrant@freenas ~]$ time ./rsctest5
Illegal instruction
real 0m0.146s
user 0m0.051s
sys 0m0.111s
[vagrant@freenas ~]$ time ./rsctest6
^C
real 0m33.222s
user 0m12.663s
sys 0m26.550s
If you can, please try one more:
(7)
func init() { runtime.LockOSThread() }
func main() {
xinit()
ch := make(chan int)
go func() {
runtime.LockOSThread()
ch <- 1
<-ch
xmain()
}()
<-ch
runtime.UnlockOSThread()
ch <- 1
select {}
}
| run | xinit | xmain | outcome |
| --- | --- | --- | --- |
| (1) | locked to main thread | on any thread | crash |
| (2) | on any thread | on any thread | crash |
| (3) | locked to main thread | locked to main thread | success |
| (4) | locked to main thread | on any thread | crash |
| (5) | locked to main thread | on any non-main thread | crash |
| (6) | on any non-main thread | locked to main thread | success |
| (7) | locked to main thread | locked to non-main thread | ??? |
Clearly the problem goes away when xmain is locked to the main OS thread. But what's important? The fact that it's locked to a thread or the fact that it's the main OS thread? The new program (7) is halfway between (3) and (5): it runs xmain locked to a thread, just definitely not the main OS thread.
While Go certainly treats locked vs unlocked thread execution differently, it otherwise makes no distinctions or special cases for the main OS thread. If (7) succeeds, then the important detail is the Go runtime's thread locking, not the specific operating system thread, which would seem to implicate Go. If (7) crashes, then the important detail would appear to be the specific operating system thread used for the syscall.Exec, which would seem to implicate FreeBSD.
Also, if you remove the "do some work", does that affect crash frequency? It's not involved in any of these theories, nor in the SIGBUS/SIGILL above, but maybe it is important for getting the timing right.
If (7) crashes the next step is probably to try to convert it into a C program that does the fork+wait of /bin/true from the main thread and then starts a new pthread for the execve, leaving the main thread to block (maybe pthread_join on the new thread).
(7) crashes.
Without the "do work" block, the crashes take longer to reproduce. For example, (2) took 46s and 90s to crash, and (4) took 27s and 21s to crash.
I wrote this, but haven't been able to reproduce the crash with it yet:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <errno.h>
extern char **environ;
void *do_exec(void *arg) {
fprintf(stdout, "running exec...\n");
char **argv = (char**)arg;
execve(argv[0], argv, environ);
fprintf(stdout, "exec failed: %d!\n", errno);
exit(1);
return NULL;
}
int main(int argc, char *argv[]) {
fprintf(stdout, "started process.\n");
int pid = fork();
if (pid == 0) { // child
if (execve("/usr/bin/true", argv, environ) < 0) {
fprintf(stderr, "execve(true) failed: %d\n", errno);
}
exit(0);
}
waitpid(pid, NULL, 0);
usleep(1000);
pthread_t thread;
pthread_create(&thread, NULL, do_exec, argv);
pthread_join(thread, NULL);
return 0;
}
I suspect that even if memory corruption were occurring in the C program, there's just not enough going on for the program to notice.
FWIW, this crash at least is not memory corruption, at least not in the "allocated memory / program heap" sense.
[signal SIGBUS: bus error code=0x3 addr=0x451730 pc=0x451730]
The fault here is trying to read from memory the instruction to be executed (addr = pc), which objdump identified as a valid PC that should have had a RET instruction mapped there. This is very strongly suggestive of the operating system doing process management wrong.
It's definitely possible that the C binary is so tiny compared to the Go binary that there are many fewer pages and less opportunity to get things wrong. Compiling with -static might be worth trying.
I tried upgrading my 10.3-RELEASE VM to 10.3-RELEASE-p11 and still couldn't reproduce any crashes.
Then I started up a fresh new 11.0-RELEASE-p8 VM (based on @stevenh's comment earlier), and sure enough it crashes there.
FreeBSD 11.0-RELEASE-p8 FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017 [email protected]:/usr/obj/usr/src/sys/GENERIC amd64
@rsc Maybe you can reproduce on GCE with --image freebsd-11-0-release-p1-amd64 --image-project=freebsd-org-cloud-dev
@rsc, we have FreeBSD 11 gomote-accessible builders, too.
Behaviors on FreeBSD 11 are slightly different than my FreeNAS VM, so it's probably worth testing all the cases again.
For instance, now it crashes with and without the time.Sleep()
, but _only if_ the program uses stdout as well. I commented out the log.Printf
lines and it wouldn't crash (quickly?) anymore.
Similarly, if I remove the "do work" section I can't get it to crash anymore on 11.0 either.
My original simplified repro still crashes quickly though. Definitely curious to hear if you see the same on GCE.
Here's the crash I'm seeing on FreeBSD 11..
runtime: nelems=6 nfree=4 nalloc=2 previous allocCount=1 nfreed=65535
fatal error: sweep increased allocation count
runtime stack:
runtime.throw(0x4fa91e, 0x20)
/home/vagrant/go/src/runtime/panic.go:596 +0x95 fp=0xc42002fe58 sp=0xc42002fe38
runtime.(*mspan).sweep(0x800563b48, 0x800563b00, 0xc420000820)
/home/vagrant/go/src/runtime/mgcsweep.go:288 +0x786 fp=0xc42002ff28 sp=0xc42002fe58
runtime.sweepone(0x42d276)
/home/vagrant/go/src/runtime/mgcsweep.go:110 +0x1ad fp=0xc42002ff90 sp=0xc42002ff28
runtime.gosweepone.func1()
/home/vagrant/go/src/runtime/mgcsweep.go:125 +0x2b fp=0xc42002ffb0 sp=0xc42002ff90
runtime.systemstack(0xc420016000)
/home/vagrant/go/src/runtime/asm_amd64.s:327 +0x79 fp=0xc42002ffb8 sp=0xc42002ffb0
runtime.mstart()
/home/vagrant/go/src/runtime/proc.go:1132 fp=0xc42002ffc0 sp=0xc42002ffb8
goroutine 3 [running]:
runtime.systemstack_switch()
/home/vagrant/go/src/runtime/asm_amd64.s:281 fp=0xc420020f68 sp=0xc420020f60
runtime.gosweepone(0x0)
/home/vagrant/go/src/runtime/mgcsweep.go:126 +0x53 fp=0xc420020f98 sp=0xc420020f68
runtime.bgsweep(0xc420014070)
/home/vagrant/go/src/runtime/mgcsweep.go:59 +0xbb fp=0xc420020fd8 sp=0xc420020f98
runtime.goexit()
/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420020fe0 sp=0xc420020fd8
created by runtime.gcenable
/home/vagrant/go/src/runtime/mgc.go:212 +0x61
goroutine 1 [runnable]:
syscall.Syscall(0x4, 0x2, 0xc4203cae40, 0x24, 0x24, 0x24, 0x0)
/home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5 fp=0xc420035cb8 sp=0xc420035cb0
syscall.write(0x2, 0xc4203cae40, 0x24, 0x40, 0x20, 0x13, 0x100000000000011)
/home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:1271 +0x55 fp=0xc420035d10 sp=0xc420035cb8
syscall.Write(0x2, 0xc4203cae40, 0x24, 0x40, 0x80055f000, 0x0, 0x14)
/home/vagrant/go/src/syscall/syscall_unix.go:181 +0x49 fp=0xc420035d58 sp=0xc420035d10
os.(*File).write(0xc42000c020, 0xc4203cae40, 0x24, 0x40, 0x40, 0x14, 0x40)
/home/vagrant/go/src/os/file_unix.go:186 +0x95 fp=0xc420035dc8 sp=0xc420035d58
os.(*File).Write(0xc42000c020, 0xc4203cae40, 0x24, 0x40, 0x23, 0xc4203cae40, 0x14)
/home/vagrant/go/src/os/file.go:142 +0x7c fp=0xc420035e40 sp=0xc420035dc8
log.(*Logger).Output(0xc420040050, 0x2, 0xc42030ffd0, 0xf, 0x0, 0x0)
/home/vagrant/go/src/log/log.go:168 +0x210 fp=0xc420035ee0 sp=0xc420035e40
log.Printf(0x4f7579, 0xf, 0x0, 0x0, 0x0)
/home/vagrant/go/src/log/log.go:291 +0x80 fp=0xc420035f28 sp=0xc420035ee0
main.main()
/home/vagrant/spawntest/main.go:28 +0x62 fp=0xc420035f88 sp=0xc420035f28
runtime.main()
/home/vagrant/go/src/runtime/proc.go:185 +0x20a fp=0xc420035fe0 sp=0xc420035f88
runtime.goexit()
/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420035fe8 sp=0xc420035fe0
goroutine 2 [force gc (idle)]:
runtime.gopark(0x4fd138, 0x573a30, 0x4f74a7, 0xf, 0x4fd014, 0x1)
/home/vagrant/go/src/runtime/proc.go:271 +0x13a fp=0xc420020768 sp=0xc420020738
runtime.goparkunlock(0x573a30, 0x4f74a7, 0xf, 0xc420000114, 0x1)
/home/vagrant/go/src/runtime/proc.go:277 +0x5e fp=0xc4200207a8 sp=0xc420020768
runtime.forcegchelper()
/home/vagrant/go/src/runtime/proc.go:226 +0x9e fp=0xc4200207e0 sp=0xc4200207a8
runtime.goexit()
/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4200207e8 sp=0xc4200207e0
created by runtime.init.4
/home/vagrant/go/src/runtime/proc.go:215 +0x35
goroutine 4 [finalizer wait]:
runtime.gopark(0x4fd138, 0x58e1f0, 0x4f72c1, 0xe, 0x14, 0x1)
/home/vagrant/go/src/runtime/proc.go:271 +0x13a fp=0xc420021718 sp=0xc4200216e8
runtime.goparkunlock(0x58e1f0, 0x4f72c1, 0xe, 0x2000000014, 0x1)
/home/vagrant/go/src/runtime/proc.go:277 +0x5e fp=0xc420021758 sp=0xc420021718
runtime.runfinq()
/home/vagrant/go/src/runtime/mfinal.go:161 +0xb2 fp=0xc4200217e0 sp=0xc420021758
runtime.goexit()
/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4200217e8 sp=0xc4200217e0
created by runtime.createfing
/home/vagrant/go/src/runtime/mfinal.go:142 +0x62
goroutine 5 [GC worker (idle)]:
runtime.gopark(0x4fcfd8, 0xc42030f860, 0x4f7654, 0x10, 0x14, 0x0)
/home/vagrant/go/src/runtime/proc.go:271 +0x13a fp=0xc420021f50 sp=0xc420021f20
runtime.gcBgMarkWorker(0xc420016000)
/home/vagrant/go/src/runtime/mgc.go:1491 +0x138 fp=0xc420021fd8 sp=0xc420021f50
runtime.goexit()
/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420021fe0 sp=0xc420021fd8
created by runtime.gcBgMarkStartWorkers
/home/vagrant/go/src/runtime/mgc.go:1412 +0x98
Abort trap (core dumped)
If I run with GOGC=off
, I eventually see the program quit with "Illegal instruction" and a coredump.
Reading symbols from spawntest...done.
[New LWP 100878]
[New LWP 100139]
[New LWP 100146]
Core was generated by `spawntest'.
Program terminated with signal SIGILL, Illegal instruction.
#0 syscall.Syscall6 () at /home/vagrant/go/src/syscall/asm_unix_amd64.s:50
50 SYSCALL
[Current thread is 1 (LWP 100878)]
Loading Go Runtime support.
(gdb) info threads
Id Target Id Frame
* 1 LWP 100878 syscall.Syscall6 () at /home/vagrant/go/src/syscall/asm_unix_amd64.s:50
2 LWP 100139 runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21
3 LWP 100146 runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21
(gdb) thread apply all bt
Thread 3 (LWP 100146):
#0 runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21
#1 0x0000000000423b3d in runtime.futexsleep1 (addr=0xc420024518, val=0, ns=-1) at /home/vagrant/go/src/runtime/os_freebsd.go:92
#2 0x000000000044b5ea in runtime.futexsleep.func1 () at /home/vagrant/go/src/runtime/os_freebsd.go:80
#3 0x000000000044cf3b in runtime.systemstack () at /home/vagrant/go/src/runtime/asm_amd64.s:343
#4 0x0000000000423acd in runtime.futexsleep (addr=0xc420024518, val=0, ns=-1) at /home/vagrant/go/src/runtime/os_freebsd.go:81
#5 0x000000000040c68b in runtime.notesleep (n=0xc420024518) at /home/vagrant/go/src/runtime/lock_futex.go:145
#6 0x000000000042b2cd in runtime.stopm () at /home/vagrant/go/src/runtime/proc.go:1650
#7 0x000000000042c0e4 in runtime.findrunnable (gp#10=0xc420016000, inheritTime=false) at /home/vagrant/go/src/runtime/proc.go:2102
#8 0x000000000042cc7c in runtime.schedule () at /home/vagrant/go/src/runtime/proc.go:2222
#9 0x000000000042cf7b in runtime.park_m (gp=0xc420000680) at /home/vagrant/go/src/runtime/proc.go:2285
#10 0x000000000044ce6b in runtime.mcall () at /home/vagrant/go/src/runtime/asm_amd64.s:269
#11 0x000000c420016000 in ?? ()
#12 0x0000000000000000 in ?? ()
Thread 2 (LWP 100139):
#0 runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21
#1 0x0000000000423cae in runtime.futexwakeup (addr=0xc420024518, cnt=1) at /home/vagrant/go/src/runtime/os_freebsd.go:102
#2 0x000000000040c5a6 in runtime.notewakeup (n=0xc420024518) at /home/vagrant/go/src/runtime/lock_futex.go:135
#3 0x000000000042b4bd in runtime.startm (_p_=0xc420016000, spinning=false) at /home/vagrant/go/src/runtime/proc.go:1713
#4 0x000000000042b635 in runtime.handoffp (_p_=0xc420016000) at /home/vagrant/go/src/runtime/proc.go:1725
#5 0x000000000043151d in runtime.retake (now=4278381200320, ~r1=0) at /home/vagrant/go/src/runtime/proc.go:3909
#6 0x000000000043113a in runtime.sysmon () at /home/vagrant/go/src/runtime/proc.go:3836
#7 0x000000000042a39e in runtime.mstart1 () at /home/vagrant/go/src/runtime/proc.go:1179
#8 0x000000000042a274 in runtime.mstart () at /home/vagrant/go/src/runtime/proc.go:1149
#9 0x0000000000450927 in runtime.thr_start () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:45
#10 0x0000000000000000 in ?? ()
Thread 1 (LWP 100878):
#0 syscall.Syscall6 () at /home/vagrant/go/src/syscall/asm_unix_amd64.s:50
#1 0x0000000000478abb in os.(*Process).blockUntilWaitable (p=0xc420012330, ~r0=96, ~r1=...) at /home/vagrant/go/src/os/wait_wait6.go:29
#2 0x00000000004763bd in os.(*Process).wait (p=0xc420012330, ps=0x8, err=...) at /home/vagrant/go/src/os/exec_unix.go:22
#3 0x000000000047594b in os.(*Process).Wait (p=0xc420012330, ~r0=0x0, ~r1=...) at /home/vagrant/go/src/os/exec.go:115
#4 0x000000000049fc52 in os/exec.(*Cmd).Wait (c=0xc420064000, ~r0=...) at /home/vagrant/go/src/os/exec/exec.go:435
#5 0x000000000049f08c in os/exec.(*Cmd).Run (c=0xc420064000, ~r0=...) at /home/vagrant/go/src/os/exec/exec.go:280
#6 0x00000000004be011 in main.init.1 () at /home/vagrant/spawntest/main.go:14
#7 0x00000000004be347 in main.init ()
#8 0x000000000042747a in runtime.main () at /home/vagrant/go/src/runtime/proc.go:173
#9 0x000000000044f9d1 in runtime.goexit () at /home/vagrant/go/src/runtime/asm_amd64.s:2197
#10 0x0000000000000000 in ?? ()
(gdb) l
45 MOVQ a2+16(FP), SI
46 MOVQ a3+24(FP), DX
47 MOVQ a4+32(FP), R10
48 MOVQ a5+40(FP), R8
49 MOVQ a6+48(FP), R9
50 SYSCALL
51 JCC ok6
52 MOVQ $-1, r1+56(FP) // r1
53 MOVQ $0, r2+64(FP) // r2
54 MOVQ AX, err+72(FP) // errno
(gdb) disas
Dump of assembler code for function syscall.Syscall6:
0x000000000046c810 <+0>: callq 0x42d9d0 <runtime.entersyscall>
0x000000000046c815 <+5>: mov 0x8(%rsp),%rax
0x000000000046c81a <+10>: mov 0x10(%rsp),%rdi
0x000000000046c81f <+15>: mov 0x18(%rsp),%rsi
0x000000000046c824 <+20>: mov 0x20(%rsp),%rdx
0x000000000046c829 <+25>: mov 0x28(%rsp),%r10
0x000000000046c82e <+30>: mov 0x30(%rsp),%r8
0x000000000046c833 <+35>: mov 0x38(%rsp),%r9
=> 0x000000000046c838 <+40>: syscall
0x000000000046c83a <+42>: jae 0x46c859 <syscall.Syscall6+73>
0x000000000046c83c <+44>: movq $0xffffffffffffffff,0x40(%rsp)
0x000000000046c845 <+53>: movq $0x0,0x48(%rsp)
0x000000000046c84e <+62>: mov %rax,0x50(%rsp)
0x000000000046c853 <+67>: callq 0x42ddf0 <runtime.exitsyscall>
0x000000000046c858 <+72>: retq
0x000000000046c859 <+73>: mov %rax,0x40(%rsp)
0x000000000046c85e <+78>: mov %rdx,0x48(%rsp)
0x000000000046c863 <+83>: movq $0x0,0x50(%rsp)
0x000000000046c86c <+92>: callq 0x42ddf0 <runtime.exitsyscall>
0x000000000046c871 <+97>: retq
End of assembler dump.
Ran with ktrace
and it crashed after a few minutes. Here's the last iteration worth of events: https://gist.github.com/tmm1/f10330c43fe6dfebeabcf5188e3ff535
Not particularly interesting from the kernel's point of view.
Thanks @bradfitz. I only just realized that the numbers in freebsd-amd64-gce93 and freebsd-amd64-gce101 are FreeBSD versions. I thought they were GCE versions somehow.
Anyway, reproduced reliably on freebsd-amd64-110 gomote (FreeBSD 11.0). That's progress.
Yeah, I've started dropping "gce" in the suffixes due to the confusion. Also, the freebsd-amd64-110 gomote is quite new (9 days ago in #19097). They weren't available when y'all started debugging this.
Anyway, reproduced reliably on freebsd-amd64-110 gomote (FreeBSD 11.0). That's progress.
Great! Let me know if I can do anything else to help.
Booted up a fresh FreeBSD-10.3-STABLE VM this morning, and I'm not able to repro my crash there either. So far, here's where my crash can be reproduced:
| version | outcome|
|---------|--------|
| 10.1-RELEASE (GCE) | not crashing |
| 10.3-RELEASE | not crashing |
| 10.3-RELEASE-p11 | not crashing |
| 10.3-STABLE (r314494) | not crashing |
| 10.3-STABLE (freenas-9.10 455d13d) | crashing |
| 10.3-STABLE (freenas-9.10.2 41eb257) | crashing |
| 11.0-RELEASE | crashing |
| 11.0-RELEASE-p8 | crashing |
| 11.0-STABLE (r314493) | crashing |
Interestingly the earlier repro was crashing for @derekmarcotte on 10.3-RELEASE, so I wonder if my issue is a different bug. I guess the 10.3 repro was also taking days, so possible it's just not triggering as readily on the older kernels.
Hiya,
In case this is architecture specific, can people include their actual CPU
details too?
-adrian
E5-2640 here
All my testing has been on VirtualBox VMs, on a darwin 15.6.0 host with a i5-2415M
Oh, and has anyone tried reproducing this on a single-core box, versus SMP?
(you can use cpuset to push the process+descendents onto a single core.)
-adrian
Booted up a 11.0-STABLE (r314493) VM and that crashes too.
(you can use cpuset to push the process+descendents onto a single core.)
I tried cpuset -l 0 ./spawntest
and it still crashed.
Hiya,
Well, doing it in a VM doesn't mean the host OS won't drift it across CPU
cores. :-)
-adrian
Tried with here still crashes but takes much longer and that's on the metal.
In case this is architecture specific, can people include their actual CPU
details too?
FreeBSD 11.0-RELEASE-p2 #0: Mon Oct 24 06:55:27 UTC 2016
[email protected]:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.8.0 (tags/RELEASE_380/final 262564) (based on LLVM 3.8.0)
VT(vga): resolution 640x480
CPU: AMD G-T40E Processor (1000.02-MHz K8-class CPU)
Origin="AuthenticAMD" Id=0x500f20 Family=0x14 Model=0x2 Stepping=0
Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
Features2=0x802209<SSE3,MON,SSSE3,CX16,POPCNT>
AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
AMD Features2=0x35ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,IBS,SKINIT,WDT>
SVM: NP,NRIP,NAsids=8
TSC: P-state invariant, performance statistics
Sorry if I'm asking a dumb question, but could this be as simple as a stack overflow? By default pthreads have smaller stacks than main threads, and a SIGBUS is the normal symptom of overflowing a thread's stack. So that's compatible with the observation that the crash only happens when the thread that calls execve
is not the main thread of the process.
The failure signal is pretty random, I've seen SIGILL, SIGBUS and SIGSEGV :(
@asomers All the stacks here are goroutine stacks which never overflow.
@ianlancetaylor goroutines never SIGILL either, except when they do. There are real stacks down there somewhere.
@asomers These crashes are happening with pure Go programs. There are no pthreads involved anywhere. The stacks grow automatically: the function prologue ensures there is enough stack space before the function begins. I don't know what the problem is, but I am confident that it is not "as simple as a stack overflow."
Hi,
So it's a single thread process? (ie, I'm asking if there are any threads
at all for your process, versus specifically pthreads using our libthr.)
Does your Go runtime hook into any of the standard C runtime / linker
stuff? initialisers, destructors, etc?
-adrian
@erikam it's a multithreaded process. As @tmm1's ktrace output show, there appear to be 3 threads.
@ianlancetaylor the bug is unlikely to like in pure Go code. It's probably either in the OS or in the implementation of Go. Remember, everything is assembly in the end. I don't know if a stack overflow is the problem, but I do know that when investigating a bug like this, we can't assume that's Go's high level safety guarantees are working.
@asomers For background I've been working on Go since 2008 and I'm very familiar with Go's implementation. I agree that Go's high level safety guarantees are not relevant here. I am simply asserting with high confidence that this is not a stack overflow.
(Personally I've believed for months now that this is a bug in the FreeBSD kernel, but I've never been able to recreate the problem myself, so I can't prove it. I could certainly be wrong--it could be a bug in Go. If it is a bug in Go it is most likely due to some error in the way that Go invokes the fork system call in syscall/exec_bsd.go. Perhaps the system call needs special handling on FreeBSD in some way, as is true of, for example, vfork on GNU/Linux.)
@erikarn
Does your Go runtime hook into any of the standard C runtime / linker stuff? initialisers, destructors, etc?
No.
Another data point for this is that after 8 hours of testing the program from https://github.com/golang/go/issues/15658#issuecomment-284940367 under go 1.4 it has still yet to fail.
The same under go 1.8 fails within a matter of minutes at the most.
A key difference between Go 1.4 and Go 1.8 is that Go 1.4 always defaults to a GOMAXPROCS
value of 1, whereas in 1.8 GOMAXPROCS
defaults to the number of CPUs on the system. So when testing with Go 1.4, set the environment variable GOMAXPROCS=4
or however many CPUs you have.
Good point, this is subtly different from using a single core in the cpuset as iirc there's an outstanding issue to allow go to set GOMAXPROCS correctly based on the number of cores available to the process and not the total cores of system, which I still got an error from on 1.8.
I've set GOMAXPROCS=24 and am retesting.
If go 1.4 still holds up, what do we think the chances are that there was an bug introduced when moving from c to pure go?
@stevenh I think the issue about the number of cores available to the process is #15206, which was recently fixed on tip. The fix will be in 1.9.
There isn't a significant amount of C code involved here, so while anything is possible I think it is unlikely to be related to the conversion from C to Go. If it works with 1.4 and fails with 1.8 then a far more likely immediate cause is the concurrent garbage collector. Perhaps there is some problem when one thread reads or writes memory pages while a different thread is calling fork.
@ianlancetaylor GC should be out of the picture as we've had failures with GOGC=off
@stevenh Hmmm, right.
Since we are only seeing reports on FreeBSD I think any problem with the C to Go conversion would be in runtime/os_freebsd.c converting to runtime/os_freebsd.go. I don't see any problems there in a quick look, but who knows.
@ianlancetaylor Yer I had a quick look too the other day and couldn't spot anything either.
However on cursory review libthr (FreeBSD's native thread library) has a lot more code when processing a fork than the go runtime seems to, so I'm wondering if some protection that's needed is missing, which is allowing additional threads to perform actions during fork and results in memory corruption we're seeing.
libthr fork can be seen here:
https://svnweb.freebsd.org/base/head/lib/libthr/thread/thr_fork.c?view=markup#l135
One test that might help backup that premise, is if we could ensure that no other threads ran while a fork was in process.
One test that might help backup that premise, is if we could ensure that no other threads ran while a fork was in process.
Any ideas on the best way to hack this in place to try it out?
@tmm1 Have beforefork
in runtime/proc.go call stopTheWorld
, and have afterfork
call startTheWorld
.
@ianlancetaylor Thanks!
I applied this patch and built a new binary with it, but the resulting binary still crashed quickly.
diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f41672de73..55950bf64d 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -2793,6 +2793,7 @@ func beforefork() {
//go:linkname syscall_runtime_BeforeFork syscall.runtime_BeforeFork
//go:nosplit
func syscall_runtime_BeforeFork() {
+ stopTheWorld("beforefork")
systemstack(beforefork)
}
@@ -2814,6 +2815,7 @@ func afterfork() {
//go:nosplit
func syscall_runtime_AfterFork() {
systemstack(afterfork)
+ startTheWorld()
}
// Allocate a new g, with a stack big enough for stacksize bytes.
Update on the go 1.4 test, its been running for over 20 hours now, no failures.
Ok so I may have a lead on this.
Given go 1.4 appeared to be unaffected I tried go 1.5 which also appears to be unaffected, however switching to go 1.6 and crashed within seconds.
After some bisecting I found that reverting https://github.com/golang/go/commit/a7cad52e04dd1890420452b38563997930edb2ca from go 1.6 made it stable again.
Digging some more I see that after execve sigaltstack is returning the details of a signal stack set before the execve, and which should never happen.
Ok reproduced it in C:
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>
extern char **environ;
static void
die(const char *s)
{
perror(s);
exit(EXIT_FAILURE);
}
static void
printsigstack(const char *prefix)
{
stack_t ss;
if (sigaltstack(NULL, &ss) < 0)
die("sigaltstack get");
printf("%s - ss_sp=%p ss_size=%lu ss_flags=0x%x\n", prefix, ss.ss_sp,
(unsigned long)ss.ss_size, (unsigned int)ss.ss_flags);
}
static void
setstack(void *arg)
{
stack_t ss;
ss.ss_sp = malloc(SIGSTKSZ);
if (ss.ss_sp == NULL)
die("malloc");
ss.ss_size = SIGSTKSZ;
ss.ss_flags = 0;
if (sigaltstack(&ss, NULL) < 0)
die("sigaltstack set");
}
static void *
thread(void *arg)
{
char *argv[] = { "./test-sigs", "no-more", 0 };
printsigstack("thread - before setstack");
setstack(arg);
printsigstack("thread - after setstack + before execve");
execve(argv[0], &argv[0], environ);
die("exec failed");
return NULL;
}
int
main(int argc, char** argv)
{
int i, j;
int alloc_size = 32 * 1024;
pthread_t tid;
char buf[1024];
void *arg;
printf("argc: %d, %s\n", argc, argv[0]);
printsigstack("main");
if (argc != 1)
return 0;
j = pthread_create(&tid, NULL, thread, NULL);
if (j != 0) {
errno = i;
die("pthread_create");
}
j = pthread_join(tid, NULL);
if (j != 0) {
errno = j;
die("pthread_join");
}
return 0;
}
Produces:
argc: 1, ./test-sigs
main - ss_sp=0x0 ss_size=0 ss_flags=0x4
thread - before setstack - ss_sp=0x0 ss_size=0 ss_flags=0x4
thread - after setstack + before execve - ss_sp=0x8014106c0 ss_size=34816 ss_flags=0x0
argc: 2, ./test-sigs
main - ss_sp=0x8014106c0 ss_size=34816 ss_flags=0x0
As you can see the second main output, which is straight after an execv, already has a signal stack which it shouldn't have.
This is corroborated by execve(2)
The signal stack is reset to be undefined (see sigaction(2) for more information).
So now to go kernel bug hunting....
Just to be clear if the execve is done from the main thread, even if it has an alternative signal stack set, its not present after the execve, so doing the execve from another thread is key to reproducing this; which may explain why its not been identified in the past.
Thanks to @kostikbel who knew exactly where to look for this, here's the kernel patch for people to test:
Index: sys/kern/kern_sig.c
===================================================================
--- sys/kern/kern_sig.c (revision 315303)
+++ sys/kern/kern_sig.c (working copy)
@@ -976,7 +976,6 @@ execsigs(struct proc *p)
* and are now ignored by default).
*/
PROC_LOCK_ASSERT(p, MA_OWNED);
- td = FIRST_THREAD_IN_PROC(p);
ps = p->p_sigacts;
mtx_lock(&ps->ps_mtx);
while (SIGNOTEMPTY(ps->ps_sigcatch)) {
@@ -1007,6 +1006,8 @@ execsigs(struct proc *p)
* Reset stack state to the user stack.
* Clear set of signals caught on the signal stack.
*/
+ td = curthread;
+ MPASS(td->td_proc == p);
td->td_sigstk.ss_flags = SS_DISABLE;
td->td_sigstk.ss_size = 0;
td->td_sigstk.ss_sp = 0;
My test here which usually crashes in a few seconds has been running 20 mins now no issues, so I'm hopeful this is the fix for this.
@stevenh Awesome work! Thanks!
Wow, incredible work @stevenh.
Looks like that patch was committed this morning: https://github.com/freebsd/freebsd/commit/0350bb5e92a7d5e3867a00def3b358d1983a4b2a
Does it make sense to workaround this bug in go until the upstream fix propagates out? (By reverting the os1_freebsd.go changes in a7cad52e04dd1890420452b38563997930edb2ca).
A huge thank you to @stevenh and everyone involved in tracking down this issue!
The FreeBSD change will make it into the stable/10 and stable/11 branches next week (by end of Mar 2017), but it will take more time to appear in new minor or errata releases; I suspect that it will be around August when we can count on fixed FreeBSD releases being widely available.
Is it possible to perform a run-time check, with the change in a7cad52 applying only to known fixed versions?
(Comment updated to clarify that it's the FreeBSD change that's making it into branches next week, and to provide a non-relative "next week".)
Note that Aug is when Go 1.9 will be released: https://github.com/golang/go/wiki/Go-Release-Cycle
I think we can work around the problem without requiring a version check. Can someone with an affected system please test whether https://golang.org/cl/38325 fixes the problem? Thanks.
CL https://golang.org/cl/38325 mentions this issue.
I think we can work around the problem without requiring a version check. Can someone with an affected system please test whether https://golang.org/cl/38325 fixes the problem? Thanks.
Confirmed that patch fixes the crash on FreeBSD 11.0-RELEASE-p8
Steven can we get this added to the green ad test suite?
On Mar 17, 2017 2:41 PM, "Aman Gupta" notifications@github.com wrote:
I think we can work around the problem without requiring a version check.
Can someone with an affected system please test whether
https://golang.org/cl/38325 fixes the problem? Thanks.Confirmed that patch fixes the crash on FreeBSD 11.0-RELEASE-p8
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/15658#issuecomment-287478344, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABGl7dhny_MxCeSVuFWqBx9bBF0VtYbeks5rmv4WgaJpZM4IcqQO
.
After speaking to @kostikbel this afternoon we may yet still have an outstanding issue.
Its possible that the repo case identified by @tmm1 was independent of the original issue.
I'm rebuilding the original fork only test case and retesting.
green ad test suite?
sorry, phone typos. The freebsd test suite that we maintain. This seems
like a good set of test cases to poke.
-adrian
On 17 March 2017 at 16:01, Steven Hartland notifications@github.com wrote:
green ad test suite?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/15658#issuecomment-287491688, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABGl7Rp8xJuDaT8oatER3jV4_1VFyWlDks5rmxCvgaJpZM4IcqQO
.
Its possible that the repo case identified by @tmm1 was independent of the original issue.
😲 If that is the case, I apologize profusely for hijacking this thread.
As I noted to @steventh, I definitely able to reproduce the original memory corruption issue on the patched system. Also, somebody mentioned that the test code was stripped to only use fork() and still cause the issue. In other words, an issue is still there.
@kostikbel could you share the exact setup of test case that you had failed, as the three fork only tests I've been running here haven't failed yet with over 10 hours of testing.
For the record the there I've been running are:
Test 1 - No channels (with ktrace)
ktrace -tfw+ -i -f ktrace-test.out ./test
Test 2 - No channels (without ktrace)
./test2
package main
import (
"log"
"net/http"
_ "net/http/pprof"
"runtime"
"syscall"
)
var (
forkRoutines = 24
)
func run() {
for {
if err := syscall.ForkOnlyBSDTest(); err != nil {
log.Fatal(err)
}
}
}
func main() {
log.Printf("Starting %v forking goroutines...\n", forkRoutines)
log.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))
for i := 0; i < forkRoutines; i++ {
go run()
}
log.Println(http.ListenAndServe(":8081", nil))
}
Test 3 - With Channels (without ktrace)
./test3
package main
import (
"log"
"syscall"
)
func run(done chan struct{}) {
if err := syscall.ForkOnlyBSDTest(); err != nil {
log.Fatal(err)
}
done <- struct{}{}
return
}
func main() {
log.Println("Starting a bunch of goroutines...")
done := make(chan struct{}, 24)
for i := 0; i < 24; i++ {
go run(done)
}
for {
select {
case <-done:
go run(done)
}
}
}
Fork Only:
package syscall
func ForkOnlyBSDTest() (err error) {
var r1 uintptr
var pid int
var err1 Errno
var wstatus WaitStatus
ForkLock.Lock()
runtime_BeforeFork()
r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
if err1 != 0 {
runtime_AfterFork()
ForkLock.Unlock()
return err1
}
if r1 == 0 {
for {
RawSyscall(SYS_EXIT, 253, 0, 0)
}
}
runtime_AfterFork()
ForkLock.Unlock()
pid = int(r1)
_, err = Wait4(pid, &wstatus, 0, nil)
for err == EINTR {
_, err = Wait4(pid, &wstatus, 0, nil)
}
return
}
All test have the following env:
export GOGC=2
export GOTRACEBACK=crash
export GOMAXPROCS=24
Hardware:
Dec 6 18:08:23 head kernel: CPU: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz (2500.05-MHz K8-class CPU)
Dec 6 18:08:23 head kernel: Origin="GenuineIntel" Id=0x206d7 Family=0x6 Model=0x2d Stepping=7
Dec 6 18:08:23 head kernel: Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Dec 6 18:08:23 head kernel: Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX>
Dec 6 18:08:23 head kernel: AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
Dec 6 18:08:23 head kernel: AMD Features2=0x1<LAHF>
Dec 6 18:08:23 head kernel: XSAVE Features=0x1<XSAVEOPT>
Dec 6 18:08:23 head kernel: VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
Dec 6 18:08:23 head kernel: TSC: P-state invariant, performance statistics
Dec 6 18:08:23 head kernel: real memory = 137438953472 (131072 MB)
Dec 6 18:08:23 head kernel: avail memory = 133429284864 (127248 MB)
OS:
FreeBSD 12.0-CURRENT #39 r315303M:
Go:
go version go1.8 freebsd/amd64
I use some binary, which was compiled from the source provided in the original message of this bug report. I am not sure which go version was used, but not older than 1/2 year, judging by dates.
I've not been able to repo a crash yet, but I have had the 100% of a core and everything sitting in locks 0% issues even with the patch but only after ~24 hours.
I'm now retesting with the stop the world while holding the fork lock to see if that helps narrow down kernel or go runtime issue.
If anyone has any quicker reproduction cases like to hear them, as 24 hours between tests is going to hamper investigation :(
For reference the following will detect if your kernel has the signal stack issue.
Its different from the original which would only detect the problem on FreeBSD 11.0+ due to the fact the order of thread inserts has changed, so to detect in all cases three threads are needed.
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>
extern char **environ;
static void
die(const char *s)
{
perror(s);
exit(EXIT_FAILURE);
}
static void
setstack(void *arg)
{
stack_t ss;
ss.ss_sp = malloc(SIGSTKSZ);
if (ss.ss_sp == NULL)
die("malloc");
ss.ss_size = SIGSTKSZ;
ss.ss_flags = 0;
if (sigaltstack(&ss, NULL) < 0)
die("sigaltstack set");
}
static void *
thread_exec(void *arg)
{
struct timespec ts = {0, 1000};
char *argv[] = {"./test-sigs", "no-more", 0};
setstack(arg);
nanosleep(&ts, NULL);
execve(argv[0], &argv[0], environ);
die("exec failed");
return NULL;
}
static void *
thread_sleep(void *arg)
{
sleep(10);
return NULL;
}
int
main(int argc, char** argv)
{
int i, j;
int alloc_size = 32 * 1024;
pthread_t tid1, tid2;
char buf[1024];
void *arg;
if (argc != 1) {
stack_t ss;
if (sigaltstack(NULL, &ss) < 0)
die("sigaltstack get");
if (ss.ss_sp != NULL || ss.ss_flags != SS_DISABLE ||
ss.ss_size != 0) {
fprintf(stderr, "invalid signal stack after execve: "
"ss_sp=%p ss_size=%lu ss_flags=0x%x\n", ss.ss_sp,
(unsigned long)ss.ss_size,
(unsigned int)ss.ss_flags);
return 1;
}
printf("valid signal stack is valid after execve\n");
return 0;
}
// We have to use two threads to ensure that can detect the
// issue when new threads are added to the head (pre 269095)
// and the tail of the process thread list.
j = pthread_create(&tid1, NULL, thread_exec, NULL);
if (j != 0) {
errno = i;
die("pthread_create");
}
j = pthread_create(&tid2, NULL, thread_sleep, NULL);
if (j != 0) {
errno = i;
die("pthread_create");
}
j = pthread_join(tid1, NULL);
if (j != 0) {
errno = j;
die("pthread_join");
}
j = pthread_join(tid2, NULL);
if (j != 0) {
errno = j;
}
return 0;
}
@stevenh I've quite a few FreeBSD machines in various configurations (virtualized, hardware, jailed, 11.0, 12-CURRENT and so on, lots of CPU's, lots of RAM) and I would love to help testing your fixes - let me know if there's any way I could help.
I've been able to reproduce the same crashes as before using the same methods with Ian's fixes added to the tree. Of three attempts, it wedged twice, panic'd once.
This strongly supports the case that the original issue stands.
Here's a log.
Here's a reference to the branch used to build the executable that produced the log.
Yes indeed @derekmarcotte the fix added by Ian was to address the issue raised by @tmm1, which unfortunately was as separate issue than the original.
Continuing the debug of this and the evidence is building up that the issue is related to a running child during a GC pass.
Using the following code I'm reliably able to trigger a failure in ~10mins on FreeBSD 12.0 r315303M (I don't believe the FreeBSD version is important).
syscall/exec_bsd_debug.go
package syscall
func ForkOnlyBSDTest() (err error) {
var r1 uintptr
var pid int
var err1 Errno
var wstatus WaitStatus
ForkLock.Lock()
print("forking: ", forks+1, "\n")
runtime_BeforeFork()
r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
if err1 != 0 {
runtime_AfterFork()
ForkLock.Unlock()
return err1
}
if r1 == 0 {
for {
RawSyscall(SYS_EXIT, 253, 0, 0)
}
}
forks++
runtime_AfterFork()
print("forked: ", forks, "\n")
runtime_GC()
ForkLock.Unlock()
pid = int(r1)
_, err = Wait4(pid, &wstatus, 0, nil)
for err == EINTR {
print("wait4: EINTR\n")
_, err = Wait4(pid, &wstatus, 0, nil)
}
return err
}
runtime/exec_debug.go
package runtime
//go:linkname syscall_runtime_GC syscall.runtime_GC
//go:nosplit
func syscall_runtime_GC() {
GC()
}
main.go
package main
import (
"log"
"runtime"
"syscall"
)
var (
forkRoutines = 1
gc = true
)
func run() {
for {
if err := syscall.ForkOnlyBSDTest(gc); err != nil {
log.Fatal(err)
}
}
}
func main() {
log.Printf("Starting %v forking goroutines...\n", forkRoutines)
log.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0), "version:", runtime.Version()
for i := 0; i < forkRoutines; i++ {
go run()
}
done := make(chan struct{})
<-done
}
env settings
export GOGC=off
export GOTRACEBACK=crash
export GOMAXPROCS=16
I've discovered a few interesting facts:
I believe 1 rules out signal delivery during the GC pass as a potential culprit.
So the issue appears to be triggered by having a __running__ child present during a GC pass.
The crash typically looks like:
fatal error: workbuf is not empty
runtime stack:
runtime.throw(0x7d62ad, 0x14)
/usr/local/go/src/runtime/panic.go:596 +0x95 fp=0x7fffde9f2b58 sp=0x7fffde9f2b38
runtime.(*workbuf).checkempty(0x801166800)
/usr/local/go/src/runtime/mgcwork.go:320 +0x4e fp=0x7fffde9f2b78 sp=0x7fffde9f2b58
runtime.getempty(0x7fffde9f2bd0)
/usr/local/go/src/runtime/mgcwork.go:332 +0x96 fp=0x7fffde9f2bb0 sp=0x7fffde9f2b78
runtime.handoff(0x801166800, 0xc41fffa1b2)
/usr/local/go/src/runtime/mgcwork.go:432 +0x26 fp=0x7fffde9f2be8 sp=0x7fffde9f2bb0
runtime.(*gcWork).balance(0xc42001f228)
/usr/local/go/src/runtime/mgcwork.go:275 +0x4e fp=0x7fffde9f2c08 sp=0x7fffde9f2be8
runtime.gcDrain(0xc42001f228, 0x0)
/usr/local/go/src/runtime/mgcmark.go:1066 +0x36c fp=0x7fffde9f2c40 sp=0x7fffde9f2c08
runtime.gchelper()
/usr/local/go/src/runtime/mgc.go:1910 +0x94 fp=0x7fffde9f2c70 sp=0x7fffde9f2c40
runtime.stopm()
/usr/local/go/src/runtime/proc.go:1653 +0xdd fp=0x7fffde9f2c98 sp=0x7fffde9f2c70
runtime.findrunnable(0xc420018000, 0x0)
/usr/local/go/src/runtime/proc.go:2102 +0x2e4 fp=0x7fffde9f2d30 sp=0x7fffde9f2c98
runtime.schedule()
/usr/local/go/src/runtime/proc.go:2222 +0x14c fp=0x7fffde9f2d70 sp=0x7fffde9f2d30
runtime.goschedImpl(0xc42014a000)
/usr/local/go/src/runtime/proc.go:2300 +0x106 fp=0x7fffde9f2d90 sp=0x7fffde9f2d70
runtime.gopreempt_m(0xc42014a000)
/usr/local/go/src/runtime/proc.go:2315 +0x36 fp=0x7fffde9f2da8 sp=0x7fffde9f2d90
runtime.newstack(0x0)
/usr/local/go/src/runtime/stack.go:1081 +0x2f0 fp=0x7fffde9f2f20 sp=0x7fffde9f2da8
runtime.morestack()
/usr/local/go/src/runtime/asm_amd64.s:398 +0x86 fp=0x7fffde9f2f30 sp=0x7fffde9f2f20
I also needed the following patch to ensure I got a core even with GOTRACEBACK=crash, as it seems that in this case were m.traceback is set to 1 it prevents the user preference taking effect, so it was impossible to trigger a crash on panic.
--- ./runtime/runtime1.go.orig 2017-03-21 16:27:26.000000000 +0000
+++ ./runtime/runtime1.go 2017-03-21 16:27:26.000000000 +0000
@@ -36,14 +36,16 @@ var traceback_env uint32
func gotraceback() (level int32, all, crash bool) {
_g_ := getg()
all = _g_.m.throwing > 0
- if _g_.m.traceback != 0 {
- level = int32(_g_.m.traceback)
- return
- }
t := atomic.Load(&traceback_cache)
crash = t&tracebackCrash != 0
all = all || t&tracebackAll != 0
- level = int32(t >> tracebackShift)
+ if _g_.m.traceback != 0 {
+ level = int32(_g_.m.traceback)
+ }
+
+ if int32(t>>tracebackShift) > level {
+ level = int32(t >> tracebackShift)
+ }
return
}
Ok very interesting data point, it appears that if I compile with CGO_ENABLE=0 then the problem goes away. This would explain why we've never seen this in production as all of our binaries are compiled with CGO disabled.
Scratch that, after a few test runs it happened :(
@stevenh: FWIW - I'm still getting panics with GOGC=off
, and no explicit calls to the GC (as in your example). I don't believe that garbage collection is implicated here, unless it runs sometimes despite GOGC
. Last trawl through the runtime, I felt like GOGC=off
kills the GC dead.
I've used a stripped down version of your example main.go
, and my ForkOnlyBSDTest
.
@derekmarcotte I don't think GC is the cause, however I do think forcing GC between the fork and wait produces enough meaningful work in the go runtime to detect, the memory corruption.
Out of interest:
main.go
and ForkOnlyBSDTest
?Forgot to ask, with GC disabled what crash did you get?
@stevenh:
In terms of extra work, I think the /bin/true
case is pretty quick to fail. I think it's helpful to eliminate as much as possible, to narrow down on the involved systems. The only things left AFAIK in play are the scheduler, stack expansion?, and the worker routines (e.g. fork/wait).
fatal error: acquirep: invalid p state
, here's a recent example:Mon Mar 27 06:47:55 EDT 2017
acquirep: p->m=4895104(0) p->status=1
fatal error: acquirep: invalid p state
... stack traces
Mon Mar 27 11:54:57 EDT 2017
So that one was about 5 hours.
I haven't been meticulous about logging the timing any more though. If it's helpful, I'm sure I could script up something like this to detect the wedge, and start recording more times? I don't feel that would help much at this point though.
Thanks @derekmarcotte that's why I've been using the force GC test recently as it blows up pretty reliably in between 1-10mins, which makes iterations much quicker.
@stevenh: I ended up scripting the thing, here's a sample log (with backtraces omitted):
=== Wed Apr 5 16:48:27 EDT 2017 - Starting new instance
=== Wed Apr 5 18:23:40 EDT 2017 - Killing wedged 42369
=== Wed Apr 5 18:23:45 EDT 2017 - Starting new instance
=== Thu Apr 6 02:45:19 EDT 2017 - Starting new instance
=== Thu Apr 6 12:52:40 EDT 2017 - Starting new instance
=== Thu Apr 6 18:24:49 EDT 2017 - Killing wedged 98836
=== Thu Apr 6 18:24:54 EDT 2017 - Starting new instance
=== Thu Apr 6 20:47:11 EDT 2017 - Starting new instance
=== Fri Apr 7 02:25:20 EDT 2017 - Killing wedged 53688
=== Fri Apr 7 02:25:25 EDT 2017 - Starting new instance
=== Fri Apr 7 02:49:28 EDT 2017 - Starting new instance
=== Fri Apr 7 03:16:05 EDT 2017 - Starting new instance
=== Fri Apr 7 06:42:34 EDT 2017 - Killing wedged 38753
=== Fri Apr 7 06:42:39 EDT 2017 - Starting new instance
=== Fri Apr 7 06:42:54 EDT 2017 - Killing wedged 24324
=== Fri Apr 7 06:42:59 EDT 2017 - Starting new instance
=== Fri Apr 7 06:43:04 EDT 2017 - Killing wedged 97972
=== Fri Apr 7 06:43:09 EDT 2017 - Starting new instance
=== Fri Apr 7 06:43:34 EDT 2017 - Killing wedged 23425
=== Fri Apr 7 06:43:39 EDT 2017 - Starting new instance
=== Fri Apr 7 06:44:29 EDT 2017 - Killing wedged 45400
=== Fri Apr 7 06:44:34 EDT 2017 - Starting new instance
=== Fri Apr 7 06:45:19 EDT 2017 - Killing wedged 86970
=== Fri Apr 7 06:45:24 EDT 2017 - Starting new instance
=== Fri Apr 7 06:45:39 EDT 2017 - Killing wedged 9083
=== Fri Apr 7 06:45:44 EDT 2017 - Starting new instance
=== Fri Apr 7 06:46:04 EDT 2017 - Killing wedged 78764
=== Fri Apr 7 06:46:09 EDT 2017 - Starting new instance
=== Fri Apr 7 06:46:14 EDT 2017 - Killing wedged 77503
=== Fri Apr 7 06:46:19 EDT 2017 - Starting new instance
=== Fri Apr 7 06:46:24 EDT 2017 - Killing wedged 2314
=== Fri Apr 7 06:46:29 EDT 2017 - Starting new instance
=== Fri Apr 7 06:46:59 EDT 2017 - Killing wedged 27186
=== Fri Apr 7 06:47:04 EDT 2017 - Starting new instance
=== Fri Apr 7 06:47:25 EDT 2017 - Killing wedged 73453
=== Fri Apr 7 06:47:30 EDT 2017 - Starting new instance
=== Fri Apr 7 06:47:35 EDT 2017 - Killing wedged 72424
=== Fri Apr 7 06:47:40 EDT 2017 - Starting new instance
=== Fri Apr 7 06:48:15 EDT 2017 - Killing wedged 96875
=== Fri Apr 7 06:48:20 EDT 2017 - Starting new instance
=== Fri Apr 7 06:49:15 EDT 2017 - Killing wedged 78286
=== Fri Apr 7 06:49:20 EDT 2017 - Starting new instance
=== Fri Apr 7 06:49:35 EDT 2017 - Killing wedged 57532
=== Fri Apr 7 06:49:40 EDT 2017 - Starting new instance
=== Fri Apr 7 06:50:40 EDT 2017 - Killing wedged 29647
=== Fri Apr 7 06:50:45 EDT 2017 - Starting new instance
=== Fri Apr 7 06:51:15 EDT 2017 - Killing wedged 41134
=== Fri Apr 7 06:51:20 EDT 2017 - Starting new instance
=== Fri Apr 7 06:51:40 EDT 2017 - Killing wedged 93192
=== Fri Apr 7 06:51:45 EDT 2017 - Starting new instance
=== Fri Apr 7 06:52:15 EDT 2017 - Killing wedged 86941
=== Fri Apr 7 06:52:20 EDT 2017 - Starting new instance
=== Fri Apr 7 06:52:35 EDT 2017 - Killing wedged 27716
=== Fri Apr 7 06:52:41 EDT 2017 - Starting new instance
=== Fri Apr 7 10:25:00 EDT 2017 - Killing wedged 6064
=== Fri Apr 7 10:25:05 EDT 2017 - Starting new instance
=== Fri Apr 7 16:12:26 EDT 2017 - Starting new instance
=== Fri Apr 7 18:25:48 EDT 2017 - Killing wedged 31596
=== Fri Apr 7 18:25:53 EDT 2017 - Starting new instance
=== Fri Apr 7 18:25:53 EDT 2017 - Killing wedged 53833
=== Fri Apr 7 18:25:58 EDT 2017 - Starting new instance
=== Fri Apr 7 19:03:51 EDT 2017 - Starting new instance
=== Sat Apr 8 04:17:57 EDT 2017 - Killing wedged 13586
=== Sat Apr 8 04:18:02 EDT 2017 - Starting new instance
=== Sat Apr 8 10:26:05 EDT 2017 - Killing wedged 44051
=== Sat Apr 8 10:26:10 EDT 2017 - Starting new instance
=== Sat Apr 8 18:25:48 EDT 2017 - Killing wedged 54803
=== Sat Apr 8 18:25:53 EDT 2017 - Starting new instance
=== Sat Apr 8 18:26:43 EDT 2017 - Killing wedged 6268
=== Sat Apr 8 18:26:48 EDT 2017 - Starting new instance
=== Sun Apr 9 00:12:15 EDT 2017 - Starting new instance
=== Sun Apr 9 01:02:04 EDT 2017 - Starting new instance
=== Sun Apr 9 02:26:56 EDT 2017 - Killing wedged 29694
=== Sun Apr 9 02:27:02 EDT 2017 - Starting new instance
=== Sun Apr 9 05:43:05 EDT 2017 - Starting new instance
=== Sun Apr 9 10:27:05 EDT 2017 - Killing wedged 90493
=== Sun Apr 9 10:27:10 EDT 2017 - Starting new instance
=== Sun Apr 9 18:27:27 EDT 2017 - Killing wedged 3961
=== Sun Apr 9 18:27:32 EDT 2017 - Starting new instance
=== Mon Apr 10 02:28:55 EDT 2017 - Killing wedged 57180
=== Mon Apr 10 02:29:00 EDT 2017 - Starting new instance
=== Mon Apr 10 02:29:00 EDT 2017 - Killing wedged 30622
=== Mon Apr 10 02:29:06 EDT 2017 - Starting new instance
=== Mon Apr 10 05:20:56 EDT 2017 - Starting new instance
=== Mon Apr 10 10:37:45 EDT 2017 - Killing wedged 39613
=== Mon Apr 10 10:37:50 EDT 2017 - Starting new instance
=== Mon Apr 10 11:30:20 EDT 2017 - Starting new instance
=== Mon Apr 10 18:28:54 EDT 2017 - Killing wedged 40438
=== Mon Apr 10 18:28:59 EDT 2017 - Starting new instance
=== Mon Apr 10 18:29:19 EDT 2017 - Killing wedged 97568
=== Mon Apr 10 18:29:24 EDT 2017 - Starting new instance
=== Mon Apr 10 22:43:26 EDT 2017 - Starting new instance
=== Tue Apr 11 00:32:26 EDT 2017 - Starting new instance
=== Tue Apr 11 01:53:53 EDT 2017 - Starting new instance
=== Tue Apr 11 02:29:11 EDT 2017 - Killing wedged 48237
=== Tue Apr 11 02:29:16 EDT 2017 - Starting new instance
=== Tue Apr 11 10:29:54 EDT 2017 - Killing wedged 28162
=== Tue Apr 11 10:29:59 EDT 2017 - Starting new instance
=== Tue Apr 11 13:44:17 EDT 2017 - Starting new instance
=== Tue Apr 11 13:46:22 EDT 2017 - Starting new instance
=== Tue Apr 11 16:25:46 EDT 2017 - Starting new instance
=== Tue Apr 11 18:30:02 EDT 2017 - Killing wedged 1645
=== Tue Apr 11 18:30:07 EDT 2017 - Starting new instance
=== Tue Apr 11 18:32:17 EDT 2017 - Killing wedged 25645
=== Tue Apr 11 18:32:22 EDT 2017 - Starting new instance
=== Tue Apr 11 20:44:39 EDT 2017 - Starting new instance
=== Tue Apr 11 20:44:59 EDT 2017 - Starting new instance
=== Wed Apr 12 02:30:15 EDT 2017 - Killing wedged 45351
If you see consecutive Starting new instance
, this means that the first is a panic. So out of 64 runs, 44 wedged, and 20 panicked.
Hrm, on second though, the wedged detection should be inverted. i.e. - counting active threads, and kill when they dip below say, 4. Having idle threads is no big deal. Sorry for the noise.
Any updates here?
I'm afraid I've not had any time recently to continue investigating this.
I do have reliable test case, which triggers the issue relatively quickly but I've been unable to make any progress due to lack to time.
The last notable status I have is from March. As far as I know, this is still an issue. I'm at the limits of my understanding/ability and have directed my efforts to places where I can be more helpful. If someone wanted to point me at an instance, I would be happy to collaborate, and get it set-up and repeatable?
Confirming this is still an issue on 10.3-RELEASE, against master (a9d8d4d):
acquirep: p->m=842351116288(6) p->status=1
fatal error: acquirep: invalid p state
Wouldn't using vfork
or posix_spawn
as suggested in #5838 be a workaround to this issue?
The kernel code underneath fork is he same as vfork so unlikely
Do you know what the issue is ? Do you know that the issue is in FreeBSD kernel and not in the go runtime ?
Also, even if the issue is apparently in the kernel, fork vs. vfork difference is exactly where VM is involved. So, although not definitive, the experiment might give some hints where to look next.
No we still don't know if it's kernel or runtime. I've tried lots of kernel options on the raw kernel forks but none made any deference.
I've tried replicating with c code and have been unable to do so.
Even with the best reproduction case I have it always takes quite a number of attempts.
Given this it seems it requires complex threading interaction during the actual fork kernel call.
Based on the fact that the same symptoms are reproducible under linux, does it make sense to no longer call this a FreeBSD issue?
Let's not merge them until we know it's the same underlying cause. Same symptoms can be misleading.
(Sorry, wrong button.)
@rsc: Thanks. Agreed that there may be other things going on here.
Having said that, the issue I'm referencing is having the same crashes (fatal error: workbuf is not empty
, fatal error: acquirep: invalid p state
), in the same places, under Linux running the same test as in FreeBSD. I suspect this should be enough to invalidate the change to FreeBSD-only, yet perhaps not enough to merge the two issues.
I haven't seen the fatal error: unexpected signal during runtime execution
referenced there, for example, and I've seen a lot of crashes.
Are the folks experiencing this running on Skylake or Kaby Lake hardware? If so, this might be relevant: https://lists.debian.org/debian-devel/2017/06/msg00308.html . Please try updating your CPU microcode to the latest release.
Nope Nehalem EP here.
I guess we still don't know what is happening. Punting to 1.10.
I agree @derekmarcotte. I think it doesn't make sense to state something is FreeBSD related when something very similar happens on at least one other OS too.
If this gets changed it should be mentioned in the release notes as well. However there it might be enough to simply mention Linux and reference the according issue.
Might this change to FreeBSD be related? I doesn't seem likely to me, but others here know more about how go works than I do.
It looks like go only uses LDT on 386 and not amd64 so I suspect this wouldn't have any impact @dankm
Just wanted to confirm this is still an issue with 1.9.1 which includes the signal handling fixes:
2017/10/10 22:57:08 Starting 8 forking goroutines...
2017/10/10 22:57:08 GOMAXPROCS: 24 version: go1.9.1
acquirep: p->m=842351920128(9) p->status=1
fatal error: acquirep: invalid p state
runtime stack:
runtime.throw(0x7a8101, 0x19)
/usr/local/go/src/runtime/panic.go:605 +0x95
runtime.acquirep1(0xc42001ec00)
/usr/local/go/src/runtime/proc.go:3689 +0x16d
runtime.acquirep(0xc42001ec00)
/usr/local/go/src/runtime/proc.go:3661 +0x2b
runtime.findrunnable(0xc420021300, 0x0)
/usr/local/go/src/runtime/proc.go:2068 +0x74f
runtime.schedule()
/usr/local/go/src/runtime/proc.go:2245 +0x12c
runtime.park_m(0xc420152900)
/usr/local/go/src/runtime/proc.go:2308 +0xb6
runtime.mcall(0x800a0b040)
/usr/local/go/src/runtime/asm_amd64.s:286 +0x5b
...
1 runs completed, 1 failures, over 6h47m54s
I was running the original test case posted above on DragonFlyBSD out of curiosity mostly and although I don't see any crashes/corruption in the short time I let it run, on 5.0.2 after running the testcase I consistently get this in syslog -
td 0xffffff81dd9a6900 test: timeout race
td 0xffffff81e01e7400 test: timeout race
This is coming from sys/kern/kern_synch.c:716 - which is tsleep() -
/*
* Cleanup the timeout. If the timeout has already occured thandle
* has already been stopped, otherwise stop thandle. If the timeout
* is running (the callout thread must be blocked trying to get
* lwp_token) then wait for us to get scheduled.
*/
if (timo) {
while (td->td_flags & TDF_TIMEOUT_RUNNING) {
/* else we won't get rescheduled! */
if (lp->lwp_stat != LSSTOP)
lp->lwp_stat = LSSLEEP;
lwkt_deschedule_self(td);
td->td_wmesg = "tsrace";
lwkt_switch();
kprintf("td %p %s: timeout race\n", td, td->td_comm);
}
I am not sure this is even relevant - I haven't looked deeper, but thought I would throw this out just in case.
I haven't read the details of this issue, but even so, it could be fixed by this recent change in the FreeBSD VM system:
https://svnweb.freebsd.org/base?view=revision&revision=329254
I you want to test that change, you'll also need this one:
https://svnweb.freebsd.org/base?view=revision&revision=329252
It's not clear if these two kernel fixes will help fix the issues identified here but they are definitely in the right area, particularly the second one.
Do not call pmap_enter() with invalid protection mode.
Ensure memory consistency on COW
I'm trying to confirm either way if the above do have an impact, tests in progress. I'll report back when I have more, but if anyone else has a good reproduction environment which they can apply the above patches and test that would be great.
Hehe, I guess I should have refreshed my page before replying, thanks @vangyzen :)
Having applied the two patches I've unfortunately manged to confirm it hasn't solved the issue as I had a crash:
acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state
@stevenh could you pull down https://svnweb.freebsd.org/base?view=revision&revision=335171 and try again? It seems like this is a different manifestation of the bug described in https://reviews.freebsd.org/D15293
Just seen that and thought that it could be related.
I've asked @kostikbel for clarification about how that may exhibit it with thought it may be related to this.
If the patch is compatible with 11.x I'll grab it and run some more tests.
From the description I got from alc, the way the issue this fix address exhibits, it sounds like its not our cause, but I'm test running a test none the less.
The way this bug exhibits, it sounds like having some form of system wide reverse / time-travel debugger running would be handy. eg The bug exhibits, so hit "stop" then track it backwards
Not sure if such a thing exists for FreeBSD though. :angel:
no go still crashes
```==== NEW RUN Fri Feb 16 17:34:25 GMT 2018 ====
Starting 16 forking goroutines...
GOMAXPROCS: 24
acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state
runtime stack:
runtime.throw(0x4c6fe1, 0x19)
/usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc420077e00 sp=0xc420077de0 pc=0x426fc5
runtime.acquirep1(0xc42002a600)
/usr/local/go/src/runtime/proc.go:3699 +0x16d fp=0xc420077e38 sp=0xc420077e00 pc=0x43205d
runtime.acquirep(0xc42002a600)
/usr/local/go/src/runtime/proc.go:3671 +0x2b fp=0xc420077e50 sp=0xc420077e38 pc=0x431eab
runtime.stopm()
/usr/local/go/src/runtime/proc.go:1689 +0x124 fp=0xc420077e78 sp=0xc420077e50 pc=0x42c924
runtime.findrunnable(0xc420020c00, 0x0)
/usr/local/go/src/runtime/proc.go:2135 +0x4d2 fp=0xc420077f10 sp=0xc420077e78 pc=0x42dac2
runtime.schedule()
/usr/local/go/src/runtime/proc.go:2255 +0x12c fp=0xc420077f58 sp=0xc420077f10 pc=0x42e57c
runtime.goexit0(0xc42013bc80)
/usr/local/go/src/runtime/proc.go:2406 +0x236 fp=0xc420077fb0 sp=0xc420077f58 pc=0x42efd6
runtime.mcall(0x0)
/usr/local/go/src/runtime/asm_amd64.s:286 +0x5b fp=0xc420077fc0 sp=0xc420077fb0 pc=0x44dfab
```
@justinclift Wouldn't DTrace be enough to investigate this problem?
Not sure how as this appears to be random memory corruption of the forking
proces. I’ve tried running it through truss before but it adds enough delay
that it doesn’t reproduce.
I can see if tracing the fork syscall with dtrace shows anything
On Mon, 18 Jun 2018 at 14:39, Bartek Rutkowski notifications@github.com
wrote:
@justinclift https://github.com/justinclift Wouldn't DTrace (
https://www.freebsd.org/doc/handbook/dtrace.html) be enough to
investigate this problem?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/15658#issuecomment-398056601, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAGXL2HVT4qVJ4zI1cAoC7BLw8MxQpiYks5t962GgaJpZM4IcqQO
.
This issue is a sprawling 2 year epic adventure. The fact that multiple bugs with different reproducers have been subsumed under one issue is unhelpful. Can someone write the Cliff's Notes including the current fastest reproducer? Thanks in advance.
Hi matt my current reproduction case is:
package main
import (
"fmt"
"log"
"runtime"
"syscall"
)
var (
forkRoutines = 16
)
func run(done chan struct{}) {
if err := syscall.ForkOnlyBSDTest(); err != nil {
log.Fatal(err)
}
done <- struct{}{}
}
func main() {
fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))
done := make(chan struct{}, forkRoutines*2)
for i := 0; i < forkRoutines; i++ {
go run(done)
}
for range done {
go run(done)
}
}
Then this in the lib:
// +build darwin dragonfly freebsd netbsd openbsd
package syscall
func ForkOnlyBSDTest() (err error) {
var r1 uintptr
var pid int
var err1 Errno
var wstatus WaitStatus
ForkLock.Lock()
runtime_BeforeFork()
r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
if err1 != 0 {
runtime_AfterFork()
ForkLock.Unlock()
return err1
}
if r1 == 0 {
// in child, die die die my darling
for {
RawSyscall(SYS_EXIT, 253, 0, 0)
}
}
runtime_AfterFork()
ForkLock.Unlock()
pid = int(r1)
// Prime directive, exterminate
// Whatever stands left
_, err = Wait4(pid, &wstatus, 0, nil)
for err == EINTR {
_, err = Wait4(pid, &wstatus, 0, nil)
}
return
}
It's not quick and not guaranteed.
There is also a thread on the freebsd ports mailing list "lang/go failes to build with poudriere, since 2018-04-05" who has a reproduction case running the lang/go port build which is related.
@stevenh is there any correlation with the amount of memory in the system or the number of cores? For example, would lowering hw.physmem on a 48-way help?
I’ve not seen one
On Sat, 7 Jul 2018 at 00:40, Matthew Macy notifications@github.com wrote:
@stevenh https://github.com/stevenh is there any correlation with the
amount of memory in the system or the number of cores? For example, would
lowering hw.physmem on a 48-way help?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/15658#issuecomment-403170886, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAGXLzGJGnr8pNivK8fjDV7xuq204ljfks5uD_VSgaJpZM4IcqQO
.
I've been running the usual tests on 11.2-RELEASE-p3 with go1.11.1 and have not managed to reproduce after a weeks run.
Has anyone else seen a crash with these updated versions?
Ok still no issues triggered on my test box and its been running forks in a loop since 18th and 20th of November
root 50215 273.0 0.0 104384 4988 9 S+J 20Nov18 134201:59.00 ./test
root 46993 259.4 0.0 104640 3204 8 S+J 18Nov18 151748:35.09 ./test2
Given this I'm tempted to say that the issue has been fixed either in FreeBSD or golang runtime.
Thanks very much for following up on this., I'm going to close the issues in the hopes that it is indeed fixed. Anyone, please comment if you find future failures.
While updating our test box to 12.0 from 11.2 I noticed that this latest test was running with a potentially relevant kernel patch on top of 11.2-RELEASE.
If anyone can reproduce this issue on 11.2 then they should try applying the following to see if was indeed the fix:
https://svnweb.freebsd.org/base?view=revision&revision=335171
This is already included in 12.0-RELEASE and was MFC'ed to 11-stable as of:
https://svnweb.freebsd.org/base?view=revision&revision=335508
Most helpful comment
Ok reproduced it in C:
Produces:
As you can see the second main output, which is straight after an execv, already has a signal stack which it shouldn't have.
This is corroborated by execve(2)
The signal stack is reset to be undefined (see sigaction(2) for more information).
So now to go kernel bug hunting....