go 🚀 - runtime: FreeBSD memory corruption involving fork system call

Here's another panic experienced in mallocgc by the same sample code:

1463046620-malloc-panic.txt

derekmarcotte on 12 May 2016

@derekmarcotte, can you also reproduce this at master? (which will become Go 1.7)

And do you only see it on FreeBSD, or other operating systems as well?

/cc @aclements @ianlancetaylor

bradfitz on 12 May 2016

@RLH, weren't you seeing "finalizer already set" failures a little while ago? Did you track that down? Could it have been related?

aclements on 12 May 2016

On closer inspection of the failures you posted (thanks for collecting several, BTW), this smells like memory corruption. Just guessing, there are a few likely culprits. It may be the finalizer code, but I actually think that's less likely. More likely is that it's the fork/exec code: that code is really subtle, mucks with the address space, and contains system-specific parts (which would explain why it's showing up on FreeBSD, but I haven't been able to reproduce it on Linux yet).

@derekmarcotte, can you try commenting out the runtime.SetFinalizer call in newProcess in os/exec.go (your test doesn't need that finalizer) and see if you can still reproduce it? If you can, that will rule out finalizers.

aclements on 12 May 2016

Note that FreeBSD runs via gomote, if this is that easily reproducible. I haven't yet tried.

bradfitz on 12 May 2016

Just got a golang/go dev environment set up on my machine (was from FreeBSD packages). Will report back soon.

derekmarcotte on 13 May 2016

Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value.

I didn't include too many of them in my initial report, because I thought they were largely redundant.

@bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80


1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8


1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80


1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements: I will try your patch next. One variable at a time.

derekmarcotte on 14 May 2016

The "fatal error: runtime.SetFinalizer: finalizer already set" bug I was
seeing a few days ago were on the 1.8 dev.garbage branch and the result of
a TOC write barrier not marking an object as being published, TOC doing a
rollback, and reallocating a new object over an old one. Both objects had
finalizers and the bug was tripped.

None of the TOC code is in 1.7 or for that matter on TIP. I can force
myself to imagine that the 1.7 allocCache code could cause a similar
situation if the allocCache and bitmaps were not coherent. but if that were
the case it would expect lots of failures all over the place and across all
platforms, not just freeBSD.

On Fri, May 13, 2016 at 6:47 PM, Derek Marcotte [email protected]
wrote:

Here's the heads of a bunch of logs with the epoch at the start of the
process, so you can see the interval. I suspected a race vs. memory
corruption because by and large it is the finalizer already set that
crashes the process. I thought maybe the gc was setting these as free (or
otherwise touching them) before the SetFinalizer had a chance to set their
value.

I didn't include too many of them in my initial report, because I thought
they were largely redundant.

@bradfitz https://github.com/bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80

1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8

1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80

1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements https://github.com/aclements: I will try your patch next.
One variable at a time.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/golang/go/issues/15658#issuecomment-219177155

RLH on 14 May 2016

@aclements: I've run it with your patch, although I haven't been able to babysit it too much.

The first time, all threads were idle after a number of hours (i.e. 0% cpu across the board). Connecting gdb to that process gave me trouble, and I couldn't get any logging out of it.

This morning, I was able to connect to a different process that looks a lot like 1462933614-wedged.txt. I've attached a log from gdb there:

1463270694.txt

Will keep trying to come up with more info.

derekmarcotte on 15 May 2016

@aclements: Here's some more logs from a binary build with the patch:

1463315708-finalizer-already-set.txt
1463349807-finalizer-already-set.txt
1463352601-workbuf-not-empty.txt
1463362849-workbuf-empty.txt
1463378745-wedged-gdb.txt

Please let me know if I can be of further assistance?

derekmarcotte on 16 May 2016

Thanks for the logs! I hadn't realized there were two finalizers involved here. Could you also comment out the SetFinalizer in NewFile in os/file_unix.go and see if it's still reproducible? (Your test also doesn't need that finalizer.)

aclements on 16 May 2016

I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process.

I didn't mean to say that it isn't necessarily a race. It's actually quite likely a race, but it's resulting in corruption of internal runtime structures, which suggests that the race is happening on freed memory. The "workbuf is empty" failure mode especially points at memory corruption, which is why my initial guess is that the finalizers (and the specials queue in general) may be victims rather than perpetrators. It's also easy to get finalizers out of the picture, while it's harder to get fork/exec out of the picture without completely changing the program. :)

aclements on 16 May 2016

Thanks @aclements !

One crash, ~4 hours into running, since removing the SetFinalizer in NewFile.

1463429459.txt

I have a second process running for almost 11 hours, with 4 of the threads wedged, but it is still doing work.

derekmarcotte on 17 May 2016

After ~11 hours + 5 minutes, the process panic'd:

1463445934-invalid-p-state.txt

derekmarcotte on 17 May 2016

Thanks for the new logs. I suspect that's actually not the same failure, which suggests that the original problem is in fact related to finalizers.

aclements on 17 May 2016

My pleasure, thanks for the feedback. Are there any next steps for me? (I'm likely to poke around this bottom issue, although I'm neither a go runtime guy, nor a FreeBSD systems-level guy - yet. Would like to be as helpful as possible.)

Thanks again!

derekmarcotte on 17 May 2016

I'm going to post a few more here. This'll be my last batch, unless requested. :smile: 1463584079 is a new message.

 ==> 1463485780 <==
Starting a bunch of goroutines...

fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4bff02, 0x14)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*workbuf).checkempty(0x8007dae00)
        /home/derek/go/src/github.com/golang/go/src/runtime/mgcwork.go:301 +0x3f
runtime.getempty(0x8007dae00)

==> 1463584079 <==
Starting a bunch of goroutines...
fatal error: MHeap_AllocLocked - MSpan not free

runtime stack:
runtime.throw(0x4c2528, 0x22)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*mheap).allocSpanLocked(0x52dbe0, 0x1, 0xc4200ae1a0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:637 +0x498
runtime.(*mheap).alloc_m(0x52dbe0, 0x1, 0x12, 0xc420447fe0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:510 +0xd6

==> 1463603516 <==
Starting a bunch of goroutines...
acquirep: p->m=842351485952(10) p->status=1
acquirep: p->m=842350813184(4) p->status=1
fatal error: acquirep: invalid p state
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc420015500)

==> 1463642257 <==
Starting a bunch of goroutines...
acquirep: p->m=0(0) p->status=2
fatal error: acquirep: invalid p state
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc42001c000)

derekmarcotte on 19 May 2016

Thanks! I assume these last few failures are also with master and with the two SetFinalizer calls commented out?

aclements on 19 May 2016

That's correct! Thanks. Anyone else able to reproduce?

derekmarcotte on 19 May 2016

@derekmarcotte, what version of FreeBSD and how many CPUs are you running with? I haven't had any luck yet reproducing on FreeBSD 10.1 with 2 CPUs.

aclements on 26 May 2016

@aclements: both machines were 4 core, 8 thread CPUs:

Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (32GB/ECC) - none of the logs included were from this machine, but could not keep my process (of my project, not the listing in this issue) running for very long

AMD FX(tm)-8350 Eight-Core Processor (8GB) - this is my dev box where all the logs are included

The Xeon is running 10.3-RELEASE, and the AMD was running 10.1-RELEASE at the time of the logs (has since been upgraded to 10.3-RELEASE).

I suspect I would be able to chew through many more invocations in the same time as a 2 core machine on these hosts, and additionally increase probability of contention/collisions in any given instant.

The Xeon has since moved to production, so I don't have that hardware at my disposal for the time being, although I might be able to arrange something if it's required.

I can get dmesgs/kldstats for the xeon, and amd if helpful (would rather post out of band).

derekmarcotte on 26 May 2016

@derekmarcotte, thanks for the extra details (I see now you already gave the CPU configuration; sorry I missed that).

Two more experiments to try:

See if you can reproduce with GOMAXPROCS=2 (either 1.6.2 or master is fine).
Try the same test program, but invoking an absolute path to a command that doesn't exist (and ignoring the error). This should get the exec out of the picture.

aclements on 31 May 2016

@aclements, Thanks for your suggestions. I'm currently exploring a different option. I was using gb for building my project (sorry! forgot to mention), and additionally for this test case.

I certainly didn't expect wildly differing binaries in a project with no external dependencies, as gb uses the go compiler internally. I've got more research to do here to account for this, so I apologize for that.

I've built using go directly and am in the process of testing. So far it has been running for 12 hours, without problem (with the SetFinalizers disabled). I have had previous test runs last this long, so I'm not ready to call it a success just yet. I'll be out of town for the next few days, so I can leave it running for a while and see where it ends up.

I think this is a very promising lead, based on the objdump of the two artefacts. It might be interesting to include in the Issue Report template, with the build tool ecosystem that is currently out there (or that there is an ecosystem at all).

derekmarcotte on 1 Jun 2016

@aclements, rebuilding gb from source using the go based on master, and then rebuilding the test with the new gb creates nearly identical binaries (minus file path TEXT entries), this is to be expected.

Perhaps there's something to this. Will keep you posted.

derekmarcotte on 1 Jun 2016

cc @davecheney

josharian on 1 Jun 2016

@aclements, the go-only build of the binary did eventually crash... somewhere around 24 hours (~153M goroutines later), so I don't think it's gb-related. I'm going to re-build with the following, per your suggestion:

func run(done chan struct{}) {
        cmd := exec.Command("doesnotexist")
        cmd.Wait()

        done <- struct{}{}
        return
}

derekmarcotte on 2 Jun 2016

@alements, after the above change, it ran for 4 days without a panic.

derekmarcotte on 6 Jun 2016

@derekmarcotte, argh, I'm sorry. That wasn't a useful test. :( That eliminates the fork, the exec, _and_ the finalizer, so there isn't really anything left. It needs to be something like:

func run(done chan struct{}) {
    cmd := exec.Command("/doesnotexist")
    cmd.Start()
    cmd.Wait()

    done <- struct{}{}
    return
}

It's important that the path be absolute. Otherwise, Go will try to resolve the path itself, fail at that point, and not do anything for the subsequent method calls. It's also important that the cmd.Start() still happen (even though it will fail). Again, without that, the Wait just returns nil immediately.

I'm sorry I didn't catch that earlier.

aclements on 6 Jun 2016

@aclements, at least we know that Go can spin for 4 days without crashing on my dev machine - so not a total waste.

I've recompiled with the above, and crashed after ~32 hours. Log attached:

doesnotexists.txt

derekmarcotte on 9 Jun 2016

@derekmarcotte, thanks! Was that with go or gb? With or without the SetFinalizer calls?

aclements on 9 Jun 2016

Also, 1.6.2 or master?

aclements on 9 Jun 2016

@aclements great questions. Was thinking we aught to recap all the variables at this point anyways.

Trying to narrow things down as much as possible. So, this is with go build using master (as of 7af2ce3), with this diff applied:

diff --git a/src/os/exec.go b/src/os/exec.go
index 239fd92..6a8eed5 100644
--- a/src/os/exec.go
+++ b/src/os/exec.go
@@ -5,7 +5,6 @@
 package os

 import (
-       "runtime"
        "sync/atomic"
        "syscall"
 )
@@ -19,7 +18,6 @@ type Process struct {

 func newProcess(pid int, handle uintptr) *Process {
        p := &Process{Pid: pid, handle: handle}
-       runtime.SetFinalizer(p, (*Process).Release)
        return p
 }

diff --git a/src/os/file_unix.go b/src/os/file_unix.go
index 9b64f21..a997e9e 100644
--- a/src/os/file_unix.go
+++ b/src/os/file_unix.go
@@ -54,7 +54,6 @@ func NewFile(fd uintptr, name string) *File {
                return nil
        }
        f := &File{&file{fd: fdi, name: name}}
-       runtime.SetFinalizer(f.file, (*file).close)
        return f
 }

with source:

package main

/* stdlib includes */
import (
    "fmt"
    "os/exec"
)

func run(done chan struct{}) {
    cmd := exec.Command("/doesnotexist")
    cmd.Start()
    cmd.Wait()

    done <- struct{}{}
    return
}

func main() {
    fmt.Println("Starting a bunch of goroutines...")

    // 8 & 16 are arbitrary
    done := make(chan struct{}, 16)

    for i := 0; i < 8; i++ {
        go run(done)
    }

    for {
        select {
        case <-done:
            go run(done)
        }
    }
}

derekmarcotte on 9 Jun 2016

It sounds like this is FreeBSD-specific, so marking as such. Please correct me if I am wrong.

ianlancetaylor on 14 Jun 2016

@ianlancetaylor : thanks for this. I'm just building a linux box dedicated to testing tonight. I haven't yet been able to test on linux. I'll update the thread as I find more information out.

derekmarcotte on 15 Jun 2016

Looks like i'm having the same issue with github.com/influxdb/telegraf. Panic log attached.

acquirep: p->m=859531515904(14) p->status=1
fatal error: acquirep: invalid p state
acquirep: p->m=859535287296(8) p->status=1
fatal error: acquirep: invalid p state

% uname -a
FreeBSD 9.3-RELEASE-p5 FreeBSD 9.3-RELEASE-p5 #0: Mon Nov  3 22:38:58 UTC 2014     [email protected]:/usr/obj/usr/src/sys/GENERIC  amd64

% cat /var/run/dmesg.boot | grep -i cpu
CPU: Intel(R) Xeon(R) CPU E3-1270 V2 @ 3.50GHz (3492.14-MHz K8-class CPU)
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs

go version go1.6.2 freebsd/amd64

I'll try to rebuild with go tip.

panic.txt

kaey on 5 Jul 2016

@derekmarcotte are your test machines metal or VM's?

I've just kicked off a test on 10.2-RELEASE box here to see if I can repro locally.

stevenh on 18 Jul 2016

It took many hours but I did get a panic in the end.

This is from go 1.6.2 (no patches) on 10.2-RELEASE amd64 (metal) no Virtualisation

./test 
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.SetFinalizer.func2()
        /usr/local/go/src/runtime/mfinal.go:372 +0x6f

goroutine 23221055 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc8202eca58 sp=0xc8202eca50
runtime.SetFinalizer(0x4f2f00, 0xc8204a19e0, 0x4d68a0, 0x54f550)
        /usr/local/go/src/runtime/mfinal.go:374 +0x4b5 fp=0xc8202ecbd8 sp=0xc8202eca58
os.NewFile(0x3, 0x51efd0, 0x9, 0x323200000000)
        /usr/local/go/src/os/file_unix.go:57 +0xfc fp=0xc8202ecc30 sp=0xc8202ecbd8
os.OpenFile(0x51efd0, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/os/file_unix.go:123 +0x1bd fp=0xc8202ecca8 sp=0xc8202ecc30
os.Open(0x51efd0, 0x9, 0xdeaddeaddeaddead, 0x0, 0x0)
        /usr/local/go/src/os/file.go:244 +0x48 fp=0xc8202ecce8 sp=0xc8202ecca8
os/exec.(*Cmd).stdin(0xc8203eac80, 0x0, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:171 +0x6e fp=0xc8202ecd80 sp=0xc8202ecce8
os/exec.(*Cmd).Start(0xc8203eac80, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:316 +0x2f4 fp=0xc8202ecf68 sp=0xc8202ecd80
main.run(0xc82006c060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x50 fp=0xc8202ecfa8 sp=0xc8202ecf68
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc8202ecfb0 sp=0xc8202ecfa8
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:31 +0x19b

stevenh on 19 Jul 2016

are your test machines metal or VM's?

For me, telegraf crashes on both vmware VMs and baremetal machines.

kaey on 19 Jul 2016

@stevenh: So it's random, and I think the probability of crashing is a function of the number of cpus being used, and while I say 8 is arbitrary, it's also the number of cpus I had on hand. If you have more, by all means, set it higher.

Try it a few times, and you'll see. I just stuck it in a loop like while true; do date; ./ex-from-comment-225006179; done, and then you can see them over a day period.

Anecdotally, I've been able to run a longer-lived process with GOMAXPROCS=2, but will be doing more thorough testing.

Most of those dumps are from a jail running on bare metal. The Xeon was straight on the jail host (although there were jails running).

derekmarcotte on 19 Jul 2016

@stevenh: Oh, and the one that uses /doesnotexist crashes after 32 hours, but the one that uses true is much sooner (because it appears there are multiple issues going on).

derekmarcotte on 19 Jul 2016

With GOMAXPROC=1 and "true" its been running for 20+ hours so far, so it does indeed to seem to be effected by the number of procs available.

stevenh on 20 Jul 2016

I've been running /doesnotexist with stock go 1.6.2 for just shy of a week with GOMAXPROCS="2". Interesting to note, there are 11 OS-threads, which is surprising to me, but may be expected?

derekmarcotte on 12 Aug 2016

I've been able to cause the panic much quicker being more aggressive with the garbage collector. I've been using this script to test:

while true; do 
  echo "==== NEW RUN $(date) ===="
  echo
  time sh -c 'export GOGC=5; export GODEBUG=gctrace=2,schedtrace=100; ./15658-doesnotexit'
  echo
  echo
done >> ../logs/15658-doesnotexit-logs-gogc5-gtrace2 2>> ../logs/15658-doesnotexit-logs-gogc5-gtrace2

Here's a sample of the crashing interval:

==== NEW RUN Tue Aug 30 16:36:11 EDT 2016 ====
==== NEW RUN Tue Aug 30 16:37:18 EDT 2016 ====
==== NEW RUN Tue Aug 30 17:13:02 EDT 2016 ====
==== NEW RUN Tue Aug 30 17:28:30 EDT 2016 ====
==== NEW RUN Tue Aug 30 17:54:22 EDT 2016 ====
==== NEW RUN Tue Aug 30 18:04:29 EDT 2016 ====
==== NEW RUN Tue Aug 30 18:36:36 EDT 2016 ====
==== NEW RUN Tue Aug 30 18:57:48 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:09:12 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:37:33 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:42:37 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:52:31 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:03:22 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:07:05 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:10:31 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:25:37 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:31:00 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:31:28 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:34:30 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:38:16 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:40:09 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:56:57 EDT 2016 ====
==== NEW RUN Tue Aug 30 21:31:54 EDT 2016 ====
==== NEW RUN Tue Aug 30 21:39:05 EDT 2016 ====
==== NEW RUN Tue Aug 30 21:51:06 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:28:44 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:32:35 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:53:50 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:55:44 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:57:44 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:58:08 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:59:58 EDT 2016 ====
==== NEW RUN Tue Aug 30 23:52:15 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:28:37 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:46:25 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:53:03 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:57:34 EDT 2016 ====
==== NEW RUN Wed Aug 31 01:06:02 EDT 2016 ====
==== NEW RUN Wed Aug 31 01:13:20 EDT 2016 ====
==== NEW RUN Wed Aug 31 01:21:55 EDT 2016 ====
==== NEW RUN Wed Aug 31 02:17:49 EDT 2016 ====
==== NEW RUN Wed Aug 31 02:55:35 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:43:36 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:47:24 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:48:14 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:50:32 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:59:05 EDT 2016 ====
==== NEW RUN Wed Aug 31 04:43:24 EDT 2016 ====
==== NEW RUN Wed Aug 31 04:55:07 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:04:41 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:34:44 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:50:45 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:53:55 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:58:08 EDT 2016 ====
==== NEW RUN Wed Aug 31 06:19:07 EDT 2016 ====
==== NEW RUN Wed Aug 31 06:34:13 EDT 2016 ====
==== NEW RUN Wed Aug 31 06:41:30 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:52:08 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:54:32 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:56:02 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:59:04 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:59:48 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:01:46 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:06:30 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:23:03 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:29:57 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:41:58 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:43:14 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:45:00 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:59:22 EDT 2016 ====
==== NEW RUN Wed Aug 31 09:24:11 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:01:22 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:05:45 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:09:24 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:12:51 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:23:35 EDT 2016 ====
^^^^ process was idle/wedged, sent kill manually
==== NEW RUN Thu Sep 1 06:47:27 EDT 2016 ====
==== NEW RUN Thu Sep 1 06:49:27 EDT 2016 ====

$ go version
go version go1.7 freebsd/amd64

$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH="/home/derek/dev/gopath"
GORACE=""
GOROOT="/home/derek/go"
GOTOOLDIR="/home/derek/go/pkg/tool/freebsd_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -gno-record-gcc-switches"
CXX="clang++"
CGO_ENABLED="1"

derekmarcotte on 1 Sep 2016

I am trying to reproduce this on DragonFly BSD. So far I was able to reproduce the idle state:

localhost% uname -a
DragonFly localhost.localdomain 4.6-RELEASE DragonFly v4.6.0-RELEASE #0: Mon Aug  1 12:46:25 EDT 2016     [email protected]:/usr/obj/build/sys/X86_64_GENERIC  x86_64
localhost% go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="dragonfly"
GOOS="dragonfly"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/dragonfly_amd64"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build276771230=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
localhost% go version
go version go1.7 dragonfly/amd64

SCHED 783590ms: gomaxprocs=2 idleprocs=0 threads=8 spinningthreads=0 idlethreads=5 runqueue=8 [0 1]

I have been running this test inside Virtualbox, with 2 virtual CPUs.

reezer on 2 Sep 2016

On DragonFly it's actually causing kernel panic. Therefor I created a bug report there. It includes the dump.

https://bugs.dragonflybsd.org/issues/2949

reezer on 2 Oct 2016

Cannot reproduce on freebsd-amd64-gce101 gomote running FreeBSD 10.1.
The Dragonfly panic is suggestive that there may be a similar kernel problem in FreeBSD.
Unlikely we are going to solve this.

rsc on 20 Oct 2016

Thanks very much for taking a look. I'm very grateful for your time.

Are you able to elaborate more fully about what you mean by "Unlikely we are going to solve this." Are you referring to the issue in general, or the DragonFly kernel crash? Is the "we" The Go Team?

I believe the implication of a kernel problem in FreeBSD, because of a panic in DragonFly, simply doesn't follow. I can give a lot more detail about this, but I fear it's redundant, unless requested.

For what it's worth, this issue is reliably reproducible enough that I've been able to bisect down to a commit that makes sense (https://github.com/golang/go/commit/e6d8bfe218a5f387d7aceddcaee5067a59181838), in my first go. Additionally, the commits leading up to the implicated commit in the bisect are documentation-only, and don't reproduce the behaviour. I have very high confidence in the bisect.

I haven't yet published this work yet, as I've been asking some new questions based on the bisect. I was hoping to be able to speak intelligently about this when I reported results, but I feel now the pressure has increased. I've been making progress slowly, but I am making progress (this isn't my daytime gig). I will post a repo later today with the work and the bisec log, so that others can reference it.

Here's my thoughts:

at first glance, the implicated commit seems to change how far down in the exec call tree the software gets in the /doesnotexist case
- that raised the question, did it ever work?
- starting with the tag go1 the /bin/true case has goruntime panic'd, and the issue therefore has always existed

New lines of thought are:

This suggests that the conventions around how something is actually being exec'd are incorrect, for at least the FreeBSD case.
Presumably, other syscalls work fine, so the syscall interface is likely correct.
Does darwin exhibit the problem? (I will test, although I don't have access to an 8-thread darwin machine), this shares a lot of go runtime code with FreeBSD in the exec codepath
How far down the exec call tree does this commit push the test, in comparison with the older code

I plan to investigate these questions, but again, this isn't my daytime gig.

As I mentioned earlier, I'm very happy to work with others on this.

I'd be very interested in more details in the environment you were unable to reproduce in. Specifically, how many CPUs? Is the gomote a virtualized environment, or do you have direct access to hardware somewhere? Perhaps we can compare build artefacts? i.e. does your artefact crash in my environment, and vice versa

Thanks again for taking a look.

derekmarcotte on 20 Oct 2016

@derekmarcotte, thanks for investigating. gomote is our interface to freshly-created FreeBSD VMs on GCE.

See
https://github.com/golang/build/blob/master/env/freebsd-amd64/make.bash
https://github.com/golang/build/blob/master/dashboard/builders.go#L81

They're both n1-highcpu-4, which is https://cloud.google.com/compute/docs/machine-types#standard_machine_types ... "Standard 4 CPU machine type with 4 virtual CPUs and 15 GB of memory. ... 1For the n1 series of machine types, a virtual CPU is implemented as a single hardware hyper-thread on a 2.6 GHz Intel Xeon E5 (Sandy Bridge), 2.5 GHz Intel Xeon E5 v2 (Ivy Bridge), 2.3 GHz Intel Xeon E5 v3 (Haswell), or 2.2 GHz Intel Xeon E5 v4 (Broadwell)"

So it's at least 2 cores.

bradfitz on 20 Oct 2016

@derekmarcotte Thanks for continuing to investigate. I'm fairly convinced that this is a kernel problem, and your identification of that CL as the cause just makes that seems more likely to me. I have never been able to recreate the problem myself. I suspect that it requires real hardware, not a VM.

ianlancetaylor on 20 Oct 2016

@rsc Actually the kernel bug causing the panic existed only for a small amount of time on DragonFly. It most likely was introduced with DragonFly 4.6.0. Using 4.6.1 I fail to reproduce it. So I am sorry for laying a false trail there.

I will keep it running though, to make sure nothing happens, but I already tested it far longer than originally.

Also as a side note, since the topic of VMs came up: All my tests have been done using Virtualbox.

reezer on 20 Oct 2016

Here's the repo, as promised: https://github.com/derekmarcotte/goruntime-testing

@ianlancetaylor: Thanks for checking it out. Would you be interested in running 15658-doesnotexist.go to completion? Should be on the order of 8-12 hours, maybe more? If I could suggest pulling the command line from the bisect-15658, to get the environment a bit more hostile. Check the logs directory for the bisect.

If there were a kernel problem, I would expect others to have problems with say, inetd (because it is a glorified fork/exec machine), in the wild. This would be a pretty hot code path for many production FreeBSD uses. Additionally, when looking at whether this is a kernel problem, the kernel data structures themselves would become corrupt (and panic like DragonFly), but I see no evidence of this. My main testing machine has a current uptime of 105 days, with probably hundreds of runtime panics, yet no other services have faltered, and everything behaves flawlessly.

derekmarcotte on 20 Oct 2016

I've been running this now on 1.7.1 for 12h+ now on an 24 core E5-2640 running 8.2-RELEASE no hang or panic:

package main

/* stdlib includes */
import (
        "fmt"
        "os/exec"
        "runtime"
)

func run(done chan struct{}) {
        cmd := exec.Command("true")
        cmd.Start()
        cmd.Wait()

        done <- struct{}{}
        return
}

func main() {
        fmt.Println("Starting a bunch of goroutines...")
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        // 8 & 16 are arbitrary
        done := make(chan struct{}, 16)

        for i := 0; i < 8; i++ {
                go run(done)
        }

        for range done {
                go run(done)
        }
}

stevenh on 21 Oct 2016

It's not uncommon for Go programs to behave significantly different from C programs, to the extent that they expose problems that have been latent in kernels for many years. I'm not saying I'm 100% sure that's going on here, but it's definitely possible, especially given that fork/exec seems to be required.

rsc on 2 Nov 2016

Using stevenh's code I received this seemingly new panic.

OS: FreeBSD 11.0
CPU: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz (2394.61-MHz K8-class CPU)

GOGC=5 ./test
Starting a bunch of goroutines...
GOMAXPROCS: 8
runtime: failed MSpanList_Remove 0x800528320 0x8005c7d20 0x53b670 0x53b660
fatal error: MSpanList_Remove

runtime stack:
runtime.throw(0x4c973e, 0x10)
    /usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*mSpanList).remove(0x53b660, 0x800528320)
    /usr/local/go/src/runtime/mheap.go:1001 +0x19d
runtime.(*mcentral).cacheSpan(0x53b650, 0xc42001b228)
    /usr/local/go/src/runtime/mcentral.go:55 +0x3d0
runtime.(*mcache).refill(0x800524e10, 0xc40000001e, 0xc4200b82e0)
    /usr/local/go/src/runtime/mcache.go:121 +0xae
runtime.(*mcache).nextFree.func1()
    /usr/local/go/src/runtime/malloc.go:505 +0x33
runtime.systemstack(0xc42007cee0)
    /usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
    /usr/local/go/src/runtime/proc.go:1079

goroutine 643553 [running]:
runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc4201e68e0 sp=0xc4201e68d8
runtime.(*mcache).nextFree(0x800524e10, 0x1e, 0x0, 0x40f354, 0x80064429b)
    /usr/local/go/src/runtime/malloc.go:506 +0xb2 fp=0xc4201e6938 sp=0xc4201e68e0
runtime.mallocgc(0x380, 0x4a8020, 0x1, 0xc4200b82e0)
    /usr/local/go/src/runtime/malloc.go:658 +0x809 fp=0xc4201e69d8 sp=0xc4201e6938
runtime.makeslice(0x4a8020, 0x338, 0x338, 0xc4200b82e0, 0x1f, 0x40e996)
    /usr/local/go/src/runtime/slice.go:57 +0x7b fp=0xc4201e6a30 sp=0xc4201e69d8
syscall.ByteSliceFromString(0xc420010000, 0x337, 0xc4201064d8, 0xc4200b82e0, 0x1f, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall.go:53 +0x64 fp=0xc4201e6a90 sp=0xc4201e6a30
syscall.BytePtrFromString(0xc420010000, 0x337, 0xc4200b82e0, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall.go:69 +0x39 fp=0xc4201e6ad8 sp=0xc4201e6a90
syscall.SlicePtrFromStrings(0xc42000c840, 0x15, 0x15, 0xc4200881f0, 0x2, 0x2, 0x0, 0x0)
    /usr/local/go/src/syscall/exec_unix.go:87 +0x9a fp=0xc4201e6b48 sp=0xc4201e6ad8
syscall.forkExec(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6d60, 0xc420162018, 0x40e996, 0xc4201e6cc8)
    /usr/local/go/src/syscall/exec_unix.go:158 +0x16e fp=0xc4201e6c78 sp=0xc4201e6b48
syscall.StartProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6d60, 0x2, 0x4, 0x44ab20, 0xc4200f0280)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc4201e6cd0 sp=0xc4201e6c78
os.startProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6f08, 0xc42000c840, 0x15, 0x15)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3 fp=0xc4201e6db8 sp=0xc4201e6cd0
os.StartProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6f08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64 fp=0xc4201e6e10 sp=0xc4201e6db8
os/exec.(*Cmd).Start(0xc42000c2c0, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9 fp=0xc4201e6f60 sp=0xc4201e6e10
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68 fp=0xc4201e6fa8 sp=0xc4201e6f60
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc4201e6fb0 sp=0xc4201e6fa8
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 1 [runnable]:
main.main()
    /home/reezer/test.go:30 +0x222

goroutine 643497 [runnable]:
sync.(*RWMutex).RUnlock(0x550950)
    /usr/local/go/src/sync/rwmutex.go:55
syscall.Environ(0xc4200fc000, 0x15, 0x15)
    /usr/local/go/src/syscall/env_unix.go:148 +0x210
os.Environ(0x0, 0x0, 0x0)
    /usr/local/go/src/os/env.go:116 +0x22
os/exec.(*Cmd).envv(0xc420076160, 0xc420126080, 0x0, 0x1)
    /usr/local/go/src/os/exec/exec.go:171 +0x38
os/exec.(*Cmd).Start(0xc420076160, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:356 +0x2d9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 643540 [runnable]:
syscall.Syscall(0x3, 0x1c, 0xc4201fabc8, 0x8, 0x0, 0x8, 0x0)
    /usr/local/go/src/syscall/asm_freebsd_amd64.s:21 +0x5
syscall.readlen(0x1c, 0xc4201fabc8, 0x8, 0x2, 0xc42010e000, 0x16)
    /usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1303 +0x56
syscall.forkExec(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201fad60, 0xc4200ea020, 0x40e996, 0xc4201facc8)
    /usr/local/go/src/syscall/exec_unix.go:202 +0x39f
syscall.StartProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201fad60, 0x2, 0x4, 0x44ab20, 0xc4200f0140)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201faf08, 0xc420076420, 0x15, 0x15)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201faf08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42000c580, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 643472 [runnable]:
syscall.Syscall6(0x214, 0x0, 0xeb54, 0x0, 0x18, 0x0, 0x0, 0xeb54, 0x0, 0x0)
    /usr/local/go/src/syscall/asm_freebsd_amd64.s:44 +0x5
os.(*Process).blockUntilWaitable(0xc420166090, 0xc420014060, 0xc420014060, 0x4)
    /usr/local/go/src/os/wait_wait6.go:29 +0x6b
os.(*Process).wait(0xc420166090, 0x0, 0xc4201ea160, 0x0)
    /usr/local/go/src/os/exec_unix.go:22 +0xab
os.(*Process).Wait(0xc420166090, 0xc4201e8f08, 0x0, 0x0)
    /usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc4201ea160, 0x0, 0x0)
    /usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc420072060)
    /home/reezer/test.go:13 +0x76
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 643498 [semacquire]:
sync.runtime_Semacquire(0x550934)
    /usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*Mutex).Lock(0x550930)
    /usr/local/go/src/sync/mutex.go:85 +0xd0
sync.(*RWMutex).Lock(0x550930)
    /usr/local/go/src/sync/rwmutex.go:86 +0x31
syscall.forkExec(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7d60, 0xc420080030, 0x40e996, 0xc4201f7cc8)
    /usr/local/go/src/syscall/exec_unix.go:185 +0x25e
syscall.StartProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7d60, 0x2, 0x4, 0x44ab20, 0xc420108100)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7f08, 0xc420132000, 0x15, 0x15)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7f08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420044420, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 643440 [runnable]:
syscall.Syscall(0x3, 0xc, 0xc4201e7bc8, 0x8, 0x0, 0x8, 0x0)
    /usr/local/go/src/syscall/asm_freebsd_amd64.s:21 +0x5
syscall.readlen(0xc, 0xc4201e7bc8, 0x8, 0x2, 0xc420106160, 0x16)
    /usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1303 +0x56
syscall.forkExec(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7d60, 0xc420080018, 0x40e996, 0xc4201e7cc8)
    /usr/local/go/src/syscall/exec_unix.go:202 +0x39f
syscall.StartProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7d60, 0x2, 0x4, 0x44ab20, 0xc4200f0140)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7f08, 0xc420132420, 0x15, 0x15)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7f08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42000c420, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 643499 [runnable]:
syscall.fcntl(0x4, 0x2, 0x1, 0x0, 0x0, 0xc4c0000000)
    /usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:255
syscall.forkExecPipe(0xc4201fbc18, 0x2, 0x2, 0xc42018e8f0, 0x16)
    /usr/local/go/src/syscall/exec_bsd.go:264 +0x87
syscall.forkExec(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbd60, 0xc420148040, 0x40e996, 0xc4201fbcc8)
    /usr/local/go/src/syscall/exec_unix.go:188 +0x281
syscall.StartProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbd60, 0x2, 0x4, 0x44ab20, 0xc4200f00c0)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbf08, 0xc4200446e0, 0x15, 0x15)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbf08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420044580, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

reezer on 23 Nov 2016

@reezer Which version of Go?

ianlancetaylor on 23 Nov 2016

go1.7.1 from the FreeBSD package

reezer on 23 Nov 2016

Go 1.7.1 has two notable bugs you could be hitting. See https://github.com/golang/go/issues?q=milestone%3AGo1.7.3

Can you try Go 1.7.3 or tip instead?

bradfitz on 23 Nov 2016

I will still test it longer, but so far it appears to be fixed.

Again, I am sorry for the noise. It was a test system I quickly set up and should have taken greater care to make sure it runs the latest version.

reezer on 26 Nov 2016

Okay, it takes a while on go1.7.3 freebsd/amd64 (FreeBSD 11.0), but I again receive one of the original errors. For people who want to try to reproduce it. GOGC=5 makes a big difference in the time it takes until the error appears.

```GOGC=5 ./test &> out.log
cat out.log
Starting a bunch of goroutines...
GOMAXPROCS: 8
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4ca671, 0x14)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(workbuf).checkempty(0x800585000)
/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x80052a260)
/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.handoff(0x800585000, 0xc41fff49b0)
/usr/local/go/src/runtime/mgcwork.go:419 +0x26
runtime.(gcWork).balance(0xc420019c28)
/usr/local/go/src/runtime/mgcwork.go:263 +0x54
runtime.gcDrainN(0xc420019c28, 0x10000, 0xc4fffffffe)
/usr/local/go/src/runtime/mgcmark.go:1042 +0x195
runtime.gcAssistAlloc.func1()
/usr/local/go/src/runtime/mgcmark.go:462 +0xe5
runtime.systemstack(0xc42017c380)
/usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1079

goroutine 681244 [running]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc42014b8b0 sp=0xc42014b8a8
runtime.gcAssistAlloc(0xc420129520)
/usr/local/go/src/runtime/mgcmark.go:499 +0x194 fp=0xc42014b938 sp=0xc42014b8b0
runtime.mallocgc(0x7, 0x4a80a0, 0x1, 0xc42000e220)
/usr/local/go/src/runtime/malloc.go:563 +0x9a4 fp=0xc42014b9d8 sp=0xc42014b938
runtime.makeslice(0x4a80a0, 0x7, 0x7, 0xc42000e220, 0x16, 0x40e996)
/usr/local/go/src/runtime/slice.go:57 +0x7b fp=0xc42014ba30 sp=0xc42014b9d8
syscall.ByteSliceFromString(0xc42000a010, 0x6, 0xc4200a01a0, 0xc42000e220, 0x16, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:53 +0x64 fp=0xc42014ba90 sp=0xc42014ba30
syscall.BytePtrFromString(0xc42000a010, 0x6, 0xc42000e220, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:69 +0x39 fp=0xc42014bad8 sp=0xc42014ba90
syscall.SlicePtrFromStrings(0xc42012c300, 0x18, 0x18, 0xc42013c050, 0x2, 0x2, 0x0, 0x0)
/usr/local/go/src/syscall/exec_unix.go:87 +0x9a fp=0xc42014bb48 sp=0xc42014bad8
syscall.forkExec(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bd60, 0xc42010a008, 0x40e996, 0xc42014bcc8)
/usr/local/go/src/syscall/exec_unix.go:158 +0x16e fp=0xc42014bc78 sp=0xc42014bb48
syscall.StartProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bd60, 0x2, 0x4, 0x44ab30, 0xc4201ce280)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc42014bcd0 sp=0xc42014bc78
os.startProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bf08, 0xc42012c300, 0x18, 0x18)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3 fp=0xc42014bdb8 sp=0xc42014bcd0
os.StartProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bf08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64 fp=0xc42014be10 sp=0xc42014bdb8
os/exec.(*Cmd).Start(0xc4201962c0, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9 fp=0xc42014bf60 sp=0xc42014be10
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68 fp=0xc42014bfa8 sp=0xc42014bf60
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42014bfb0 sp=0xc42014bfa8
created by main.main
/home/reezer/test.go:31 +0x24d

goroutine 1 [chan receive]:
main.main()
/home/reezer/test.go:30 +0x222

goroutine 681182 [semacquire]:
syscall.forkAndExecInChild(0xc42012e150, 0xc42012e160, 0x2, 0x2, 0xc420194000, 0x19, 0x19, 0x0, 0x0, 0xc4201c8d60, ...)
/usr/local/go/src/syscall/exec_bsd.go:54 +0x5e
syscall.forkExec(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8d60, 0xc42013a010, 0x40e996, 0xc4201c8cc8)
/usr/local/go/src/syscall/exec_unix.go:193 +0x334
syscall.StartProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8d60, 0x2, 0x4, 0x44ab30, 0xc420134000)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8f08, 0xc420076180, 0x18, 0x18)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8f08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc4201b2420, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d

goroutine 681184 [semacquire]:
os/exec.(*Cmd).Start(0xc4201d2000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:367 +0x466
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d

goroutine 681213 [semacquire]:
os.(Process).wait(0xc42010e150, 0x0, 0xc4201b22c0, 0x0)
/usr/local/go/src/os/exec_unix.go:48 +0x17a
os.(Process).Wait(0xc42010e150, 0xc4201adf08, 0x0, 0x0)
/usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc4201b22c0, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc420072060)
/home/reezer/test.go:13 +0x76
created by main.main
/home/reezer/test.go:31 +0x24d

goroutine 681183 [semacquire]:
os.newProcess(0x14bf2, 0x0, 0xc4200ec010)
/usr/local/go/src/os/exec.go:23 +0x31
os.startProcess(0xc42000a4d0, 0xd, 0xc4200ec010, 0x1, 0x1, 0xc4201abf08, 0xc42000c780, 0x18, 0x18)
/usr/local/go/src/os/exec_posix.go:49 +0x2c0
os.StartProcess(0xc42000a4d0, 0xd, 0xc4200ec010, 0x1, 0x1, 0xc4201abf08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420196160, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d

goroutine 681161 [running]:
goroutine running on other thread; stack unavailable
created by main.main
/home/reezer/test.go:31 +0x24d

goroutine 681214 [semacquire]:
syscall.ByteSliceFromString(0xc42000a010, 0x6, 0xc4200bc080, 0x16, 0x16, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc42000a010, 0x6, 0xc4200bc080, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.SlicePtrFromStrings(0xc420180180, 0x18, 0x18, 0xc420198290, 0x2, 0x2, 0x0, 0x0)
/usr/local/go/src/syscall/exec_unix.go:87 +0x9a
syscall.forkExec(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7d60, 0xc4201b8068, 0x40e996, 0xc4201c7cc8)
/usr/local/go/src/syscall/exec_unix.go:158 +0x16e
syscall.StartProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7d60, 0x2, 0x4, 0x44ab30, 0xc42004e4c0)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7f08, 0xc420180180, 0x18, 0x18)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7f08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc4201b2160, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
/home/reezer/test.go:12 +0x68
created by main.main
/home/reezer/test.go:31 +0x24d

goroutine 681162 [running]:
goroutine running on other thread; stack unavailable
created by main.main
/home/reezer/test.go:31 +0x24d

runtime stack:
runtime.throw(0x4ca671, 0x14)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(workbuf).checkempty(0x800585000)
/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(gcWork).init(0xc420017228)
/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc420017228, 0xc42017d380)
/usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc42017d380, 0x0, 0x0, 0xc41fff4163, 0xc400000000, 0x800765860, 0xc420017228, 0xc)
/usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc42017d380)
/usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc4200944a0, 0xc42017d380)
/usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
/usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc42003de70)
/usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc4200944a0, 0xc42017d380)
/usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc4200944a0, 0xc42017d380)
/usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc42017d380, 0x0)
/usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
/usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420129040)
/usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0xc4200001a0)
/usr/local/go/src/runtime/asm_amd64.s:240 +0x5b

runtime stack:
runtime.throw(0x4ca671, 0x14)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(workbuf).checkempty(0x800585000)
/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(gcWork).init(0xc420018728)
/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).tryGet(0xc420018728, 0x5364d0)
/usr/local/go/src/runtime/mgcwork.go:144 +0xd9
runtime.gcDrainN(0xc420018728, 0x10000, 0xfffffffd)
/usr/local/go/src/runtime/mgcmark.go:1052 +0x145
runtime.gcAssistAlloc.func1()
/usr/local/go/src/runtime/mgcmark.go:462 +0xe5
runtime.systemstack(0xc420018a00)
/usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1079

runtime stack:
runtime.throw(0x4ca671, 0x14)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(workbuf).checkempty(0x800585000)
/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(gcWork).init(0xc42001b228)
/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc42001b228, 0xc420153860)
/usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc420153860, 0x0, 0x0, 0xc41fff563c, 0xc400000000, 0x80052b2a0, 0xc42001b228, 0xf)
/usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc420153860)
/usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc4200950a0, 0xc420153860)
/usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
/usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc4200ade70)
/usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc4200950a0, 0xc420153860)
/usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc4200950a0, 0xc420153860)
/usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc420153860, 0x0)
/usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
/usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420001380)
/usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0x0)
/usr/local/go/src/runtime/asm_amd64.s:240 +0x5b

runtime stack:
runtime.throw(0x4ca671, 0x14)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(workbuf).checkempty(0x800585000)
/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(gcWork).init(0xc42001c728)
/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc42001c728, 0xc42017d040)
/usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc42017d040, 0x0, 0x0, 0xc41fff417d, 0xc400000000, 0x800765860, 0xc42001c728, 0xa)
/usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc42017d040)
/usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc42002c8a0, 0xc42017d040)
/usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
/usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc420039e70)
/usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc42002c8a0, 0xc42017d040)
/usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc42002c8a0, 0xc42017d040)
/usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc42017d040, 0x0)
/usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
/usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420045040)
/usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0x0)
/usr/local/go/src/runtime/asm_amd64.s:240 +0x5b
```

reezer on 27 Nov 2016

@reezer how long did it take to reproduce for you with GOGC=5?

stevenh on 1 Dec 2016

It varied quite a bit, sometimes minutes, but up to about two hours. It does really needs the forking. What also speeds it up (very slightly) is using an absolute path for the binary. Go otherwise will iterate over all of the the PATH directories, trying to execute true, so you have a few unnecessary syscalls. Of course you should also make sure that you really have GOMAXPROCS set to something > 1.

See also the timings that derekmarcotte posted. Mine have always been higher, but not that much.

It looks that this bug doesn't appear in virtual environments. I agree that there is quite a bit of evidence that it's a bug somewhere outside of Go (ie. kernel). Another hint here is that it apparently takes longest the first time you run this. However take this statement with a grain of salt, cause I didn't really sample this enough yet.

reezer on 1 Dec 2016

Thanks, reproduced now :)

stevenh on 1 Dec 2016

Here's a possibly interesting one:-
```Starting a bunch of goroutines...
GOMAXPROCS: 24
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [chan receive]:
main.main()
/data/go/src/github.com/multiplay/go/apps/test/main.go:29 +0x222

goroutine 3560607 [semacquire]:
syscall.ByteSliceFromString(0x4c8a13, 0x9, 0x41aaf7, 0x800574199, 0x800574120, 0x4d, 0x4d)
/usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0x4c8a13, 0x9, 0x10, 0x4b0f60, 0x4aedc0)
/usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.Open(0x4c8a13, 0x9, 0x100001, 0xc400000000, 0x40d8de, 0xc420326000, 0x8)
/usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:820 +0x3f
os.OpenFile(0x4c8a13, 0x9, 0x1, 0x800000000, 0x0, 0x0, 0xc42029ee00)
/usr/local/go/src/os/file_unix.go:97 +0xa7
os/exec.(Cmd).writerDescriptor(0xc420258000, 0x0, 0x0, 0x1, 0xc42026a4d0, 0xc420347e48)
/usr/local/go/src/os/exec/exec.go:233 +0x569
os/exec.(Cmd).stdout(0xc420258000, 0x0, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:221 +0x3d
os/exec.(*Cmd).Start(0xc420258000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560699 [semacquire]:
os/exec.(*Cmd).Start(0xc42031a160, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:348 +0x7ac
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560559 [semacquire]:
os.newProcess(0xfc37, 0x0, 0xc4202ce450)
/usr/local/go/src/os/exec.go:23 +0x31
os.startProcess(0xc4202ce500, 0xd, 0xc4202ce450, 0x1, 0x1, 0xc42030cf08, 0xc4201843c0, 0x1d, 0x1d)
/usr/local/go/src/os/exec_posix.go:49 +0x2c0
os.StartProcess(0xc4202ce500, 0xd, 0xc4202ce450, 0x1, 0x1, 0xc42030cf08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420240000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560649 [semacquire]:
os/exec.(Cmd).writerDescriptor(0xc4202c4000, 0x0, 0x0, 0x1, 0xc4201188f0, 0xc42003b648)
/usr/local/go/src/os/exec/exec.go:237 +0x665
os/exec.(Cmd).stdout(0xc4202c4000, 0xc4202a6030, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:221 +0x3d
os/exec.(*Cmd).Start(0xc4202c4000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560772 [semacquire]:
syscall.ByteSliceFromString(0xc420012000, 0x2d, 0xc4201d0040, 0x11, 0x11, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc420012000, 0x2d, 0xc4201d0040, 0x0, 0x0)
/usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.SlicePtrFromStrings(0xc4202565a0, 0x1d, 0x1d, 0xc4202e4390, 0x2, 0x2, 0x0, 0x0)
/usr/local/go/src/syscall/exec_unix.go:87 +0x9a
syscall.forkExec(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1d60, 0xc420278028, 0x40e996, 0xc4202a1cc8)
/usr/local/go/src/syscall/exec_unix.go:158 +0x16e
syscall.StartProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1d60, 0x2, 0x4, 0x44ab30, 0xc42017a180)
/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1f08, 0xc4202565a0, 0x1d, 0x1d)
/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1f08, 0x0, 0x0, 0xd)
/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42024a000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560475 [semacquire]:
os.(Process).wait(0xc42028c000, 0x0, 0xc42031a000, 0x0)
/usr/local/go/src/os/exec_unix.go:48 +0x17a
os.(Process).Wait(0xc42028c000, 0xc420310f08, 0x0, 0x0)
/usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc42031a000, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:12 +0x76
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560701 [semacquire]:
syscall.ByteSliceFromString(0xc4203321e0, 0xa, 0x4bb0c0, 0xc420332101, 0x100000000000a, 0x38, 0xb0)
/usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc4203321e0, 0xa, 0x8005260d0, 0xc420159d40, 0xa)
/usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.Stat(0xc4203321e0, 0xa, 0xc420328038, 0xc4203321e0, 0xa)
/usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1130 +0x3f
os.Stat(0xc4203321e0, 0xa, 0x2, 0x2, 0xc4203321e0, 0xa)
/usr/local/go/src/os/file_unix.go:169 +0x7d
os/exec.findExecutable(0xc4203321e0, 0xa, 0x2, 0xc4203321e0)
/usr/local/go/src/os/exec/lp_unix.go:20 +0x39
os/exec.LookPath(0x4c8239, 0x4, 0x4c8239, 0x4, 0x1, 0x0)
/usr/local/go/src/os/exec/lp_unix.go:53 +0x188
os/exec.Command(0x4c8239, 0x4, 0x0, 0x0, 0x0, 0xc420232420)
/usr/local/go/src/os/exec/exec.go:135 +0x1ff
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:10 +0x55
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560650 [semacquire]:
os.NewFile(0x12, 0x4c8a13, 0x9, 0x0)
/usr/local/go/src/os/file_unix.go:56 +0x3f
os.OpenFile(0x4c8a13, 0x9, 0x1, 0xc400000000, 0xc420374d70, 0x40eb85, 0xc420254230)
/usr/local/go/src/os/file_unix.go:123 +0xff
os/exec.(Cmd).writerDescriptor(0xc420178000, 0x0, 0x0, 0xc420254238, 0xc420268020, 0xc420268020)
/usr/local/go/src/os/exec/exec.go:233 +0x569
os/exec.(Cmd).stderr(0xc420178000, 0xc420268020, 0x1, 0x1)
/usr/local/go/src/os/exec/exec.go:228 +0x4e
os/exec.(*Cmd).Start(0xc420178000, 0x4, 0x0)
/usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
/data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
/data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d
4006.16 real 3508.55 user 29207.58 sys```

stevenh on 1 Dec 2016

Here's the selection of errors it triggered:

fatal error: all goroutines are asleep - deadlock!

fatal error: workbuf is empty

runtime: nelems=256 nfree=233 nalloc=23 previous allocCount=18 nfreed=65531
fatal error: sweep increased allocation count

runtime: failed MSpanList_Remove 0x800698500 0x800b46d40 0x53adb0 0x53ada0
fatal error: MSpanList_Remove

This is results from 1.7.3 on FreeBSD 11.0

stevenh on 1 Dec 2016

Had it hang on a run last night here's a full trace with GOTRACEBACK=crash so includes the runtime routines. I also have the crash dump if its of use:
https://gist.github.com/stevenh/4b6d5bde0b81503c1a4bca461a5322ae

stevenh on 2 Dec 2016

👍1

Just wanted to give a quick status-update:

As has been mentioned the fork appears to be critical. In testing, I've removed all the Pipes plumbing in the parent, and exited the child immediately after fork (using SYS_EXIT), for my /doesnotexist example. This reliably crashes the process, with similar crashes. Removing the fork, Everything Is Fine. This implies that the exec syscall is not implicated.

What is surprising to me, is the nature of the call. i.e. - it is surprising that a child might affect a parent process, but if the syscall itself is to blame, well that is also surprising.

I can make this work available if anyone is interested.

I feel that my next step is to circumvent the regular runtime path that /doesnotexist takes, and write a second test using, say the RawSyscall interface alone.

derekmarcotte on 2 Dec 2016

👍2

I'd add that it also seems to need GC as with GOGC=100 it always takes many hours here, with GOGC=2-5 then its usually between 20mins and 1 hour and I've never had it happen with GOGC=off.

I'm going to leave a GOGC=off test running over the weekend to try and confirm that.

stevenh on 2 Dec 2016

👍1

Perhaps the next step is to stop garbage collection as part of ForkLock? Ima play in here for a bit.

derekmarcotte on 2 Dec 2016

I thought the same so I'm running the forks with GC disabled using GOGC=off and then manually calling GC via debug.FreeOSMemory().

No results yet, but here's the code for reference.

This adds additional synchronisation due to the use of constant goroutines, which might also have an impact but we'll have to see. It will be an interesting data point either way.

package main

import (
        "fmt"
        "os/exec"
        "runtime"
        "runtime/debug"
        "time"
)

var (
        gcPeriod     = time.Second * 10
        forkRoutines = 16
)

func run(work, done chan struct{}) {
        for range work {
                cmd := exec.Command("/usr/bin/true")
                cmd.Start()
                cmd.Wait()

                done <- struct{}{}
        }
}

func main() {
        fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        done := make(chan struct{}, forkRoutines)
        work := make(chan struct{}, forkRoutines)

        for i := 0; i < forkRoutines; i++ {
                work <- struct{}{}
                go run(work, done)
        }

        for {
                start := time.Now()
                forks := 0
                active := forkRoutines
        forking:
                for range done {
                        forks++
                        if time.Since(start) > gcPeriod {
                                active--
                                if active == 0 {
                                        break forking
                                }
                        } else {
                                work <- struct{}{}
                        }
                }

                fmt.Println("forks:", forks)
                debug.FreeOSMemory()

                for i := 0; i < forkRoutines; i++ {
                        work <- struct{}{}
                }
        }
}

stevenh on 4 Dec 2016

Here's an alternative version which is as close to the known bad version as I think we can get without editing the runtime:

package main

import (
        "fmt"
        "os/exec"
        "runtime"
        "time"
)

var (
        gcPeriod     = time.Second * 10
        forkRoutines = 16
)

func run(done chan struct{}) {
        cmd := exec.Command("/usr/bin/true")
        cmd.Start()
        cmd.Wait()

        done <- struct{}{}
}

func main() {
        fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        done := make(chan struct{}, forkRoutines*2)

        for i := 0; i < forkRoutines; i++ {
                go run(done)
        }

        for {
                start := time.Now()
                active := forkRoutines
        forking:
                for range done {
                        if time.Since(start) > gcPeriod {
                                active--
                                if active == 0 {
                                        break forking
                                }
                        } else {
                                go run(done)
                        }
                }

                runtime.GC()

                for i := 0; i < forkRoutines; i++ {
                        go run(done)
                }
        }
}

I've been running these now here for a few hours with no errors and I usually see an error within 20m - 1 hour with GOGC=5

This seems to indicate the issue maybe centered around GC kicking in while there is a fork in progress.

stevenh on 5 Dec 2016

Both sets of code have now been running for 12 hours with no errors, so looking more like this is indeed to do with the GC interacting with active forks causing corruption.

stevenh on 5 Dec 2016

@stevenh,

and then manually calling GC via debug.FreeOSMemory()

That doesn't do what you think it does. That's not how you manually call a GC. That would be runtime.GC(), but calling runtime.GC() can be super slow. (it does a full GC while blocking the process)

Just disable & reenable with:

   old := debug.SetGCPercent(0) // now disabled.
   // forklock..fork..exec...unlock
   debug.SetGCPercent(old) // now GC is re-enabled.

bradfitz on 5 Dec 2016

I did actually check that and saw debug.FreeOSMemory() is freeOSMemory which in turn is:

func runtime_debug_freeOSMemory() {
        gcStart(gcForceBlockMode, false)
        systemstack(func() { mheap_.scavenge(-1, ^uint64(0), 0) })
}

Where as runtime.GC() is:
```func GC() {
gcStart(gcForceBlockMode, false)
}


So it looks like FreeOSMemory does do a full GC + trying to get the OS to cleanup. This is reinforced by the docs for FreeOSMemory:

FreeOSMemory forces a garbage collection followed by an attempt to return as much memory to the operating system as possible. (Even if this is not called, the runtime gradually returns memory to the operating system in a background task.)


I've also confirmed that memory usage is maintained which is not the case with just GOGC=off.

Am I really mistaken?

I could use
```go
old := debug.SetGCPercent(0)
...
// Force GC
debug.SetGCPercent(old)
debug.SetGCPercent(0)

Its also worth noting that SetGCPercent directly calls runtime.GC() so would still result in a stop the world GC?

stevenh on 5 Dec 2016

Also worth noting @bradfitz that while the first used debug.FreeOSMemory() the second uses runtime.GC() as I was interested in the difference.

stevenh on 5 Dec 2016

I guess I missed or don't understand why you want to force a GC. Or why you'd want to return memory to the operating system.

I thought the whole point was to stop the GC so it's not running during a fork. That's much lighter weight than running a GC.

bradfitz on 5 Dec 2016

The idea of the tests is to confirm that we can fork without error if the GC was disabled.

As the original test used multiple routines doing the forking, to force the issue quicker, I wanted to maintain that. Given this it seemed easier to just do a stop the world GC than to try and coordinate ensuring that GC was stopped across all the routines as they fork.

The first version used debug.FreeOSMemory() in case the freeing the memory to the OS was a contributing factor to the issue.

The second version I switched to runtime.GC() to see if removing this fact and allowing background free made any difference.

I did consider using debug.SetGCPercent(X) but after looking at it and noticing it actually calls runtime.GC() anyway, I concluded it would be almost identical to calling runtime.GC() while all the forking routines where blocked which was as mentioned previously was easier to coordinate.

Both the test versions have now been running for 24hours so it seems highly likely that the GC while forking is the cause of this issue.

I'll leave them running for now, but I need a way to confirm for sure and also to narrow down the exact piece of code which is causing the issue.

Any ideas gratefully received.

stevenh on 5 Dec 2016

Given that we believe that this only occurs on FreeBSD, and only occurs when running on real hardware, I would guess that the problem is not the GC as such, but rather than fact that another thread in the parent process is doing a lot of reads and writes to memory pages that are also being read and written by the child process.

ianlancetaylor on 5 Dec 2016

Hmm why do you say it only occurs on real hardware @ianlancetaylor?

Reading back through the thread I see @kaey mention it telegraph crashes for him on vmware too, so interested to know if there's another piece of the puzzle there?

I'm wondering if PCID may be at play here...

stevenh on 6 Dec 2016

If it's PCID related, does "vm.pmap.pcid_enabled=0" help?

minux on 6 Dec 2016

That's exactly what I just tried unfortunately not :(

stevenh on 6 Dec 2016

@stevenh I said "real hardware" based on @reezer 's comment in https://github.com/golang/go/issues/15658#issuecomment-264165583. My apologies if I misunderstood.

ianlancetaylor on 6 Dec 2016

In an attempt to produce a minimal version I've tried creating a C version but currently no joy.

Capturing here for reference.

#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/resource.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
#include <errno.h>
#include <pthread.h>

#define MEMSIZE 10240
#define RANDMAX 4
#define THREADS 16

#define handle_error_en(en, msg) \
        do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)

#define handle_error(msg) \
        do { perror(msg); exit(EXIT_FAILURE); } while (0)

struct thread_info {
        pthread_t thread_id;
        int thread_num;
};

void
check(int val, int *i, int *mem)
{
        int j;

        if (*i != val) {
                printf("i: %d\n", *i);
                exit(EXIT_FAILURE);
        }

        for (j = 0; j < MEMSIZE; j++) {
                if (mem[j] != val) {
                        printf("mem: %d = %d\n", j, mem[j]);
                        exit(EXIT_FAILURE);
                }
        }
}

void
randomise(int val, int *i, int *mem)
{
        struct timespec ts;

        ts.tv_sec = 0;
        srandom((unsigned long)val);

        for (; val < RANDMAX; val+=2) {
                *i = val;
                for (int j = 0; j < MEMSIZE; j++) {
                        mem[j] = val;
                }

                check(val, i, mem);
        }
}

void
dofork()
{
        pid_t pid;
        int i;
        int mem[MEMSIZE];

        i = 0;
        for (int j = 0; j < MEMSIZE; j++) {
                mem[j] = 0;
        }

        pid = fork();
        switch (pid) {
        case -1:
                handle_error("fork");
        case 0: {
                // child
                randomise(1, &i, mem);
                exit(execv("/usr/bin/true", NULL));
        }
        default: {
                int status;
                struct rusage ru;
                // parent
                randomise(2, &i, mem);

                // wait6 followed by wait4 as that's what golang does        
                wait6(P_PID, pid, 0, WNOWAIT|WEXITED, 0, 0);
                wait4(pid, &status, 0, &ru);
        }}
}

static void *
thread_start(void *arg)
{
        struct thread_info *tinfo = arg;
        char info[1024], *ret;

        sprintf(info, "thread %d\n", tinfo->thread_num);

        ret = strdup(info);
        if (ret == NULL) {
                handle_error("strdup");
        }

        while (1==1) {
                dofork();
        }

        return ret;
}

int
main(int argc, char**argv)
{
        int s, tnum;
        struct thread_info *tinfo;
        pthread_attr_t attr;
        void *res;

        s = pthread_attr_init(&attr);
        if (s != 0) {
                handle_error_en(s, "pthread_attr_init");
        }

        tinfo = calloc(THREADS, sizeof(struct thread_info));
        if (tinfo == NULL) {
                handle_error("calloc");
        }

        for (tnum = 0; tnum < THREADS; tnum++) {
                tinfo[tnum].thread_num = tnum + 1;
                s = pthread_create(&tinfo[tnum].thread_id, &attr, &thread_start, &tinfo[tnum]);
                if (s != 0) {
                        handle_error_en(s, "pthread_create");
                }
        }

        s = pthread_attr_destroy(&attr);
        if (s != 0) {
                handle_error_en(s, "pthread_attr_destroy");
        }

        for (tnum = 0; tnum < THREADS; tnum++) {
                s = pthread_join(tinfo[tnum].thread_id, &res);
                if (s != 0) {
                        handle_error_en(s, "pthread_join");
                }
                printf("joined with thread %d; returned value was %s\n", tinfo[tnum].thread_num, (char *)res);
                free(res);
        }

        free(tinfo);
        exit(EXIT_SUCCESS);
}

stevenh on 6 Dec 2016

I've been looking through the FreeBSD specific code in this area and I noticed that newosproc uses raw sigprocmask.

If its possible that newosproc can be called from multiple threads at the same time then its possible that signals which are meant to be blocked during thr_new could be re-enabled unexpectedly causing undefined behaviour.

This is is why man SIGPROCMASK(2) states:
In threaded applications, pthread_sigmask(3) must be used instead of sigprocmask().

This may be unrelated but...

stevenh on 6 Dec 2016

I looked at the FreeBSD library sources, and it looks like pthread_sigmask just calls sigprocmask after updating some user space information. There is no separate pthread_sigmask system call. So I think the Go code is doing the right thing in making the sigprocmask system call.

ianlancetaylor on 7 Dec 2016

libthr maintains internal state to ensure that signals don't get unblocked when they shouldn't (keeps a blocked counter and only unblocks when the counter drops to zero). This seems a bit quirky to me at first glance so I'd have to spend more time looking to see if there is any other things it does to ensure correct handling of signals to threaded apps it provides.

It struck me that as go can be run different routines on a given kernel thread, then that when combined with the direct use of sigprocmask it may be possible for a signal blocked from one could then be unintentionally unblocked by another without additional protection in the runtime.

I'm not saying this is possible or that this could be the issue, but just highlighting something I noticed while investigating.

stevenh on 8 Dec 2016

Something else that I've just noticed is that when we get the failed MSpanList_Remove panic it seems to always due to span.list != list with list off by 10 every time, which doesn't seem random.

Does this ring any bells?

stevenh on 8 Dec 2016

Some progress: I was able to add a temporary method to the runtime to test only the fork portion of the program. This work is available here:

https://github.com/derekmarcotte/go/commit/9a3667bc261085793f7386064699eb3f020d1a4b

Additionally, a test case that exercises that method is available here:

https://raw.githubusercontent.com/derekmarcotte/goruntime-testing/master/src/15658-forkonly.go

This test case has wedged once (with 3 threads in 100%, and the rest in idle), and crashed once, out of two attempts:

acquirep: p->m=842351068160(8) p->status=1
fatal error: acquirep: invalid p state

This is definitely some interaction with the fork system call. I'll next disable garbage collection and see where it goes. Will post results.

EDIT: Whoop, thought I based my branch off of master. Have rebased and will be testing with GOGC=5, and then GOGC=off. New links to come in a few days.

derekmarcotte on 9 Dec 2016

Just wanted to add, using the 15658-forkonly.go merged against master, of three tests:

(GOGC=5) 4 threads were at 100% CPU usage, the rest were working okay after many hours
(GOGC=5) panic'd:
> acquirep: p->m=842351142912(10) p->status=1
fatal error: acquirep: invalid p state
acquirep: p->m=5203968(0) p->status=1
fatal error: acquirep: invalid p state
(GOGC=off) all threads were idle (uwait), issuing a break gave me:
> acquirep: p->m=842350797824(5) p->status=1
fatal error: acquirep: invalid p state
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state

I'll be posting updates to my branch later today.

Interesting to note: all of the panics now seem to be in acquirep.

derekmarcotte on 12 Dec 2016

Interesting, do you have the full trace from the GOGC=off case?

I don't suppose you have the output from procstat -k -k do you?

Given how simple that is now, it really smells like this could either be signals or something more fundamental like _umtx_op interation.

stevenh on 12 Dec 2016

@stevenh:

Trace: https://github.com/derekmarcotte/goruntime-testing/blob/master/logs/15658-forkonly-log-head-2

No procstat, but I can try again.

derekmarcotte on 13 Dec 2016

Here's the test function integrated with head: https://github.com/derekmarcotte/go/commit/1deb3ec41080ce69dc4e980716fcf07db34825ac

derekmarcotte on 14 Dec 2016

Been running your new code with ktrace since Wed Dec 14 10:06:02 UTC 2016 no crashes yet :(

stevenh on 16 Dec 2016

Still no crashes with your test case under ktrace after 7 days @derekmarcotte :(

I'll leave it till tomorrow then try to see if trigger it without ktrace.

stevenh on 21 Dec 2016

Still unable to reproduce the issue after 8 days running with GOGC=OFF and no ktrace.

Are you sure that managed to reproduce the issue with GOGC=OFF @derekmarcotte?

stevenh on 30 Dec 2016

@stevenh: FWIW, I have not been able to reproduce the crashes with GOGC=OFF, on an Intel Xeon (with ECC RAM), only on my AMD FX-8350. I'm going to (start to) verify GOGC=5 on the Xeon today. If you have an AMD (+FreeBSD 10.3 amd64) machine at your disposal, that would be interesting.

derekmarcotte on 3 Jan 2017

I'm afraid I don't have access to any AMD HW. I'm currently running with GOGC=2 + ktrace on a E5-2640 since the 1st no crashes yet but it does seem to take quite a while with ktrace enabled.

stevenh on 3 Jan 2017

Here's a GOGC=5, wedged, with with thread at 100%, and the rest at 0%:

procstat -k -k 62812
PID TID COMM TDNAME KSTACK
62812 100817 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_timedwait_sig+0x10 _sleep+0x238 umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 100928 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 100937 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 101090 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 102891 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 102896 15658-forkonly-+5b6 15658-forkonly-+
62812 102897 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 102898 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 102899 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_timedwait_sig+0x10 _sleep+0x238 umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 102900 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb
62812 103024 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb

I'm guessing they are blocked on the ForkLock, but I don't know how to interpret this output. I ran ktrace for a bit (~30s), and got the following (first number is counts):

1716125 CALL clock_gettime(0x4,0xc420063e88)
44278 CALL clock_gettime(0x4,0xc4201adee0)
2 CALL clock_gettime(0x4,0xc420039eb0)
1 CALL clock_gettime(0x4,0xc420039f10)
1 CALL clock_gettime(0,0xc420039ef0)
1760407 RET clock_gettime 0
858063 CALL sched_yield
858062 RET sched_yield 0
22139 CALL _umtx_op(0x4f62c8,UMTX_OP_WAIT_UINT_PRIVATE,0,0x18,0xc4201ade78)
1 CALL _umtx_op(0x4f62d8,UMTX_OP_WAIT_UINT_PRIVATE,0,0x18,0xc420039e48)
22140 RET _umtx_op -1 errno 60 Operation timed out
1 CALL nanosleep(0xc420039f18,0)
1 RET nanosleep 0https://github.com/golang/go/issues/17168

Interesting to note: I'm running ntp on this machine. I think there was another issue that depending on a MONOTONIC clock on FreeBSD. Are you running NTP?

For those that cannot reproduce, what if you run NTP?

derekmarcotte on 3 Jan 2017

Interesting find, I am running NTP here.

It doesn't seem like this could be the cause of the corruption given the description but I may be wrong.

stevenh on 4 Jan 2017

I've been unable to get anything other than the single thread stuck @ 100% CPU while exercising https://github.com/derekmarcotte/go/commit/1deb3ec41080ce69dc4e980716fcf07db34825ac

I'm going to try running with 1.7.4 + https://github.com/golang/go/commit/360f2e43b78a3ea119ea8dce9649f7c1227d793b to see if this behaviour changes.

stevenh on 11 Jan 2017

Scratch that as https://github.com/golang/go/commit/360f2e43b78a3ea119ea8dce9649f7c1227d793b was already included in 1.7 by https://github.com/golang/go/commit/a2f37b7fe130e1a5da78b605f4191547e9868944

stevenh on 11 Jan 2017

Testing on my machines right now, got this (already mentioned in this thread):

runtime: failed MSpanList_Remove 0x800528280 0x800727480 0x53ab10 0x53ab00
fatal error: MSpanList_Remove

on the original true test fairly quickly (GOTRACEBACK=crash GOMAXPROCS=10 GOGC=5, FreeBSD/amd64 11.0-RELEASE-p6, go1.7.4, Intel Core i5 Skylake @ 4.5GHz, VirtualBox VM)

UPDATE on my laptop (same variables; FreeBSD/amd64 12-CURRENT, go1.7.4, Intel Core i3 Haswell @ 1.8 GHz):

fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty

myfreeweb on 11 Jan 2017

@stevenh : Testing on a Xeon w/ ECC, and GOGC=2, I haven't yet panic'd. I have seen all threads idle a few times though, and then I just break and restart. Challenge is that this machine is FreeBSD 10.1.

I'd like to use the same usb stick/test binary/environment on both hardware platforms, and see what the observed differences are. It's hard to dedicate hardware resources like this for any period of time though. At least then, it should be easily repeatable.

derekmarcotte on 12 Jan 2017

@myfreeweb: are you interested/able to run the fork-only tests, using:

This runtime:

https://github.com/derekmarcotte/go/commit/1deb3ec41080ce69dc4e980716fcf07db34825ac (branch dm-15658-testing-head)

This test application:

https://raw.githubusercontent.com/derekmarcotte/goruntime-testing/master/src/15658-forkonly.go

This command line:

GOMAXPROCS=8 GOGC=5 ./15658-forkonly

Thanks!

derekmarcotte on 12 Jan 2017

Using a fresh FreeBSD 10.3-RELEASE memstick-mini "live cd" mode (from the FTP release site), and the above, I got an aquirep panic, first try on identical hardware to this machine.

I'm going to use the GOGC=off and see if I still panic next.

At this point though, this means that I'm able to consistently repeat it on:
FreeBSD 10.3-RELEASE amd64 (AMD+non-ecc, Xeon+ecc), with a fresh memstick-mini image from FreeBSD

derekmarcotte on 20 Jan 2017

To see what happened when we eliminate new goroutines and channels I've been testing with the the following:

package main

import (
        "log"
        "net/http"
        _ "net/http/pprof"
        "runtime"
        "syscall"
)

var (
        forkRoutines = 16
)

func run() {
        for {
                if err := syscall.ForkOnlyBSDTest(); err != nil {
                        log.Fatal(err)
                }
        }
}

func main() {
        log.Printf("Starting %v forking goroutines...\n", forkRoutines)
        log.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        for i := 0; i < forkRoutines; i++ {
                go run()
        }

        log.Println(http.ListenAndServe(":8080", nil))
}

After 4 days of running it wedged using very little CPU.

ktrace showed it was waiting periodically but not much else. The /debug/pprof/ was unresponsive and there was one goroutine waiting to be run in Wait4:

goroutine 71 [runnable]:
syscall.Wait4(0x13e31, 0xc42008d73c, 0x0, 0x0, 0x13e31, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall_bsd.go:127 fp=0xc42008d700 sp=0xc42008d6f8
syscall.ForkOnlyBSDTest(0x0, 0x0)
        /usr/local/go/src/syscall/exec_bsd_debug.go:35 +0xf1 fp=0xc42008d780 sp=0xc42008d700
main.run()
        /data/go/src/github.com/multiplay/go/apps/test/main.go:17 +0x2e fp=0xc42008d7c0 sp=0xc42008d780
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42008d7c8 sp=0xc42008d7c0
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:28 +0x1c4

It seems like the runtime scheduler is failing to run either the one goroutine which holds the fork lock (above) or the pprof handler, so is some how wedged.

This may be irrelevant and just another symptom of memory corruption but I wanted to share as I've not seen it before.

I have the full goroutine trace after -ABRT and the gdb threads trace before that in case it might shed some additional light.

stevenh on 6 Feb 2017

👍2

I'm seeing a similar memory corruption issue in my app, which I was able to narrow down slightly and reproduce as follows. My repro case is not as basic as the one already being used here, but I figured I'd document it anyway:

package main

import (
    "log"
    "os"
    "syscall"
    "time"

    "github.com/fancybits/gopsutil/cpu"
)

func doit() {
    time.Sleep(50 * time.Millisecond)
    log.Printf("running exec...")
    syscall.Exec(os.Args[0], os.Args, os.Environ())
    log.Printf("exec failed!")
}

func main() {
    log.SetFlags(log.LstdFlags)
    log.Printf("starting...")
    cpu.Info()

    go doit()

    handler := make(chan os.Signal, 1)
    for range handler {
        break
    }
}

I'm building this with golang 1.8 (cd6b6202dd1559b3ac63179b45f1833fcfbe7eca) on amd64 freebsd in a virtualbox VM running FreeNAS 9.10 (which uses the FreeBSD 10.3 kernel). The vagrant box I used is https://atlas.hashicorp.com/drajen/boxes/freenas9

Running the test program reliably crashes within a few seconds, varying between SIGBUS, SIGSEGV, SIGILL, etc:

vagrant@freenas:~ % uname -a
FreeBSD freenas.local 10.3-STABLE FreeBSD 10.3-STABLE #0 455d13d(9.10-STABLE): Sun Jun 26 22:47:03 PDT 2016     [email protected]:/tank/home/nightlies/build-freenas9/_BE/objs/tank/home/nightlies/build-freenas9/_BE/trueos/sys/FreeNAS.amd64  amd64

vagrant@freenas:~ % ./spawntest
2017/03/07 19:56:21 starting...
2017/03/07 19:56:21 running exec...
2017/03/07 19:56:21 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
2017/03/07 19:56:22 starting...
2017/03/07 19:56:22 running exec...
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
panic during panic
[signal SIGBUS: bus error code=0x3 addr=0x451730 pc=0x451730]

runtime stack:
runtime.startpanic_m()
    /home/vagrant/go/src/runtime/panic.go:653 +0x19a
runtime.systemstack(0x525fc8)
    /home/vagrant/go/src/runtime/asm_amd64.s:343 +0xab
runtime.startpanic()
    /home/vagrant/go/src/runtime/panic.go:569 +0x1e
runtime.throw(0x5248de, 0x2a)
    /home/vagrant/go/src/runtime/panic.go:595 +0x88
runtime.sigpanic()
    /home/vagrant/go/src/runtime/signal_unix.go:274 +0x2db
runtime.usleep(0x333520322e322e30, 0x302e303120343430, 0x32322035312e322e, 0x6e69622f3a6e6962, 0x6f6c2f7273752f3a, 0x3a6e69622f6c6163, 0x636f6c2f7273752f, 0x3a6e6962732f6c61, 0x6e69622f7273752f, 0x2f3a6e6962732f3a, ...)
    /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:276 +0x40

goroutine 1 [runnable, locked to thread]:
syscall.Syscall6(0x214, 0x0, 0x6018, 0x0, 0x18, 0x0, 0x0, 0x6018, 0x0, 0x0)
    /home/vagrant/go/src/syscall/asm_unix_amd64.s:42 +0x5
os.(*Process).blockUntilWaitable(0xc420070090, 0xc42009e000, 0xc420092140, 0x3)
    /home/vagrant/go/src/os/wait_wait6.go:29 +0x6b
os.(*Process).wait(0xc420070090, 0x40e042, 0xc420094090, 0xc420070030)
    /home/vagrant/go/src/os/exec_unix.go:22 +0x4d
os.(*Process).Wait(0xc420070090, 0x0, 0x0, 0x525c08)
    /home/vagrant/go/src/os/exec.go:115 +0x2b
os/exec.(*Cmd).Wait(0xc42009e000, 0x0, 0x0)
    /home/vagrant/go/src/os/exec/exec.go:435 +0x62
os/exec.(*Cmd).Run(0xc42009e000, 0xc420096050, 0x2)
    /home/vagrant/go/src/os/exec/exec.go:280 +0x5c
os/exec.(*Cmd).Output(0xc42009e000, 0xc, 0xc420035db8, 0x2, 0x2, 0xc42009e000)
    /home/vagrant/go/src/os/exec/exec.go:474 +0x11c
github.com/fancybits/gopsutil/internal/common.DoSysctrl(0x51f5cd, 0xd, 0x0, 0x0, 0x1, 0xc420098000, 0x0)
    /home/vagrant/.go/src/github.com/fancybits/gopsutil/internal/common/common_freebsd.go:22 +0x120
github.com/fancybits/gopsutil/cpu.Times(0x5ad601, 0xc420098000, 0x1, 0x1, 0x0, 0x0)
    /home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu_freebsd.go:53 +0x6a
github.com/fancybits/gopsutil/cpu.init.1()
    /home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu.go:59 +0xb1
github.com/fancybits/gopsutil/cpu.init()
    /home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu_freebsd.go:155 +0x76
main.init()
    /home/vagrant/spawntest/main.go:31 +0x58

goroutine 17 [runnable]:
syscall.Syscall(0x3, 0x4, 0xc4200a8000, 0x200, 0x4, 0x200, 0x0)
    /home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5
syscall.read(0x4, 0xc4200a8000, 0x200, 0x200, 0xc420024800, 0x800598000, 0x0)
    /home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:893 +0x55
syscall.Read(0x4, 0xc4200a8000, 0x200, 0x200, 0x42561f, 0x5259f0, 0xc420021dc0)
    /home/vagrant/go/src/syscall/syscall_unix.go:162 +0x49
os.(*File).read(0xc4200a0010, 0xc4200a8000, 0x200, 0x200, 0x4a4feb, 0x4f45e0, 0x200)
    /home/vagrant/go/src/os/file_unix.go:165 +0x6a
os.(*File).Read(0xc4200a0010, 0xc4200a8000, 0x200, 0x200, 0x42e5be, 0xc420001a00, 0xc420021e60)
    /home/vagrant/go/src/os/file.go:101 +0x76
bytes.(*Buffer).ReadFrom(0xc420098070, 0x5994a0, 0xc4200a0010, 0x800638028, 0xc420098070, 0xc420021f01)
    /home/vagrant/go/src/bytes/buffer.go:179 +0x160
io.copyBuffer(0x5992a0, 0xc420098070, 0x5994a0, 0xc4200a0010, 0x0, 0x0, 0x0, 0xc420012178, 0x0, 0x0)
    /home/vagrant/go/src/io/io.go:384 +0x2cb
io.Copy(0x5992a0, 0xc420098070, 0x5994a0, 0xc4200a0010, 0x4f3660, 0xc420012120, 0xc420021fb0)
    /home/vagrant/go/src/io/io.go:360 +0x68
os/exec.(*Cmd).writerDescriptor.func1(0x4f3660, 0xc420012120)
    /home/vagrant/go/src/os/exec/exec.go:254 +0x4d
os/exec.(*Cmd).Start.func1(0xc42009e000, 0xc4200920c0)
    /home/vagrant/go/src/os/exec/exec.go:371 +0x27
created by os/exec.(*Cmd).Start
    /home/vagrant/go/src/os/exec/exec.go:372 +0x4e4

goroutine 18 [runnable]:
syscall.Syscall(0x3, 0x6, 0xc4200ac000, 0x8000, 0x0, 0x8000, 0x0)
    /home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5
syscall.read(0x6, 0xc4200ac000, 0x8000, 0x8000, 0x80059dc80, 0x1, 0x101000000000000)
    /home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:893 +0x55
syscall.Read(0x6, 0xc4200ac000, 0x8000, 0x8000, 0x8005984b0, 0x0, 0x0)
    /home/vagrant/go/src/syscall/syscall_unix.go:162 +0x49
os.(*File).read(0xc4200a0028, 0xc4200ac000, 0x8000, 0x8000, 0x4f45e0, 0xc42001c601, 0xc4200ac000)
    /home/vagrant/go/src/os/file_unix.go:165 +0x6a
os.(*File).Read(0xc4200a0028, 0xc4200ac000, 0x8000, 0x8000, 0x8000, 0x8000, 0x0)
    /home/vagrant/go/src/os/file.go:101 +0x76
io.copyBuffer(0x599620, 0xc420096050, 0x5994a0, 0xc4200a0028, 0xc4200ac000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
    /home/vagrant/go/src/io/io.go:390 +0x100
io.Copy(0x599620, 0xc420096050, 0x5994a0, 0xc4200a0028, 0x0, 0x0, 0x0)
    /home/vagrant/go/src/io/io.go:360 +0x68
os/exec.(*Cmd).writerDescriptor.func1(0x0, 0x0)
    /home/vagrant/go/src/os/exec/exec.go:254 +0x4d
os/exec.(*Cmd).Start.func1(0xc42009e000, 0xc420092140)
    /home/vagrant/go/src/os/exec/exec.go:371 +0x27
created by os/exec.(*Cmd).Start
    /home/vagrant/go/src/os/exec/exec.go:372 +0x4e4

tmm1 on 8 Mar 2017

Your actually hitting the fork issue using gopsutil as it uses the cmd sysctl instead of just the syscall, which is silly, so for your case switching to use raw syscalls would likely fix your issue.

stevenh on 8 Mar 2017

Well, hold on. The call to cpu.Info() does fork/exec 3x times, but cpu.Info() isn't being called from the goroutine doit(), it's only being used before the goroutine is spawned.

@tmm1 does removing the call to cpu.Info() before doit change the abort behavior in your environment? My suspicion it will and the program will take a while to crash.

What @tmm1 is reporting is interesting to me because his test case is reproducible within seconds versus days. Without digging in, what stands out to me is this test case has two concurrent fork calls happening at the start of the program and it's calling a real program.

I'm not sure how this is different than @stevenh 's snippet, but there's probably something in syscall.ForkOnlyBSDTest() that's different than the fork/exec call to /sbin/sysctl.

sean- on 8 Mar 2017

The trace indicates the issue is in the init() of https://github.com/fancybits/gopsutil/blob/arm-mhz-fix/cpu/cpu.go#L59

The frequency of the panic is interesting.

@tmm1 could post your spawntest script please so we can try to repo.

stevenh on 8 Mar 2017

Ignore the above I've produced it and its is very quick, so definitely very interesting.

stevenh on 8 Mar 2017

Your actually hitting the fork issue using gopsutil as it uses the cmd sysctl instead of just the syscall, which is silly, so for your case switching to use raw syscalls would likely fix your issue.

Indeed, I'm not sure why its using fork/exec instead of the syscall. Switching to the syscall should be an effective workaround.

@tmm1 does removing the call to cpu.Info() before doit change the abort behavior in your environment? My suspicion it will and the program will take a while to crash.

Removing the cpu.Info() makes the program take longer to crash, to a few minutes instead of seconds.

The trace indicates the issue is in the init() of https://github.com/fancybits/gopsutil/blob/arm-mhz-fix/cpu/cpu.go#L59

Indeed, it almost always happens in cpu.init()

@tmm1 could post your spawntest script please so we can try to repo.

spawntest is what I called the compiled golang binary given the source posted above.

I also tried this morning to run the same binary on a fresh freebsd VM, using the freebsd/FreeBSD-10.3-RELEASE vagrant box. I cannot seem to reproduce the crash there (atleast not in the same short amount of time that I tried). The kernel there is:

FreeBSD fbsd 10.3-RELEASE FreeBSD 10.3-RELEASE #0 r297264: Fri Mar 25 02:10:02 UTC 2016 [email protected]:/usr/obj/usr/src/sys/GENERIC amd64

However on the FreeNAS 9.10 VM (running on the same macOS host) crashes readily:

FreeBSD freenas.local 10.3-STABLE FreeBSD 10.3-STABLE #0 455d13d(9.10-STABLE): Sun Jun 26 22:47:03 PDT 2016 [email protected]:/tank/home/nightlies/build-freenas9/_BE/objs/tank/home/nightlies/build-freenas9/_BE/trueos/sys/FreeNAS.amd64 amd64

My golang app which I originally saw this memory corruption bug in is designed to be run on FreeNAS appliances, and atleast half a dozen users have reported the same crash on various pieces of hardware all running FreeNAS 9.10. (The golang program auto-updates itself by downloading a new version, then syscall.Exec to replace itself. The crash always happens during cpu.init() after the new process takes over. Here's a gdb trace from a production crash, which usually exhibits as a SIGILL: https://gist.github.com/tmm1/86cd869d1bfdae66e4058698dd30aee9)

tmm1 on 8 Mar 2017

Anyone who can make it crash reliably, please try changing go/src/os/wait_wait6.go's func (p *Process) blockUntilWaitable() (bool, error) to have as its very first line:

return false, nil

That will fall back to not using the wait6 system call at all. If the crash still happens, then we know that wait6 is not the problem. If the crash stops, then we have somewhere to look. Thanks.

rsc on 8 Mar 2017

(And then of course 'go install os')

rsc on 8 Mar 2017

Anyone who can make it crash reliably, please try changing go/src/os/wait_wait6.go's func (p *Process) blockUntilWaitable() (bool, error) to have as its very first line:

Made this change and my repro still crashed quickly:

fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
panic during panic
[signal SIGBUS: bus error code=0x3 addr=0x451730 pc=0x451730]

runtime stack:
runtime.startpanic_m()
    /home/vagrant/go/src/runtime/panic.go:653 +0x19a
runtime.systemstack(0x525fc0)
    /home/vagrant/go/src/runtime/asm_amd64.s:343 +0xab
runtime.startpanic()
    /home/vagrant/go/src/runtime/panic.go:569 +0x1e
runtime.throw(0x5248d9, 0x2a)
    /home/vagrant/go/src/runtime/panic.go:595 +0x88
runtime.sigpanic()
    /home/vagrant/go/src/runtime/signal_unix.go:274 +0x2db
runtime.usleep(0x6e69622f7273752f, 0x2f3a6e6962732f3a, 0x6e6962732f727375, 0xc420070750, 0x2d, 0xc4200707b0, 0x2d, 0xc420070810, 0x21, 0xc420094820, ...)
    /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:276 +0x40

goroutine 1 [runnable]:
syscall.Syscall6(0x7, 0x13612, 0xc420035a7c, 0x0, 0xc420074090, 0x0, 0x0, 0x13612, 0x0, 0x0)
    /home/vagrant/go/src/syscall/asm_unix_amd64.s:42 +0x5
syscall.wait4(0x13612, 0xc420035a7c, 0x0, 0xc420074090, 0x90, 0x516b00, 0x4bc401)
    /home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:34 +0x88
syscall.Wait4(0x13612, 0xc420035acc, 0x0, 0xc420074090, 0xc420035b80, 0xc420014360, 0x0)
    /home/vagrant/go/src/syscall/syscall_bsd.go:129 +0x51
os.(*Process).wait(0xc420014360, 0x40e042, 0xc42000e660, 0xc420014300)
    /home/vagrant/go/src/os/exec_unix.go:38 +0xbb
os.(*Process).Wait(0xc420014360, 0x0, 0x0, 0x525c00)
    /home/vagrant/go/src/os/exec.go:115 +0x2b
os/exec.(*Cmd).Wait(0xc4200686e0, 0x0, 0x0)
    /home/vagrant/go/src/os/exec/exec.go:435 +0x62
os/exec.(*Cmd).Run(0xc4200686e0, 0xc42004a0f0, 0x2)
    /home/vagrant/go/src/os/exec/exec.go:280 +0x5c
os/exec.(*Cmd).Output(0xc4200686e0, 0xc, 0xc420035cb8, 0x2, 0x2, 0xc4200686e0)
    /home/vagrant/go/src/os/exec/exec.go:474 +0x11c
github.com/fancybits/gopsutil/internal/common.DoSysctrl(0x51e5bc, 0x7, 0x40, 0x40a1ec0000000000, 0x0, 0x0, 0x0)
    /home/vagrant/.go/src/github.com/fancybits/gopsutil/internal/common/common_freebsd.go:22 +0x120
github.com/fancybits/gopsutil/cpu.Info(0x51f162, 0xb, 0x0, 0x0, 0x0)
    /home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu_freebsd.go:118 +0x1e4
main.main()
    /home/vagrant/spawntest/main.go:28 +0x67

goroutine 8 [runnable]:
syscall.Syscall(0x3, 0x6, 0xc4200da000, 0x8000, 0x0, 0x8000, 0x0)
    /home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5
syscall.read(0x6, 0xc4200da000, 0x8000, 0x8000, 0x80059e3a0, 0x402001, 0x10100c42001de08)
    /home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:893 +0x55
syscall.Read(0x6, 0xc4200da000, 0x8000, 0x8000, 0x800598000, 0x0, 0xc42001de28)
    /home/vagrant/go/src/syscall/syscall_unix.go:162 +0x49
os.(*File).read(0xc42000c088, 0xc4200da000, 0x8000, 0x8000, 0x4f45e0, 0xc42001de01, 0xc4200da000)
    /home/vagrant/go/src/os/file_unix.go:165 +0x6a
os.(*File).Read(0xc42000c088, 0xc4200da000, 0x8000, 0x8000, 0x8000, 0x8000, 0x403fc0)
    /home/vagrant/go/src/os/file.go:101 +0x76
io.copyBuffer(0x599620, 0xc42004a0f0, 0x5994a0, 0xc42000c088, 0xc4200da000, 0x8000, 0x8000, 0xc4200a44d8, 0x0, 0x0)
    /home/vagrant/go/src/io/io.go:390 +0x100
io.Copy(0x599620, 0xc42004a0f0, 0x5994a0, 0xc42000c088, 0x4f3660, 0xc4200a4480, 0xc42001dfb0)
    /home/vagrant/go/src/io/io.go:360 +0x68
os/exec.(*Cmd).writerDescriptor.func1(0x4f3660, 0xc4200a4480)
    /home/vagrant/go/src/os/exec/exec.go:254 +0x4d
os/exec.(*Cmd).Start.func1(0xc4200686e0, 0xc42000a600)
    /home/vagrant/go/src/os/exec/exec.go:371 +0x27
created by os/exec.(*Cmd).Start
    /home/vagrant/go/src/os/exec/exec.go:372 +0x4e4

goroutine 7 [runnable]:
syscall.Syscall(0x3, 0x4, 0xc420072400, 0x200, 0x1, 0x200, 0x0)
    /home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5
syscall.read(0x4, 0xc420072400, 0x200, 0x200, 0xc420024400, 0x8005984b0, 0x0)
    /home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:893 +0x55
syscall.Read(0x4, 0xc420072400, 0x200, 0x200, 0x42561f, 0x5259e8, 0xc42001e5c0)
    /home/vagrant/go/src/syscall/syscall_unix.go:162 +0x49
os.(*File).read(0xc42000c070, 0xc420072400, 0x200, 0x200, 0x4a4e2b, 0x4f45e0, 0x200)
    /home/vagrant/go/src/os/file_unix.go:165 +0x6a
os.(*File).Read(0xc42000c070, 0xc420072400, 0x200, 0x200, 0x42e5be, 0xc4200a8680, 0xc42001e660)
    /home/vagrant/go/src/os/file.go:101 +0x76
bytes.(*Buffer).ReadFrom(0xc420048230, 0x5994a0, 0xc42000c070, 0x800638028, 0xc420048230, 0xc42001e701)
    /home/vagrant/go/src/bytes/buffer.go:179 +0x160
io.copyBuffer(0x5992a0, 0xc420048230, 0x5994a0, 0xc42000c070, 0x0, 0x0, 0x0, 0xc4200a44d8, 0x0, 0x0)
    /home/vagrant/go/src/io/io.go:384 +0x2cb
io.Copy(0x5992a0, 0xc420048230, 0x5994a0, 0xc42000c070, 0x4f3660, 0xc4200a4480, 0xc42001e7b0)
    /home/vagrant/go/src/io/io.go:360 +0x68
os/exec.(*Cmd).writerDescriptor.func1(0x4f3660, 0xc4200a4480)
    /home/vagrant/go/src/os/exec/exec.go:254 +0x4d
os/exec.(*Cmd).Start.func1(0xc4200686e0, 0xc42000a580)
    /home/vagrant/go/src/os/exec/exec.go:371 +0x27
created by os/exec.(*Cmd).Start
    /home/vagrant/go/src/os/exec/exec.go:372 +0x4e4

tmm1 on 8 Mar 2017

OK, that's good progress. Thanks. @tmm1, in the trace it says pc=0x451730. What is that pc in the binary? (for example, go tool objdump yourprog | grep -C 10 0x451730)

rsc on 8 Mar 2017

$ go tool objdump spawntest | grep -C 10 0x451730
    sys_freebsd_amd64.s:267 0x45170b    48890424    MOVQ AX, 0(SP)
    sys_freebsd_amd64.s:268 0x45170f    b8e8030000  MOVL $0x3e8, AX
    sys_freebsd_amd64.s:269 0x451714    f7e2        MULL DX
    sys_freebsd_amd64.s:270 0x451716    4889442408  MOVQ AX, 0x8(SP)
    sys_freebsd_amd64.s:272 0x45171b    4889e7      MOVQ SP, DI
    sys_freebsd_amd64.s:273 0x45171e    31f6        XORL SI, SI
    sys_freebsd_amd64.s:274 0x451720    b8f0000000  MOVL $0xf0, AX
    sys_freebsd_amd64.s:275 0x451725    0f05        SYSCALL
    sys_freebsd_amd64.s:276 0x451727    488b6c2410  MOVQ 0x10(SP), BP
    sys_freebsd_amd64.s:276 0x45172c    4883c418    ADDQ $0x18, SP
    sys_freebsd_amd64.s:276 0x451730    c3      RET

TEXT runtime.settls(SB) /home/vagrant/go/src/runtime/sys_freebsd_amd64.s
    sys_freebsd_amd64.s:279 0x451740    4883ec10        SUBQ $0x10, SP
    sys_freebsd_amd64.s:279 0x451744    48896c2408      MOVQ BP, 0x8(SP)
    sys_freebsd_amd64.s:279 0x451749    488d6c2408      LEAQ 0x8(SP), BP
    sys_freebsd_amd64.s:280 0x45174e    4883c708        ADDQ $0x8, DI
    sys_freebsd_amd64.s:281 0x451752    48893c24        MOVQ DI, 0(SP)
    sys_freebsd_amd64.s:282 0x451756    4889e6          MOVQ SP, SI
    sys_freebsd_amd64.s:283 0x451759    48c7c781000000      MOVQ $0x81, DI
    sys_freebsd_amd64.s:284 0x451760    48c7c0a5000000      MOVQ $0xa5, AX

tmm1 on 8 Mar 2017

Ignore the above I've produced it and its is very quick, so definitely very interesting.

What version of FreeBSD did you repro on?

I'm still confused as to why my FreeBSD VM does not reproduce, but my FreeNAS VM does.

I noticed there was an update available to FreeNAS (9.10 -> 9.10.2). I updated, and the crash still happens.

before:

FreeBSD freenas.local 10.3-STABLE FreeBSD 10.3-STABLE #0 455d13d(9.10-STABLE): Sun Jun 26 22:47:03 PDT 2016 [email protected]:/tank/home/nightlies/build-freenas9/_BE/objs/tank/home/nightlies/build-freenas9/_BE/trueos/sys/FreeNAS.amd64 amd64

after:

FreeBSD freenas.local 10.3-STABLE FreeBSD 10.3-STABLE #0 41eb257(9.10.2-STABLE): Mon Mar 6 17:03:14 UTC 2017 root@gauntlet:/freenas-9.10-releng/_BE/objs/freenas-9.10-releng/_BE/os/sys/FreeNAS.amd64 amd64

tmm1 on 8 Mar 2017

@tmm1 FreeBSD 11.0-RELEASE no VM.

stevenh on 8 Mar 2017

Here's a different trace I got just now.. Clearly some nasty memory corruption going on. It's almost as-if the RawSyscall wrapper is wrong or we're using vfork instead of fork (neither of which appears to be the case).

runtime: nelems=6 nfree=4 nalloc=2 previous allocCount=1 nfreed=65535
fatal error: sweep increased allocation count

runtime stack:
runtime.throw(0x5231bf, 0x20)
    /home/vagrant/go/src/runtime/panic.go:596 +0x95
runtime.(*mspan).sweep(0x80059cda8, 0x80059cd00, 0xc420000b60)
    /home/vagrant/go/src/runtime/mgcsweep.go:288 +0x786
runtime.sweepone(0x42d876)
    /home/vagrant/go/src/runtime/mgcsweep.go:110 +0x1ad
runtime.gosweepone.func1()
    /home/vagrant/go/src/runtime/mgcsweep.go:125 +0x2b
runtime.systemstack(0xc420016000)
    /home/vagrant/go/src/runtime/asm_amd64.s:327 +0x79
runtime.mstart()
    /home/vagrant/go/src/runtime/proc.go:1132

goroutine 3 [running]:
runtime.systemstack_switch()
    /home/vagrant/go/src/runtime/asm_amd64.s:281 fp=0xc420020f68 sp=0xc420020f60
runtime.gosweepone(0x0)
    /home/vagrant/go/src/runtime/mgcsweep.go:126 +0x53 fp=0xc420020f98 sp=0xc420020f68
runtime.bgsweep(0xc420048000)
    /home/vagrant/go/src/runtime/mgcsweep.go:59 +0xbb fp=0xc420020fd8 sp=0xc420020f98
runtime.goexit()
    /home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420020fe0 sp=0xc420020fd8
created by runtime.gcenable
    /home/vagrant/go/src/runtime/mgc.go:212 +0x61

goroutine 1 [runnable]:
regexp.(*bitState).reset(0xc4203737a0, 0x38, 0xc)
    /home/vagrant/go/src/regexp/backtrack.go:95 +0xff
regexp.(*machine).backtrack(0xc420101680, 0x59a460, 0xc420101758, 0x0, 0x38, 0xc, 0x4f)
    /home/vagrant/go/src/regexp/backtrack.go:321 +0xa1
regexp.(*Regexp).doExecute(0xc42036fcc0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc420098440, 0x38, 0x0, 0xc, ...)
    /home/vagrant/go/src/regexp/exec.go:439 +0x33a
regexp.(*Regexp).FindStringSubmatch(0xc42036fcc0, 0xc420098440, 0x38, 0x0, 0x0, 0x0)
    /home/vagrant/go/src/regexp/regexp.go:919 +0xec
github.com/fancybits/gopsutil/cpu.Info(0x51f162, 0xb, 0x0, 0x0, 0x0)
    /home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu_freebsd.go:133 +0x3ef
main.main()
    /home/vagrant/spawntest/main.go:28 +0x67

tmm1 on 8 Mar 2017

Ok this is interesting, just got a crash even with GOGC=off and ktrace e.g.
time sh -c 'export GOGC=off; export GOTRACEBACK=crash; ktrace -i -f ktrace-test.out ./test '

stevenh on 8 Mar 2017

And?

rsc on 8 Mar 2017

Here's an updated version of the reproduction program, now with no external dependencies or goroutines:

package main

import (
    "bufio"
    "log"
    "os"
    "os/exec"
    "regexp"
    "syscall"
    "time"
)

func init() {
    exec.Command("true").Run()

    // do some work after the fork/exec, to expose memory corruption if it occurred
    f, _ := os.Open("/var/run/dmesg.boot")
    defer f.Close()
    scanner := bufio.NewScanner(f)
    for scanner.Scan() {
        line := scanner.Text()
        regexp.MustCompile(`Origin\s*=\s*"(.+)"\s+Id\s*=\s*(.+)\s+Family\s*=\s*(.+)\s+Model\s*=\s*(.+)\s+Stepping\s*=\s*(.+)`).FindStringSubmatch(line)
    }
}

func main() {
    log.SetFlags(log.LstdFlags)
    log.Printf("started process")
    time.Sleep(1 * time.Millisecond)
    log.Printf("running exec...")
    syscall.Exec(os.Args[0], os.Args, os.Environ())
    log.Printf("exec failed!")
}

tmm1 on 8 Mar 2017

👍1

This is interesting:

If I comment out the time.Sleep(50 * time.Millisecond) (or use time.Sleep(0)), I can no longer reproduce the crash.

If I change it to time.Sleep(1 * time.Millisecond), the crash happens reliably within a couple seconds of running.

tmm1 on 8 Mar 2017

@tmm1, nice. Sadly, still no crash for me with FreeBSD 10.1-RELEASE on a Google Cloud VM; tried both 1ms and 50ms sleeps.

I'm starting to wonder about threading behavior, especially since it seems like init is important. (I assume if it wasn't, you'd have simplified further already.)

Some things to try. First rename init to xinit, main to xmain. Then:

(1) Make sure this still crashes:

func init() { xinit() }
func main() { xmain() }

(2) Confirm that this doesn't crash:

func main() { xinit(); xmain() }

(3) See if this crashes:

func init() { runtime.LockOSThread() }
func main() { xinit(); xmain() }

(4) See if this crashes:

func init() { runtime.LockOSThread() }
func main() { xinit(); runtime.UnlockOSThread(); xmain() }

(5) See if this crashes:

func init() { runtime.LockOSThread() }
func main() { xinit(); go xmain(); select{} }

rsc on 8 Mar 2017

@tmm1, nice. Sadly, still no crash for me with FreeBSD 10.1-RELEASE on a Google Cloud VM; tried both 1ms and 50ms sleeps.

Strange. One thing I just remembered is that I had trouble booting the FreeNAS VM in virtualbox and had to change some settings before it worked (https://github.com/drajen/freenas-vagrant/issues/1). I will try making that same change on the FreeBSD VM to see if it causes the repro to work there.

I'm starting to wonder about threading behavior, especially since it seems like init is important. (I assume if it wasn't, you'd have simplified further already.)

Looks like the init is irrelevant, I just didn't get that far in the simplification process.

Here are the results from your tests:

crashes
crashes
does not crash
crashes
crashes

tmm1 on 8 Mar 2017

Very interesting that (3) does not crash. That variant makes sure the fork and the final exec happen in the same thread (the main thread, it turns out).

Does (5) crash more reliably than the others? It forces the fork to happen in the main thread and then forces the exec to happen in a non-main thread.

Also interesting would be the reverse (fork forced in non-main thread, final exec forced in main thread):

(6)

func init() { runtime.LockOSThread() }
func main() {
    ch := make(chan int)
    go func() {
        xinit(); 
        ch <- 1
    }()
    <-ch
    xmain()
}

It's really tempting to think there is some kind of race in exec where the OS doesn't quite manage to stop all the threads in the old process before creating the new one, hence the memory fault or SIGILL trying to execute a return instruction. But I don't see where the race could be: the final exec doesn't happen until the first fork+exec is over and cleaned up. It's almost like a thread in the new process starts running and then accidentally gets torn down as if it were part of the old process.

rsc on 8 Mar 2017

(6) does not crash.

(4) and (5) appear to crash more quickly than the rest.

Random sampling of timings (ofcourse these are all highly variable):

[vagrant@freenas ~]$ time ./rsctest1
Illegal instruction

real    0m0.685s
user    0m0.213s
sys 0m0.427s

[vagrant@freenas ~]$ time ./rsctest2
Illegal instruction

real    0m8.332s
user    0m3.093s
sys 0m6.164s

[vagrant@freenas ~]$ time ./rsctest3
^C

real    0m25.811s
user    0m10.192s
sys 0m18.873s

[vagrant@freenas ~]$ time ./rsctest4
Illegal instruction

real    0m0.026s
user    0m0.010s
sys 0m0.020s

[vagrant@freenas ~]$ time ./rsctest5
Illegal instruction

real    0m0.146s
user    0m0.051s
sys 0m0.111s

[vagrant@freenas ~]$ time ./rsctest6
^C

real    0m33.222s
user    0m12.663s
sys 0m26.550s

tmm1 on 8 Mar 2017

If you can, please try one more:

(7)

func init() { runtime.LockOSThread() }
func main() {
    xinit()
    ch := make(chan int)
    go func() {
        runtime.LockOSThread()
        ch <- 1
        <-ch
        xmain()
    }()
    <-ch
    runtime.UnlockOSThread()
    ch <- 1
    select {}
}

| run | xinit | xmain | outcome |
| --- | --- | --- | --- |
| (1) | locked to main thread | on any thread | crash |
| (2) | on any thread | on any thread | crash |
| (3) | locked to main thread | locked to main thread | success |
| (4) | locked to main thread | on any thread | crash |
| (5) | locked to main thread | on any non-main thread | crash |
| (6) | on any non-main thread | locked to main thread | success |
| (7) | locked to main thread | locked to non-main thread | ??? |

Clearly the problem goes away when xmain is locked to the main OS thread. But what's important? The fact that it's locked to a thread or the fact that it's the main OS thread? The new program (7) is halfway between (3) and (5): it runs xmain locked to a thread, just definitely not the main OS thread.

While Go certainly treats locked vs unlocked thread execution differently, it otherwise makes no distinctions or special cases for the main OS thread. If (7) succeeds, then the important detail is the Go runtime's thread locking, not the specific operating system thread, which would seem to implicate Go. If (7) crashes, then the important detail would appear to be the specific operating system thread used for the syscall.Exec, which would seem to implicate FreeBSD.

Also, if you remove the "do some work", does that affect crash frequency? It's not involved in any of these theories, nor in the SIGBUS/SIGILL above, but maybe it is important for getting the timing right.

If (7) crashes the next step is probably to try to convert it into a C program that does the fork+wait of /bin/true from the main thread and then starts a new pthread for the execve, leaving the main thread to block (maybe pthread_join on the new thread).

rsc on 8 Mar 2017

(7) crashes.

Without the "do work" block, the crashes take longer to reproduce. For example, (2) took 46s and 90s to crash, and (4) took 27s and 21s to crash.

tmm1 on 8 Mar 2017

I wrote this, but haven't been able to reproduce the crash with it yet:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <errno.h>

extern char **environ;

void *do_exec(void *arg) {
    fprintf(stdout, "running exec...\n");
    char **argv = (char**)arg;
    execve(argv[0], argv, environ);
    fprintf(stdout, "exec failed: %d!\n", errno);
    exit(1);
    return NULL;
}

int main(int argc, char *argv[]) {
    fprintf(stdout, "started process.\n");
    int pid = fork();
    if (pid == 0) { // child
        if (execve("/usr/bin/true", argv, environ) < 0) {
            fprintf(stderr, "execve(true) failed: %d\n", errno);
        }
        exit(0);
    }
    waitpid(pid, NULL, 0);

    usleep(1000);

    pthread_t thread;
    pthread_create(&thread, NULL, do_exec, argv);
    pthread_join(thread, NULL);
    return 0;
}

tmm1 on 8 Mar 2017

I suspect that even if memory corruption were occurring in the C program, there's just not enough going on for the program to notice.

tmm1 on 8 Mar 2017

FWIW, this crash at least is not memory corruption, at least not in the "allocated memory / program heap" sense.

[signal SIGBUS: bus error code=0x3 addr=0x451730 pc=0x451730]

The fault here is trying to read from memory the instruction to be executed (addr = pc), which objdump identified as a valid PC that should have had a RET instruction mapped there. This is very strongly suggestive of the operating system doing process management wrong.

It's definitely possible that the C binary is so tiny compared to the Go binary that there are many fewer pages and less opportunity to get things wrong. Compiling with -static might be worth trying.

rsc on 8 Mar 2017

I tried upgrading my 10.3-RELEASE VM to 10.3-RELEASE-p11 and still couldn't reproduce any crashes.

Then I started up a fresh new 11.0-RELEASE-p8 VM (based on @stevenh's comment earlier), and sure enough it crashes there.

FreeBSD 11.0-RELEASE-p8 FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017 [email protected]:/usr/obj/usr/src/sys/GENERIC amd64

@rsc Maybe you can reproduce on GCE with --image freebsd-11-0-release-p1-amd64 --image-project=freebsd-org-cloud-dev

tmm1 on 8 Mar 2017

@rsc, we have FreeBSD 11 gomote-accessible builders, too.

bradfitz on 8 Mar 2017

Behaviors on FreeBSD 11 are slightly different than my FreeNAS VM, so it's probably worth testing all the cases again.

For instance, now it crashes with and without the time.Sleep(), but _only if_ the program uses stdout as well. I commented out the log.Printf lines and it wouldn't crash (quickly?) anymore.

Similarly, if I remove the "do work" section I can't get it to crash anymore on 11.0 either.

My original simplified repro still crashes quickly though. Definitely curious to hear if you see the same on GCE.

tmm1 on 8 Mar 2017

Here's the crash I'm seeing on FreeBSD 11..

runtime: nelems=6 nfree=4 nalloc=2 previous allocCount=1 nfreed=65535
fatal error: sweep increased allocation count

runtime stack:
runtime.throw(0x4fa91e, 0x20)
    /home/vagrant/go/src/runtime/panic.go:596 +0x95 fp=0xc42002fe58 sp=0xc42002fe38
runtime.(*mspan).sweep(0x800563b48, 0x800563b00, 0xc420000820)
    /home/vagrant/go/src/runtime/mgcsweep.go:288 +0x786 fp=0xc42002ff28 sp=0xc42002fe58
runtime.sweepone(0x42d276)
    /home/vagrant/go/src/runtime/mgcsweep.go:110 +0x1ad fp=0xc42002ff90 sp=0xc42002ff28
runtime.gosweepone.func1()
    /home/vagrant/go/src/runtime/mgcsweep.go:125 +0x2b fp=0xc42002ffb0 sp=0xc42002ff90
runtime.systemstack(0xc420016000)
    /home/vagrant/go/src/runtime/asm_amd64.s:327 +0x79 fp=0xc42002ffb8 sp=0xc42002ffb0
runtime.mstart()
    /home/vagrant/go/src/runtime/proc.go:1132 fp=0xc42002ffc0 sp=0xc42002ffb8

goroutine 3 [running]:
runtime.systemstack_switch()
    /home/vagrant/go/src/runtime/asm_amd64.s:281 fp=0xc420020f68 sp=0xc420020f60
runtime.gosweepone(0x0)
    /home/vagrant/go/src/runtime/mgcsweep.go:126 +0x53 fp=0xc420020f98 sp=0xc420020f68
runtime.bgsweep(0xc420014070)
    /home/vagrant/go/src/runtime/mgcsweep.go:59 +0xbb fp=0xc420020fd8 sp=0xc420020f98
runtime.goexit()
    /home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420020fe0 sp=0xc420020fd8
created by runtime.gcenable
    /home/vagrant/go/src/runtime/mgc.go:212 +0x61

goroutine 1 [runnable]:
syscall.Syscall(0x4, 0x2, 0xc4203cae40, 0x24, 0x24, 0x24, 0x0)
    /home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5 fp=0xc420035cb8 sp=0xc420035cb0
syscall.write(0x2, 0xc4203cae40, 0x24, 0x40, 0x20, 0x13, 0x100000000000011)
    /home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:1271 +0x55 fp=0xc420035d10 sp=0xc420035cb8
syscall.Write(0x2, 0xc4203cae40, 0x24, 0x40, 0x80055f000, 0x0, 0x14)
    /home/vagrant/go/src/syscall/syscall_unix.go:181 +0x49 fp=0xc420035d58 sp=0xc420035d10
os.(*File).write(0xc42000c020, 0xc4203cae40, 0x24, 0x40, 0x40, 0x14, 0x40)
    /home/vagrant/go/src/os/file_unix.go:186 +0x95 fp=0xc420035dc8 sp=0xc420035d58
os.(*File).Write(0xc42000c020, 0xc4203cae40, 0x24, 0x40, 0x23, 0xc4203cae40, 0x14)
    /home/vagrant/go/src/os/file.go:142 +0x7c fp=0xc420035e40 sp=0xc420035dc8
log.(*Logger).Output(0xc420040050, 0x2, 0xc42030ffd0, 0xf, 0x0, 0x0)
    /home/vagrant/go/src/log/log.go:168 +0x210 fp=0xc420035ee0 sp=0xc420035e40
log.Printf(0x4f7579, 0xf, 0x0, 0x0, 0x0)
    /home/vagrant/go/src/log/log.go:291 +0x80 fp=0xc420035f28 sp=0xc420035ee0
main.main()
    /home/vagrant/spawntest/main.go:28 +0x62 fp=0xc420035f88 sp=0xc420035f28
runtime.main()
    /home/vagrant/go/src/runtime/proc.go:185 +0x20a fp=0xc420035fe0 sp=0xc420035f88
runtime.goexit()
    /home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420035fe8 sp=0xc420035fe0

goroutine 2 [force gc (idle)]:
runtime.gopark(0x4fd138, 0x573a30, 0x4f74a7, 0xf, 0x4fd014, 0x1)
    /home/vagrant/go/src/runtime/proc.go:271 +0x13a fp=0xc420020768 sp=0xc420020738
runtime.goparkunlock(0x573a30, 0x4f74a7, 0xf, 0xc420000114, 0x1)
    /home/vagrant/go/src/runtime/proc.go:277 +0x5e fp=0xc4200207a8 sp=0xc420020768
runtime.forcegchelper()
    /home/vagrant/go/src/runtime/proc.go:226 +0x9e fp=0xc4200207e0 sp=0xc4200207a8
runtime.goexit()
    /home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4200207e8 sp=0xc4200207e0
created by runtime.init.4
    /home/vagrant/go/src/runtime/proc.go:215 +0x35

goroutine 4 [finalizer wait]:
runtime.gopark(0x4fd138, 0x58e1f0, 0x4f72c1, 0xe, 0x14, 0x1)
    /home/vagrant/go/src/runtime/proc.go:271 +0x13a fp=0xc420021718 sp=0xc4200216e8
runtime.goparkunlock(0x58e1f0, 0x4f72c1, 0xe, 0x2000000014, 0x1)
    /home/vagrant/go/src/runtime/proc.go:277 +0x5e fp=0xc420021758 sp=0xc420021718
runtime.runfinq()
    /home/vagrant/go/src/runtime/mfinal.go:161 +0xb2 fp=0xc4200217e0 sp=0xc420021758
runtime.goexit()
    /home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4200217e8 sp=0xc4200217e0
created by runtime.createfing
    /home/vagrant/go/src/runtime/mfinal.go:142 +0x62

goroutine 5 [GC worker (idle)]:
runtime.gopark(0x4fcfd8, 0xc42030f860, 0x4f7654, 0x10, 0x14, 0x0)
    /home/vagrant/go/src/runtime/proc.go:271 +0x13a fp=0xc420021f50 sp=0xc420021f20
runtime.gcBgMarkWorker(0xc420016000)
    /home/vagrant/go/src/runtime/mgc.go:1491 +0x138 fp=0xc420021fd8 sp=0xc420021f50
runtime.goexit()
    /home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420021fe0 sp=0xc420021fd8
created by runtime.gcBgMarkStartWorkers
    /home/vagrant/go/src/runtime/mgc.go:1412 +0x98
Abort trap (core dumped)

If I run with GOGC=off, I eventually see the program quit with "Illegal instruction" and a coredump.

Reading symbols from spawntest...done.
[New LWP 100878]
[New LWP 100139]
[New LWP 100146]
Core was generated by `spawntest'.
Program terminated with signal SIGILL, Illegal instruction.
#0  syscall.Syscall6 () at /home/vagrant/go/src/syscall/asm_unix_amd64.s:50
50      SYSCALL
[Current thread is 1 (LWP 100878)]
Loading Go Runtime support.

(gdb) info threads
  Id   Target Id         Frame
* 1    LWP 100878        syscall.Syscall6 () at /home/vagrant/go/src/syscall/asm_unix_amd64.s:50
  2    LWP 100139        runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21
  3    LWP 100146        runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21

(gdb) thread apply all bt
Thread 3 (LWP 100146):
#0  runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21
#1  0x0000000000423b3d in runtime.futexsleep1 (addr=0xc420024518, val=0, ns=-1) at /home/vagrant/go/src/runtime/os_freebsd.go:92
#2  0x000000000044b5ea in runtime.futexsleep.func1 () at /home/vagrant/go/src/runtime/os_freebsd.go:80
#3  0x000000000044cf3b in runtime.systemstack () at /home/vagrant/go/src/runtime/asm_amd64.s:343
#4  0x0000000000423acd in runtime.futexsleep (addr=0xc420024518, val=0, ns=-1) at /home/vagrant/go/src/runtime/os_freebsd.go:81
#5  0x000000000040c68b in runtime.notesleep (n=0xc420024518) at /home/vagrant/go/src/runtime/lock_futex.go:145
#6  0x000000000042b2cd in runtime.stopm () at /home/vagrant/go/src/runtime/proc.go:1650
#7  0x000000000042c0e4 in runtime.findrunnable (gp#10=0xc420016000, inheritTime=false) at /home/vagrant/go/src/runtime/proc.go:2102
#8  0x000000000042cc7c in runtime.schedule () at /home/vagrant/go/src/runtime/proc.go:2222
#9  0x000000000042cf7b in runtime.park_m (gp=0xc420000680) at /home/vagrant/go/src/runtime/proc.go:2285
#10 0x000000000044ce6b in runtime.mcall () at /home/vagrant/go/src/runtime/asm_amd64.s:269
#11 0x000000c420016000 in ?? ()
#12 0x0000000000000000 in ?? ()

Thread 2 (LWP 100139):
#0  runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21
#1  0x0000000000423cae in runtime.futexwakeup (addr=0xc420024518, cnt=1) at /home/vagrant/go/src/runtime/os_freebsd.go:102
#2  0x000000000040c5a6 in runtime.notewakeup (n=0xc420024518) at /home/vagrant/go/src/runtime/lock_futex.go:135
#3  0x000000000042b4bd in runtime.startm (_p_=0xc420016000, spinning=false) at /home/vagrant/go/src/runtime/proc.go:1713
#4  0x000000000042b635 in runtime.handoffp (_p_=0xc420016000) at /home/vagrant/go/src/runtime/proc.go:1725
#5  0x000000000043151d in runtime.retake (now=4278381200320, ~r1=0) at /home/vagrant/go/src/runtime/proc.go:3909
#6  0x000000000043113a in runtime.sysmon () at /home/vagrant/go/src/runtime/proc.go:3836
#7  0x000000000042a39e in runtime.mstart1 () at /home/vagrant/go/src/runtime/proc.go:1179
#8  0x000000000042a274 in runtime.mstart () at /home/vagrant/go/src/runtime/proc.go:1149
#9  0x0000000000450927 in runtime.thr_start () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:45
#10 0x0000000000000000 in ?? ()

Thread 1 (LWP 100878):
#0  syscall.Syscall6 () at /home/vagrant/go/src/syscall/asm_unix_amd64.s:50
#1  0x0000000000478abb in os.(*Process).blockUntilWaitable (p=0xc420012330, ~r0=96, ~r1=...) at /home/vagrant/go/src/os/wait_wait6.go:29
#2  0x00000000004763bd in os.(*Process).wait (p=0xc420012330, ps=0x8, err=...) at /home/vagrant/go/src/os/exec_unix.go:22
#3  0x000000000047594b in os.(*Process).Wait (p=0xc420012330, ~r0=0x0, ~r1=...) at /home/vagrant/go/src/os/exec.go:115
#4  0x000000000049fc52 in os/exec.(*Cmd).Wait (c=0xc420064000, ~r0=...) at /home/vagrant/go/src/os/exec/exec.go:435
#5  0x000000000049f08c in os/exec.(*Cmd).Run (c=0xc420064000, ~r0=...) at /home/vagrant/go/src/os/exec/exec.go:280
#6  0x00000000004be011 in main.init.1 () at /home/vagrant/spawntest/main.go:14
#7  0x00000000004be347 in main.init ()
#8  0x000000000042747a in runtime.main () at /home/vagrant/go/src/runtime/proc.go:173
#9  0x000000000044f9d1 in runtime.goexit () at /home/vagrant/go/src/runtime/asm_amd64.s:2197
#10 0x0000000000000000 in ?? ()

(gdb) l
45      MOVQ    a2+16(FP), SI
46      MOVQ    a3+24(FP), DX
47      MOVQ    a4+32(FP), R10
48      MOVQ    a5+40(FP), R8
49      MOVQ    a6+48(FP), R9
50      SYSCALL
51      JCC ok6
52      MOVQ    $-1, r1+56(FP)  // r1
53      MOVQ    $0, r2+64(FP)   // r2
54      MOVQ    AX, err+72(FP)  // errno

(gdb) disas
Dump of assembler code for function syscall.Syscall6:
   0x000000000046c810 <+0>: callq  0x42d9d0 <runtime.entersyscall>
   0x000000000046c815 <+5>: mov    0x8(%rsp),%rax
   0x000000000046c81a <+10>:    mov    0x10(%rsp),%rdi
   0x000000000046c81f <+15>:    mov    0x18(%rsp),%rsi
   0x000000000046c824 <+20>:    mov    0x20(%rsp),%rdx
   0x000000000046c829 <+25>:    mov    0x28(%rsp),%r10
   0x000000000046c82e <+30>:    mov    0x30(%rsp),%r8
   0x000000000046c833 <+35>:    mov    0x38(%rsp),%r9
=> 0x000000000046c838 <+40>:    syscall
   0x000000000046c83a <+42>:    jae    0x46c859 <syscall.Syscall6+73>
   0x000000000046c83c <+44>:    movq   $0xffffffffffffffff,0x40(%rsp)
   0x000000000046c845 <+53>:    movq   $0x0,0x48(%rsp)
   0x000000000046c84e <+62>:    mov    %rax,0x50(%rsp)
   0x000000000046c853 <+67>:    callq  0x42ddf0 <runtime.exitsyscall>
   0x000000000046c858 <+72>:    retq
   0x000000000046c859 <+73>:    mov    %rax,0x40(%rsp)
   0x000000000046c85e <+78>:    mov    %rdx,0x48(%rsp)
   0x000000000046c863 <+83>:    movq   $0x0,0x50(%rsp)
   0x000000000046c86c <+92>:    callq  0x42ddf0 <runtime.exitsyscall>
   0x000000000046c871 <+97>:    retq
End of assembler dump.

tmm1 on 9 Mar 2017

Ran with ktrace and it crashed after a few minutes. Here's the last iteration worth of events: https://gist.github.com/tmm1/f10330c43fe6dfebeabcf5188e3ff535

Not particularly interesting from the kernel's point of view.

tmm1 on 9 Mar 2017

Thanks @bradfitz. I only just realized that the numbers in freebsd-amd64-gce93 and freebsd-amd64-gce101 are FreeBSD versions. I thought they were GCE versions somehow.

Anyway, reproduced reliably on freebsd-amd64-110 gomote (FreeBSD 11.0). That's progress.

rsc on 9 Mar 2017

🎉2

Yeah, I've started dropping "gce" in the suffixes due to the confusion. Also, the freebsd-amd64-110 gomote is quite new (9 days ago in #19097). They weren't available when y'all started debugging this.

bradfitz on 9 Mar 2017

Anyway, reproduced reliably on freebsd-amd64-110 gomote (FreeBSD 11.0). That's progress.

Great! Let me know if I can do anything else to help.

tmm1 on 9 Mar 2017

Booted up a fresh FreeBSD-10.3-STABLE VM this morning, and I'm not able to repro my crash there either. So far, here's where my crash can be reproduced:

| version | outcome|
|---------|--------|
| 10.1-RELEASE (GCE) | not crashing |
| 10.3-RELEASE | not crashing |
| 10.3-RELEASE-p11 | not crashing |
| 10.3-STABLE (r314494) | not crashing |
| 10.3-STABLE (freenas-9.10 455d13d) | crashing |
| 10.3-STABLE (freenas-9.10.2 41eb257) | crashing |
| 11.0-RELEASE | crashing |
| 11.0-RELEASE-p8 | crashing |
| 11.0-STABLE (r314493) | crashing |

Interestingly the earlier repro was crashing for @derekmarcotte on 10.3-RELEASE, so I wonder if my issue is a different bug. I guess the 10.3 repro was also taking days, so possible it's just not triggering as readily on the older kernels.

tmm1 on 9 Mar 2017

Hiya,

In case this is architecture specific, can people include their actual CPU
details too?

-adrian

erikarn on 9 Mar 2017

E5-2640 here

stevenh on 9 Mar 2017

All my testing has been on VirtualBox VMs, on a darwin 15.6.0 host with a i5-2415M

tmm1 on 9 Mar 2017

Oh, and has anyone tried reproducing this on a single-core box, versus SMP?

(you can use cpuset to push the process+descendents onto a single core.)

-adrian

erikarn on 9 Mar 2017

👍1

Booted up a 11.0-STABLE (r314493) VM and that crashes too.

(you can use cpuset to push the process+descendents onto a single core.)

I tried cpuset -l 0 ./spawntest and it still crashed.

tmm1 on 9 Mar 2017

Hiya,

Well, doing it in a VM doesn't mean the host OS won't drift it across CPU
cores. :-)

-adrian

erikarn on 9 Mar 2017

Tried with here still crashes but takes much longer and that's on the metal.

stevenh on 10 Mar 2017

In case this is architecture specific, can people include their actual CPU
details too?

FreeBSD 11.0-RELEASE-p2 #0: Mon Oct 24 06:55:27 UTC 2016
    [email protected]:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.8.0 (tags/RELEASE_380/final 262564) (based on LLVM 3.8.0)
VT(vga): resolution 640x480
CPU: AMD G-T40E Processor (1000.02-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x500f20  Family=0x14  Model=0x2  Stepping=0
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x802209<SSE3,MON,SSSE3,CX16,POPCNT>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD Features2=0x35ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,IBS,SKINIT,WDT>
  SVM: NP,NRIP,NAsids=8
  TSC: P-state invariant, performance statistics

kaey on 10 Mar 2017

Sorry if I'm asking a dumb question, but could this be as simple as a stack overflow? By default pthreads have smaller stacks than main threads, and a SIGBUS is the normal symptom of overflowing a thread's stack. So that's compatible with the observation that the crash only happens when the thread that calls execve is not the main thread of the process.

asomers on 10 Mar 2017

The failure signal is pretty random, I've seen SIGILL, SIGBUS and SIGSEGV :(

stevenh on 10 Mar 2017

@asomers All the stacks here are goroutine stacks which never overflow.

ianlancetaylor on 10 Mar 2017

@ianlancetaylor goroutines never SIGILL either, except when they do. There are real stacks down there somewhere.

asomers on 10 Mar 2017

@asomers These crashes are happening with pure Go programs. There are no pthreads involved anywhere. The stacks grow automatically: the function prologue ensures there is enough stack space before the function begins. I don't know what the problem is, but I am confident that it is not "as simple as a stack overflow."

ianlancetaylor on 10 Mar 2017

Hi,

So it's a single thread process? (ie, I'm asking if there are any threads
at all for your process, versus specifically pthreads using our libthr.)

Does your Go runtime hook into any of the standard C runtime / linker
stuff? initialisers, destructors, etc?

-adrian

erikarn on 10 Mar 2017

@erikam it's a multithreaded process. As @tmm1's ktrace output show, there appear to be 3 threads.
@ianlancetaylor the bug is unlikely to like in pure Go code. It's probably either in the OS or in the implementation of Go. Remember, everything is assembly in the end. I don't know if a stack overflow is the problem, but I do know that when investigating a bug like this, we can't assume that's Go's high level safety guarantees are working.

asomers on 10 Mar 2017

@asomers For background I've been working on Go since 2008 and I'm very familiar with Go's implementation. I agree that Go's high level safety guarantees are not relevant here. I am simply asserting with high confidence that this is not a stack overflow.

(Personally I've believed for months now that this is a bug in the FreeBSD kernel, but I've never been able to recreate the problem myself, so I can't prove it. I could certainly be wrong--it could be a bug in Go. If it is a bug in Go it is most likely due to some error in the way that Go invokes the fork system call in syscall/exec_bsd.go. Perhaps the system call needs special handling on FreeBSD in some way, as is true of, for example, vfork on GNU/Linux.)

ianlancetaylor on 10 Mar 2017

@erikarn

Does your Go runtime hook into any of the standard C runtime / linker stuff? initialisers, destructors, etc?

No.

ianlancetaylor on 10 Mar 2017

Another data point for this is that after 8 hours of testing the program from https://github.com/golang/go/issues/15658#issuecomment-284940367 under go 1.4 it has still yet to fail.

The same under go 1.8 fails within a matter of minutes at the most.

stevenh on 14 Mar 2017

👍1

A key difference between Go 1.4 and Go 1.8 is that Go 1.4 always defaults to a GOMAXPROCS value of 1, whereas in 1.8 GOMAXPROCS defaults to the number of CPUs on the system. So when testing with Go 1.4, set the environment variable GOMAXPROCS=4 or however many CPUs you have.

ianlancetaylor on 14 Mar 2017

Good point, this is subtly different from using a single core in the cpuset as iirc there's an outstanding issue to allow go to set GOMAXPROCS correctly based on the number of cores available to the process and not the total cores of system, which I still got an error from on 1.8.

I've set GOMAXPROCS=24 and am retesting.

stevenh on 14 Mar 2017

If go 1.4 still holds up, what do we think the chances are that there was an bug introduced when moving from c to pure go?

stevenh on 14 Mar 2017

@stevenh I think the issue about the number of cores available to the process is #15206, which was recently fixed on tip. The fix will be in 1.9.

There isn't a significant amount of C code involved here, so while anything is possible I think it is unlikely to be related to the conversion from C to Go. If it works with 1.4 and fails with 1.8 then a far more likely immediate cause is the concurrent garbage collector. Perhaps there is some problem when one thread reads or writes memory pages while a different thread is calling fork.

ianlancetaylor on 14 Mar 2017

@ianlancetaylor GC should be out of the picture as we've had failures with GOGC=off

stevenh on 14 Mar 2017

@stevenh Hmmm, right.

Since we are only seeing reports on FreeBSD I think any problem with the C to Go conversion would be in runtime/os_freebsd.c converting to runtime/os_freebsd.go. I don't see any problems there in a quick look, but who knows.

ianlancetaylor on 14 Mar 2017

@ianlancetaylor Yer I had a quick look too the other day and couldn't spot anything either.

However on cursory review libthr (FreeBSD's native thread library) has a lot more code when processing a fork than the go runtime seems to, so I'm wondering if some protection that's needed is missing, which is allowing additional threads to perform actions during fork and results in memory corruption we're seeing.

libthr fork can be seen here:
https://svnweb.freebsd.org/base/head/lib/libthr/thread/thr_fork.c?view=markup#l135

One test that might help backup that premise, is if we could ensure that no other threads ran while a fork was in process.

stevenh on 14 Mar 2017

One test that might help backup that premise, is if we could ensure that no other threads ran while a fork was in process.

Any ideas on the best way to hack this in place to try it out?

tmm1 on 15 Mar 2017

@tmm1 Have beforefork in runtime/proc.go call stopTheWorld, and have afterfork call startTheWorld.

ianlancetaylor on 15 Mar 2017

@ianlancetaylor Thanks!

I applied this patch and built a new binary with it, but the resulting binary still crashed quickly.

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f41672de73..55950bf64d 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -2793,6 +2793,7 @@ func beforefork() {
 //go:linkname syscall_runtime_BeforeFork syscall.runtime_BeforeFork
 //go:nosplit
 func syscall_runtime_BeforeFork() {
+       stopTheWorld("beforefork")
        systemstack(beforefork)
 }

@@ -2814,6 +2815,7 @@ func afterfork() {
 //go:nosplit
 func syscall_runtime_AfterFork() {
        systemstack(afterfork)
+       startTheWorld()
 }

 // Allocate a new g, with a stack big enough for stacksize bytes.

tmm1 on 15 Mar 2017

Update on the go 1.4 test, its been running for over 20 hours now, no failures.

stevenh on 15 Mar 2017

Ok so I may have a lead on this.

Given go 1.4 appeared to be unaffected I tried go 1.5 which also appears to be unaffected, however switching to go 1.6 and crashed within seconds.

After some bisecting I found that reverting https://github.com/golang/go/commit/a7cad52e04dd1890420452b38563997930edb2ca from go 1.6 made it stable again.

Digging some more I see that after execve sigaltstack is returning the details of a signal stack set before the execve, and which should never happen.

stevenh on 17 Mar 2017

🎉7 👍1

Ok reproduced it in C:

#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>

extern char **environ;

static void
die(const char *s)
{
       perror(s);
       exit(EXIT_FAILURE);
}

static void
printsigstack(const char *prefix)
{
        stack_t ss;

        if (sigaltstack(NULL, &ss) < 0)
                die("sigaltstack get");
        printf("%s - ss_sp=%p ss_size=%lu ss_flags=0x%x\n", prefix, ss.ss_sp,
            (unsigned long)ss.ss_size, (unsigned int)ss.ss_flags);
}

static void
setstack(void *arg)
{
        stack_t ss;

        ss.ss_sp = malloc(SIGSTKSZ);
        if (ss.ss_sp == NULL)
                die("malloc");

        ss.ss_size = SIGSTKSZ;
        ss.ss_flags = 0;
        if (sigaltstack(&ss, NULL) < 0)
                die("sigaltstack set");
}


static void *
thread(void *arg)
{
        char *argv[] = { "./test-sigs", "no-more", 0 };

        printsigstack("thread - before setstack");
        setstack(arg);
        printsigstack("thread - after setstack + before execve");

        execve(argv[0], &argv[0], environ);
        die("exec failed");

        return NULL;
}

int
main(int argc, char** argv)
{
        int i, j;
        int alloc_size = 32 * 1024;
        pthread_t tid;
        char buf[1024];
        void *arg;

        printf("argc: %d, %s\n", argc, argv[0]);
        printsigstack("main");
        if (argc != 1)
                return 0;

        j = pthread_create(&tid, NULL, thread, NULL);
        if (j != 0) {
               errno = i;
               die("pthread_create");
        }

        j = pthread_join(tid, NULL);
        if (j != 0) {
                errno = j;
                die("pthread_join");
        }

        return 0;
}

Produces:

argc: 1, ./test-sigs
main - ss_sp=0x0 ss_size=0 ss_flags=0x4
thread - before setstack - ss_sp=0x0 ss_size=0 ss_flags=0x4
thread - after setstack + before execve - ss_sp=0x8014106c0 ss_size=34816 ss_flags=0x0
argc: 2, ./test-sigs
main - ss_sp=0x8014106c0 ss_size=34816 ss_flags=0x0

As you can see the second main output, which is straight after an execv, already has a signal stack which it shouldn't have.

This is corroborated by execve(2)
The signal stack is reset to be undefined (see sigaction(2) for more information).

So now to go kernel bug hunting....

stevenh on 17 Mar 2017

🎉21 ❤5

Just to be clear if the execve is done from the main thread, even if it has an alternative signal stack set, its not present after the execve, so doing the execve from another thread is key to reproducing this; which may explain why its not been identified in the past.

stevenh on 17 Mar 2017

Thanks to @kostikbel who knew exactly where to look for this, here's the kernel patch for people to test:

Index: sys/kern/kern_sig.c
===================================================================
--- sys/kern/kern_sig.c (revision 315303)
+++ sys/kern/kern_sig.c (working copy)
@@ -976,7 +976,6 @@ execsigs(struct proc *p)
         * and are now ignored by default).
         */
        PROC_LOCK_ASSERT(p, MA_OWNED);
-       td = FIRST_THREAD_IN_PROC(p);
        ps = p->p_sigacts;
        mtx_lock(&ps->ps_mtx);
        while (SIGNOTEMPTY(ps->ps_sigcatch)) {
@@ -1007,6 +1006,8 @@ execsigs(struct proc *p)
         * Reset stack state to the user stack.
         * Clear set of signals caught on the signal stack.
         */
+       td = curthread;
+       MPASS(td->td_proc == p);
        td->td_sigstk.ss_flags = SS_DISABLE;
        td->td_sigstk.ss_size = 0;
        td->td_sigstk.ss_sp = 0;

My test here which usually crashes in a few seconds has been running 20 mins now no issues, so I'm hopeful this is the fix for this.

stevenh on 17 Mar 2017

👍16 ❤2

@stevenh Awesome work! Thanks!

ianlancetaylor on 17 Mar 2017

Wow, incredible work @stevenh.

Looks like that patch was committed this morning: https://github.com/freebsd/freebsd/commit/0350bb5e92a7d5e3867a00def3b358d1983a4b2a

Does it make sense to workaround this bug in go until the upstream fix propagates out? (By reverting the os1_freebsd.go changes in a7cad52e04dd1890420452b38563997930edb2ca).

tmm1 on 17 Mar 2017

🎉4

A huge thank you to @stevenh and everyone involved in tracking down this issue!

The FreeBSD change will make it into the stable/10 and stable/11 branches next week (by end of Mar 2017), but it will take more time to appear in new minor or errata releases; I suspect that it will be around August when we can count on fixed FreeBSD releases being widely available.

Is it possible to perform a run-time check, with the change in a7cad52 applying only to known fixed versions?

(Comment updated to clarify that it's the FreeBSD change that's making it into branches next week, and to provide a non-relative "next week".)

emaste on 17 Mar 2017

🎉2

Note that Aug is when Go 1.9 will be released: https://github.com/golang/go/wiki/Go-Release-Cycle

josharian on 17 Mar 2017

I think we can work around the problem without requiring a version check. Can someone with an affected system please test whether https://golang.org/cl/38325 fixes the problem? Thanks.

ianlancetaylor on 17 Mar 2017

CL https://golang.org/cl/38325 mentions this issue.

gopherbot on 17 Mar 2017

I think we can work around the problem without requiring a version check. Can someone with an affected system please test whether https://golang.org/cl/38325 fixes the problem? Thanks.

Confirmed that patch fixes the crash on FreeBSD 11.0-RELEASE-p8

tmm1 on 17 Mar 2017

🎉1

Steven can we get this added to the green ad test suite?

On Mar 17, 2017 2:41 PM, "Aman Gupta" notifications@github.com wrote:

I think we can work around the problem without requiring a version check.
Can someone with an affected system please test whether
https://golang.org/cl/38325 fixes the problem? Thanks.

Confirmed that patch fixes the crash on FreeBSD 11.0-RELEASE-p8

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/15658#issuecomment-287478344, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABGl7dhny_MxCeSVuFWqBx9bBF0VtYbeks5rmv4WgaJpZM4IcqQO
.

erikarn on 17 Mar 2017

After speaking to @kostikbel this afternoon we may yet still have an outstanding issue.

Its possible that the repo case identified by @tmm1 was independent of the original issue.

I'm rebuilding the original fork only test case and retesting.

stevenh on 17 Mar 2017

👍2

green ad test suite?

stevenh on 18 Mar 2017

sorry, phone typos. The freebsd test suite that we maintain. This seems
like a good set of test cases to poke.

-adrian

On 17 March 2017 at 16:01, Steven Hartland notifications@github.com wrote:

green ad test suite?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/15658#issuecomment-287491688, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABGl7Rp8xJuDaT8oatER3jV4_1VFyWlDks5rmxCvgaJpZM4IcqQO
.

erikarn on 18 Mar 2017

Its possible that the repo case identified by @tmm1 was independent of the original issue.

😲 If that is the case, I apologize profusely for hijacking this thread.

tmm1 on 18 Mar 2017

As I noted to @steventh, I definitely able to reproduce the original memory corruption issue on the patched system. Also, somebody mentioned that the test code was stripped to only use fork() and still cause the issue. In other words, an issue is still there.

kostikbel on 18 Mar 2017

@kostikbel could you share the exact setup of test case that you had failed, as the three fork only tests I've been running here haven't failed yet with over 10 hours of testing.

For the record the there I've been running are:

Test 1 - No channels (with ktrace)
ktrace -tfw+ -i -f ktrace-test.out ./test
Test 2 - No channels (without ktrace)
./test2

package main

import (
        "log"
        "net/http"
        _ "net/http/pprof"
        "runtime"
        "syscall"
)

var (
        forkRoutines = 24
)

func run() {
        for {
                if err := syscall.ForkOnlyBSDTest(); err != nil {
                        log.Fatal(err)
                }
        }
}

func main() {
        log.Printf("Starting %v forking goroutines...\n", forkRoutines)
        log.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        for i := 0; i < forkRoutines; i++ {
                go run()
        }

        log.Println(http.ListenAndServe(":8081", nil))
}

Test 3 - With Channels (without ktrace)
./test3

package main

import (
        "log"
        "syscall"
)

func run(done chan struct{}) {
        if err := syscall.ForkOnlyBSDTest(); err != nil {
                log.Fatal(err)
        }

        done <- struct{}{}
        return
}

func main() {
        log.Println("Starting a bunch of goroutines...")

        done := make(chan struct{}, 24)

        for i := 0; i < 24; i++ {
                go run(done)
        }

        for {
                select {
                case <-done:
                        go run(done)
                }
        }
}

Fork Only:

package syscall

func ForkOnlyBSDTest() (err error) {
        var r1 uintptr
        var pid int
        var err1 Errno
        var wstatus WaitStatus

        ForkLock.Lock()
        runtime_BeforeFork()

        r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
        if err1 != 0 {
                runtime_AfterFork()
                ForkLock.Unlock()
                return err1
        }

        if r1 == 0 {
                for {
                        RawSyscall(SYS_EXIT, 253, 0, 0)
                }
        }

        runtime_AfterFork()
        ForkLock.Unlock()

        pid = int(r1)

        _, err = Wait4(pid, &wstatus, 0, nil)
        for err == EINTR {
                _, err = Wait4(pid, &wstatus, 0, nil)
        }

        return
}

All test have the following env:

export GOGC=2
export GOTRACEBACK=crash
export GOMAXPROCS=24

Hardware:

Dec  6 18:08:23 head kernel: CPU: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz (2500.05-MHz K8-class CPU)
Dec  6 18:08:23 head kernel: Origin="GenuineIntel"  Id=0x206d7  Family=0x6  Model=0x2d  Stepping=7
Dec  6 18:08:23 head kernel: Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Dec  6 18:08:23 head kernel: Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX>
Dec  6 18:08:23 head kernel: AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
Dec  6 18:08:23 head kernel: AMD Features2=0x1<LAHF>
Dec  6 18:08:23 head kernel: XSAVE Features=0x1<XSAVEOPT>
Dec  6 18:08:23 head kernel: VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
Dec  6 18:08:23 head kernel: TSC: P-state invariant, performance statistics
Dec  6 18:08:23 head kernel: real memory  = 137438953472 (131072 MB)
Dec  6 18:08:23 head kernel: avail memory = 133429284864 (127248 MB)

OS:

FreeBSD 12.0-CURRENT #39 r315303M:

Go:

go version go1.8 freebsd/amd64

stevenh on 18 Mar 2017

I use some binary, which was compiled from the source provided in the original message of this bug report. I am not sure which go version was used, but not older than 1/2 year, judging by dates.

kostikbel on 19 Mar 2017

I've not been able to repo a crash yet, but I have had the 100% of a core and everything sitting in locks 0% issues even with the patch but only after ~24 hours.

I'm now retesting with the stop the world while holding the fork lock to see if that helps narrow down kernel or go runtime issue.

If anyone has any quicker reproduction cases like to hear them, as 24 hours between tests is going to hamper investigation :(

stevenh on 19 Mar 2017

For reference the following will detect if your kernel has the signal stack issue.

Its different from the original which would only detect the problem on FreeBSD 11.0+ due to the fact the order of thread inserts has changed, so to detect in all cases three threads are needed.

#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>

extern char **environ;

static void
die(const char *s)
{
       perror(s);
       exit(EXIT_FAILURE);
}

static void
setstack(void *arg)
{
    stack_t ss;

    ss.ss_sp = malloc(SIGSTKSZ);
    if (ss.ss_sp == NULL)
        die("malloc");

    ss.ss_size = SIGSTKSZ;
    ss.ss_flags = 0;
    if (sigaltstack(&ss, NULL) < 0)
        die("sigaltstack set");
}

static void *
thread_exec(void *arg)
{
    struct timespec ts = {0, 1000};
    char *argv[] = {"./test-sigs", "no-more", 0};

    setstack(arg);
    nanosleep(&ts, NULL);

    execve(argv[0], &argv[0], environ);
    die("exec failed");

    return NULL;
}

static void *
thread_sleep(void *arg)
{
    sleep(10);

    return NULL;
}


int
main(int argc, char** argv)
{
    int i, j;
    int alloc_size = 32 * 1024;
    pthread_t tid1, tid2;
    char buf[1024];
    void *arg;

    if (argc != 1) {
        stack_t ss;

        if (sigaltstack(NULL, &ss) < 0)
            die("sigaltstack get");

        if (ss.ss_sp != NULL || ss.ss_flags != SS_DISABLE ||
            ss.ss_size != 0) {
            fprintf(stderr, "invalid signal stack after execve: "
                "ss_sp=%p ss_size=%lu ss_flags=0x%x\n", ss.ss_sp,
                (unsigned long)ss.ss_size,
                (unsigned int)ss.ss_flags);
            return 1;
        }

        printf("valid signal stack is valid after execve\n");

        return 0;
    }

    // We have to use two threads to ensure that can detect the
    // issue when new threads are added to the head (pre 269095)
    // and the tail of the process thread list.
    j = pthread_create(&tid1, NULL, thread_exec, NULL);
    if (j != 0) {
           errno = i;
           die("pthread_create");
    }

    j = pthread_create(&tid2, NULL, thread_sleep, NULL);
    if (j != 0) {
           errno = i;
           die("pthread_create");
    }

    j = pthread_join(tid1, NULL);
    if (j != 0) {
        errno = j;
        die("pthread_join");
    }

    j = pthread_join(tid2, NULL);
    if (j != 0) {
        errno = j;
    }

    return 0;
}

stevenh on 20 Mar 2017

@stevenh I've quite a few FreeBSD machines in various configurations (virtualized, hardware, jailed, 11.0, 12-CURRENT and so on, lots of CPU's, lots of RAM) and I would love to help testing your fixes - let me know if there's any way I could help.

bartekrutkowski on 22 Mar 2017

I've been able to reproduce the same crashes as before using the same methods with Ian's fixes added to the tree. Of three attempts, it wedged twice, panic'd once.

This strongly supports the case that the original issue stands.

Here's a log.

Here's a reference to the branch used to build the executable that produced the log.

derekmarcotte on 23 Mar 2017

Yes indeed @derekmarcotte the fix added by Ian was to address the issue raised by @tmm1, which unfortunately was as separate issue than the original.

stevenh on 23 Mar 2017

Continuing the debug of this and the evidence is building up that the issue is related to a running child during a GC pass.

Using the following code I'm reliably able to trigger a failure in ~10mins on FreeBSD 12.0 r315303M (I don't believe the FreeBSD version is important).

syscall/exec_bsd_debug.go

package syscall

func ForkOnlyBSDTest() (err error) {
        var r1 uintptr
        var pid int
        var err1 Errno
        var wstatus WaitStatus

        ForkLock.Lock()
        print("forking: ", forks+1, "\n")
        runtime_BeforeFork()

        r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
        if err1 != 0 {
                runtime_AfterFork()
                ForkLock.Unlock()
                return err1
        }

        if r1 == 0 {
                for {
                        RawSyscall(SYS_EXIT, 253, 0, 0)
                }
        }

        forks++
        runtime_AfterFork()
        print("forked: ", forks, "\n")
        runtime_GC()
        ForkLock.Unlock()

        pid = int(r1)
        _, err = Wait4(pid, &wstatus, 0, nil)
        for err == EINTR {
                print("wait4: EINTR\n")
                _, err = Wait4(pid, &wstatus, 0, nil)
        }

        return err
}

runtime/exec_debug.go

package runtime

//go:linkname syscall_runtime_GC syscall.runtime_GC
//go:nosplit
func syscall_runtime_GC() {
        GC()
}

main.go

package main

import (
        "log"
        "runtime"
        "syscall"
)

var (
        forkRoutines = 1
        gc           = true
)

func run() {
        for {
                if err := syscall.ForkOnlyBSDTest(gc); err != nil {
                        log.Fatal(err)
                }
        }
}

func main() {
        log.Printf("Starting %v forking goroutines...\n", forkRoutines)
        log.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0), "version:", runtime.Version()

        for i := 0; i < forkRoutines; i++ {
                go run()
        }

        done := make(chan struct{})
        <-done
}

env settings

export GOGC=off
export GOTRACEBACK=crash
export GOMAXPROCS=16

I've discovered a few interesting facts:

If I add a delay (tried up to 1.5 seconds) before exiting the child, so the child exits after the GC pass, I still get the issue.
If I move the GC after the Wait4 block, so the child exits before the GC pass, I've not had a failure after hours.
If I add a delay (10ms) before the GC, so the child exits before the GC pass, I've not had a failure after hours.

I believe 1 rules out signal delivery during the GC pass as a potential culprit.

So the issue appears to be triggered by having a __running__ child present during a GC pass.

The crash typically looks like:

fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x7d62ad, 0x14)
        /usr/local/go/src/runtime/panic.go:596 +0x95 fp=0x7fffde9f2b58 sp=0x7fffde9f2b38
runtime.(*workbuf).checkempty(0x801166800)
        /usr/local/go/src/runtime/mgcwork.go:320 +0x4e fp=0x7fffde9f2b78 sp=0x7fffde9f2b58
runtime.getempty(0x7fffde9f2bd0)
        /usr/local/go/src/runtime/mgcwork.go:332 +0x96 fp=0x7fffde9f2bb0 sp=0x7fffde9f2b78
runtime.handoff(0x801166800, 0xc41fffa1b2)
        /usr/local/go/src/runtime/mgcwork.go:432 +0x26 fp=0x7fffde9f2be8 sp=0x7fffde9f2bb0
runtime.(*gcWork).balance(0xc42001f228)
        /usr/local/go/src/runtime/mgcwork.go:275 +0x4e fp=0x7fffde9f2c08 sp=0x7fffde9f2be8
runtime.gcDrain(0xc42001f228, 0x0)
        /usr/local/go/src/runtime/mgcmark.go:1066 +0x36c fp=0x7fffde9f2c40 sp=0x7fffde9f2c08
runtime.gchelper()
        /usr/local/go/src/runtime/mgc.go:1910 +0x94 fp=0x7fffde9f2c70 sp=0x7fffde9f2c40
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1653 +0xdd fp=0x7fffde9f2c98 sp=0x7fffde9f2c70
runtime.findrunnable(0xc420018000, 0x0)
        /usr/local/go/src/runtime/proc.go:2102 +0x2e4 fp=0x7fffde9f2d30 sp=0x7fffde9f2c98
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2222 +0x14c fp=0x7fffde9f2d70 sp=0x7fffde9f2d30
runtime.goschedImpl(0xc42014a000)
        /usr/local/go/src/runtime/proc.go:2300 +0x106 fp=0x7fffde9f2d90 sp=0x7fffde9f2d70
runtime.gopreempt_m(0xc42014a000)
        /usr/local/go/src/runtime/proc.go:2315 +0x36 fp=0x7fffde9f2da8 sp=0x7fffde9f2d90
runtime.newstack(0x0)
        /usr/local/go/src/runtime/stack.go:1081 +0x2f0 fp=0x7fffde9f2f20 sp=0x7fffde9f2da8
runtime.morestack()
        /usr/local/go/src/runtime/asm_amd64.s:398 +0x86 fp=0x7fffde9f2f30 sp=0x7fffde9f2f20

I also needed the following patch to ensure I got a core even with GOTRACEBACK=crash, as it seems that in this case were m.traceback is set to 1 it prevents the user preference taking effect, so it was impossible to trigger a crash on panic.

--- ./runtime/runtime1.go.orig  2017-03-21 16:27:26.000000000 +0000
+++ ./runtime/runtime1.go       2017-03-21 16:27:26.000000000 +0000
@@ -36,14 +36,16 @@ var traceback_env uint32
 func gotraceback() (level int32, all, crash bool) {
        _g_ := getg()
        all = _g_.m.throwing > 0
-       if _g_.m.traceback != 0 {
-               level = int32(_g_.m.traceback)
-               return
-       }
        t := atomic.Load(&traceback_cache)
        crash = t&tracebackCrash != 0
        all = all || t&tracebackAll != 0
-       level = int32(t >> tracebackShift)
+       if _g_.m.traceback != 0 {
+               level = int32(_g_.m.traceback)
+       }
+
+       if int32(t>>tracebackShift) > level {
+               level = int32(t >> tracebackShift)
+       }
        return
 }

stevenh on 23 Mar 2017

Ok very interesting data point, it appears that if I compile with CGO_ENABLE=0 then the problem goes away. This would explain why we've never seen this in production as all of our binaries are compiled with CGO disabled.

Scratch that, after a few test runs it happened :(

stevenh on 23 Mar 2017

@stevenh: FWIW - I'm still getting panics with GOGC=off, and no explicit calls to the GC (as in your example). I don't believe that garbage collection is implicated here, unless it runs sometimes despite GOGC. Last trawl through the runtime, I felt like GOGC=off kills the GC dead.

I've used a stripped down version of your example main.go, and my ForkOnlyBSDTest.

derekmarcotte on 27 Mar 2017

@derekmarcotte I don't think GC is the cause, however I do think forcing GC between the fork and wait produces enough meaningful work in the go runtime to detect, the memory corruption.

Out of interest:

What frequency do you get failures with your tests?
Whats your setup?
Could post your exact main.go and ForkOnlyBSDTest?

stevenh on 28 Mar 2017

Forgot to ask, with GC disabled what crash did you get?

stevenh on 28 Mar 2017

@stevenh:
In terms of extra work, I think the /bin/true case is pretty quick to fail. I think it's helpful to eliminate as much as possible, to narrow down on the involved systems. The only things left AFAIK in play are the scheduler, stack expansion?, and the worker routines (e.g. fork/wait).

Most recently, I had one hang, one crash of two tests
Most of the testing I do is in a 10.3-RELEASE jail on a 10.3-RELEASE host (AMD 8350FX), the host is busy doing other things as well. Additionally, I've booted the 10.3-RELEASE memstick in single user mode on a Xeon+ECC, and have been able to crash in exactly the same fashion. It doesn't appear to be environment related (i.e. hardware/jails + background services).
main.go, ForkOnlyBSDTest - these are intentionally bare bones to eliminate any other sources of memory corruption.
Since moving to the ForkOnlyBSDTest, I have only panicked with fatal error: acquirep: invalid p state, here's a recent example:

Mon Mar 27 06:47:55 EDT 2017
acquirep: p->m=4895104(0) p->status=1
fatal error: acquirep: invalid p state
... stack traces
Mon Mar 27 11:54:57 EDT 2017

So that one was about 5 hours.

I haven't been meticulous about logging the timing any more though. If it's helpful, I'm sure I could script up something like this to detect the wedge, and start recording more times? I don't feel that would help much at this point though.

derekmarcotte on 28 Mar 2017

Thanks @derekmarcotte that's why I've been using the force GC test recently as it blows up pretty reliably in between 1-10mins, which makes iterations much quicker.

stevenh on 28 Mar 2017

@stevenh: I ended up scripting the thing, here's a sample log (with backtraces omitted):

=== Wed Apr 5 16:48:27 EDT 2017 - Starting new instance === Wed Apr 5 18:23:40 EDT 2017 - Killing wedged 42369 === Wed Apr 5 18:23:45 EDT 2017 - Starting new instance === Thu Apr 6 02:45:19 EDT 2017 - Starting new instance === Thu Apr 6 12:52:40 EDT 2017 - Starting new instance === Thu Apr 6 18:24:49 EDT 2017 - Killing wedged 98836 === Thu Apr 6 18:24:54 EDT 2017 - Starting new instance === Thu Apr 6 20:47:11 EDT 2017 - Starting new instance === Fri Apr 7 02:25:20 EDT 2017 - Killing wedged 53688 === Fri Apr 7 02:25:25 EDT 2017 - Starting new instance === Fri Apr 7 02:49:28 EDT 2017 - Starting new instance === Fri Apr 7 03:16:05 EDT 2017 - Starting new instance === Fri Apr 7 06:42:34 EDT 2017 - Killing wedged 38753 === Fri Apr 7 06:42:39 EDT 2017 - Starting new instance === Fri Apr 7 06:42:54 EDT 2017 - Killing wedged 24324 === Fri Apr 7 06:42:59 EDT 2017 - Starting new instance === Fri Apr 7 06:43:04 EDT 2017 - Killing wedged 97972 === Fri Apr 7 06:43:09 EDT 2017 - Starting new instance === Fri Apr 7 06:43:34 EDT 2017 - Killing wedged 23425 === Fri Apr 7 06:43:39 EDT 2017 - Starting new instance === Fri Apr 7 06:44:29 EDT 2017 - Killing wedged 45400 === Fri Apr 7 06:44:34 EDT 2017 - Starting new instance === Fri Apr 7 06:45:19 EDT 2017 - Killing wedged 86970 === Fri Apr 7 06:45:24 EDT 2017 - Starting new instance === Fri Apr 7 06:45:39 EDT 2017 - Killing wedged 9083 === Fri Apr 7 06:45:44 EDT 2017 - Starting new instance === Fri Apr 7 06:46:04 EDT 2017 - Killing wedged 78764 === Fri Apr 7 06:46:09 EDT 2017 - Starting new instance === Fri Apr 7 06:46:14 EDT 2017 - Killing wedged 77503 === Fri Apr 7 06:46:19 EDT 2017 - Starting new instance === Fri Apr 7 06:46:24 EDT 2017 - Killing wedged 2314 === Fri Apr 7 06:46:29 EDT 2017 - Starting new instance === Fri Apr 7 06:46:59 EDT 2017 - Killing wedged 27186 === Fri Apr 7 06:47:04 EDT 2017 - Starting new instance === Fri Apr 7 06:47:25 EDT 2017 - Killing wedged 73453 === Fri Apr 7 06:47:30 EDT 2017 - Starting new instance === Fri Apr 7 06:47:35 EDT 2017 - Killing wedged 72424 === Fri Apr 7 06:47:40 EDT 2017 - Starting new instance === Fri Apr 7 06:48:15 EDT 2017 - Killing wedged 96875 === Fri Apr 7 06:48:20 EDT 2017 - Starting new instance === Fri Apr 7 06:49:15 EDT 2017 - Killing wedged 78286 === Fri Apr 7 06:49:20 EDT 2017 - Starting new instance === Fri Apr 7 06:49:35 EDT 2017 - Killing wedged 57532 === Fri Apr 7 06:49:40 EDT 2017 - Starting new instance === Fri Apr 7 06:50:40 EDT 2017 - Killing wedged 29647 === Fri Apr 7 06:50:45 EDT 2017 - Starting new instance === Fri Apr 7 06:51:15 EDT 2017 - Killing wedged 41134 === Fri Apr 7 06:51:20 EDT 2017 - Starting new instance === Fri Apr 7 06:51:40 EDT 2017 - Killing wedged 93192 === Fri Apr 7 06:51:45 EDT 2017 - Starting new instance === Fri Apr 7 06:52:15 EDT 2017 - Killing wedged 86941 === Fri Apr 7 06:52:20 EDT 2017 - Starting new instance === Fri Apr 7 06:52:35 EDT 2017 - Killing wedged 27716 === Fri Apr 7 06:52:41 EDT 2017 - Starting new instance === Fri Apr 7 10:25:00 EDT 2017 - Killing wedged 6064 === Fri Apr 7 10:25:05 EDT 2017 - Starting new instance === Fri Apr 7 16:12:26 EDT 2017 - Starting new instance === Fri Apr 7 18:25:48 EDT 2017 - Killing wedged 31596 === Fri Apr 7 18:25:53 EDT 2017 - Starting new instance === Fri Apr 7 18:25:53 EDT 2017 - Killing wedged 53833 === Fri Apr 7 18:25:58 EDT 2017 - Starting new instance === Fri Apr 7 19:03:51 EDT 2017 - Starting new instance === Sat Apr 8 04:17:57 EDT 2017 - Killing wedged 13586 === Sat Apr 8 04:18:02 EDT 2017 - Starting new instance === Sat Apr 8 10:26:05 EDT 2017 - Killing wedged 44051 === Sat Apr 8 10:26:10 EDT 2017 - Starting new instance === Sat Apr 8 18:25:48 EDT 2017 - Killing wedged 54803 === Sat Apr 8 18:25:53 EDT 2017 - Starting new instance === Sat Apr 8 18:26:43 EDT 2017 - Killing wedged 6268 === Sat Apr 8 18:26:48 EDT 2017 - Starting new instance === Sun Apr 9 00:12:15 EDT 2017 - Starting new instance === Sun Apr 9 01:02:04 EDT 2017 - Starting new instance === Sun Apr 9 02:26:56 EDT 2017 - Killing wedged 29694 === Sun Apr 9 02:27:02 EDT 2017 - Starting new instance === Sun Apr 9 05:43:05 EDT 2017 - Starting new instance === Sun Apr 9 10:27:05 EDT 2017 - Killing wedged 90493 === Sun Apr 9 10:27:10 EDT 2017 - Starting new instance === Sun Apr 9 18:27:27 EDT 2017 - Killing wedged 3961 === Sun Apr 9 18:27:32 EDT 2017 - Starting new instance === Mon Apr 10 02:28:55 EDT 2017 - Killing wedged 57180 === Mon Apr 10 02:29:00 EDT 2017 - Starting new instance === Mon Apr 10 02:29:00 EDT 2017 - Killing wedged 30622 === Mon Apr 10 02:29:06 EDT 2017 - Starting new instance === Mon Apr 10 05:20:56 EDT 2017 - Starting new instance === Mon Apr 10 10:37:45 EDT 2017 - Killing wedged 39613 === Mon Apr 10 10:37:50 EDT 2017 - Starting new instance === Mon Apr 10 11:30:20 EDT 2017 - Starting new instance === Mon Apr 10 18:28:54 EDT 2017 - Killing wedged 40438 === Mon Apr 10 18:28:59 EDT 2017 - Starting new instance === Mon Apr 10 18:29:19 EDT 2017 - Killing wedged 97568 === Mon Apr 10 18:29:24 EDT 2017 - Starting new instance === Mon Apr 10 22:43:26 EDT 2017 - Starting new instance === Tue Apr 11 00:32:26 EDT 2017 - Starting new instance === Tue Apr 11 01:53:53 EDT 2017 - Starting new instance === Tue Apr 11 02:29:11 EDT 2017 - Killing wedged 48237 === Tue Apr 11 02:29:16 EDT 2017 - Starting new instance === Tue Apr 11 10:29:54 EDT 2017 - Killing wedged 28162 === Tue Apr 11 10:29:59 EDT 2017 - Starting new instance === Tue Apr 11 13:44:17 EDT 2017 - Starting new instance === Tue Apr 11 13:46:22 EDT 2017 - Starting new instance === Tue Apr 11 16:25:46 EDT 2017 - Starting new instance === Tue Apr 11 18:30:02 EDT 2017 - Killing wedged 1645 === Tue Apr 11 18:30:07 EDT 2017 - Starting new instance === Tue Apr 11 18:32:17 EDT 2017 - Killing wedged 25645 === Tue Apr 11 18:32:22 EDT 2017 - Starting new instance === Tue Apr 11 20:44:39 EDT 2017 - Starting new instance === Tue Apr 11 20:44:59 EDT 2017 - Starting new instance === Wed Apr 12 02:30:15 EDT 2017 - Killing wedged 45351
If you see consecutive Starting new instance, this means that the first is a panic. So out of 64 runs, 44 wedged, and 20 panicked.

derekmarcotte on 12 Apr 2017

Hrm, on second though, the wedged detection should be inverted. i.e. - counting active threads, and kill when they dip below say, 4. Having idle threads is no big deal. Sorry for the noise.

derekmarcotte on 12 Apr 2017

Any updates here?

bradfitz on 24 May 2017

I'm afraid I've not had any time recently to continue investigating this.

I do have reliable test case, which triggers the issue relatively quickly but I've been unable to make any progress due to lack to time.

stevenh on 24 May 2017

The last notable status I have is from March. As far as I know, this is still an issue. I'm at the limits of my understanding/ability and have directed my efforts to places where I can be more helpful. If someone wanted to point me at an instance, I would be happy to collaborate, and get it set-up and repeatable?

derekmarcotte on 25 May 2017

Confirming this is still an issue on 10.3-RELEASE, against master (a9d8d4d):

acquirep: p->m=842351116288(6) p->status=1
fatal error: acquirep: invalid p state

derekmarcotte on 25 May 2017

Wouldn't using vfork or posix_spawn as suggested in #5838 be a workaround to this issue?

rs on 28 May 2017

The kernel code underneath fork is he same as vfork so unlikely

stevenh on 28 May 2017

Do you know what the issue is ? Do you know that the issue is in FreeBSD kernel and not in the go runtime ?

Also, even if the issue is apparently in the kernel, fork vs. vfork difference is exactly where VM is involved. So, although not definitive, the experiment might give some hints where to look next.

kostikbel on 28 May 2017

No we still don't know if it's kernel or runtime. I've tried lots of kernel options on the raw kernel forks but none made any deference.

I've tried replicating with c code and have been unable to do so.

Even with the best reproduction case I have it always takes quite a number of attempts.

Given this it seems it requires complex threading interaction during the actual fork kernel call.

stevenh on 28 May 2017

Based on the fact that the same symptoms are reproducible under linux, does it make sense to no longer call this a FreeBSD issue?

derekmarcotte on 16 Jun 2017

Let's not merge them until we know it's the same underlying cause. Same symptoms can be misleading.

rsc on 16 Jun 2017

(Sorry, wrong button.)

rsc on 16 Jun 2017

@rsc: Thanks. Agreed that there may be other things going on here.

Having said that, the issue I'm referencing is having the same crashes (fatal error: workbuf is not empty, fatal error: acquirep: invalid p state), in the same places, under Linux running the same test as in FreeBSD. I suspect this should be enough to invalidate the change to FreeBSD-only, yet perhaps not enough to merge the two issues.

I haven't seen the fatal error: unexpected signal during runtime execution referenced there, for example, and I've seen a lot of crashes.

derekmarcotte on 16 Jun 2017

👍1

Are the folks experiencing this running on Skylake or Kaby Lake hardware? If so, this might be relevant: https://lists.debian.org/debian-devel/2017/06/msg00308.html . Please try updating your CPU microcode to the latest release.

jclinton on 25 Jun 2017

👍1

Nope Nehalem EP here.

stevenh on 25 Jun 2017

I guess we still don't know what is happening. Punting to 1.10.

ianlancetaylor on 2 Aug 2017

I agree @derekmarcotte. I think it doesn't make sense to state something is FreeBSD related when something very similar happens on at least one other OS too.

If this gets changed it should be mentioned in the release notes as well. However there it might be enough to simply mention Linux and reference the according issue.

reezer on 25 Aug 2017

Might this change to FreeBSD be related? I doesn't seem likely to me, but others here know more about how go works than I do.

dankm on 19 Sep 2017

It looks like go only uses LDT on 386 and not amd64 so I suspect this wouldn't have any impact @dankm

stevenh on 20 Sep 2017

Just wanted to confirm this is still an issue with 1.9.1 which includes the signal handling fixes:

2017/10/10 22:57:08 Starting 8 forking goroutines...
2017/10/10 22:57:08 GOMAXPROCS: 24 version: go1.9.1
acquirep: p->m=842351920128(9) p->status=1
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x7a8101, 0x19)
        /usr/local/go/src/runtime/panic.go:605 +0x95
runtime.acquirep1(0xc42001ec00)
        /usr/local/go/src/runtime/proc.go:3689 +0x16d
runtime.acquirep(0xc42001ec00)
        /usr/local/go/src/runtime/proc.go:3661 +0x2b
runtime.findrunnable(0xc420021300, 0x0)
        /usr/local/go/src/runtime/proc.go:2068 +0x74f
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2245 +0x12c
runtime.park_m(0xc420152900)
        /usr/local/go/src/runtime/proc.go:2308 +0xb6
runtime.mcall(0x800a0b040)
        /usr/local/go/src/runtime/asm_amd64.s:286 +0x5b
...
1 runs completed, 1 failures, over 6h47m54s

stevenh on 11 Oct 2017

👍2

I was running the original test case posted above on DragonFlyBSD out of curiosity mostly and although I don't see any crashes/corruption in the short time I let it run, on 5.0.2 after running the testcase I consistently get this in syslog -

td 0xffffff81dd9a6900 test: timeout race
td 0xffffff81e01e7400 test: timeout race

This is coming from sys/kern/kern_synch.c:716 - which is tsleep() -

 /*
         * Cleanup the timeout.  If the timeout has already occured thandle
         * has already been stopped, otherwise stop thandle.  If the timeout
         * is running (the callout thread must be blocked trying to get
         * lwp_token) then wait for us to get scheduled.
         */
        if (timo) {
                while (td->td_flags & TDF_TIMEOUT_RUNNING) {
                        /* else we won't get rescheduled! */
                        if (lp->lwp_stat != LSSTOP)
                                lp->lwp_stat = LSSLEEP;
                        lwkt_deschedule_self(td);
                        td->td_wmesg = "tsrace";
                        lwkt_switch();
                        kprintf("td %p %s: timeout race\n", td, td->td_comm);
                }

I am not sure this is even relevant - I haven't looked deeper, but thought I would throw this out just in case.

blinkingled on 8 Dec 2017

I haven't read the details of this issue, but even so, it could be fixed by this recent change in the FreeBSD VM system:

https://svnweb.freebsd.org/base?view=revision&revision=329254

I you want to test that change, you'll also need this one:

https://svnweb.freebsd.org/base?view=revision&revision=329252

vangyzen on 15 Feb 2018

It's not clear if these two kernel fixes will help fix the issues identified here but they are definitely in the right area, particularly the second one.
Do not call pmap_enter() with invalid protection mode.
Ensure memory consistency on COW

I'm trying to confirm either way if the above do have an impact, tests in progress. I'll report back when I have more, but if anyone else has a good reproduction environment which they can apply the above patches and test that would be great.

stevenh on 15 Feb 2018

Hehe, I guess I should have refreshed my page before replying, thanks @vangyzen :)

stevenh on 15 Feb 2018

Having applied the two patches I've unfortunately manged to confirm it hasn't solved the issue as I had a crash:

acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state

stevenh on 17 Feb 2018

👍2

@stevenh could you pull down https://svnweb.freebsd.org/base?view=revision&revision=335171 and try again? It seems like this is a different manifestation of the bug described in https://reviews.freebsd.org/D15293

sean- on 14 Jun 2018

Just seen that and thought that it could be related.

I've asked @kostikbel for clarification about how that may exhibit it with thought it may be related to this.

If the patch is compatible with 11.x I'll grab it and run some more tests.

stevenh on 15 Jun 2018

From the description I got from alc, the way the issue this fix address exhibits, it sounds like its not our cause, but I'm test running a test none the less.

stevenh on 15 Jun 2018

👍1

The way this bug exhibits, it sounds like having some form of system wide reverse / time-travel debugger running would be handy. eg The bug exhibits, so hit "stop" then track it backwards

Not sure if such a thing exists for FreeBSD though. :angel:

justinclift on 15 Jun 2018

no go still crashes
```==== NEW RUN Fri Feb 16 17:34:25 GMT 2018 ====

Starting 16 forking goroutines...
GOMAXPROCS: 24
acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c6fe1, 0x19)
/usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc420077e00 sp=0xc420077de0 pc=0x426fc5
runtime.acquirep1(0xc42002a600)
/usr/local/go/src/runtime/proc.go:3699 +0x16d fp=0xc420077e38 sp=0xc420077e00 pc=0x43205d
runtime.acquirep(0xc42002a600)
/usr/local/go/src/runtime/proc.go:3671 +0x2b fp=0xc420077e50 sp=0xc420077e38 pc=0x431eab
runtime.stopm()
/usr/local/go/src/runtime/proc.go:1689 +0x124 fp=0xc420077e78 sp=0xc420077e50 pc=0x42c924
runtime.findrunnable(0xc420020c00, 0x0)
/usr/local/go/src/runtime/proc.go:2135 +0x4d2 fp=0xc420077f10 sp=0xc420077e78 pc=0x42dac2
runtime.schedule()
/usr/local/go/src/runtime/proc.go:2255 +0x12c fp=0xc420077f58 sp=0xc420077f10 pc=0x42e57c
runtime.goexit0(0xc42013bc80)
/usr/local/go/src/runtime/proc.go:2406 +0x236 fp=0xc420077fb0 sp=0xc420077f58 pc=0x42efd6
runtime.mcall(0x0)
/usr/local/go/src/runtime/asm_amd64.s:286 +0x5b fp=0xc420077fc0 sp=0xc420077fb0 pc=0x44dfab
```

stevenh on 18 Jun 2018

@justinclift Wouldn't DTrace be enough to investigate this problem?

bartekrutkowski on 18 Jun 2018

Not sure how as this appears to be random memory corruption of the forking
proces. I’ve tried running it through truss before but it adds enough delay
that it doesn’t reproduce.

I can see if tracing the fork syscall with dtrace shows anything

On Mon, 18 Jun 2018 at 14:39, Bartek Rutkowski notifications@github.com
wrote:

@justinclift https://github.com/justinclift Wouldn't DTrace (
https://www.freebsd.org/doc/handbook/dtrace.html) be enough to
investigate this problem?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/15658#issuecomment-398056601, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAGXL2HVT4qVJ4zI1cAoC7BLw8MxQpiYks5t962GgaJpZM4IcqQO
.

stevenh on 18 Jun 2018

👍1

This issue is a sprawling 2 year epic adventure. The fact that multiple bugs with different reproducers have been subsumed under one issue is unhelpful. Can someone write the Cliff's Notes including the current fastest reproducer? Thanks in advance.

mattmacy on 5 Jul 2018

Hi matt my current reproduction case is:

package main

import (
        "fmt"
        "log"
        "runtime"
        "syscall"
)

var (
        forkRoutines = 16
)

func run(done chan struct{}) {
        if err := syscall.ForkOnlyBSDTest(); err != nil {
                log.Fatal(err)
        }

        done <- struct{}{}
}

func main() {
        fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        done := make(chan struct{}, forkRoutines*2)

        for i := 0; i < forkRoutines; i++ {
                go run(done)
        }

        for range done {
                go run(done)
        }
}

Then this in the lib:

// +build darwin dragonfly freebsd netbsd openbsd

package syscall

func ForkOnlyBSDTest() (err error) {
        var r1 uintptr
        var pid int
        var err1 Errno
        var wstatus WaitStatus

        ForkLock.Lock()
        runtime_BeforeFork()

        r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
        if err1 != 0 {
                runtime_AfterFork()
                ForkLock.Unlock()
                return err1
        }

        if r1 == 0 {
                // in child, die die die my darling
                for {
                        RawSyscall(SYS_EXIT, 253, 0, 0)
                }
        }

        runtime_AfterFork()
        ForkLock.Unlock()

        pid = int(r1)

        // Prime directive, exterminate
        // Whatever stands left
        _, err = Wait4(pid, &wstatus, 0, nil)
        for err == EINTR {
                _, err = Wait4(pid, &wstatus, 0, nil)
        }

        return
}

It's not quick and not guaranteed.

stevenh on 5 Jul 2018

👍2

There is also a thread on the freebsd ports mailing list "lang/go failes to build with poudriere, since 2018-04-05" who has a reproduction case running the lang/go port build which is related.

stevenh on 5 Jul 2018

@stevenh is there any correlation with the amount of memory in the system or the number of cores? For example, would lowering hw.physmem on a 48-way help?

mattmacy on 7 Jul 2018

I’ve not seen one

On Sat, 7 Jul 2018 at 00:40, Matthew Macy notifications@github.com wrote:

@stevenh https://github.com/stevenh is there any correlation with the
amount of memory in the system or the number of cores? For example, would
lowering hw.physmem on a 48-way help?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/15658#issuecomment-403170886, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAGXLzGJGnr8pNivK8fjDV7xuq204ljfks5uD_VSgaJpZM4IcqQO
.

stevenh on 7 Jul 2018

I've been running the usual tests on 11.2-RELEASE-p3 with go1.11.1 and have not managed to reproduce after a weeks run.

Has anyone else seen a crash with these updated versions?

stevenh on 28 Nov 2018

❤2 🎉2

Ok still no issues triggered on my test box and its been running forks in a loop since 18th and 20th of November

root  50215  273.0  0.0   104384   4988  9  S+J  20Nov18  134201:59.00 ./test
root  46993  259.4  0.0   104640   3204  8  S+J  18Nov18  151748:35.09 ./test2

Given this I'm tempted to say that the issue has been fixed either in FreeBSD or golang runtime.

stevenh on 21 Dec 2018

🎉2 👍2

Thanks very much for following up on this., I'm going to close the issues in the hopes that it is indeed fixed. Anyone, please comment if you find future failures.

ianlancetaylor on 21 Dec 2018

👍1

While updating our test box to 12.0 from 11.2 I noticed that this latest test was running with a potentially relevant kernel patch on top of 11.2-RELEASE.

If anyone can reproduce this issue on 11.2 then they should try applying the following to see if was indeed the fix:
https://svnweb.freebsd.org/base?view=revision&revision=335171

This is already included in 12.0-RELEASE and was MFC'ed to 11-stable as of:
https://svnweb.freebsd.org/base?view=revision&revision=335508

stevenh on 3 Jan 2019

👍1

Go: runtime: FreeBSD memory corruption involving fork system call

Most helpful comment

All 243 comments

Related issues