This is probably a duplicate of some other recent runtime issues, but it happens to be very easily reproduced using Go 1.10rc1. I'm sure it can be minimized further, but I am almost out of dev time for several days, so I wanted to get it up here ASAP in case it was helpful for @aclements.
Repro: Grab https://gist.github.com/josharian/cec3dbe91b2cffaf7419f53757cce3d3 and run it from inside any git repo. (See https://github.com/libgit2/git2go if there are questions about how to install git2go.) Pasted code:
package main
import (
"os"
git "gopkg.in/libgit2/git2go.v26"
)
func main() {
pwd, err := os.Getwd()
check(err)
repo, err := git.OpenRepositoryExtended(pwd, 0, "")
check(err)
rw, err := repo.Walk()
check(err)
rw.Push(nil)
}
func check(err error) {
if err != nil {
panic(err)
}
}
The resulting crash is obviously programmer error (rw.Push expects a non-nil parameter), but the "unexpected return pc" part shouldn't be present.
Result:
$ go run x.go
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x42cafcd]
runtime stack:
runtime: unexpected return pc for runtime.sigpanic called from 0x42cafcd
stack: frame={sp:0x7ffeefbff510, fp:0x7ffeefbff560} stack=[0x7ffeefb80350,0x7ffeefbff7d0)
00007ffeefbff410: 00007ffeefbff480 000000000402aeec <runtime.dopanic_m+540>
00007ffeefbff420: 000000000402a991 <runtime.throw+129> 00007ffeefbff4f0
00007ffeefbff430: 0000000000000000 0000000004214f80
00007ffeefbff440: 010000000402b448 0000000000000004
00007ffeefbff450: 000000000000001f 00000000042cafcd
00007ffeefbff460: 0000000000000000 0000000000000001
00007ffeefbff470: 000000000415e881 000000c420000180
00007ffeefbff480: 00007ffeefbff4a8 000000000404f7bc <runtime.dopanic.func1+60>
00007ffeefbff490: 0000000004214f80 000000000402a991 <runtime.throw+129>
00007ffeefbff4a0: 00007ffeefbff4f0 00007ffeefbff4e0
00007ffeefbff4b0: 000000000402a8ba <runtime.dopanic+74> 00007ffeefbff4c0
00007ffeefbff4c0: 000000000404f780 <runtime.dopanic.func1+0> 0000000004214f80
00007ffeefbff4d0: 000000000402a991 <runtime.throw+129> 00007ffeefbff4f0
00007ffeefbff4e0: 00007ffeefbff500 000000000402a991 <runtime.throw+129>
00007ffeefbff4f0: 0000000000000000 000000000000002a
00007ffeefbff500: 00007ffeefbff550 000000000403d39e <runtime.sigpanic+654>
00007ffeefbff510: <0000000004160a85 000000000000002a
00007ffeefbff520: 0000000000000000 0000000000000000
00007ffeefbff530: 0000000000000000 0000000004214f80
00007ffeefbff540: 0000000000000000 0000000000000000
00007ffeefbff550: 00007ffeefbff560 !00000000042cafcd
00007ffeefbff560: >00007ffeefbff600 00000000042c7287
00007ffeefbff570: 0000000000000000 0000000000000000
00007ffeefbff580: 0000000000000000 0000000000000000
00007ffeefbff590: 0000000000000000 0000000000000000
00007ffeefbff5a0: 0000000000000000 0000000004611750
00007ffeefbff5b0: 0000000000000000 0000020100000280
00007ffeefbff5c0: 0000000000000000 0000000004611758
00007ffeefbff5d0: 909597a7fa2e00c2 00000000046369a0
00007ffeefbff5e0: 0000000004611730 0000000000000000
00007ffeefbff5f0: 0000000000000000 00007ffeefbff658
00007ffeefbff600: 00007ffeefbff630 00000000042c71fb
00007ffeefbff610: 0000000000000000 0000000000000028
00007ffeefbff620: 0000000000000000 00007ffeefbff6d8
00007ffeefbff630: 00007ffeefbff6b0 00000000042c568e
00007ffeefbff640: 00000000ffffffff 00000000046369a0
00007ffeefbff650: fffffffeefbff6a0 0000000000000000
runtime.throw(0x4160a85, 0x2a)
/Users/josh/go/1.10/src/runtime/panic.go:619 +0x81
runtime: unexpected return pc for runtime.sigpanic called from 0x42cafcd
stack: frame={sp:0x7ffeefbff510, fp:0x7ffeefbff560} stack=[0x7ffeefb80350,0x7ffeefbff7d0)
00007ffeefbff410: 00007ffeefbff480 000000000402aeec <runtime.dopanic_m+540>
00007ffeefbff420: 000000000402a991 <runtime.throw+129> 00007ffeefbff4f0
00007ffeefbff430: 0000000000000000 0000000004214f80
00007ffeefbff440: 010000000402b448 0000000000000004
00007ffeefbff450: 000000000000001f 00000000042cafcd
00007ffeefbff460: 0000000000000000 0000000000000001
00007ffeefbff470: 000000000415e881 000000c420000180
00007ffeefbff480: 00007ffeefbff4a8 000000000404f7bc <runtime.dopanic.func1+60>
00007ffeefbff490: 0000000004214f80 000000000402a991 <runtime.throw+129>
00007ffeefbff4a0: 00007ffeefbff4f0 00007ffeefbff4e0
00007ffeefbff4b0: 000000000402a8ba <runtime.dopanic+74> 00007ffeefbff4c0
00007ffeefbff4c0: 000000000404f780 <runtime.dopanic.func1+0> 0000000004214f80
00007ffeefbff4d0: 000000000402a991 <runtime.throw+129> 00007ffeefbff4f0
00007ffeefbff4e0: 00007ffeefbff500 000000000402a991 <runtime.throw+129>
00007ffeefbff4f0: 0000000000000000 000000000000002a
00007ffeefbff500: 00007ffeefbff550 000000000403d39e <runtime.sigpanic+654>
00007ffeefbff510: <0000000004160a85 000000000000002a
00007ffeefbff520: 0000000000000000 0000000000000000
00007ffeefbff530: 0000000000000000 0000000004214f80
00007ffeefbff540: 0000000000000000 0000000000000000
00007ffeefbff550: 00007ffeefbff560 !00000000042cafcd
00007ffeefbff560: >00007ffeefbff600 00000000042c7287
00007ffeefbff570: 0000000000000000 0000000000000000
00007ffeefbff580: 0000000000000000 0000000000000000
00007ffeefbff590: 0000000000000000 0000000000000000
00007ffeefbff5a0: 0000000000000000 0000000004611750
00007ffeefbff5b0: 0000000000000000 0000020100000280
00007ffeefbff5c0: 0000000000000000 0000000004611758
00007ffeefbff5d0: 909597a7fa2e00c2 00000000046369a0
00007ffeefbff5e0: 0000000004611730 0000000000000000
00007ffeefbff5f0: 0000000000000000 00007ffeefbff658
00007ffeefbff600: 00007ffeefbff630 00000000042c71fb
00007ffeefbff610: 0000000000000000 0000000000000028
00007ffeefbff620: 0000000000000000 00007ffeefbff6d8
00007ffeefbff630: 00007ffeefbff6b0 00000000042c568e
00007ffeefbff640: 00000000ffffffff 00000000046369a0
00007ffeefbff650: fffffffeefbff6a0 0000000000000000
runtime.sigpanic()
/Users/josh/go/1.10/src/runtime/signal_unix.go:372 +0x28e
goroutine 1 [syscall, locked to thread]:
runtime.cgocall(0x410adb0, 0xc420067eb8, 0x4162d70)
/Users/josh/go/1.10/src/runtime/cgocall.go:128 +0x64 fp=0xc420067e88 sp=0xc420067e50 pc=0x4004b64
gopkg.in/libgit2/git2go%2ev26._Cfunc_git_revwalk_push(0x4636760, 0x0, 0x0)
_cgo_gotypes.go:6303 +0x4d fp=0xc420067eb8 sp=0xc420067e88 pc=0x40fcd3d
gopkg.in/libgit2/git2go%2ev26.(*RevWalk).Push.func1(0x4636760, 0x0, 0xc42000e030)
/Users/josh/src/gopkg.in/libgit2/git2go.v26/walk.go:44 +0x60 fp=0xc420067ef0 sp=0xc420067eb8 pc=0x4102dd0
gopkg.in/libgit2/git2go%2ev26.(*RevWalk).Push(0xc4200105a0, 0x0, 0x0, 0x0)
/Users/josh/src/gopkg.in/libgit2/git2go.v26/walk.go:44 +0x65 fp=0xc420067f18 sp=0xc420067ef0 pc=0x4101ee5
main.main()
/Users/josh/go/tip/src/x.go:19 +0xe3 fp=0xc420067f88 sp=0xc420067f18 pc=0x4104023
runtime.main()
/Users/josh/go/1.10/src/runtime/proc.go:198 +0x212 fp=0xc420067fe0 sp=0xc420067f88 pc=0x402c202
runtime.goexit()
/Users/josh/go/1.10/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420067fe8 sp=0xc420067fe0 pc=0x4053491
exit status 2
go version go1.10rc1 darwin/amd64
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/josh/Library/Caches/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/josh"
GORACE=""
GOROOT="/Users/josh/go/1.10"
GOTMPDIR=""
GOTOOLDIR="/Users/josh/go/1.10/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/1t/n61cbvls5bl293bbb0zyypqw0000gn/T/go-build301881037=/tmp/go-build -gno-record-gcc-switches -fno-common"
GOROOT/bin/go version: go version go1.10rc1 darwin/amd64
GOROOT/bin/go tool compile -V: compile version go1.10rc1
uname -v: Darwin Kernel Version 17.3.0: Thu Nov 9 18:09:22 PST 2017; root:xnu-4570.31.3~1/RELEASE_X86_64
ProductName: Mac OS X
ProductVersion: 10.13.2
BuildVersion: 17C205
lldb --version: lldb-900.0.64
Swift-4.0
gdb --version: GNU gdb (GDB) 8.0.1
Interesting. Thanks for the repro. This happened recently on linux/arm on the build dashboard, too: https://build.golang.org/log/4552a24f91dc893759cce93184c415f3986fcbc6
I haven't dug into this very deeply, but it sort of looks like we're injecting a sigpanic call into C code and then attempting to do a stack walk, which of course fails when it tries to unwind beyond the sigpanic. I thought we were supposed to just forward signals that came in during C execution, but perhaps that broke.
/cc @ianlancetaylor @bcmills
Thank you for the repro @josharian and my apologies for totally forgetting to work on diagnosing and reproducing this bug last year, but on the bright side no one could have done it better than you :)
@odeke-em want to bisect? :)
@josharian gladly, you've done the heavy lifting already :)
Obviously commit https://github.com/golang/go/commit/dbd8f3d739fe4ec34dd48f655edc15443c23a580 aka CL https://go-review.googlesource.com/#/c/89016/ exposes this problem, now my next step is to bisect
where we are perhaps mishandling the panic signal from C back to Go.
And also good news, here is a standalone repro
package main
/*
#include <stdio.h>
void printIt(void *ptr) {
char *s = (char *)ptr;
printf(":%c\n", *s);
}
*/
import "C"
func main() {
C.printIt(nil)
}
@odeke-em, thanks for looking into this! I'm not positive that your reproducer is quite the same since it unwinds to asmcgocall, while Josh's only gets to the sigpanic. If it is the same, then I think we've had the underlying issue for a long time. Go 1.8 also cuts off the traceback at asmcgocall and presumably was just silently failing to unwind past there. It would make sense to cut off the traceback at asmcgocall since that performs a system stack switch and thus there's nowhere to unwind to from it most of the time. (But it wouldn't make sense to cut it off at sigpanic, which is what makes me think your reproducer might be different from Josh's in an important way, though it still highlights a bug.)
We only forward signals if there is a C signal handler to which to forward them. If there is no C signal handler (the normal case) and if the Go runtime expects to handle the signal itself (the normal case) we don't forward. So, yes, we can definitely insert a call to sigpanic in C code.
Perhaps we should undo the hex dump for 1.10, or skip it when we think we may be executing C code.
@odeke-em, thanks for looking into this! I'm not positive that your reproducer is quite the same since it unwinds to asmcgocall, while Josh's only gets to the sigpanic.
Thank you @aclements, I'll take a deeper look sometime when I get more bandwidth, it's an interesting issue but I am finding myself having to tear up many CLs and commits :)
We only forward signals if there is a C signal handler to which to forward them. If there is no C signal handler (the normal case) and if the Go runtime expects to handle the signal itself (the normal case) we don't forward.
@ianlancetaylor nice, in deed :)
package main
/*
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
void sigHandler(const int sig) {
printf("got this signal: %d\n", sig);
exit(0);
}
void printIt(void *ptr, int handleSig) {
if (handleSig) {
struct sigaction sa;
sa.sa_handler = sigHandler;
sigaction(SIGSEGV, &sa, NULL);
}
char *s = (char *)ptr;
printf(":%c\n", *s);
}
*/
import "C"
func main() {
C.printIt(nil, 1)
}
For the original reproducer, the problem is indeed that we're injecting a sigpanic call directly into C, so gentraceback is seeing a return PC from sigpanic that's in C code. For that, I think Ian's right that we should just suppress the hex dump if there might be C code on the stack. I'm not sure what else we could do. Maybe we should only do that if we just saw a sigpanic frame since I think that's the only time this should happen.
For the second reproducer, after switching to the system stack, asmcgocall aligns the stack before calling C code, so gentraceback is reading an uninitialized stack slot and treating it as a return PC. That's not so great and I think we should add asmcgocall to the set of "top of stack" functions since it does the system stack switch.
The difference between the reproducers is rather amusing. In general we should see things like in the original reproducer. However, two GCC optimizations combine to make it appear like asmcgocall called sigpanic in the traceback of the second reproducer: because printIt returns void, GCC optimizes the the generated cgo wrapper into a tail call to printIt, so the wrapper doesn't appear on the stack; and because printIt has a zero-sized frame, printIt's own return PC (which is asmcgocall) appears in the position of sigpanic's return PC, so we also lose printIt from the stack. At that point, all of the actual C code is gone from the stack, so we don't wind up trying to unwind into it. Even changing printIt's return type to int is enough to inhibit the tail-call optimization and make the second reproducer fail like the first one.
I think we should add asmcgocall to the set of "top of stack" functions since it does the system stack switch.
It turns out this doesn't work. If the C code calls back in to Go, we switch back to the regular G stack and if we try to unwind from there, we'll see the call to asmcgocall from before it did the system stack switch. We want to be able to unwind through that to the rest of the regular Go stack, so we can't just treat asmcgocall as a top-of-stack function all the time.
Change https://golang.org/cl/90896 mentions this issue: runtime: fail silently if we unwind over sigpanic into C code
Change https://golang.org/cl/90895 mentions this issue: runtime: don't unwind past asmcgocall
asmcgocallaligns the stack before calling C code, sogentracebackis reading an uninitialized stack slot and treating it as a return PC.
That's surprising: doesn't asmcgocall leave a valid frame (with a valid saved base-pointer and return address) at the root of the system stack?
I guess the base-pointer and return address can't be saved there because they might be moved during a callback from C. But why does the callback from C need to run on the same G at all? (I think we've talked before about treating C calls as implicitly noescape and picking up a new goroutine stack instead of resuming on the old one.)
It seems like we could save the G in the system stack frame and zero it out in TLS, so that if we get a callback from C we run it on a fresh goroutine. What would that break? (LockOSThread, perhaps?)
At any rate, I think all the things I'm speculating about here are too invasive for a fix yet in 1.10...
That's surprising: doesn't asmcgocall leave a valid frame (with a valid saved base-pointer and return address) at the root of the system stack?
It does not and, unfortunately, I'm not sure it even can. The function itself is declared with 0 stack size, but it opens up a 64 byte frame and then aligns it on the system stack. I think traceback is still using the 0 stack size (I would expect the pcsp table to include the 64 byte change, but empirically it may not be; I haven't dug in to that), so we'd have to put the return PC at 0(SP). But that 64 byte frame is some spill location for the Windows fastcall calling convention (though I wasn't able to find documentation on why it needed that), so 0(SP) may be clobbered (if that's indeed what's going on with the 64 byte frame).
I guess the base-pointer and return address can't be saved there because they might be moved during a callback from C.
They're on the system stack, so they'll never get moved.
But why does the callback from C need to run on the same G at all? ... t seems like we could save the G in the system stack frame and zero it out in TLS, so that if we get a callback from C we run it on a fresh goroutine.
I'm not sure if there's much point in running the callback on a new G, but it certainly could run on a new stack, as I think we've talked about. We may not want to pay the cost of allocating a new stack at every C callback, so perhaps a cleverer thing would be to keep using the original G stack, but if we need to grow it, only copy back as far as the last C callback and keep both segments around. Of course, then we're sort of back in segmented stack land and Russ really didn't like the sound of that. :)
I guess the base-pointer and return address can't be saved there because they might be moved during a callback from C.
They're on the system stack, so they'll never get moved.
No, I mean that the stack to which they point might get moved, and the saved register values wouldn't be updated to match (since they would be on the system stack, instead of the goroutine stack being moved).
I think traceback is still using the 0 stack size (I would expect the pcsp table to include the 64 byte change, but empirically it may not be; I haven't dug in to that), so we'd have to put the return PC at 0(SP).
Ah, I was thinking we would construct something that looks like a C stack frame rather than a Go stack frame.
If I understand correctly, today the system stack in asmcgocall looks like this before the CALL instruction:
SP+64: (possible alignment padding)
SP+56: (unused?)
SP+48: g
SP+40: goroutine SP - g.stack.hi (offset into goroutine stack)
SP+32: (unused?)
SP+24: Windows spill slot for R9 / XMM3
SP+16: Windows spill slot for R8 / XMM2
SP+8: Windows spill slot for RDX / XMM1
SP: Windows spill slot for RCX / XMM0
If a frame pointer is present, it still points to the goroutine stack at that point, so the root of the stack appears to be an invalid frame. (We know how to restore the caller frame from the current stack frame and the TLS register, but that process doesn't follow the usual C or Go ABIs.)
We could instead make that look like a 64-byte standard (Windows-and-SysV-compatible) call frame:
SP+64: (possible alignment padding)
SP+56: return address within asmcgocall
BP, SP+48: saved asmcgocall BP
SP+40: g
SP+32: goroutine SP - g.stack.hi (offset into goroutine stack)
SP+24: Windows spill slot for R9 / XMM3
SP+16: Windows spill slot for R8 / XMM2
SP+8: Windows spill slot for RDX / XMM1
SP: Windows spill slot for RCX / XMM0
(BP)+8: return address from asmcgocall
(BP): saved caller BP
(BP)-8: (unused)
…
(BP)-48: (unused)
Then we mark asmcgocall as having a 64-byte frame rather than a 0-byte frame. If you walk frame-pointer chains assuming the usual function epilogue, all of the frame pointers are valid, but you end up with one weird discontinuity where the frame pointer jumps to a different stack. (Fortunately, that discontinuity is almost exactly the same as the one that occurs at the return from a signal stack in ordinary C code: it just happens to trampoline through asmcgocall rather than the usual signal trampoline.)
As far as I can tell, the main problem with that approach is that the saved BP value at SP+48 can become incorrect if a callback from C to Go moves the underlying goroutine stack.
Most helpful comment
For the original reproducer, the problem is indeed that we're injecting a
sigpaniccall directly into C, sogentracebackis seeing a return PC fromsigpanicthat's in C code. For that, I think Ian's right that we should just suppress the hex dump if there might be C code on the stack. I'm not sure what else we could do. Maybe we should only do that if we just saw asigpanicframe since I think that's the only time this should happen.For the second reproducer, after switching to the system stack,
asmcgocallaligns the stack before calling C code, sogentracebackis reading an uninitialized stack slot and treating it as a return PC. That's not so great and I think we should addasmcgocallto the set of "top of stack" functions since it does the system stack switch.The difference between the reproducers is rather amusing. In general we should see things like in the original reproducer. However, two GCC optimizations combine to make it appear like
asmcgocallcalledsigpanicin the traceback of the second reproducer: becauseprintItreturns void, GCC optimizes the the generated cgo wrapper into a tail call toprintIt, so the wrapper doesn't appear on the stack; and becauseprintIthas a zero-sized frame,printIt's own return PC (which isasmcgocall) appears in the position ofsigpanic's return PC, so we also loseprintItfrom the stack. At that point, all of the actual C code is gone from the stack, so we don't wind up trying to unwind into it. Even changingprintIt's return type tointis enough to inhibit the tail-call optimization and make the second reproducer fail like the first one.