Go: cmd/compile: BoltDB with SSA crashes with: unexpected fault address 0xc420a00000

Created on 17 Aug 2016  ·  17Comments  ·  Source: golang/go

  1. What version of Go are you using (go version)?
    go version go1.7 darwin/amd64
    go version go1.7 linux/amd64
  2. What operating system and processor architecture are you using (go env)?
    GOARCH="amd64"
    GOBIN=""
    GOEXE=""
    GOHOSTARCH="amd64"
    GOHOSTOS="darwin"
    GOOS="darwin"
    GOPATH=""
    GORACE=""
    GOROOT="/usr/local/Cellar/go/1.7/libexec"
    GOTOOLDIR="/usr/local/Cellar/go/1.7/libexec/pkg/tool/darwin_amd64"
    CC="clang"
    GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/lm/g9hbhwd10gj8g_t64zzm63pm0000gn/T/go-build065044530=/tmp/go-build -gno-record-gcc-switches -fno-common"
    CXX="clang++"
    CGO_ENABLED="1"

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build785786852=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

  1. What did you do?
    When creating a number of nested buckets, my app will intermittently crash. Here is the relevant part of the stack trace. The location of the failure is quite consistent under both Mac OS and Linux.

unexpected fault address 0xc420a00000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0xc420a00000 pc=0xef856]

goroutine 7 [running]:
runtime.throw(0x27f950, 0x5)
/usr/local/Cellar/go/1.7/libexec/src/runtime/panic.go:566 +0x95 fp=0xc42004f210 sp=0xc42004f1f0
runtime.sigpanic()
/usr/local/Cellar/go/1.7/libexec/src/runtime/sigpanic_unix.go:21 +0x1d0 fp=0xc42004f268 sp=0xc42004f210
github.com/boltdb/bolt.(_node).write(0xc420098070, 0xc4209ffff0)
/Users/ebusto/Perforce/mq-watch/src/github.com/boltdb/bolt/node.go:205 +0x86 fp=0xc42004f3a8 sp=0xc42004f268
github.com/boltdb/bolt.(_Bucket).write(0xc42004f4c0, 0xc420098070, 0x9, 0x0)
/Users/ebusto/Perforce/mq-watch/src/github.com/boltdb/bolt/bucket.go:598 +0xb1 fp=0xc42004f408 sp=0xc42004f3a8
github.com/boltdb/bolt.(_Bucket).CreateBucket(0xc4200bec40, 0xc420910600, 0x9, 0x9, 0x9, 0x9, 0x0)
/Users/ebusto/Perforce/mq-watch/src/github.com/boltdb/bolt/bucket.go:181 +0x2bf fp=0xc42004f508 sp=0xc42004f408
pkg/journal.(_TX).set(0xc420152028, 0xc4200bec40, 0xc4209105f6, 0x9, 0x9, 0x237f80, 0xc4205a85a0, 0x15)

  1. What did you expect to see?
    Under Go 1.6, and Go 1.7 with SSA disabled (go build -gcflags="-ssa=0"), Bolt will happily create buckets without fail.
  2. What did you see instead?
    SIGBUS on Mac OS, and SIGSEGV on Linux.
FrozenDueToAge

Most helpful comment

This program demonstrates what the problem is:

package main

import "unsafe"

func main() {
    b := (*[10]byte)(unsafe.Pointer(uintptr(0xdeadbeef)))[0:3]
    _ = b
}

All 17 comments

A helpful fellow in the Gopher slack channel, @zeebo, produced this test case.

Looking at this. I did notice an appearance of the u-word on github.com/boltdb/bolt/node.go:205

Changing the depth from rand.Intn(10) to 100 seems to cause it to more consistently and quickly crash for me. Since I'm basically shooting blind, YMMV. :)

After playing around for a bit, I believe I've found the problem.

BoltDB allocates byte slices for some structs, and uses a marker field on the end of the page type to indicate where the key/value pairs for that page start being stored. In the case that the page has no key/value pairs, it allocates less space than for the entire struct. Later, we construct a slice pointing at that last marker field. Since there were no key/value pairs, the slice is unused, but under the SSA code, something attempts to do a read anyway. When the allocated memory happens to end right at some unmapped memory or something, a SIGBUS happens.

It seems like the main take away is that before it wasn't a problem to create a slice pointing past some allocation, but now it is. No idea if that's intended behavior, or whatever.

Can you provide a minimal test case, or at least a minimal representative sketch of what you think the problem is? (I think I understand from your description, but it'll be less ambiguous if we have actual Go code to discuss.)

The playground link above is fairly minimal. Most of it is to exercise making as many page writes as possible. The slice is constructed and read from every time, but it will only crash when the bytes after the allocation will cause a fault. If there's a debug flag that puts guard pages around memory allocations that would make this much more deterministic.

I'll do my best with an ascii diagram.

|Full allocation                        |
+--------------------+------------------+------|
|Some bucket header  |The page struct   |p.ptr |
|--------------------|------------------|------|
|blah fields here    |fields here       |      |
+--------------------+------------------+------+

We construct a slice who's backing array starts at p.ptr, which is outside of the allocation. The slice ends up unused because the for loop has zero iterations, but for some reason, the SSA code dereferences the first byte of the slice (I think. asm output for both here: https://play.golang.org/p/QNYzVjU1ls)

I've added code to Bolt to avoid constructing these slices when it's known the size of the slice will be zero and the crashes go away.

This program demonstrates what the problem is:

package main

import "unsafe"

func main() {
    b := (*[10]byte)(unsafe.Pointer(uintptr(0xdeadbeef)))[0:3]
    _ = b
}

I see, thanks! (Sorry, I thought the playground link was just a sample application that demonstrated an issue within BoltDB itself.)

So when slicing a pointer-to-array like that, we're supposed to check for and panic on nil. We used to implement that with an explicit nil comparison; and if we found nil, we would conditional store to it to trigger a SIGSEGV.

It looks like with SSA we've switched to a more implicit nil check by unconditionally trying to load from the pointed-to address, which causes the SIGSEGV now also on non-nil-but-still-invalid pointers.

/cc @randall77

@mdempsky: Yes, that is my understanding as well.

16727 is essentially the same problem. So I guess the question is, can we do a nil check by doing a test load from the pointer to-be-nilchecked? This is a change in behavior since 1.6, but as you can only trigger it by using unsafe, maybe that's ok. But nevertheless we might want to consider going back to the old way. Opinions?

Upon reading the rules outlined on https://golang.org/pkg/unsafe/ section 3 for pointer arithemetic states that: "the result must continue to point into the original allocated object. Unlike in C, it is not valid to advance a pointer just beyond the end of its original allocation".

That section is specifically talking about doing pointer arithmetic with unintptr conversions, but it seems that creating these slice headers is implicitly creating a pointer that is not pointing into any allocated object, and so I'd argue that the section still applies and code is invalid.

@zeebo That section starts with "If p points into an allocated object", which is referring to GC managed Go variables. It's meant to allow pointer arithmetic within a variable, but not to limit us from adopting a moving garbage collector in the future. (For example, you can't assume the distance between two different variables will remain the same.) I don't think it's applicable here because in this case p points to unmapped memory.

That said, I do agree that this appears to be unspecified behavior. I think the first option we should consider is how hard it is for applications to avoid relying on it. It seems like slicing unmapped memory should be a relatively uncommon operation, whereas a simpler nil check benefits much more code.

Could BoltDB be reasonably modified to avoid creating these problematic slices?

@benbjohnson

@mdempsky I was chatting with @zeebo yesterday and I'm going to make a patch today. I'm happy to fix it on my side. Thanks for all the investigation everybody! 👍

In this case, we are starting with GC managed Go variables. A byte slice is created and we use unsafe to create a pointer to a struct inside of it, but the struct extends past the end of the allocation. We then use field access to essentially do pointer arithmetic to find where we should create the slice, and it is created past the end of the allocation.

On Aug 18, 2016, at 1:10 AM, Matthew Dempsky [email protected] wrote:

@zeebo That section starts with "If p points into an allocated object", which is referring to GC managed Go variables. It's meant to allow pointer arithmetic within a variable, but not to limit us from adopting a moving garbage collector in the future. (For example, you can't assume the distance between two different variables will remain the same.) I don't think it's applicable here because in this case p points to unmapped memory.

That said, I do agree that this appears to be unspecified behavior. I think the first option we should consider is how hard it is for applications to avoid relying on it. It seems like slicing unmapped memory should be a relatively uncommon operation, whereas a simpler nil check benefits much more code.

Could BoltDB be reasonably modified to avoid creating these problematic slices?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

I posted a fix for BoltDB here: https://github.com/boltdb/bolt/pull/584

It now checks for pages which don't extend past the end of the header before building the slice header.

I'm going to close this issue. It is fixed on the BoltDB side.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dominikh picture dominikh  ·  3Comments

rakyll picture rakyll  ·  3Comments

mingrammer picture mingrammer  ·  3Comments

myitcv picture myitcv  ·  3Comments

natefinch picture natefinch  ·  3Comments