Go: reflect: duplicate reflect.Type values under dynamic linking

Created on 20 Jan 2017  路  23Comments  路  Source: golang/go

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.8rc2 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/fsenart/projects"
GORACE=""
GOROOT="/home/fsenart/tools/go1.8rc2"
GOTOOLDIR="/home/fsenart/tools/go1.8rc2/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build840231100=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

// main.go

package main

import "plugin"

func main() {
    p, err := plugin.Open("plugin.so")
    if err != nil {
        panic(err)
    }
    f, err := p.Lookup("F")
    if err != nil {
        panic(err)
    }

    f.(func())()
}
// plugin.go

package main

import "C"
import "net/http"

func F() {
    _, err := http.Get("https://amazon.com")
    if err != nil {
        panic(err)
    }

    _, err = http.Get("https://google.com")
    if err != nil {
        panic(err)
    }
}
test: build
    @./main

build: clean
    @go build -buildmode=plugin -o plugin.so plugin.go
    @go build -o main main.go

clean:
    @rm -rf main

What did you expect to see?

I don't expect any panic to happen.

What did you see instead?

A panic happens when requesting https://google.com while all works as expected when requesting https://amazon.com.

panic: Get https://google.com: remote error: tls: bad record MAC

goroutine 1 [running]:
plugin/unnamed-7da7c3830d2c48526e010430dd8f9f30a6c7f139.F()
    /home/fsenart/projects/src/issues/issue_badmac/plugin.go:14 +0xa7
main.main()
    /home/fsenart/projects/src/issues/issue_badmac/main.go:15 +0xb6
Makefile:2: recipe for target 'test' failed
make: *** [test] Error 2

Observations

  • The issue is here since go1.8beta1
  • The issue happens only when using plugins.
  • The issue is different from #18584.
FrozenDueToAge

Most helpful comment

I had forgotten about that bug. Presumably the fix for #18252 wasn't enough (which has been my working assumption so far, that there's a reflect/type-info bug). I'll dig around.

All 23 comments

/cc @crawshaw @randall77

This is probably a reflect bug of some sort meaning a message is being encoded incorrectly.

While it's a nice short test case (thanks!) it's tricky to pull apart because the "bad record MAC" is an error coming back from the server. I'm going to see if I can get some of the cyrpto/tls unit tests running inside a plugin and hope those narrow it down.

If they don't, does anyone have a TLS server designed for debugging? (A TLS-aware equivalent of tcpdump?)

The same error occurs with -buildmode=shared, which gives us some (not exactly scrutable) unit test failures.

$ go test -linkshared -short crypto/tls
--- FAIL: TestClientResumption (0.02s)
    handshake_client_test.go:687: DifferentCipherSuite: handshake failed: local error: tls: bad record MAC
--- FAIL: TestVersion (0.01s)
    handshake_server_test.go:340: handshake failed: local error: tls: bad record MAC
--- FAIL: TestSCTHandshake (0.01s)
    handshake_server_test.go:390: handshake failed: local error: tls: bad record MAC
--- FAIL: TestHandshakeClientECDHERSAChaCha20 (0.00s)
    handshake_client_test.go:396: TLSv12-ECDHE-RSA-CHACHA20-POLY1305 #2: mismatch on read: got:16030300251000002120459a0798b78ac5dc1220f0798323cb75f8ce32eb80fbdeb886eaf25de1b57f541403030001011603030020a879516ad07f72e38cc4ec2a6a947bd25db2acce314e1101365e995f9ef0f643 want:160303002510000021202fe57da347cd62431528daac5fbb290730fff684afc4cfc2ed90995f58cb3b7414030300010116030300209d2fa6b72156ad38a831200b2edc3f8a3464de810ed3a5b1c1fc0518d93e7735
--- FAIL: TestHandshakClientSCTs (0.00s)
    handshake_client_test.go:396: TLSv12-SCT #2: mismatch on read: got:16030300251000002120459a0798b78ac5dc1220f0798323cb75f8ce32eb80fbdeb886eaf25de1b57f54140303000101160303002098378699d1c39a3dd65fd0e3575bfd63e7101d88cd3e47f79aa9ccdeda303546 want:160303002510000021202fe57da347cd62431528daac5fbb290730fff684afc4cfc2ed90995f58cb3b7414030300010116030300207a58e133d4ceca57efeab99df24decce864be9c2b564dd0f32f066654274d859
...

After many more similar failures it locks up. Building with -c and sending it a SIGQUIT show it stuck in a way that suggests one end of the client/server has locked up. EDIT: I'm omitting the stack dump but it's available if anyone wants it.

I suspect going after the unit test failures will be easier than the lockup, so I'll spend some time today trying to simplify one of them.

cc @mwhudson as this now affects shared libraries. Would be nice if we had a -linkshared builder to have caught this earlier.

Ah ha, I'm going to start here:

$ go install -buildmode=shared runtime sync runtime/cgo time
$ go test -short -linkshared encoding/...
?       encoding    [no test files]
ok      encoding/ascii85    0.041s
--- FAIL: TestUnmarshal (0.00s)
    asn1_test.go:506: #3:
        have &[]int{1, 2, 3}
        want &[]int{1, 2, 3}
FAIL
FAIL    encoding/asn1   0.033s
ok      encoding/base32 0.035s
ok      encoding/base64 0.037s
ok      encoding/binary 0.028s
ok      encoding/csv    0.030s
ok      encoding/gob    0.067s
ok      encoding/hex    0.028s
ok      encoding/json   0.404s
ok      encoding/pem    0.054s
ok      encoding/xml    0.060s

Oddly if I try to run go test -short -linkshared encoding/asn1 directly, it hangs, but in this form, the failure is consistent.

@crawshaw do you remember the other issue I've opened about encoding/asn1 golang/go#18584. The two may be related.

I had forgotten about that bug. Presumably the fix for #18252 wasn't enough (which has been my working assumption so far, that there's a reflect/type-info bug). I'll dig around.

Simplified test case:

package main // prg.go

import "reflect"

var d interface{} = &[]int{}

func main() {
    pv := reflect.New(reflect.TypeOf(d).Elem())
    v := pv.Interface()

    println(reflect.TypeOf(v) == reflect.TypeOf(d))
}

At tip:

$ go run prg.go 
true
$ go run -linkshared prg.go 
false
$

In the above example, both the binary and libstd.so contain type symbols for *[]int, as you would expect. But the program does not contain any dynamic relocations for it:

$ go tool nm libstd.so | grep "type.\*\[]int$"
 147e240 D type.*[]int
$ go tool nm prg | grep "type.\*\[]int$"
  602960 D type.*[]int
$ readelf -r prg | grep type
0000006029d0  000400000001 R_X86_64_64       0000000000000000 type.int + 0
000000602a10  000400000001 R_X86_64_64       0000000000000000 type.int + 0
000000603040  001400000001 R_X86_64_64       0000000000000000 type.8b7X1see + 0
000000603048  001100000001 R_X86_64_64       0000000000000000 type.H0+h5nWH + 0

@mwhudson I think this is basically the eface version of the itab problem from #18252. Shouldn't there be a dynamic relocation for the *rtype in the binary prg?

Well no, because prg is an executable so the system linker can assume no other type.*[]int symbol will be interposed in front of the one in prg.

Adding

    println("d", reflect.TypeOf(d))
    println("v", reflect.TypeOf(v))

to prg.go gives as output:

d (0x603040,0x602940)
v (0x603040,0x7f84580f66a0)

d looks correct to my level of understanding, 0x602940 is the address of type.[]int in the executable. 0x7f84580f66a0 is the address of type.[]int in the library though, which is wrong. I don't understand right now how v got constructed, but I guess if this address was computed by adding some offsets this might have happened. (The bug is present in 1.7 but not 1.6 which lends support to this theory).

Have we gotten @ianlancetaylor to look at this bug yet?

Ah of course, thanks, I was inverting the libstd.so and prg modules in my mind. That gives me a new place to hunt.

The reflect.New function takes a type []int and returns a value wrapping a *[]int. It gets to the pointer using the ptrToThis field. This is resolved via the moduledata typemap, and it resolves to the wrong module. That is, it points to the *[]int symbol that comes from libstd.so, not the program binary.

This is because libstd.so is listed first the list of modules. I instrumented typelinksinit and firstmoduledata is libstd.so, with firstmoduledata.next being the program binary module.

That's wrong. We want the linked list of modules to have same load order the dynamic linker uses. But the runtime package is defined in libstd.so, and that's where firstmoduledata is defined and first used.

Problematically it's backwards in plugins. The runtime package is always in the initial binary, so firstmoduledata is the initial program.

(I feel like I've run into this before, and I thought module load order was solved. Clearly it's not.)

We actually have a slice of active modules now, independent from the linked list. So if we can come up with a way to order the modules correctly, we can reorder in the slice before typelinksinit.

One hack I'm considering: sort the modules from lowest to highest values of md.text. From what I've seen in my printing on linux/glibc and macOS that matches what the dynamic linker does. (But I don't know it for a fact.)

On 25 January 2017 at 16:25, David Crawshaw notifications@github.com
wrote:

The reflect.New function takes a type []int and returns a value wrapping
a *[]int. It gets to the pointer using the ptrToThis field. This is
resolved via the moduledata typemap, and it resolves to the wrong module.
That is, it points to the *[]int symbol that comes from libstd.so, not
the program binary.

This is because libstd.so is listed first the list of modules. I
instrumented typelinksinit and firstmoduledata is libstd.so, with
firstmoduledata.next being the program binary module.

That's wrong. We want the linked list of modules to have same load order
the dynamic linker uses.

Ah.

But the runtime package is defined in libstd.so, and that's where
firstmoduledata is defined and first used.

Problematically it's backwards in plugins. The runtime package is always in
the initial binary, so firstmoduledata is the initial program.

Starting with firstmoduledata and walking next pointers is almost right,
isn't it? The only problem is that in the linkshared case libstd.so is at
the front when it should be .. uh, not at the front. (probably it will
always be second?).

Could we make it right by having the moduledata for the module containing
the runtime added by an initarray entry as well? I guess you'd need to do
something different in a static binary which is kinda horrid.

(I feel like I've run into this before, and I thought module load order
was solved. Clearly it's not.)

We actually have a slice of active modules now, independent from the
linked list. So if we can come up with a way to order the modules
correctly, we can reorder in the slice before typelinksinit.

Having the module containing runtime do _something_ in an initarray feels
most reliable here (maybe something separate to maintaining the moduledata
linked list) somehow.

One hack I'm considering: sort the modules from lowest to highest values
of md.text. From what I've seen in my printing on linux/glibc and macOS
that matches what the dynamic linker does. (But I don't know it for a fact.)

Yikes! Please don't do that without getting written confirmation from LKML
that it's safe first :)

A less evil hack would be to swap the position of firstmoduledata with whatever module contains the main_main symbol. That would be correct in all practical situations we will see with 1.8 (libstd.so, and plugins). It will eventually have to be revisited, if someone builds a complex enough loading scenario. WDYT?

(At this point I'm looking for something small enough to go into 1.8. I agree going after initarray is more principled, it just sounds a bit hairy to me for a close-to-release change, because of that asm blob in the linker.)

Of course, this bug was originally about a plugin failure. Runtime module ordering does not explain why the type would be wrong in a plugin. Either this diagnosis is wrong, or the failing test case I extracted is for a different bug.

CL https://golang.org/cl/35644 mentions this issue.

go test -linkshared seems to be pretty broken right now (sadface) but doing things by hand suggests that go test -linkshared crypto/tls is still broken with that CL (but encoding/asn1 is indeed fixed).

Now there are a couple of bugs here, and I'm not sure how to separate them into issues. I'll put a bit of time tomorrow into poking at it all tomorrow.

I hacked up the crypto/tls, encoding/asn1, and reflect unit tests to run as a plugin. The tls tests indeed fail just as this example fails. Unfortunately they are not easy things to debug.

The asn1 and reflect tests all pass except for this interesting one:

reflect_all.go:4800: typelinks not sorted: "struct {}" [1537] > "***int" [1538]

I'll work on fixing that, then try hacking up the other encoding tests to run under a plugin.

Actually that typelinks error is just a bad unit test, it cannot handle multiple modules. Onward.

More interesting, from vendor/golang_org/x/crypto/curve25519:

curve25519_test.go:27: incorrect result: got 469423c8d7cc27bb36c7096de54d38d285740efa69594c1f98f49b113dcc7a19, want 89161fde887b2b53de549af483940106ecc114d6982daa98256de23bdf77661a

I'm guessing some of the amd64 asm in this package is trying to hold CX across a load, and when dynamic linking for -buildmode=plugin or -buildmode=shared, it doesn't work. If so, this could be the encoding/crypto hang @mwhudson mentioned above.

@crawshaw Are you still looking at this?

I believe the only bug remaining here is #18820, which has been split out.

There also needs to be better testing, which I split out as #18814.

As this issue ranged across several problems, and I believe everything left is split out into those two, I'm going to close this issue. @fsenart, please follow #18820 for a resolution to your original issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gopherbot picture gopherbot  路  3Comments

enoodle picture enoodle  路  3Comments

natefinch picture natefinch  路  3Comments

stub42 picture stub42  路  3Comments

ashb picture ashb  路  3Comments