Longhorn: Stack trace happened in instance manager - engine

Created on 4 Oct 2019  路  7Comments  路  Source: longhorn/longhorn

Copied from https://github.com/longhorn/longhorn/issues/428#issuecomment-538292044 by @ffilippopoulos


We observe the same behaviour with longhorn v0.6.1. Our prometheus pods that use longhorn volumes fail randomly because the mount becomes ReadOnly (not during mount time but while pod is running and the volume is already mounted). Some errors I collected form instance-manager-e instance that is running on the same node around the time that this happens:

[pvc-69ef44b0-1b94-426b-a065-cb7f5094a8b8-e-0cc6736f-e79ab6f3] fatal error: concurrent map read and map write
receive_msg: fail to read magic version
response_process: Receive response returned error
lh_client_close_conn: Closing connection
lh_client_close_conn: Cancel request 44534 due to disconnection
tgtd: bs_longhorn_request(91) fail to write at 10db62000 for 4096
tgtd: bs_longhorn_request(150) io error 0x1ff5f80 2a -14 4096 4525006848, Success
[pvc-69ef44b0-1b94-426b-a065-cb7f5094a8b8-e-0cc6736f-e79ab6f3] 

then we get a bunch of stack traces:

goroutine 506743 [running]:
runtime.throw(0xc52f20, 0x21)
        /usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc422f1e4b0 sp=0xc422f1e490 pc=0x42ad11
runtime.mapaccess2(0xb587e0, 0xc4202c2f60, 0xc423037a00, 0xc423037a00, 0x10)
        /usr/local/go/src/runtime/hashmap.go:409 +0x225 fp=0xc422f1e4f8 sp=0xc422f1e4b0 pc=0x408b95
reflect.mapaccess(0xb587e0, 0xc4202c2f60, 0xc423037a00, 0xc4202c2f60)
        /usr/local/go/src/runtime/hashmap.go:1195 +0x3f fp=0xc422f1e530 sp=0xc422f1e4f8 pc=0x40ae4f
reflect.Value.MapIndex(0xb587e0, 0xc42054ac60, 0x195, 0xb18240, 0xc423037a00, 0x98, 0xb18240, 0xc4204428b0, 0x98)
        /usr/local/go/src/reflect/value.go:1081 +0x126 fp=0xc422f1e5b8 sp=0xc422f1e530 pc=0x4b5026
github.com/longhorn/longhorn-engine/vendor/github.com/golang/protobuf/proto.makeMapMarshaler.func2(0xc42054ac60, 0x1, 0x0)
        /go/src/github.com/longhorn/longhorn-engine/vendor/github.com/golang/protobuf/proto/table_marshal.go:2321 +0x22a fp=0xc422f1e698 sp=0xc422f1e5b8 pc=0x79cbda
github.com/longhorn/longhorn-engine/vendor/github.com/golang/protobuf/proto.(*marshalInfo).size(0xc420214a80, 0xc42054ac60, 0x131c200)
        /go/src/github.com/longhorn/longhorn-engine/vendor/github.com/golang/protobuf/proto/table_marshal.go:183 +0xaa fp=0xc422f1e700 sp=0xc422f1e698 pc=0x77481a
github.com/longhorn/longhorn-engine/vendor/github.com/golang/protobuf/proto.(*InternalMessageInfo).Size(0x131c200, 0xcc2080, 0xc42054ac60, 0x7fc5b8599378)
        /go/src/github.com/longhorn/longhorn-engine/vendor/github.com/golang/protobuf/proto/table_marshal.go:125 +0x65 fp=0xc422f1e730 sp=0xc422f1e700 pc=0x774425
github.com/longhorn/longhorn-engine/controller/rpc/pb.(*BackupReplicaMap).XXX_Size(0xc42054ac60, 0xcc2080)
        /go/src/github.com/longhorn/longhorn-engine/controller/rpc/pb/controller.pb.go:958 +0x43 fp=0xc422f1e760 sp=0xc422f1e730 pc=0x91b0e3
github.com/longhorn/longhorn-engine/vendor/github.com/golang/protobuf/proto.(*Buffer).Marshal(0xc42054a548, 0xcc2080, 0xc42054ac60, 0xc420694b40, 0x0)
        /go/src/github.com/longhorn/longhorn-engine/vendor/github.com/golang/protobuf/proto/table_marshal.go:2740 +0x8a fp=0xc422f1e830 sp=0xc422f1e760 pc=0x78759a
github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/encoding/proto.marshal(0xbdab20, 0xc42054ac60, 0xc42054a540, 0x0, 0x0, 0x0, 0x0, 0xc420154080)
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/encoding/proto/proto.go:59 +0xda fp=0xc422f1e880 sp=0xc422f1e830 pc=0x7f69fa
github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/encoding/proto.codec.Marshal(0xbdab20, 0xc42054ac60, 0x453f60, 0xc4205c1080, 0x40e55d, 0xc4203aa000, 0xb749e0)
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/encoding/proto/proto.go:74 +0xaf fp=0xc422f1e900 sp=0xc422f1e880 pc=0x7f6b4f
github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/encoding/proto.(*codec).Marshal(0x13440b0, 0xbdab20, 0xc42054ac60, 0x5, 0xc42029a420, 0xc422f1ea10, 0x411439, 0xc42054ac60)
        <autogenerated>:1 +0x48 fp=0xc422f1e948 sp=0xc422f1e900 pc=0x7f73b8
github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc.encode(0x7fc5c9833210, 0x13440b0, 0xbdab20, 0xc42054ac60, 0x13440b0, 0xc42024a5a0, 0xc420330f00, 0x46e5be, 0x411439)
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/rpc_util.go:543 +0x61 fp=0xc422f1e9d0 sp=0xc422f1e948 pc=0x833511
github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc.(*Server).sendResponse(0xc4204b2d80, 0xcc8180, 0xc42032f500, 0xc420142600, 0xbdab20, 0xc42054ac60, 0x0, 0x0, 0xc420352f20, 0x0, ...)
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:837 +0x89 fp=0xc422f1eae8 sp=0xc422f1e9d0 pc=0x838569
github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc4204b2d80, 0xcc8180, 0xc42032f500, 0xc420142600, 0xc4203077d0, 0x12d66a8, 0x0, 0x0, 0x0)
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1033 +0x5ad fp=0xc422f1ede0 sp=0xc422f1eae8 pc=0x83932d
github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc.(*Server).handleStream(0xc4204b2d80, 0xcc8180, 0xc42032f500, 0xc420142600, 0x0)
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1278 +0xe02 fp=0xc422f1ef80 sp=0xc422f1ede0 pc=0x83ce72
github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc4203ee830, 0xc4204b2d80, 0xcc8180, 0xc42032f500, 0xc420142600)
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:717 +0x9f fp=0xc422f1efb8 sp=0xc422f1ef80 pc=0x84865f
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc422f1efc0 sp=0xc422f1efb8 pc=0x458051
created by github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:715 +0xa1

and after it all calms we constantly see:

goroutine 506744 [select]:
github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc4202ec9c0, 0x1, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:317 +0x133
github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc4203edda0, 0x0, 0x0)
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:435[pvc-69ef44b0-1b94-426b-a065-cb7f5094a8b8-e-0cc6736f-e79ab6f3]  +0x1a8
github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/internal/transport.newHTTP2Server.func2(0xc42032f800)
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/internal/transport/http2_server.go:276 +0xda
created by github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/internal/transport.newHTTP2Server
        /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/internal/transport/http2_server.go:273 +0xf0f
[longhorn-instance-manager] time="2019-10-04T01:50:39Z" level=info msg="Process Manager: process pvc-69ef44b0-1b94-426b-a065-cb7f5094a8b8-e-0cc6736f-e79ab6f3 error out, error msg: exit status 2"
[longhorn-instance-manager] time="2019-10-04T01:50:39Z" level=debug msg="Process update: pvc-69ef44b0-1b94-426b-a065-cb7f5094a8b8-e-0cc6736f-e79ab6f3: state error: Error: exit status 2"
[longhorn-instance-manager] time="2019-10-04T01:50:39Z" level=debug msg="Process update: pvc-69ef44b0-1b94-426b-a065-cb7f5094a8b8-e-0cc6736f-e79ab6f3: state error: Error: exit status 2"
[longhorn-instance-manager] time="2019-10-04T01:50:39Z" level=debug msg="Process update: pvc-69ef44b0-1b94-426b-a065-cb7f5094a8b8-e-0cc6736f-e79ab6f3: state error: Error: exit status 2"

and the volume stays in ReadOnly mode. I was able to recover by deleting the pod and re attaching volumes through the ui

areengine bug

All 7 comments

Thanks for the report! @ffilippopoulos Is it possible to send a support bundle to us at [email protected] ? We want to look into it ASAP.

@yasker I've forwarded all the collected logs to the email above! thanks for looking into this

@ffilippopoulos Received, thanks!

The direct reason is the same as the race condition in https://github.com/golang/protobuf/issues/838

But I am not sure if it's caused by the protobuf bug or the engine bug.
Two possible reasons:

  1. Due to https://github.com/golang/protobuf/issues/838 , if protobuf structure definition has subfield, it may triggered the issue. We do have subfields for engine e.g. https://github.com/longhorn/longhorn-engine/blob/master/controller/rpc/pb/controller.proto#L89
  2. The wrong usage of BackupReplicaMap due to returning a pointer to the existing field in use instead of a copy, which is already fixed after v0.6.1 release in https://github.com/longhorn/longhorn-engine/commit/8943734dbc82619fd72d483a2a7c3df8f509d4b9.

There is no way to know which one triggered the issue without looking into the manager's log though.

This won't be easy to reproduce too. We will update the protobuf vendor with the fix for now.

The fix will be released with v0.6.2 release in a couple of days.

Feel free to reopen if it happens again with v0.6.2 release or later with the same call trace.

The fix will be released with v0.6.2 release in a couple of days.

Feel free to reopen if it happens again with v0.6.2 release or later with the same call trace.

can you explain how to fix this bug? thanks very much.

@awinlei the fix is already in v0.6.2, which version of Longhorn you're using?

Was this page helpful?
0 / 5 - 0 ratings