Grpc-go: all SubConns are in TransientFailure, latest connection error: <nil>

Created on 12 Feb 2019 · 24Comments · Source: grpc/grpc-go

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.18.0

What version of Go are you using (`go version`)?

1.11.2

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

Ran gRPC requests in production through a local Envoy sidecar. This means that all connections are to a process over a local unix socket. WaitForReady was not set to true for these calls

What did you expect to see?

I'd expect it to be exceedingly rare for connections to a local unix sidecar to be in the TRANSIENT_FAILURE state, and if they were i'd expect to see a non-nil error message. We use whatever the default balancer implementation is, but we do pass in a custom dialer implementation:

    envoySocketDialer := func(address string, timeout time.Duration) (net.Conn, error) {
        return net.DialUnix("unix", nil, &net.UnixAddr{
            Name: envoySocketPath,
            Net:  "unix",
        })
    }
    options := []grpc.DialOption{
                ...
        grpc.WithDialer(envoySocketDialer), // Override the dialer as a workaround for https://github.com/grpc/grpc-go/issues/2510
    }

What did you see instead?

A small percentage (<1%) of RPCs failed with a "all SubConns are in TransientFailure, latest connection error: \

P2 Bug

Source

mpuncel

👍23

Most helpful comment

Run in to the same problem when upgrading to 1.18.0.
1.17.0 works fine.

zhyon404 on 15 Feb 2019

👍4

All 24 comments

I suspect that setting WaitForReady(true) in the call options will effectively prevent this issue for me, but I'm curious if there is a way to get more information about why all SubConns would be in the failure state and wondering if there's a lurking bug because there seems to not be any connection errors

mpuncel on 12 Feb 2019

Run in to the same problem when upgrading to 1.18.0.
1.17.0 works fine.

zhyon404 on 15 Feb 2019

👍4

I am currently using 1.19.0 and same issue described here was experienced. I added WaitForReady(true) to call options and worked as stated @mpuncel Ty
There definitely seems to be a bug here. I shall add this as a workaround for now and keep digging to see what I come up with.

jdejesus07 on 28 Feb 2019

Thanks for reporting. This is indeed a bug.
I suspect we either set the SubConn in TransientFailure when it isn't, or we forgot to set the connection. Anyway, IMO the last connection error shouldn't be nil here.

Is there any other logs that you got to tell whether there's an actual error (like connection broken) when this happens?

menghanl on 28 Feb 2019

Also, it could be related to https://github.com/grpc/grpc-go/pull/2565 (see https://github.com/grpc/grpc-go/issues/2406 for more information).

Can you try setting GRPC_GO_REQUIRE_HANDSHAKE=off and see if that changes anything?

menghanl on 28 Feb 2019

2019/03/01 14:17:38 http2: Framer 0xc4206b7260: wrote HEADERS flags=END_HEADERS stream=1 len=107

2019/03/01 14:17:38 http2: Framer 0xc4206b7260: wrote DATA flags=END_STREAM stream=1 len=1285 data="\x00\x00\x00\x05\x00\t\xbb'\x0f\v\xb5v:@\x11\r\xe0-\x90\xa0\x04T\xc0\x18\x012\n\b(\x12\x06\b\xb1\xa3\xdd\xe3\x052\n\br\x12\x06\b\x84\xcd\xe3\xe2\x052\v\b\xb0\x01\x12\x06\b\xa7ڵ\xe1\x052\v\b\xd5\f\x12\x06\b\x96\x8b\xed\xe2\x052\v\b\xd8 \x12\x06\b\xbd\x95\xdd\xe3\x052\v\b\xa1-\x12\x06\b\xa7\xa5\xdc\xe3\x052\v\b\xf6>\x12\x06\b\x96\x8b\xed\xe2\x052\n\b\x0f\x12\x06\b\x96\x8b\xed\xe2\x052\n\b\x12\x12\x06\b\x9b\xc0\xc9\xe2\x052\v\b\x8c\x01\x12\x06\b\xa3\xa9\xd1\xe0\x052\v\b\xc7\x01\x12\x06\b\xda\xd1\xdb\xe3\x052\v\b\xff\x02\x12\x06\bރ\x93\xe3\x052\v\b\x99\x1c\x12\x06\b\xb0\x89\xdc\xe3\x052\v\b\xb8#\x12\x06\b\x96\x8b\xed\xe2\x052\v\b\xc8,\x12\x06\b\xa7\xbe\xd2\xe3\x052\v\b\xf04\x12\x06\b\xb5\xe5\xd2\xe2\x052\v\b\xf6\x01\x12\x06\b\xe8\xa9\xd6\xe1\x052\v\b\xf9\x01\x12\x06\b\xbd\xe7\xb6\xe3\x052" (1029 bytes omitted)

INFO: 2019/03/01 14:17:38 pickfirstBalancer: HandleSubConnStateChange: 0xc4203ccee0, TRANSIENT_FAILURE

INFO: 2019/03/01 14:17:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

I removed the WaitForReady and turned off handshake but experienced the same behavior.

jdejesus007 on 4 Mar 2019

@menghanl that fixes the issue for me.

aanm on 5 Mar 2019

There are more than one things happening, but the root cause is failing to reset reconnecting backoff, which is fixed in #2669. Please try the fix and see if it helps.

Because of the bug, reconnecting doesn't happen in time, resulting in longer than expected time in transient failure.

About latest connection error: nil. This error message was introduced for errors like TLS misconfig, so when the connection fails, the RPC error will contain useful information.
It can still potentially be nil when connection is closed by servers. It shouldn't be noticeable because reconnecting will happen immediately. But because of the backoff bug, it's much more easily encountered.

One possible fix is to set the error whenever connection goes into transient failure, so it will never be nil. I will save that in another fix.

menghanl on 6 Mar 2019

@menghanl patch #2669 does not solve the issue. it only works if GRPC_GO_REQUIRE_HANDSHAKE=off is set

aanm on 7 Mar 2019

@aanm
If GRPC_GO_REQUIRE_HANDSHAKE=off is required, it sounds like that the server didn't finish the handshake in time.
How is your server setup? Is it a proxy, a gRPC server, or server behind a mux (so port is shared by gRPC and http)?

menghanl on 7 Mar 2019

@menghanl #2669 does not fix the issue for me either, but GRPC_GO_REQUIRE_HANDSHAKE=off does work for me.

bbassingthwaite on 12 Mar 2019

We are multiplexing gRPC and HTTP traffic on a single port using cmux. Removing cmux from the equation fixes the problem.

import "github.com/cockroachdb/cmux"

server := grpc.NewServer()

lis, err := net.Listen("tcp", ":8081")
if err != nil {
    log.Fatal(err)
}
mux := cmux.New(lis)
grpcL := mux.Match(cmux.HTTP2HeaderField("content-type", "application/grpc"))

server.Serve(grpcL)

bbassingthwaite on 12 Mar 2019

@bbassingthwaite-va
That is caused by #2565 (also see #2406).
With that, gRPC-Go clients have the same behavior as gRPC-Java, thus you need to set the same match with writers as mentioned: https://github.com/soheilhy/cmux#limitations

An issue (https://github.com/soheilhy/cmux/issues/64) was filed before the gRPC-Go changes were checked in. Will send a PR to update their README as well.

menghanl on 12 Mar 2019

@aanm
If GRPC_GO_REQUIRE_HANDSHAKE=off is required, it sounds like that the server didn't finish the handshake in time.
How is your server setup? Is it a proxy, a gRPC server, or server behind a mux (so port is shared by gRPC and http)?

@menghanl sorry for the late reply, the server is a binary already compiled with grpc 1.7.5 https://github.com/grpc/grpc-go/commit/5b3c4e850e90a4cf6a20ebd46c8b32a0a3afcb9e, it's an program called cri-o. In their vendor directory they have that commit that I was mentioning https://github.com/kubernetes-sigs/cri-o/blob/v1.13.1/vendor.conf#L24. They do have a cmux library in there https://github.com/kubernetes-sigs/cri-o/blob/v1.13.1/vendor.conf#L115.

If you want to install it in your computer, these instructions should be enough

   sudo apt-key adv --recv-key --keyserver keyserver.ubuntu.com 8BECF1637AD8C79D

   cat <<EOF > /etc/apt/sources.list.d/projectatomic-ubuntu-ppa-artful.list
deb http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main
deb-src http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main
EOF
   sudo apt-get update
   sudo apt-get remove cri-o-1.* -y || true
   sudo apt-get install cri-o-1.13 -y

The client that I'm using to test is an "hello world" type of code, which is compiled with grpc 1.18.0, that checks the status of the server

import (
    criRuntime "k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2"
    "context"
    "fmt"
    "google.golang.org/grpc"
    "k8s.io/kubernetes/pkg/kubelet/util"
    "runtime"
    "sync"
    "time"
)

func getGRPCCLient(ctx context.Context) (*grpc.ClientConn, error) {
    addr, dialer, err := util.GetAddressAndDialer("unix:///var/run/crio/crio.sock")
    if err != nil {
        return nil, err
    }

    c, cancel := context.WithTimeout(ctx, time.Duration(5*time.Second))
    defer cancel()

    conn, err := grpc.DialContext(c, addr, grpc.WithDialer(dialer), grpc.WithInsecure(), grpc.WithBlock(), grpc.WithBackoffMaxDelay(15*time.Second))
    if err != nil {
        return nil, fmt.Errorf("failed to connect: %s", err)
    }
    return conn, nil
}

func newCRIClient(ctx context.Context) (*criClient, error) {
    cc, err := getGRPCCLient(ctx)
    if err != nil {
        return nil, err
    }
    rsc := criRuntime.NewRuntimeServiceClient(cc)
    return &criClient{rsc}, nil
}

func testCRI() {
    c, err := newCRIClient(context.Background())
    if err != nil {
        panic(err)
    }
    sreq := &criRuntime.StatusRequest{
        Verbose: false,
    }
    sresp, err := c.RuntimeServiceClient.Status(context.Background(), sreq)
    fmt.Println(sresp)
    if err != nil {
        panic(err)
    }
}

If this client program is executed with GRPC_GO_REQUIRE_HANDSHAKE=off everything works fine.

aanm on 13 Mar 2019

@aanm That is caused by #2565 (also see #2406).
cri-o will need to set match with writers as mentioned in https://github.com/soheilhy/cmux#limitations for gRPC-Java clients.
For now, you can set GRPC_GO_REQUIRE_HANDSHAKE=off before it's done.

menghanl on 13 Mar 2019

Lowering priority, since the main issue is resolved by eliminating the backoff after a successful connection is established. Ideally the error message would be populated with "server handshake not received" (or similar). When it's due to the handshake, this will happen naturally when we make transport creation function blocking. When it's due to a subsequent connection attempt, we need to plumb the error from the transport to the addrConn.

dfawley on 14 Mar 2019

Hi, guys!

Lowering priority, since the main issue is resolved by eliminating the backoff after a successful connection is established.

Do I understand your position correctly? Upgrading client to 1.18.1 should help for the majority, and GRPC_GO_REQUIRE_HANDSHAKE=off should be used for some specific cases (like cmux).

We've got the same problem with rpc error: code = Unknown desc = rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil> and I checked - only passing GRPC_GO_REQUIRE_HANDSHAKE=off worked for me.

Simple upgrade to 1.18.1 didn't change much. The only thing we observe - this errors became a little bit more rare, but not gone.

We don't have cmux, but we use nginx as k8s ingress balancer. Could it possibly lead to such errors? Is there some known issue in nginx tracker for this case?

mwf on 2 Apr 2019

Kindly ping, @menghanl @dfawley

mwf on 4 Apr 2019

@mwf Sorry for the late reply

There are 3 related/unrelated issues, but they (somehow) cause the same symptom.

The backoff reset problem, fixed in #2669
lastest connection error: <nil> still happens
- when connection is closed by servers.
- This needs a broader change to propagate errors from transport layer to gRPC layer.
- We kept the issue open to track this problem, but lowered the priority to p2 because it shouldn't be noticeable when reconnecting happens immediately (after #2669).
GRPC_GO_REQUIRE_HANDSHAKE as in #2406

In your case

only passing GRPC_GO_REQUIRE_HANDSHAKE=off worked for me.

This seems like the root cause is 3.
This usually happens when there's a dispatcher in front of the server, waiting for gRPC requests. While in the new behavior where HANDSHAKE is required, the client won't sent the request until the HANDSHAKE is done. This results in a deadlock.
cmux is one application that we know this happens. nginx might be another one.

Can you try something to get a better understanding of the root cause

run client with gRPC-Go from HEAD. We changed the reconnect logic a bit, and should have better logging now.
- Turn on logging with env variables GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info
- What I would look for is a warning didn't receive server preface in time.
gRPC-Java clients should also not work because of the HANDSHAKE. If you have those and give it a try, it should help to confirm the problem.

menghanl on 4 Apr 2019

This issue is labeled as requiring an update from the reporter, and no update has been received after 7 days. If no update is provided in the next 7 days, this issue will be automatically closed.

stale[bot] on 6 Sep 2019

I've noticed this problem too, but found that it only seems to happen when I have left a program being debugged and then hibernated the system. On waking up I typically see that there is still a listener as expected, but there is no accept on any incoming requests and eventually the process locks up. Is there some kind of time keeping inside the go implementation of the grpc server which gets buggered up when time freezes and is then later unfrozen when the systems is woken up again?

perolausson on 2 Dec 2019

@perolausson there's no server side time keeping. Could it be that the listener doesn't work well after the system is hibernated? And that in this case, all connections are blocked on connecting, but don't error either.

menghanl on 5 Dec 2019

I don't really know what is going on, but I noticed that a few reported seeing this in the context of containers (wherepausing the container is quite common), and from what I have seen it has been a fairly consistent observation in my own work as I hibernate my development machine all the time.

perolausson on 5 Dec 2019

still facing this issue