Please answer these questions before submitting your issue.
1.18.0
go version)?1.11.2
Linux
Ran gRPC requests in production through a local Envoy sidecar. This means that all connections are to a process over a local unix socket. WaitForReady was not set to true for these calls
I'd expect it to be exceedingly rare for connections to a local unix sidecar to be in the TRANSIENT_FAILURE state, and if they were i'd expect to see a non-nil error message. We use whatever the default balancer implementation is, but we do pass in a custom dialer implementation:
envoySocketDialer := func(address string, timeout time.Duration) (net.Conn, error) {
return net.DialUnix("unix", nil, &net.UnixAddr{
Name: envoySocketPath,
Net: "unix",
})
}
options := []grpc.DialOption{
...
grpc.WithDialer(envoySocketDialer), // Override the dialer as a workaround for https://github.com/grpc/grpc-go/issues/2510
}
A small percentage (<1%) of RPCs failed with a "all SubConns are in TransientFailure, latest connection error: \
I suspect that setting WaitForReady(true) in the call options will effectively prevent this issue for me, but I'm curious if there is a way to get more information about why all SubConns would be in the failure state and wondering if there's a lurking bug because there seems to not be any connection errors
Run in to the same problem when upgrading to 1.18.0.
1.17.0 works fine.
I am currently using 1.19.0 and same issue described here was experienced. I added WaitForReady(true) to call options and worked as stated @mpuncel Ty
There definitely seems to be a bug here. I shall add this as a workaround for now and keep digging to see what I come up with.
Thanks for reporting. This is indeed a bug.
I suspect we either set the SubConn in TransientFailure when it isn't, or we forgot to set the connection. Anyway, IMO the last connection error shouldn't be nil here.
Is there any other logs that you got to tell whether there's an actual error (like connection broken) when this happens?
Also, it could be related to https://github.com/grpc/grpc-go/pull/2565 (see https://github.com/grpc/grpc-go/issues/2406 for more information).
Can you try setting GRPC_GO_REQUIRE_HANDSHAKE=off and see if that changes anything?
2019/03/01 14:17:38 http2: Framer 0xc4206b7260: wrote HEADERS flags=END_HEADERS stream=1 len=107
2019/03/01 14:17:38 http2: Framer 0xc4206b7260: wrote DATA flags=END_STREAM stream=1 len=1285 data="\x00\x00\x00\x05\x00\t\xbb'\x0f\v\xb5v:@\x11\r\xe0-\x90\xa0\x04T\xc0\x18\x012\n\b(\x12\x06\b\xb1\xa3\xdd\xe3\x052\n\br\x12\x06\b\x84\xcd\xe3\xe2\x052\v\b\xb0\x01\x12\x06\b\xa7ڵ\xe1\x052\v\b\xd5\f\x12\x06\b\x96\x8b\xed\xe2\x052\v\b\xd8 \x12\x06\b\xbd\x95\xdd\xe3\x052\v\b\xa1-\x12\x06\b\xa7\xa5\xdc\xe3\x052\v\b\xf6>\x12\x06\b\x96\x8b\xed\xe2\x052\n\b\x0f\x12\x06\b\x96\x8b\xed\xe2\x052\n\b\x12\x12\x06\b\x9b\xc0\xc9\xe2\x052\v\b\x8c\x01\x12\x06\b\xa3\xa9\xd1\xe0\x052\v\b\xc7\x01\x12\x06\b\xda\xd1\xdb\xe3\x052\v\b\xff\x02\x12\x06\bރ\x93\xe3\x052\v\b\x99\x1c\x12\x06\b\xb0\x89\xdc\xe3\x052\v\b\xb8#\x12\x06\b\x96\x8b\xed\xe2\x052\v\b\xc8,\x12\x06\b\xa7\xbe\xd2\xe3\x052\v\b\xf04\x12\x06\b\xb5\xe5\xd2\xe2\x052\v\b\xf6\x01\x12\x06\b\xe8\xa9\xd6\xe1\x052\v\b\xf9\x01\x12\x06\b\xbd\xe7\xb6\xe3\x052" (1029 bytes omitted)
INFO: 2019/03/01 14:17:38 pickfirstBalancer: HandleSubConnStateChange: 0xc4203ccee0, TRANSIENT_FAILURE
INFO: 2019/03/01 14:17:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I removed the WaitForReady and turned off handshake but experienced the same behavior.
@menghanl that fixes the issue for me.
There are more than one things happening, but the root cause is failing to reset reconnecting backoff, which is fixed in #2669. Please try the fix and see if it helps.
Because of the bug, reconnecting doesn't happen in time, resulting in longer than expected time in transient failure.
About latest connection error: nil. This error message was introduced for errors like TLS misconfig, so when the connection fails, the RPC error will contain useful information.
It can still potentially be nil when connection is closed by servers. It shouldn't be noticeable because reconnecting will happen immediately. But because of the backoff bug, it's much more easily encountered.
One possible fix is to set the error whenever connection goes into transient failure, so it will never be nil. I will save that in another fix.
@menghanl patch #2669 does not solve the issue. it only works if GRPC_GO_REQUIRE_HANDSHAKE=off is set
@aanm
If GRPC_GO_REQUIRE_HANDSHAKE=off is required, it sounds like that the server didn't finish the handshake in time.
How is your server setup? Is it a proxy, a gRPC server, or server behind a mux (so port is shared by gRPC and http)?
@menghanl #2669 does not fix the issue for me either, but GRPC_GO_REQUIRE_HANDSHAKE=off does work for me.
We are multiplexing gRPC and HTTP traffic on a single port using cmux. Removing cmux from the equation fixes the problem.
import "github.com/cockroachdb/cmux"
server := grpc.NewServer()
lis, err := net.Listen("tcp", ":8081")
if err != nil {
log.Fatal(err)
}
mux := cmux.New(lis)
grpcL := mux.Match(cmux.HTTP2HeaderField("content-type", "application/grpc"))
server.Serve(grpcL)
@bbassingthwaite-va
That is caused by #2565 (also see #2406).
With that, gRPC-Go clients have the same behavior as gRPC-Java, thus you need to set the same match with writers as mentioned: https://github.com/soheilhy/cmux#limitations
An issue (https://github.com/soheilhy/cmux/issues/64) was filed before the gRPC-Go changes were checked in. Will send a PR to update their README as well.
@aanm
IfGRPC_GO_REQUIRE_HANDSHAKE=offis required, it sounds like that the server didn't finish the handshake in time.
How is your server setup? Is it a proxy, a gRPC server, or server behind a mux (so port is shared by gRPC and http)?
@menghanl sorry for the late reply, the server is a binary already compiled with grpc 1.7.5 https://github.com/grpc/grpc-go/commit/5b3c4e850e90a4cf6a20ebd46c8b32a0a3afcb9e, it's an program called cri-o. In their vendor directory they have that commit that I was mentioning https://github.com/kubernetes-sigs/cri-o/blob/v1.13.1/vendor.conf#L24. They do have a cmux library in there https://github.com/kubernetes-sigs/cri-o/blob/v1.13.1/vendor.conf#L115.
If you want to install it in your computer, these instructions should be enough
sudo apt-key adv --recv-key --keyserver keyserver.ubuntu.com 8BECF1637AD8C79D
cat <<EOF > /etc/apt/sources.list.d/projectatomic-ubuntu-ppa-artful.list
deb http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main
deb-src http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main
EOF
sudo apt-get update
sudo apt-get remove cri-o-1.* -y || true
sudo apt-get install cri-o-1.13 -y
The client that I'm using to test is an "hello world" type of code, which is compiled with grpc 1.18.0, that checks the status of the server
import (
criRuntime "k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2"
"context"
"fmt"
"google.golang.org/grpc"
"k8s.io/kubernetes/pkg/kubelet/util"
"runtime"
"sync"
"time"
)
func getGRPCCLient(ctx context.Context) (*grpc.ClientConn, error) {
addr, dialer, err := util.GetAddressAndDialer("unix:///var/run/crio/crio.sock")
if err != nil {
return nil, err
}
c, cancel := context.WithTimeout(ctx, time.Duration(5*time.Second))
defer cancel()
conn, err := grpc.DialContext(c, addr, grpc.WithDialer(dialer), grpc.WithInsecure(), grpc.WithBlock(), grpc.WithBackoffMaxDelay(15*time.Second))
if err != nil {
return nil, fmt.Errorf("failed to connect: %s", err)
}
return conn, nil
}
func newCRIClient(ctx context.Context) (*criClient, error) {
cc, err := getGRPCCLient(ctx)
if err != nil {
return nil, err
}
rsc := criRuntime.NewRuntimeServiceClient(cc)
return &criClient{rsc}, nil
}
func testCRI() {
c, err := newCRIClient(context.Background())
if err != nil {
panic(err)
}
sreq := &criRuntime.StatusRequest{
Verbose: false,
}
sresp, err := c.RuntimeServiceClient.Status(context.Background(), sreq)
fmt.Println(sresp)
if err != nil {
panic(err)
}
}
If this client program is executed with GRPC_GO_REQUIRE_HANDSHAKE=off everything works fine.
@aanm That is caused by #2565 (also see #2406).
cri-o will need to set match with writers as mentioned in https://github.com/soheilhy/cmux#limitations for gRPC-Java clients.
For now, you can set GRPC_GO_REQUIRE_HANDSHAKE=off before it's done.
Lowering priority, since the main issue is resolved by eliminating the backoff after a successful connection is established. Ideally the error message would be populated with "server handshake not received" (or similar). When it's due to the handshake, this will happen naturally when we make transport creation function blocking. When it's due to a subsequent connection attempt, we need to plumb the error from the transport to the addrConn.
Hi, guys!
Lowering priority, since the main issue is resolved by eliminating the backoff after a successful connection is established.
Do I understand your position correctly? Upgrading client to 1.18.1 should help for the majority, and GRPC_GO_REQUIRE_HANDSHAKE=off should be used for some specific cases (like cmux).
We've got the same problem with rpc error: code = Unknown desc = rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil> and I checked - only passing GRPC_GO_REQUIRE_HANDSHAKE=off worked for me.
Simple upgrade to 1.18.1 didn't change much. The only thing we observe - this errors became a little bit more rare, but not gone.
We don't have cmux, but we use nginx as k8s ingress balancer. Could it possibly lead to such errors? Is there some known issue in nginx tracker for this case?
Kindly ping, @menghanl @dfawley
@mwf Sorry for the late reply
There are 3 related/unrelated issues, but they (somehow) cause the same symptom.
lastest connection error: <nil> still happensGRPC_GO_REQUIRE_HANDSHAKE as in #2406In your case
only passing GRPC_GO_REQUIRE_HANDSHAKE=off worked for me.
This seems like the root cause is 3.
This usually happens when there's a dispatcher in front of the server, waiting for gRPC requests. While in the new behavior where HANDSHAKE is required, the client won't sent the request until the HANDSHAKE is done. This results in a deadlock.
cmux is one application that we know this happens. nginx might be another one.
Can you try something to get a better understanding of the root cause
GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=infodidn't receive server preface in time.This issue is labeled as requiring an update from the reporter, and no update has been received after 7 days. If no update is provided in the next 7 days, this issue will be automatically closed.
I've noticed this problem too, but found that it only seems to happen when I have left a program being debugged and then hibernated the system. On waking up I typically see that there is still a listener as expected, but there is no accept on any incoming requests and eventually the process locks up. Is there some kind of time keeping inside the go implementation of the grpc server which gets buggered up when time freezes and is then later unfrozen when the systems is woken up again?
@perolausson there's no server side time keeping. Could it be that the listener doesn't work well after the system is hibernated? And that in this case, all connections are blocked on connecting, but don't error either.
I don't really know what is going on, but I noticed that a few reported seeing this in the context of containers (wherepausing the container is quite common), and from what I have seen it has been a fairly consistent observation in my own work as I hibernate my development machine all the time.
still facing this issue
Most helpful comment
Run in to the same problem when upgrading to 1.18.0.
1.17.0 works fine.