Please answer these questions before submitting your issue.
1.18.0
go version)?1.11.5
Alpine Linux 3.9 running on Google Kubernetes Engine
Since upgrading, I've noticed a large increase in grpc errors leading to failed requests and service disruptions.
We don't use any connection settings on the clients, nor do we have any special settings on the server (keepalives, fastfail, etc). The only thing we have on the connections is Mutual TLS. It's roughly configured like this:
Client:
creds := grpc.WithTransportCredentials(credentials.NewTLS(&tls.Config{
RootCAs: certPool,
Certificates: []tls.Certificate{cert},
MinVersion: tls.VersionTLS12,
}))
grpc.Dial("foo.foo.svc.cluster.local:80", creds)
Server:
creds := grpc.Creds(credentials.NewTLS(&tls.Config{
Certificates: []tls.Certificate{cert},
ClientCAs: certPool,
ClientAuth: tls.RequireAndVerifyClientCert,
MinVersion: tls.VersionTLS12,
NextProtos: []string{"h2"},
PreferServerCipherSuites: true,
CurvePreferences: []tls.CurveID{
tls.CurveP256,
tls.X25519,
},
CipherSuites: []uint16{
tls.TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,
tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
tls.TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,
tls.TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,
tls.TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,
tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,
},
}))
grpc.NewServer(creds)
This may be related to one of these other two issues
https://github.com/grpc/grpc-go/issues/2653
https://github.com/grpc/grpc-go/issues/2636
Based on the feedback in one of the above linked issues, I set these ENV vars on one service that had a lot of failures: GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info
I added (__GRPC CALL HERE__) at the end of the log lines below where the error was logged as a failed grpc call that we made.
WARNING: 2019/03/01 04:41:42 grpc: addrConn.createTransport failed to connect to {foo.foo.svc.cluster.local:80 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
INFO: 2019/03/01 04:41:42 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:41:42 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, TRANSIENT_FAILURE
timestamp=2019-03-01T04:41:42.887562062Z level=error error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\"" (__GRPC CALL HERE))
INFO: 2019/03/01 04:41:43 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:41:43 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, CONNECTING
INFO: 2019/03/01 04:41:44 Subchannel Connectivity change to READY
INFO: 2019/03/01 04:41:44 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, READY
INFO: 2019/03/01 04:45:22 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 04:45:22 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:45:22 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 04:45:22 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:45:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:45:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:45:28 Subchannel Connectivity change to READY
INFO: 2019/03/01 04:45:28 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:34:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:34:23 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:35:25 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:35:25 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:35:27 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:35:32 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:35:59 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:35:59 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:36:01 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:36:01 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
timestamp=2019-03-01T05:38:19.972966567Z level=error error="rpc error: code = Unavailable desc = transport is closing" (__GRPC CALL HERE__)
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:38:21 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:13:05 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:13:05 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, TRANSIENT_FAILURE
INFO: 2019/03/01 16:13:05 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
timestamp=2019-03-01T16:13:05.92908361Z level=error error="rpc error: code = Unavailable desc = transport is closing" (__GRPC CALL HERE__)
INFO: 2019/03/01 16:13:07 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, CONNECTING
INFO: 2019/03/01 16:13:07 Subchannel Connectivity change to READY
INFO: 2019/03/01 16:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, READY
INFO: 2019/03/01 16:20:17 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:20:17 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:20:18 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:20:23 Subchannel Connectivity change to READY
INFO: 2019/03/01 16:42:24 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:42:24 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:42:25 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:42:25 Subchannel Connectivity change to READY
INFO: 2019/03/01 17:20:47 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 17:20:47 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 17:20:49 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 17:20:54 Subchannel Connectivity change to READY
INFO: 2019/03/01 17:43:01 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 17:43:01 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 17:43:02 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 17:43:07 Subchannel Connectivity change to READY
Can you also get server side logs to see if it contains more information?
The handshake deadline exceeded error looks very suspicious. It would help if we know why it happened.
One more thing to try: can you set GRPC_GO_REQUIRE_HANDSHAKE=off on client side and see if that changes anything? (#2406)
@menghanl I have experienced same issue here. Here is a reproduce procedure.
package main
import (
"context"
"log"
"time"
"github.com/golang/glog"
"github.com/kazegusuri/grpc-test/pb"
"google.golang.org/grpc"
_ "google.golang.org/grpc/grpclog/glogger"
)
func main() {
ctx := context.Background()
conn1, err := grpc.DialContext(ctx, "127.0.0.1:9000", grpc.WithInsecure())
if err != nil {
log.Fatalf("grpc.DialContext failed: %v", err)
}
defer func() { _ = conn1.Close() }()
cli1 := pb.NewEchoClient(conn1)
for {
time.Sleep(1 * time.Millisecond)
_, err := cli1.Echo(ctx, &pb.EchoMessage{Message: "hello"})
if err != nil {
glog.Infof("goroutine: echo failed: %v", err)
} else {
glog.Infof("goroutine: echo succeeded")
}
}
}
func main() {
server := newServer()
s := grpc.NewServer(grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionAge: 10 * time.Second, // this kills connection 10s after connected
}))
pb.RegisterEchoServer(s, server)
lis, err := net.Listen("tcp", ":9000")
if err != nil {
log.Fatalf("failed to listen: %v", err)
}
if err := s.Serve(lis); err != nil {
log.Fatalf("err %v\n", err)
}
select {}
}
With above programs, I run the server in v1.19.0 and the client with v1.17, 1.18, 1.19.
The server kills the connection from the client 10 second after connected.
The client should reconnect to the server but its behavior is different in client versions.
With 1.18 and 1.19 the client takes over 1 second to recconect the server.
v1.17.0 (default)
I0303 02:22:59.745529 6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, TRANSIENT_FAILURE
I0303 02:22:59.745566 6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, CONNECTING
I0303 02:22:59.745732 6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, READY
v1.18.0 (default)
I0303 02:20:58.820793 5169 clientconn.go:941] Subchannel Connectivity change to TRANSIENT_FAILURE
I0303 02:20:58.820892 5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, TRANSIENT_FAILURE
I0303 02:20:59.821028 5169 clientconn.go:941] Subchannel Connectivity change to CONNECTING
I0303 02:20:59.821189 5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, CONNECTING
I0303 02:20:59.822058 5169 clientconn.go:941] Subchannel Connectivity change to READY
I0303 02:20:59.822155 5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, READY
v1.18.0 (GRPC_GO_REQUIRE_HANDSHAKE=off)
I0303 02:25:59.393033 8568 clientconn.go:941] Subchannel Connectivity change to TRANSIENT_FAILURE
I0303 02:25:59.393096 8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, TRANSIENT_FAILURE
I0303 02:26:00.393385 8568 clientconn.go:941] Subchannel Connectivity change to CONNECTING
I0303 02:26:00.393510 8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, CONNECTING
I0303 02:26:00.393937 8568 clientconn.go:941] Subchannel Connectivity change to READY
I0303 02:26:00.393979 8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, READY
v1.19.0 (default)
I0303 02:28:57.655932 11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, TRANSIENT_FAILURE
I0303 02:28:58.656125 11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, CONNECTING
I0303 02:28:58.656489 11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, READY
v1.19.0 (GRPC_GO_REQUIRE_HANDSHAKE=off)
I0303 02:30:30.642447 11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, TRANSIENT_FAILURE
I0303 02:30:31.642680 11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, CONNECTING
I0303 02:30:31.643004 11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, READY
This causes performance regression on the server which provides single subconn like Google Spanner.
Spanner seems to kill connections in the server-side in an hour, then the connection becomes transient failure. For a second in transient failure until becoming ready, the connection keeps unavaiable.
If it's okay to try to create transport immediately after transient failure for first retry, how about adding code like this here?
if ac.backoffIdx == 0 {
backoffFor = 0
}
I have a question.
I found baseDelay = 1.0 * time.Second in here.
I think this value is little bit big. Why we use 1 sec by default?
1 sec is a too big from user-facing frontend side.
I encountered the same issue using the gc bigtable go client (grpc based), bigtable grpc server seems to have a connection max age of one hour, which means that every hour, I litteraly stop handling my traffic for one second.
I agree with @vvakame this default value is too high for a lot of use cases and should be easily configurable in the other google clients based on this repo.
There's a bug in the reconnect logic, the backoff time is not reset after creating a connection.
The fix is in #2669. Can you guys try it and see if it fixes the problems you are seeing? Thanks!
Closing. Please comment if you are still experiencing problems.
Does the fix in https://github.com/grpc/grpc-go/pull/2669 still require GRPC_GO_REQUIRE_HANDSHAKE=off to be set?
I hope not, since I intend to remove that option soon (this or next week). Please let me know if you are experiencing any problems with it unset.
@dfawley
It seems to still be an issue for us. I reported it to Google Cloud support which pointed me to this thread and suggested the workaround was to set GRPC_GO_REQUIRE_HANDSHAKE=off. It appears that is no longer the case.
We running a Golang Google App Engine flexible app with Go v1.11 and grpc-gio v1.20.0.
With Spanner, we see context timeouts:
spanner: code = “Canceled”, desc = “context canceled, lastErr is <<nil>>”
With Cloud Tasks, we see these RPC errors:
all SubConns are in TransientFailure, latest connection error:
These errors are not occuring consistently, only intermittently and in bursts.
These errors are not occuring consistently, only intermittently and in bursts.
How long do these bursts last and how often do they occur?
Does setting GRPC_GO_REQUIRE_HANDSHAKE=off help for you?
These errors are not occuring consistently, only intermittently and in bursts.
How long do these bursts last and how often do they occur?
Currently, we do not have a production environment with any meaningful load, but in a trial production run that we ran for 3 days, we had one burst of errors for ~20 minutes. Otherwise, we see it less than once a day, for a minute.
I turned on GRPC logging on one of our deployments and it appears I get the TRANSIENT_FAILURE errors continuously regardless if GRPC_GO_REQUIRE_HANDSHAKE=off is set.
grpc v1.21.1 and GRPC_GO_REQUIRE_HANDSHAKE=off
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239980, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239980, CONNECTING
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239980, READY
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239770, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239770, CONNECTING
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239770, READY
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
2019-06-14 20:50:14 default[20190614t163806] INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY
The reconnects seems to happening immediately and I haven't experienced any RPC failures with all SubConns are in TransientFailure, so I guess this is acceptable behavior?
We do configure http.DefaultTransport.(*http.Transport).MaxIdleConnsPerHost = 500, would that be a problem?
@mikesun
I guess this is acceptable behavior?
It's not normal for your connection to be dying so frequently (several per second). Is this happening continuously or are these entries the only ones? Unfortunately, there isn't enough logging in place to see why the connection is being closed. But it does go into READY, which means at least TLS was successful.
We do configure http.DefaultTransport.(*http.Transport).MaxIdleConnsPerHost = 500, would that be a problem?
Go HTTP server settings should not affect grpc-go - we have our own server implementation.
It's not normal for your connection to be dying so frequently (several per second). Is this happening continuously or are these entries the only ones? Unfortunately, there isn't enough logging in place to see why the connection is being closed. But it does go into READY, which means at least TLS was successful.
It seems to happen in bunches (~3/sec) every few minutes (1-4 min). This is on an GAE flexible instance with basically no load:
A INFO: 2019/06/17 16:10:15 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239900, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239900, CONNECTING
A INFO: 2019/06/17 16:10:15 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239870, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239870, CONNECTING
A INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239900, READY
A INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239870, READY
A INFO: 2019/06/17 16:10:37 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:10:37 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:10:37 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A INFO: 2019/06/17 16:10:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A INFO: 2019/06/17 16:10:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY
A INFO: 2019/06/17 16:14:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A INFO: 2019/06/17 16:14:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A INFO: 2019/06/17 16:14:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY
A INFO: 2019/06/17 16:18:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A INFO: 2019/06/17 16:18:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A INFO: 2019/06/17 16:18:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY
A INFO: 2019/06/17 16:22:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A INFO: 2019/06/17 16:22:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A INFO: 2019/06/17 16:22:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY
@mikesun - one last thing I can think of: make sure you aren't running a grpc-go version between #2740 (82fdf27 - 21d ago) and #2862 (a1d4c28 - 10d ago). We had a bug in that range that could potentially cause problems like this.
If it's not that, then most likely the problem is not on the client side. Logs from the server/proxy side would be needed for further debugging. Also, since your problems are not related to this issue, could you file a new issue if you suspect a problem with grpc-go?