Grpc-go: Hard to differentiate between expected dial errors and fatal ones

Created on 26 Apr 2018  Â·  3Comments  Â·  Source: grpc/grpc-go

Please answer these questions before submitting your issue.

Downstream etcd issue: https://github.com/coreos/etcd/pull/9623

What version of gRPC are you using?

v1.7.5, but behavior is the same on latest

What version of Go are you using (go version)?

go version go1.10 linux/amd64

What operating system (Linux, Windows, …) and version?

Linux 4.15.17-300.fc27.x86_64

What did you do?

After mis-configuring a gRPC client with the wrong CA, I noticed the error is logged at a WARNING level here. After relaxing our logger to print WARNING messages, we started seeing routine disconnects, apparently caused by the internal gRPC load balancer:

$ etcdctl user list                                                  
WARNING: 2018/04/25 13:06:05 Failed to dial etcd-2.local:22379: grpc: the connection is closing; please retry.
WARNING: 2018/04/25 13:06:05 Failed to dial etcd-1.local:12379: grpc: the connection is closing; please retry.
$ etcdctl user list                                                  
WARNING: 2018/04/25 13:06:05 Failed to dial etcd-1.local:12379: grpc: the connection is closing; please retry.
$ etcdctl user list                                                  
WARNING: 2018/04/25 13:06:06 Failed to dial etcd-3.local:32379: grpc: the connection is closing; please retry.
WARNING: 2018/04/25 13:06:06 Failed to dial etcd-2.local:22379: grpc: the connection is closing; please retry.

What did you expect to see?

I expected these kind of errors logged at different levels.

What did you see instead?

No difference between fatal errors (such as mis-configured TLS), and routine ones.

Maybe something like:

diff --git a/clientconn.go b/clientconn.go
index 9745484..6263b30 100644
--- a/clientconn.go
+++ b/clientconn.go
@@ -925,10 +925,12 @@ func (ac *addrConn) connect() error {
        // Start a goroutine connecting to the server asynchronously.
        go func() {
                if err := ac.resetTransport(); err != nil {
-                       grpclog.Warningf("Failed to dial %s: %v; please retry.", ac.addrs[0].Addr, err)
                        if err != errConnClosing {
+                               grpclog.Errorf("Failed to dial %s: %v; please retry.", ac.addrs[0].Addr, err)
                                // Keep this ac in cc.conns, to get the reason it's torn down.
                                ac.tearDown(err)
+                       } else {
+                               grpclog.Warningf("Failed to dial %s: %v; please retry.", ac.addrs[0].Addr, err)
                        }
                        return
                }
P2 Bug

Most helpful comment

@gyuho Thanks for clarifying!

@ericchiang
In the newer releases, we include the last connection error in RPC errors, (#1855, part of release 1.11). So if you have logs for RPC errors, you will see information about connections there, which could help you tell if the error was fatal or not.

The proposal about log levels in this issue sounds good. We have plan to define our logging policy (how to pick log levels), we will make sure this issue gets covered.

All 3 comments

@menghanl @ericchiang I believe this happens when etcd client manually triggers down(errConnClosing) on unused connections (from old balancer interface in v1.7). We manually close connections after pinning the first connection out of multiple endpoints. We are upgrading gRPC with balancer rewrite. In our new implementation, we are getting rid of such manual connection close.

@gyuho Thanks for clarifying!

@ericchiang
In the newer releases, we include the last connection error in RPC errors, (#1855, part of release 1.11). So if you have logs for RPC errors, you will see information about connections there, which could help you tell if the error was fatal or not.

The proposal about log levels in this issue sounds good. We have plan to define our logging policy (how to pick log levels), we will make sure this issue gets covered.

The logging doc was submitted in #2033, and sweeping through our logging is tracked in #2146.
Closing this issue now.

Was this page helpful?
0 / 5 - 0 ratings