Etcd: The etcd v3 API isn't returning helpful TLS error messages.

Created on 12 Sep 2018 · 16Comments · Source: etcd-io/etcd

Issue

The etcd v3 API isn't returning helpful TLS error messages. Instead, it simply returns Error: context deadline exceeded. It shoud return TLS errors that are more like the ones returned by the v2 API.

etcdctl version: 3.3.6

This V3 API command:

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ssl/ca.pem --cert=/etc/etcd/ssl/cert.pem --key=/etc/etcd/ssl/key.pem member list

Returns:

Error: context deadline exceeded

While the corresponding command on the V2 API:

ETCDCTL_API=2 etcdctl --endpoints=https://127.0.0.1:2379 --ca-file=/etc/etcd/ssl/ca.pem --cert-file=/etc/etcd/ssl/cert.pem --key-file=/etc/etcd/ssl/key.pem member list

Returns:

client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is valid for 10.0.0.26, 10.100.0.1, not 127.0.0.1

With the vague error returned by the V3 API, it was impossible to solve the problem. The V2 error message lead me to the issue immediately.

Steps to Reproduce

Download and Build etcd

Follow the instructions here to download and build etcd.

Create Some Test Certs

cd hack/tls-setup

Edit the config/req-csr.json file to have the below JSON (note that the JSON is just the example from hack/tls-setup/README.md with the loopback interface (127.0.0.1) removed).

{
  "CN": "etcd",
  "hosts": [
    "3.8.121.201",
    "46.4.19.20"
  ],
  "key": {
    "algo": "ecdsa",
    "size": 384
  },
  "names": [
    {
      "O": "autogenerated",
      "OU": "etcd cluster",
      "L": "the internet"
    }
  ]
}

Now generate the certs

make

Start etcd

Change directories back to the root of the project and run the following command to start etcd:

./bin/etcd --listen-client-urls=https://127.0.0.1:2379 --advertise-client-urls=https://127.0.0.1:2379--client-cert-auth=true --cert-file=hack/tls-setup/certs/etcd1.pem --key-file=hack/tls-setup/certs/etcd1-key.pem --trusted-ca-file=hack/tls-setup/certs/ca.pem

Attempt to Connect via V3 API

From a separate terminal window, attempt to connect via the V3 API using the following command:

ETCDCTL_API=3 ./bin/etcdctl --endpoints=https://127.0.0.1:2379 --cert=hack/tls-setup/certs/etcd2.pem --key=hack/tls-setup/certs/etcd2-key.pem --cacert=hack/tls-setup/certs/ca.pem member list

This will return the rather unhelpful:

Error: context deadline exceeded

Attempt to Connect via V2 API

Now attempt to connect via the V2 API using the following command:

ETCDCTL_API=2 ./bin/etcdctl --endpoints=https://127.0.0.1:2379 --cert-file=hack/tls-setup/certs/etcd2.pem --key-file=hack/tls-setup/certs/etcd2-key.pem --ca-file=hack/tls-setup/certs/ca.pem member list

This will return the much more helpful:

client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is valid for 3.8.121.201, 46.4.19.20, not 127.0.0.1

https://github.com/etcd-io/etcd/issues/6915#issuecomment-364675388

Help Wanted arebug aretls

Source

collinkrawll

👍3

Most helpful comment

Fixed in master branch by https://github.com/etcd-io/etcd/pull/11029/commits/02b27798147444d2ff8defc91caaa20d0ccf40ba

Fixed in v3.3 branch by https://github.com/etcd-io/etcd/commit/830bba337fb3b9a3aab98e8def19c01e356106c1

jingyih on 20 Aug 2019

👍4

All 16 comments

@jingyih @wenjiaswe

wenjiaswe on 13 Sep 2018

@collinkrawll Thanks for reporting this error. I took a quick look at the code, etcdctl v2 and etcdctl v3 command code are quite different and the way they handle the Error is different as well. In etcdctl v2, ErrClusterUnavailable is plumbed through isConnectionError when member list command call mustNewMembersAPI:

https://github.com/etcd-io/etcd/blob/27fc7e2296f506182f58ce846e48f36b34fe6842/etcdctl/ctlv2/command/util.go#L243
https://github.com/etcd-io/etcd/blob/8c80efb886ec798645c1aebf07689a8a733e663b/etcdctl/ctlv3/command/member_command.go#L223

This part of code is wrapped in etcdctl v3, and should be handled here:

https://github.com/etcd-io/etcd/blob/27fc7e2296f506182f58ce846e48f36b34fe6842/etcdctl/ctlv2/command/member_commands.go#L63

But I need to try reproduce this error and find out where the ErrClusterUnavailable message got dropped in etcdctl v3. Also I am not quite familiar with this part of the code, so I am worried if the useful error message got dropped somewhere else in etcdctl V3, not just member add with TLS.

However, I probably won't be able to get back to this for the next two weeks. You are more than welcome to jump in if this is urgent for you, or anyone else, I will be happy to review the PR. Otherwise I will get back to this later.

wenjiaswe on 19 Oct 2018

@wenjiaswe If you haven't gotten back to this, I'd like to dig into it. I haven't contributed here before, but it seems like a good first issue.

dahc on 21 Dec 2018

@dahc no I haven't and you are more than welcome to take this! Thank you very much! I would be more than happy to review when it's done:)
/assign @dahc

wenjiaswe on 21 Dec 2018

Thanks @dahc and @wenjiaswe for taking a look at this. I've added steps to reproduce the issue at the top.

collinkrawll on 21 Dec 2018

👍2

@wenjiaswe @dahc @collinkrawll
I find out this error is caused by grpc FailFastCallOption value.
in etcd/clientv3/options.go:25,

     // client-side handling retrying of request failures where data was not written to the wire or
     // where server indicates it did not process the data. gRPC default is default is "FailFast(true)"
     // but for etcd we default to "FailFast(false)" to minimize client request error responses due to
     // transient failures.
     defaultFailFast = grpc.FailFast(false)

We can change the defaultFailFast to true.

grpc client connect to the server Asynchronously. The balancer has a slice which acts as a connection pool, and the balancer.Picker can pick one transport from the pool. Also, the balancer.Picker has a field.

// The latest connection happened.
     connErrMu sync.Mutex
     connErr   error

when the connect goroutine catch an error, it updates the connErr field.

 // FailFast configures the action to take when an RPC is attempted on broken
 // connections or unreachable servers.  If failFast is true, the RPC will fail
 // immediately. Otherwise, the RPC client will block the call until a
 // connection is available (or the call is canceled or times out) and will
 // retry the call if it fails due to a transient error.  gRPC will not retry if
 // data was written to the wire unless the server indicates it did not process
 // the data.  Please refer to
 // https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md.
 //
 // By default, RPCs are "Fail Fast".

If FailFast is false, the pick can't return the error include connErr, just a unclear error like Error: context deadline exceeded.

Also, we can think about open an issue in grpc, change the return error when the call is canceled or times out comes up.
in picker_wrapper.go:119

     if ch == bp.blockingCh {
             // This could happen when either:
             // - bp.picker is nil (the previous if condition), or
             // - has called pick on the current picker.
             bp.mu.Unlock()
             select {
             case <-ctx.Done():
                 return nil, nil, ctx.Err()
             case <-ch:
             }
             continue
         }

By the way, can the script 'test' help me confirm if changing FailFast to true affect other codes ? or is there something like CI ？

yuqitao on 26 Mar 2019

@xiang90

yuqitao on 29 Mar 2019

@yuqitao we should open a PR in gRPC.

xiang90 on 2 Apr 2019

@yuqitao we should open a PR in gRPC.

The PR in gRPC is meraged.

yuqitao on 30 Apr 2019

Let's wait for the next gRPC release and try to bump up the version in etcd repo.

jingyih on 6 May 2019

@jingyih have we bumped the gRPC version in etcd/clientv3?

xiang90 on 2 Jul 2019

@xiang90 Not yet.

@spzala Do you have bandwidth to help bump grpc version? The fix we need is included in 1.21 and above. Otherwise I can do it sometime next week (I'm currently out of country on vacation).

jingyih on 3 Jul 2019

👍1

@jingyih aha...out of country vacation :-), that sounds cool. Enjoy!! I will be out for July 4th holidays from afternoon tomorrow for rest of the week but yes I can start looking at it with real work from weekend. Meanwhile a qq - isn't this a WIP in this PR? - https://github.com/etcd-io/etcd/pull/10624 except that we should be moving to the latest release 1.22? /cc @xiang90

spzala on 3 Jul 2019

FYI. The fix for this issue is https://github.com/grpc/grpc-go/pull/2777, which is in 1.21 and above.

Since there is already a WIP PR for bumping grpc version, maybe we should continue the effort there, instead of opening another PR.

I did not realize it is a US holiday. Enjoy the long weekend:) @spzala

jingyih on 3 Jul 2019

FYI. The fix for this issue is grpc/grpc-go#2777, which is in 1.21 and above.

Since there is already a WIP PR for bumping grpc version, maybe we should continue the effort there, instead of opening another PR.

Yup, agree. Thanks @jingyih !
I did not realize it is a US holiday. Enjoy the long weekend:) @spzala
:) thanks @jingyih