The etcd v3 API isn't returning helpful TLS error messages. Instead, it simply returns Error: context deadline exceeded. It shoud return TLS errors that are more like the ones returned by the v2 API.
etcdctl version: 3.3.6
This V3 API command:
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ssl/ca.pem --cert=/etc/etcd/ssl/cert.pem --key=/etc/etcd/ssl/key.pem member list
Returns:
Error: context deadline exceeded
While the corresponding command on the V2 API:
ETCDCTL_API=2 etcdctl --endpoints=https://127.0.0.1:2379 --ca-file=/etc/etcd/ssl/ca.pem --cert-file=/etc/etcd/ssl/cert.pem --key-file=/etc/etcd/ssl/key.pem member list
Returns:
client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is valid for 10.0.0.26, 10.100.0.1, not 127.0.0.1
With the vague error returned by the V3 API, it was impossible to solve the problem. The V2 error message lead me to the issue immediately.
Follow the instructions here to download and build etcd.
cd hack/tls-setup
Edit the config/req-csr.json file to have the below JSON (note that the JSON is just the example from hack/tls-setup/README.md with the loopback interface (127.0.0.1) removed).
{
"CN": "etcd",
"hosts": [
"3.8.121.201",
"46.4.19.20"
],
"key": {
"algo": "ecdsa",
"size": 384
},
"names": [
{
"O": "autogenerated",
"OU": "etcd cluster",
"L": "the internet"
}
]
}
Now generate the certs
make
Change directories back to the root of the project and run the following command to start etcd:
./bin/etcd --listen-client-urls=https://127.0.0.1:2379 --advertise-client-urls=https://127.0.0.1:2379--client-cert-auth=true --cert-file=hack/tls-setup/certs/etcd1.pem --key-file=hack/tls-setup/certs/etcd1-key.pem --trusted-ca-file=hack/tls-setup/certs/ca.pem
From a separate terminal window, attempt to connect via the V3 API using the following command:
ETCDCTL_API=3 ./bin/etcdctl --endpoints=https://127.0.0.1:2379 --cert=hack/tls-setup/certs/etcd2.pem --key=hack/tls-setup/certs/etcd2-key.pem --cacert=hack/tls-setup/certs/ca.pem member list
This will return the rather unhelpful:
Error: context deadline exceeded
Now attempt to connect via the V2 API using the following command:
ETCDCTL_API=2 ./bin/etcdctl --endpoints=https://127.0.0.1:2379 --cert-file=hack/tls-setup/certs/etcd2.pem --key-file=hack/tls-setup/certs/etcd2-key.pem --ca-file=hack/tls-setup/certs/ca.pem member list
This will return the much more helpful:
client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is valid for 3.8.121.201, 46.4.19.20, not 127.0.0.1
@jingyih @wenjiaswe
@collinkrawll Thanks for reporting this error. I took a quick look at the code, etcdctl v2 and etcdctl v3 command code are quite different and the way they handle the Error is different as well. In etcdctl v2, ErrClusterUnavailable is plumbed through isConnectionError when member list command call mustNewMembersAPI:
https://github.com/etcd-io/etcd/blob/27fc7e2296f506182f58ce846e48f36b34fe6842/etcdctl/ctlv2/command/util.go#L243
https://github.com/etcd-io/etcd/blob/8c80efb886ec798645c1aebf07689a8a733e663b/etcdctl/ctlv3/command/member_command.go#L223
This part of code is wrapped in etcdctl v3, and should be handled here:
But I need to try reproduce this error and find out where the ErrClusterUnavailable message got dropped in etcdctl v3. Also I am not quite familiar with this part of the code, so I am worried if the useful error message got dropped somewhere else in etcdctl V3, not just member add with TLS.
However, I probably won't be able to get back to this for the next two weeks. You are more than welcome to jump in if this is urgent for you, or anyone else, I will be happy to review the PR. Otherwise I will get back to this later.
@wenjiaswe If you haven't gotten back to this, I'd like to dig into it. I haven't contributed here before, but it seems like a good first issue.
@dahc no I haven't and you are more than welcome to take this! Thank you very much! I would be more than happy to review when it's done:)
/assign @dahc
Thanks @dahc and @wenjiaswe for taking a look at this. I've added steps to reproduce the issue at the top.
@wenjiaswe @dahc @collinkrawll
I find out this error is caused by grpc FailFastCallOption value.
in etcd/clientv3/options.go:25,
// client-side handling retrying of request failures where data was not written to the wire or
// where server indicates it did not process the data. gRPC default is default is "FailFast(true)"
// but for etcd we default to "FailFast(false)" to minimize client request error responses due to
// transient failures.
defaultFailFast = grpc.FailFast(false)
We can change the defaultFailFast to true.
grpc client connect to the server Asynchronously. The balancer has a slice which acts as a connection pool, and the balancer.Picker can pick one transport from the pool. Also, the balancer.Picker has a field.
// The latest connection happened.
connErrMu sync.Mutex
connErr error
when the connect goroutine catch an error, it updates the connErr field.
// FailFast configures the action to take when an RPC is attempted on broken
// connections or unreachable servers. If failFast is true, the RPC will fail
// immediately. Otherwise, the RPC client will block the call until a
// connection is available (or the call is canceled or times out) and will
// retry the call if it fails due to a transient error. gRPC will not retry if
// data was written to the wire unless the server indicates it did not process
// the data. Please refer to
// https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md.
//
// By default, RPCs are "Fail Fast".
If FailFast is false, the pick can't return the error include connErr, just a unclear error like Error: context deadline exceeded.
Also, we can think about open an issue in grpc, change the return error when the call is canceled or times out comes up.
in picker_wrapper.go:119
if ch == bp.blockingCh {
// This could happen when either:
// - bp.picker is nil (the previous if condition), or
// - has called pick on the current picker.
bp.mu.Unlock()
select {
case <-ctx.Done():
return nil, nil, ctx.Err()
case <-ch:
}
continue
}
By the way, can the script 'test' help me confirm if changing FailFast to true affect other codes ? or is there something like CI ?
@xiang90
@yuqitao we should open a PR in gRPC.
@yuqitao we should open a PR in gRPC.
The PR in gRPC is meraged.
Let's wait for the next gRPC release and try to bump up the version in etcd repo.
@jingyih have we bumped the gRPC version in etcd/clientv3?
@xiang90 Not yet.
@spzala Do you have bandwidth to help bump grpc version? The fix we need is included in 1.21 and above. Otherwise I can do it sometime next week (I'm currently out of country on vacation).
@jingyih aha...out of country vacation :-), that sounds cool. Enjoy!! I will be out for July 4th holidays from afternoon tomorrow for rest of the week but yes I can start looking at it with real work from weekend. Meanwhile a qq - isn't this a WIP in this PR? - https://github.com/etcd-io/etcd/pull/10624 except that we should be moving to the latest release 1.22? /cc @xiang90
FYI. The fix for this issue is https://github.com/grpc/grpc-go/pull/2777, which is in 1.21 and above.
Since there is already a WIP PR for bumping grpc version, maybe we should continue the effort there, instead of opening another PR.
I did not realize it is a US holiday. Enjoy the long weekend:) @spzala
FYI. The fix for this issue is grpc/grpc-go#2777, which is in 1.21 and above.
Since there is already a WIP PR for bumping grpc version, maybe we should continue the effort there, instead of opening another PR.
Yup, agree. Thanks @jingyih !
I did not realize it is a US holiday. Enjoy the long weekend:) @spzala
:) thanks @jingyih
Fixed in master branch by https://github.com/etcd-io/etcd/pull/11029/commits/02b27798147444d2ff8defc91caaa20d0ccf40ba
Fixed in v3.3 branch by https://github.com/etcd-io/etcd/commit/830bba337fb3b9a3aab98e8def19c01e356106c1
Most helpful comment
Fixed in master branch by https://github.com/etcd-io/etcd/pull/11029/commits/02b27798147444d2ff8defc91caaa20d0ccf40ba
Fixed in v3.3 branch by https://github.com/etcd-io/etcd/commit/830bba337fb3b9a3aab98e8def19c01e356106c1