I'm having issues using domain names to communicate with an existing cluster via etcdctl.
The problem seems related to #10430, which seemed fixed in #10428.
A few info:
$ brew info etcd # provides etcdctl command
etcd: stable 3.3.12 (bottled), HEAD
Key value store for shared configuration and service discovery
https://github.com/etcd-io/etcd
/usr/local/Cellar/etcd/3.3.12 (9 files, 51.6MB) *
Poured from bottle on 2019-02-16 at 13:41:00
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/etcd.rb
$ env | grep -i etcd
ETCDCTL_API=3
$ etcdctl version
etcdctl version: 3.3.12
API version: 3.3
etcd is currently running in a single container on docker; the host machine has 4 ethernet ports two of which are bonded. The following is the issue:
$ etcdctl --endpoints=http://10.0.0.161:2379,http://10.0.0.162:2379,http://10.0.0.166:2379,http://br0.sagittarius.<lan.domain.com>:2379,http://eno1.sagittarius.<lan.domain.com>:2379,http://eno2.sagittarius.<lan.domain.com>:2379,http://etcd:2379 endpoint status
Failed to get the status of endpoint http://br0.sagittarius.<lan.domain.com>:2379 (context deadline exceeded)
Failed to get the status of endpoint http://eno2.sagittarius.<lan.domain.com>:2379 (context deadline exceeded)
Failed to get the status of endpoint http://etcd:2379 (context deadline exceeded)
http://10.0.0.161:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10350
http://10.0.0.162:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10350
http://10.0.0.166:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10350
http://eno1.sagittarius.<lan.domain.com>:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10350
Please ignore the http://etcd:2379 which is for docker networking purposes.
However when testing a failing endpoint using http (from httpie) it seems to work:
$ http POST http://br0.sagittarius.<lan.domain.com>:2379/v3beta/cluster/member/list cluster=default
Content-Length: 475
Content-Type: application/json
Date: Sat, 13 Apr 2019 15:55:47 GMT
{
"header": {
"cluster_id": "13381000697838399546",
"member_id": "12455110354436832465",
"raft_term": "14"
},
"members": [
{
"ID": "12455110354436832465",
"name": "sagittarius",
"peerURLs": [
"http://10.0.0.166:2379",
"http://<domain.com>:2379"
],
"clientURLs": [
"http://10.0.0.161:2379",
"http://10.0.0.162:2379",
"http://10.0.0.166:2379",
"http://br0.sagittarius.<lan.domain.com>:2379",
"http://eno1.sagittarius.<lan.domain.com>:2379",
"http://eno2.sagittarius.<lan.domain.com>:2379",
"http://etcd:2379"
]
}
]
}
The DNS that allows lan.domain.com is a local router running dnsmasq.
@devster31 context deadline exceeded is an unclear error returned by grpc client when it can't establish the connection. you can set ETCDCTL_API=2, maybe you can get the right error message.
Also, you can change some code in etcd to debug this error.
see #10087
@yuqitao unfortunately the endpoint command wasn't present in v2.
I tried, as you suggested, to use ETCDCTL_API=2 but unfortunately it didn't really help (fish shell below)
$ env ETCDCTL_API=2 etcdctl --endpoints=http://br0.sagittarius.<lan.domain.com>:2379,http://eno1.sagittarius.<lan.domain.com>:2379,http://eno2.sagittarius.<lan.domain.com>:2379 ls
Error: context deadline exceeded
$ env ETCDCTL_API=2 etcdctl --endpoints=http://br0.sagittarius.<lan.domain.com>:2379,http://eno1.sagittarius.<lan.domain.com>:2379,http://eno2.sagittarius.<lan.domain.com>:2379 cluster-health
cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://eno1.sagittarius.<lan.domain.com>:2379 exceeded header timeout
; error #1: client: endpoint http://br0.sagittarius.<lan.domain.com>:2379 exceeded header timeout
; error #2: client: endpoint http://eno2.sagittarius.<lan.domain.com>:2379 exceeded header timeout
error #0: client: endpoint http://eno1.sagittarius.<lan.domain.com>:2379 exceeded header timeout
error #1: client: endpoint http://br0.sagittarius.<lan.domain.com>:2379 exceeded header timeout
error #2: client: endpoint http://eno2.sagittarius.<lan.domain.com>:2379 exceeded header timeout
In addition to this I showed above how using a standard HTTP request there are no issues.
Similar issue here (context deadline exceeded when using ETCDCTL_API=3). We cannot use API v2, because for that version etcdctl does not provide an --insecure-skip-tls-verify flag (which is the sole reason why we switched to ETCDCTL_API=3).
/cc @illuhad
@urzds how to reproduce it ?
context deadline exceeded is just an unclear error message when grpc client fails to establish a network connection when the context is time out.
You can try this.
https://github.com/etcd-io/etcd/blob/8146e1ebdf1f54791edf33e85e0c816619c7d9cd/clientv3/options.go#L29
change false to true, you maybe find the helpful error message about connection.
diff --git a/clientv3/options.go b/clientv3/options.go
index 4660acea0..af0ed0528 100644
--- a/clientv3/options.go
+++ b/clientv3/options.go
@@ -26,7 +26,7 @@ var (
// where server indicates it did not process the data. gRPC default is default is "FailFast(true)"
// but for etcd we default to "FailFast(false)" to minimize client request error responses due to
// transient failures.
- defaultFailFast = grpc.FailFast(false)
+ defaultFailFast = grpc.FailFast(true)
// client-side request send limit, gRPC default is math.MaxInt32
// Make sure that "client-side send limit < server-side default send/recv limit"
New output with the above option set to true
{"level":"warn","ts":"2019-04-24T02:00:23.825+0200","caller":"clientv3/retry_interceptor.go:60","msg":"retrying of unary invoker failed","target":"passthrough:///http://br0.sagittarius.<lan.domain.com>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://br0.sagittarius.<lan.domain.com>:2379 (context deadline exceeded)
{"level":"warn","ts":"2019-04-24T02:00:33.715+0200","caller":"clientv3/retry_interceptor.go:60","msg":"retrying of unary invoker failed","target":"passthrough:///http://eno2.sagittarius.<lan.domain.com>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://eno2.sagittarius.<lan.domain.com>:2379 (context deadline exceeded)
{"level":"warn","ts":"2019-04-24T02:00:38.717+0200","caller":"clientv3/retry_interceptor.go:60","msg":"retrying of unary invoker failed","target":"passthrough:///http://etcd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
http://10.0.0.161:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10394, 0,
http://10.0.0.162:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10394, 0,
http://10.0.0.166:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10394, 0,
http://eno1.sagittarius.<lan.domain.com>:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10394, 0,
I am having the same issue with 3.3.11. In my case auth is enabled. The Error: context deadline exceeded appears intermittently. When I run etcdctl and pass --user root and provide the password, 3 out of 5 times on I get the error. If auth is disabled, i don't see the issue. I also see the error more often when a complex password is used something like 10 chars with special upper and lower case. With simple passwords I see the error less often. I am using only v3 in my case. v2 is disabled.
Hi All,
I am having the same issue with etcd 3.2.24 in cluster with 3 nodes used in a kubernetes cluster with about 25 Raft Proposals committed for second.
I noted an interesting correlation always true: every time the etcd_disk_backend_commit_duration 99 percentile metric goes to values above 100ms, the etcdctl client goes in timeout.
Sometime a leader election happens when the same metric is above 100 ms.
The scenario for not having issues is have etcd_disk_backend_commit_duration 0.99 percentile below 20 ms.
Best Regards
Stefano Gristina
Another consideration.
In the etcd faq (https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md) there is written:
To rule out a slow disk from causing this warning, monitor backend_commit_duration_seconds (p99 duration should be less than 25ms) to confirm the disk is reasonably fast.
I had different switch leader with p99 < 10 ms, but, in the same time, the p999 was greater than 300 ms.
So what does it mean? Etcd is very sensible to latency. The p99, in the backend and wal, less than 10 ms is not enough, the p999 should be considered too.
Best Regards
Stefano Gristina
if you are running on openstack or any cloud please make sure to allow the ports in security groups
in my case this was the error
had the same error.
changing endpoints from http:// to https:// solved it for me.
I solved this issue only when I applied the right cert/key pair.
Assuming we're using kubeadm to spin up the cluster, there should be a couple of cert/key pairs under the folder:
# ls -l /etc/kubernetes/pki/etcd/
total 32
-rw-r--r-- 1 root root 1017 Nov 12 15:32 ca.crt
-rw------- 1 root root 1679 Nov 12 15:32 ca.key
-rw-r--r-- 1 root root 1094 Nov 12 15:32 healthcheck-client.crt
-rw------- 1 root root 1675 Nov 12 15:32 healthcheck-client.key
-rw-r--r-- 1 root root 1180 Nov 12 15:32 peer.crt
-rw------- 1 root root 1675 Nov 12 15:32 peer.key
-rw-r--r-- 1 root root 1180 Nov 12 15:32 server.crt
-rw------- 1 root root 1679 Nov 12 15:32 server.key
# etcdctl --version
etcdctl version: 3.3.1
API version: 2
# ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
--cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key
Snapshot saved at snapshot.db
# ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshot.db
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b9d500f7 | 72966 | 1194 | 4.9 MB |
+----------+----------+------------+------------+
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@devster31 hi
Can you tell me this problem is solved? I have encountered this problem, can you tell me how you solved it?

@Abandonsun Not OP, but this error has always meant for me "in your command you tried to run, you messed up something related to the network connection". In your case, try making your endpoint "https" instead of "http". IIRC, I've gotten that error when I didn't have the right certs, when I used 2380 instead of 2379, when I used http instead of https, etc. It's just a vague "check everything in your command related to the initial network connection to the etcd server".
Most helpful comment
I solved this issue only when I applied the right cert/key pair.
Assuming we're using
kubeadmto spin up the cluster, there should be a couple of cert/key pairs under the folder: