Etcd: [etcdctl] Error: context deadline exceeded

Created on 13 Apr 2019 · 14Comments · Source: etcd-io/etcd

I'm having issues using domain names to communicate with an existing cluster via etcdctl.
The problem seems related to #10430, which seemed fixed in #10428.

A few info:

$ brew info etcd # provides etcdctl command
etcd: stable 3.3.12 (bottled), HEAD
Key value store for shared configuration and service discovery
https://github.com/etcd-io/etcd
/usr/local/Cellar/etcd/3.3.12 (9 files, 51.6MB) *
  Poured from bottle on 2019-02-16 at 13:41:00
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/etcd.rb
$ env | grep -i etcd
ETCDCTL_API=3
$ etcdctl version
etcdctl version: 3.3.12
API version: 3.3

etcd is currently running in a single container on docker; the host machine has 4 ethernet ports two of which are bonded. The following is the issue:

$ etcdctl --endpoints=http://10.0.0.161:2379,http://10.0.0.162:2379,http://10.0.0.166:2379,http://br0.sagittarius.<lan.domain.com>:2379,http://eno1.sagittarius.<lan.domain.com>:2379,http://eno2.sagittarius.<lan.domain.com>:2379,http://etcd:2379 endpoint status
Failed to get the status of endpoint http://br0.sagittarius.<lan.domain.com>:2379 (context deadline exceeded)
Failed to get the status of endpoint http://eno2.sagittarius.<lan.domain.com>:2379 (context deadline exceeded)
Failed to get the status of endpoint http://etcd:2379 (context deadline exceeded)
http://10.0.0.161:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10350
http://10.0.0.162:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10350
http://10.0.0.166:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10350
http://eno1.sagittarius.<lan.domain.com>:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10350

Please ignore the http://etcd:2379 which is for docker networking purposes.
However when testing a failing endpoint using http (from httpie) it seems to work:

$ http POST http://br0.sagittarius.<lan.domain.com>:2379/v3beta/cluster/member/list cluster=default
Content-Length: 475
Content-Type: application/json
Date: Sat, 13 Apr 2019 15:55:47 GMT

{
  "header": {
    "cluster_id": "13381000697838399546",
    "member_id": "12455110354436832465",
    "raft_term": "14"
  },
  "members": [
    {
      "ID": "12455110354436832465",
      "name": "sagittarius",
      "peerURLs": [
        "http://10.0.0.166:2379",
        "http://<domain.com>:2379"
      ],
      "clientURLs": [
        "http://10.0.0.161:2379",
        "http://10.0.0.162:2379",
        "http://10.0.0.166:2379",
        "http://br0.sagittarius.<lan.domain.com>:2379",
        "http://eno1.sagittarius.<lan.domain.com>:2379",
        "http://eno2.sagittarius.<lan.domain.com>:2379",
        "http://etcd:2379"
      ]
    }
  ]
}

The DNS that allows lan.domain.com is a local router running dnsmasq.

stale

Source

devster31

👍1

Most helpful comment

I solved this issue only when I applied the right cert/key pair.

Assuming we're using kubeadm to spin up the cluster, there should be a couple of cert/key pairs under the folder:

# ls -l /etc/kubernetes/pki/etcd/
total 32
-rw-r--r--    1 root     root          1017 Nov 12 15:32 ca.crt
-rw-------    1 root     root          1679 Nov 12 15:32 ca.key
-rw-r--r--    1 root     root          1094 Nov 12 15:32 healthcheck-client.crt
-rw-------    1 root     root          1675 Nov 12 15:32 healthcheck-client.key
-rw-r--r--    1 root     root          1180 Nov 12 15:32 peer.crt
-rw-------    1 root     root          1675 Nov 12 15:32 peer.key
-rw-r--r--    1 root     root          1180 Nov 12 15:32 server.crt
-rw-------    1 root     root          1679 Nov 12 15:32 server.key

# etcdctl --version
etcdctl version: 3.3.1
API version: 2

# ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
  --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/server.crt \
  --key /etc/kubernetes/pki/etcd/server.key
Snapshot saved at snapshot.db

# ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshot.db
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b9d500f7 |    72966 |       1194 |     4.9 MB |
+----------+----------+------------+------------+

brightzheng100 on 15 Nov 2019

👍14 🎉2

All 14 comments

@devster31 context deadline exceeded is an unclear error returned by grpc client when it can't establish the connection. you can set ETCDCTL_API=2, maybe you can get the right error message.
Also, you can change some code in etcd to debug this error.
see #10087

yuqitao on 15 Apr 2019

@yuqitao unfortunately the endpoint command wasn't present in v2.
I tried, as you suggested, to use ETCDCTL_API=2 but unfortunately it didn't really help (fish shell below)

$ env ETCDCTL_API=2 etcdctl --endpoints=http://br0.sagittarius.<lan.domain.com>:2379,http://eno1.sagittarius.<lan.domain.com>:2379,http://eno2.sagittarius.<lan.domain.com>:2379 ls
Error:  context deadline exceeded
$ env ETCDCTL_API=2 etcdctl --endpoints=http://br0.sagittarius.<lan.domain.com>:2379,http://eno1.sagittarius.<lan.domain.com>:2379,http://eno2.sagittarius.<lan.domain.com>:2379 cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://eno1.sagittarius.<lan.domain.com>:2379 exceeded header timeout
; error #1: client: endpoint http://br0.sagittarius.<lan.domain.com>:2379 exceeded header timeout
; error #2: client: endpoint http://eno2.sagittarius.<lan.domain.com>:2379 exceeded header timeout

error #0: client: endpoint http://eno1.sagittarius.<lan.domain.com>:2379 exceeded header timeout
error #1: client: endpoint http://br0.sagittarius.<lan.domain.com>:2379 exceeded header timeout
error #2: client: endpoint http://eno2.sagittarius.<lan.domain.com>:2379 exceeded header timeout

In addition to this I showed above how using a standard HTTP request there are no issues.

devster31 on 15 Apr 2019

Similar issue here (context deadline exceeded when using ETCDCTL_API=3). We cannot use API v2, because for that version etcdctl does not provide an --insecure-skip-tls-verify flag (which is the sole reason why we switched to ETCDCTL_API=3).

/cc @illuhad

ghost on 23 Apr 2019

👍2

@urzds how to reproduce it ?
context deadline exceeded is just an unclear error message when grpc client fails to establish a network connection when the context is time out.

You can try this.
https://github.com/etcd-io/etcd/blob/8146e1ebdf1f54791edf33e85e0c816619c7d9cd/clientv3/options.go#L29

change false to true, you maybe find the helpful error message about connection.

yuqitao on 24 Apr 2019

diff --git a/clientv3/options.go b/clientv3/options.go
index 4660acea0..af0ed0528 100644
--- a/clientv3/options.go
+++ b/clientv3/options.go
@@ -26,7 +26,7 @@ var (
        // where server indicates it did not process the data. gRPC default is default is "FailFast(true)"
        // but for etcd we default to "FailFast(false)" to minimize client request error responses due to
        // transient failures.
-       defaultFailFast = grpc.FailFast(false)
+       defaultFailFast = grpc.FailFast(true)

        // client-side request send limit, gRPC default is math.MaxInt32
        // Make sure that "client-side send limit < server-side default send/recv limit"

New output with the above option set to true

{"level":"warn","ts":"2019-04-24T02:00:23.825+0200","caller":"clientv3/retry_interceptor.go:60","msg":"retrying of unary invoker failed","target":"passthrough:///http://br0.sagittarius.<lan.domain.com>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://br0.sagittarius.<lan.domain.com>:2379 (context deadline exceeded)
{"level":"warn","ts":"2019-04-24T02:00:33.715+0200","caller":"clientv3/retry_interceptor.go:60","msg":"retrying of unary invoker failed","target":"passthrough:///http://eno2.sagittarius.<lan.domain.com>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://eno2.sagittarius.<lan.domain.com>:2379 (context deadline exceeded)
{"level":"warn","ts":"2019-04-24T02:00:38.717+0200","caller":"clientv3/retry_interceptor.go:60","msg":"retrying of unary invoker failed","target":"passthrough:///http://etcd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
http://10.0.0.161:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10394, 0,
http://10.0.0.162:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10394, 0,
http://10.0.0.166:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10394, 0,
http://eno1.sagittarius.<lan.domain.com>:2379, acd970e09a7f3cd1, 3.3.10, 23 MB, true, 14, 10394, 0,

devster31 on 24 Apr 2019

I am having the same issue with 3.3.11. In my case auth is enabled. The Error: context deadline exceeded appears intermittently. When I run etcdctl and pass --user root and provide the password, 3 out of 5 times on I get the error. If auth is disabled, i don't see the issue. I also see the error more often when a complex password is used something like 10 chars with special upper and lower case. With simple passwords I see the error less often. I am using only v3 in my case. v2 is disabled.

shanmugara on 26 Jul 2019

Hi All,

I am having the same issue with etcd 3.2.24 in cluster with 3 nodes used in a kubernetes cluster with about 25 Raft Proposals committed for second.

I noted an interesting correlation always true: every time the etcd_disk_backend_commit_duration 99 percentile metric goes to values above 100ms, the etcdctl client goes in timeout.

Sometime a leader election happens when the same metric is above 100 ms.

The scenario for not having issues is have etcd_disk_backend_commit_duration 0.99 percentile below 20 ms.

Best Regards

Stefano Gristina

stefano-gristina on 1 Aug 2019

Another consideration.

In the etcd faq (https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md) there is written:
To rule out a slow disk from causing this warning, monitor backend_commit_duration_seconds (p99 duration should be less than 25ms) to confirm the disk is reasonably fast.

I had different switch leader with p99 < 10 ms, but, in the same time, the p999 was greater than 300 ms.

So what does it mean? Etcd is very sensible to latency. The p99, in the backend and wal, less than 10 ms is not enough, the p999 should be considered too.

Best Regards

Stefano Gristina

stefano-gristina on 1 Aug 2019

if you are running on openstack or any cloud please make sure to allow the ports in security groups
in my case this was the error

jobin-AT on 5 Nov 2019

had the same error.
changing endpoints from http:// to https:// solved it for me.

monotek on 13 Nov 2019

I solved this issue only when I applied the right cert/key pair.

Assuming we're using kubeadm to spin up the cluster, there should be a couple of cert/key pairs under the folder:

# ls -l /etc/kubernetes/pki/etcd/
total 32
-rw-r--r--    1 root     root          1017 Nov 12 15:32 ca.crt
-rw-------    1 root     root          1679 Nov 12 15:32 ca.key
-rw-r--r--    1 root     root          1094 Nov 12 15:32 healthcheck-client.crt
-rw-------    1 root     root          1675 Nov 12 15:32 healthcheck-client.key
-rw-r--r--    1 root     root          1180 Nov 12 15:32 peer.crt
-rw-------    1 root     root          1675 Nov 12 15:32 peer.key
-rw-r--r--    1 root     root          1180 Nov 12 15:32 server.crt
-rw-------    1 root     root          1679 Nov 12 15:32 server.key

# etcdctl --version
etcdctl version: 3.3.1
API version: 2

# ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
  --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/server.crt \
  --key /etc/kubernetes/pki/etcd/server.key
Snapshot saved at snapshot.db

# ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshot.db
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b9d500f7 |    72966 |       1194 |     4.9 MB |
+----------+----------+------------+------------+

brightzheng100 on 15 Nov 2019

👍14 🎉2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 6 Apr 2020

@devster31 hi
Can you tell me this problem is solved? I have encountered this problem, can you tell me how you solved it?

Abandonsun on 1 Sep 2020

@Abandonsun Not OP, but this error has always meant for me "in your command you tried to run, you messed up something related to the network connection". In your case, try making your endpoint "https" instead of "http". IIRC, I've gotten that error when I didn't have the right certs, when I used 2380 instead of 2379, when I used http instead of https, etc. It's just a vague "check everything in your command related to the initial network connection to the etcd server".