We are running a three node etcd cluster. Etcd members are running inside docker container.
We are using quay.io/coreos/etcd:v3.1.7 docker image. We have SSL enabled on our etcd members.
One fine day- etcd3 node/container went down. We brought that back and from then on it never became healthy and is leaking memory. Etcd container is running on 32GB host and bloats all the way upto to 32GB and crashes. In 34min it has bloated up to roughly 10GB.
[root@etcd3 dharjain]# free -m
total used free shared buff/cache available
Mem: 31742 559 30318 9 863 30743
Swap: 2047 203 1843
[root@etcd3 dharjain]# free -m
total used free shared buff/cache available
Mem: 31742 670 30187 9 884 30633
Swap: 2047 203 1843
[root@etcd3 dharjain]# free -m
total used free shared buff/cache available
Mem: 31742 1616 29100 10 1024 29684
Swap: 2047 201 1846
[root@etcd3 dharjain]# free -m
total used free shared buff/cache available
Mem: 31742 3524 27056 10 1161 27773
Swap: 2047 199 1847
[root@etcd3 dharjain]# free -m
total used free shared buff/cache available
Mem: 31742 7022 23537 11 1182 24274
Swap: 2047 198 1848
[root@etcd3 dharjain]# free -m
total used free shared buff/cache available
Mem: 31742 10335 20228 12 1178 20960
Swap: 2047 198 1849
[root@etcd3 dharjain]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
986e31c9ec50 quay.io/coreos/etcd:v3.1.7 "/usr/local/bin/etcd " 34 minutes ago Up 34 minutes etcd
Host OS and kernel info:
[root@etcd3 dharjain]# cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
[root@etcd3 dharjain]# uname -a
Linux etcd3.bluemix.net 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Docker info
[root@etcd3 dharjain]# docker info
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 5
Server Version: 1.12.6
Storage Driver: devicemapper
Pool Name: docker-202:2-467209-pool
Pool Blocksize: 65.54 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 2.006 GB
Data Space Total: 107.4 GB
Data Space Available: 85.22 GB
Metadata Space Used: 3.269 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.144 GB
Thin Pool Minimum Free Space: 10.74 GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
Volume: local
Network: null host bridge overlay
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp selinux
Kernel Version: 3.10.0-514.26.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 16
Total Memory: 31 GiB
Name: etcd3.bi.services.us-south.bluemix.net
ID: OK2J:L4PH:22PR:JFIP:6AAJ:DTOF:D4FY:7CJX:RSQI:KMUU:POSM:BL32
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
127.0.0.0/8
Registries: docker.io (secure)
[root@etcd3 dharjain]#
Hi @djain13 just for context what was your process for bringing the node back into the cluster. Thanks!
@hexfusion
Here is what we follow:
member DELETE apicurl --cacert /certs/ca.cert --key /certs/node.key --cert /certs/node.cert https://<private-ip-of-leader>:2379/v2/members/<member-id-of-bad-node> -XDELETE
Clean up the etcd data from docker host. We maintain etcd data on host and mount the same on container.
Add a new member, using etcdctl member add
docker exec -it etcd etcdctl --ca-file /certs/ca.cert --key-file /certs/node.key --cert-file /certs/node.cert --endpoint https://<private-ip-of-leader>:2379 member add etcd2 https://<peer-ip>:2380
docker run -d --restart=always --net=host -v /opt/etcd/backup:/etcd-bkp -v /opt/etcd/certs:/certs --volume=/opt/etcd/data:/etcd2.etcd --name etcd quay.io/coreos/etcd:v3.1.7 /usr/local/bin/etcd --data-dir=etcd2.etcd --name etcd2 --initial-advertise-peer-urls https://<peer-ip>2380 --listen-peer-urls https://<peer-ip>:2380 --advertise-client-urls https://<peer-ip>:2379 --listen-client-urls https://<peer-ip>:2379 --initial-cluster etcd2=https://<peer-ip2>:2380,etcd0=https://<peer-ip0>:2380,etcd1=https://<peer-ip1>:2380 --initial-cluster-state existing --client-cert-auth --cert-file=/certs/node.cert --key-file=/certs/node.key --trusted-ca-file=/certs/ca.cert --peer-client-cert-auth --peer-cert-file=/certs/node.cert --peer-key-file=/certs/node.key --peer-trusted-ca-file=/certs/ca.cert
One more thing- on health members we have close to 400MB etcd data. And every time we add the member back, we are expecting the newly added node to sync up ~400MB- can that be a concern?
@djain13 great thanks, I will try and play with this tonight and to confirm your using SSD drives? Could you also post the output of the logs from node 3 please, that should shed a bit of light. Also have you tried anything more recent from 3.1.x?
Thanks for looking into this issue @hexfusion . Yes we store etcd data is stored on SSD drive.
Some more context around how we are using etcd- we have a pool of close to 400 hosts running docker daemon, docker containers scheduled on these hosts are networked using docker overlay n/w, and we use etcd to store data related to overlay networks.
Here are the logs from etcd container leaking memory-
etcd02-12132017.log
Let me know if you need more info.
@djain13, sorry for the delay I will have time this weekend to review. Thanks!
@hexfusion
I enabled swap on the host using swapon -a command. And post that newly added etcd container/member is not leaking memory. From last 24 hours the container is up and stayed at 1.5G.
But the node still unhealthy, though the overall cluster is healthy.
I made several passes through the log, but couldn't find any thing other than some random httpv2 call timing out, I see similar random error on a healthy node as well.
I wanted to understand- what an unhealthy node means? And for what all reason a node is qualified to be called unhealthy. I didn't find that in etcd documentation. I also ran the node in debug mode but couldn't not many logs at debug level.
Some more updates:
[root@etcd3 dharjain]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
90df93922c8f quay.io/coreos/etcd:v3.1.7 "/usr/local/bin/etcd " 23 hours ago Up 23 hours etcd
>80/tcp software_repo
[root@etcd3 dharjain]#
[root@etcd3 dharjain]#
[root@etcd3 dharjain]# free -m
total used free shared buff/cache available
Mem: 31742 1551 27603 201 2587 29493
Swap: 2047 183 1864
[root@etcd3 dharjain]#
[root@etcd3 dharjain]# docker exec -it etcd etcdctl -C https://<ip-of-cluster-member>:2379 --cert-file /certs/node.cert --key-file /certs/node.key --ca-file /certs/ca.cert cluster-health
2017-12-15 13:23:33.313064 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2017-12-15 13:23:33.316380 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
member 61f9be4275db04aa is unhealthy: got unhealthy result from https://<member3>:2379
member a2ce6c1e04b4a127 is healthy: got healthy result from https://<member1>:2379
member b169a1692c599ddd is healthy: got healthy result from https://<member2>:2379
cluster is healthy
[root@etcd3 dharjain]#
@hexfusion
Some more observations:
leader of the cluster.etcdctl ls I could fetch the key from unhealthy node.Now I am confused- Is the node really unhealthy? or etcdctl cluster-health command is returning health incorrectly.
@hexfusion
You got a chance to look into this issue? Also can you respond to below question, or point me to appropriate documentation.
I wanted to understand- what an unhealthy node means? And for what all reason a node is qualified to be called unhealthy. I didn't find that in etcd documentation. I also ran the node in debug mode but couldn't not many logs at debug level.
@djain13 taking a look today. Probably be tonight I have a loop creating a large datadir now for testing.
@djain13 sorry this is taking so long but perhaps you can help me while I get my ENV setup to handle these larger tests. One metric that is used to label a node unhealthy is proposals_committed_total. My theory is that because you are basically starting from scratch with the new node and a decent sized dataset that the replaced member is never quite caught up. Lets test that by getting some metrics on the cluster. Can you add Prometheus into the mix? Here is a link to a setup UI, I used it to test another issue using v2 with compose. But you should be able to extract the Prometheus config and get that working. Then lets review that data and see what is going on. I agree a writeup should be done on what is healthy and describe how the code defines this. The metrics will help paint that picture.
Also you can curl http://127.0.0.1:2379/metrics | grep $metric without setting anything up but the graphing is kind of nice.
@hexfusion
Thanks, I will setup Prometheus. Until then some updates:
/metrics API-# HELP etcd_server_proposals_applied_total The total number of consensus proposals applied.
# TYPE etcd_server_proposals_applied_total gauge
etcd_server_proposals_applied_total 1.1814133734e+10
# HELP etcd_server_proposals_committed_total The total number of consensus proposals committed.
# TYPE etcd_server_proposals_committed_total gauge
etcd_server_proposals_committed_total 1.1814136396e+10
# HELP etcd_server_proposals_failed_total The total number of failed proposals seen.
# TYPE etcd_server_proposals_failed_total counter
etcd_server_proposals_failed_total 1967
# HELP etcd_server_proposals_pending The current number of pending proposals to commit.
# TYPE etcd_server_proposals_pending gauge
etcd_server_proposals_pending 0
@djain13 cluster-health command is same as curl -L http://localhost:2379/health. It just pings your client URL. Do you have server log at the moment you got the unhealthy status?
@djain13 does this issue exist on latest version of etcd such as v3.2.12?
@gyuho
That was the first thing I tried. I don't see any specific logs rolling when I fired etcdctl cluster-health or /health.
Here is the o/p from /health- {"health": "false"}
@djain13 Maybe try raw curl command (without docker)? Also try out the latest release as @fanminshi suggested.
@djain13 does this issue exist on latest version of etcd such as v3.2.12?
We haven't tried the latest version yet- we are on {"etcdserver":"3.1.7","etcdcluster":"3.1.0"}. Do you think moving to latest version shall help? If version was an issue, then all the nodes in the cluster might have shown the same problem.
@djain13 Maybe try with raw curl command (without docker)?
Output from raw curl:
[root@etcd2 dharjain]# curl --cacert /opt/bicloud/etcd/certs/ca.cert --key /opt/bicloud/etcd/certs/node.key --cert /opt/bicloud/etcd/certs/node.cert https://<unhealthy-node-ip>2379/health
{"health": "false"}
@djain13 That's strange... I would not suggest upgrading it just because of this. It would be great if you can reproduce this behavior without affecting the production cluster.
@gyuho I am thinking about one way to reproduce this is issue to first create an new etcd cluster with snapshot from the production cluster. Then try to add a new member to the new test cluster to see if memory leaks still occur?
We can test the the new cluster with v3.1.7 and v3.2.12 without affecting the production cluster. What do you think?
@fanminshi Agree. Just make sure to read upgrade guide https://github.com/coreos/etcd/blob/master/Documentation/upgrades/upgrade_3_2.md. Most breaking changes are from v3 client. Shouldn't break anything with v2.
@djain13 it will be best if you can reproduce this issue without affecting existing production cluster.
Thanks @gyuho and @fanminshi .
Can you point me to steps to recreate a new cluster from a snapshot of production cluster.
@djain13 read this guide https://github.com/coreos/etcd/blob/master/Documentation/op-guide/recovery.md
@djain13 If you have only v2 data, also see https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery.
@hexfusion @gyuho @fanminshi
While I bring up parallel test cluster. Some metrics from the bad node:
# HELP etcd_server_has_leader Whether or not a leader exists. 1 is existence, 0 is not.
# TYPE etcd_server_has_leader gauge
etcd_server_has_leader 1
# HELP etcd_server_leader_changes_seen_total The number of leader changes seen.
# TYPE etcd_server_leader_changes_seen_total counter
etcd_server_leader_changes_seen_total 1
# HELP etcd_server_proposals_applied_total The total number of consensus proposals applied.
# TYPE etcd_server_proposals_applied_total gauge
etcd_server_proposals_applied_total 1.1834524999e+10
# HELP etcd_server_proposals_committed_total The total number of consensus proposals committed.
# TYPE etcd_server_proposals_committed_total gauge
etcd_server_proposals_committed_total 1.1834530489e+10
# HELP etcd_server_proposals_failed_total The total number of failed proposals seen.
# TYPE etcd_server_proposals_failed_total counter
etcd_server_proposals_failed_total 2435
# HELP etcd_server_proposals_pending The current number of pending proposals to commit.
# TYPE etcd_server_proposals_pending gauge
etcd_server_proposals_pending 0
I see a steady rise in etcd_server_proposals_failed_total metrics.
I also see a huge difference between etcd_server_proposals_committed_total and etcd_server_proposals_applied_total. On good node this difference is negligible, infact for leader node both the metrics are ditto.
Any specific reason for proposal application to fail? Can this be because of slow disk/network? We store etcd data on SSD.
@djain13 There are many way for raft proposals to fail; leader election, server downtime, slow disks, etc.. Best way to troubleshoot is set up a Prometheus monitoring(preferrably with dashboard) and locate the server warnings when the spikes happen.
@djain13 anything else we can help with this issue?
Closing due to inactivity. Please create a new issue with reproducible steps.
Most helpful comment
@hexfusion
I enabled swap on the host using
swapon -acommand. And post that newly added etcd container/member is not leaking memory. From last 24 hours the container is up and stayed at 1.5G.But the node still unhealthy, though the overall cluster is healthy.
I made several passes through the log, but couldn't find any thing other than some random httpv2 call timing out, I see similar random error on a healthy node as well.
I wanted to understand- what an unhealthy node means? And for what all reason a node is qualified to be called unhealthy. I didn't find that in etcd documentation. I also ran the node in debug mode but couldn't not many logs at debug level.