Etcd: Newly added member to etcd cluster remains unhealthy and leaks memory.

Created on 12 Dec 2017 · 29Comments · Source: etcd-io/etcd

We are running a three node etcd cluster. Etcd members are running inside docker container.
We are using quay.io/coreos/etcd:v3.1.7 docker image. We have SSL enabled on our etcd members.

One fine day- etcd3 node/container went down. We brought that back and from then on it never became healthy and is leaking memory. Etcd container is running on 32GB host and bloats all the way upto to 32GB and crashes. In 34min it has bloated up to roughly 10GB.

[root@etcd3 dharjain]# free -m
              total        used        free      shared  buff/cache   available
Mem:          31742         559       30318           9         863       30743
Swap:          2047         203        1843
[root@etcd3 dharjain]# free -m
              total        used        free      shared  buff/cache   available
Mem:          31742         670       30187           9         884       30633
Swap:          2047         203        1843
[root@etcd3 dharjain]# free -m
              total        used        free      shared  buff/cache   available
Mem:          31742        1616       29100          10        1024       29684
Swap:          2047         201        1846
[root@etcd3 dharjain]# free -m
              total        used        free      shared  buff/cache   available
Mem:          31742        3524       27056          10        1161       27773
Swap:          2047         199        1847
[root@etcd3 dharjain]# free -m
              total        used        free      shared  buff/cache   available
Mem:          31742        7022       23537          11        1182       24274
Swap:          2047         198        1848
[root@etcd3 dharjain]# free -m
              total        used        free      shared  buff/cache   available
Mem:          31742       10335       20228          12        1178       20960
Swap:          2047         198        1849
[root@etcd3 dharjain]# docker ps
CONTAINER ID        IMAGE                        COMMAND                  CREATED             STATUS              PORTS                  NAMES
986e31c9ec50        quay.io/coreos/etcd:v3.1.7   "/usr/local/bin/etcd "   34 minutes ago      Up 34 minutes                              etcd

Host OS and kernel info:

[root@etcd3 dharjain]# cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core) 
[root@etcd3 dharjain]# uname -a
Linux etcd3.bluemix.net 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Docker info

[root@etcd3 dharjain]# docker info
Containers: 4
 Running: 4
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 1.12.6
Storage Driver: devicemapper
 Pool Name: docker-202:2-467209-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 2.006 GB
 Data Space Total: 107.4 GB
 Data Space Available: 85.22 GB
 Metadata Space Used: 3.269 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.144 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp selinux
Kernel Version: 3.10.0-514.26.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 16
Total Memory: 31 GiB
Name: etcd3.bi.services.us-south.bluemix.net
ID: OK2J:L4PH:22PR:JFIP:6AAJ:DTOF:D4FY:7CJX:RSQI:KMUU:POSM:BL32
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 127.0.0.0/8
Registries: docker.io (secure)
[root@etcd3 dharjain]#

arequestion

Source

djain13

Most helpful comment

@hexfusion
I enabled swap on the host using swapon -a command. And post that newly added etcd container/member is not leaking memory. From last 24 hours the container is up and stayed at 1.5G.
But the node still unhealthy, though the overall cluster is healthy.
I made several passes through the log, but couldn't find any thing other than some random httpv2 call timing out, I see similar random error on a healthy node as well.

I wanted to understand- what an unhealthy node means? And for what all reason a node is qualified to be called unhealthy. I didn't find that in etcd documentation. I also ran the node in debug mode but couldn't not many logs at debug level.

djain13 on 15 Dec 2017

👍2

All 29 comments

Hi @djain13 just for context what was your process for bringing the node back into the cluster. Thanks!

hexfusion on 13 Dec 2017

@hexfusion
Here is what we follow:

Stop and remove the etcd docker container, essentially bring down the etcd process.
Remove the member from the cluster using member DELETE api

curl --cacert /certs/ca.cert --key /certs/node.key --cert /certs/node.cert https://<private-ip-of-leader>:2379/v2/members/<member-id-of-bad-node> -XDELETE

Clean up the etcd data from docker host. We maintain etcd data on host and mount the same on container.
Add a new member, using etcdctl member add

docker exec -it etcd etcdctl --ca-file /certs/ca.cert --key-file /certs/node.key --cert-file /certs/node.cert --endpoint https://<private-ip-of-leader>:2379 member add etcd2 https://<peer-ip>:2380

Once the member is successfully added, start back the etcd docker container. Here is the command:

docker run -d --restart=always --net=host -v /opt/etcd/backup:/etcd-bkp  -v /opt/etcd/certs:/certs --volume=/opt/etcd/data:/etcd2.etcd --name etcd quay.io/coreos/etcd:v3.1.7   /usr/local/bin/etcd    --data-dir=etcd2.etcd --name etcd2     --initial-advertise-peer-urls https://<peer-ip>2380 --listen-peer-urls https://<peer-ip>:2380     --advertise-client-urls https://<peer-ip>:2379 --listen-client-urls https://<peer-ip>:2379     --initial-cluster etcd2=https://<peer-ip2>:2380,etcd0=https://<peer-ip0>:2380,etcd1=https://<peer-ip1>:2380    --initial-cluster-state existing --client-cert-auth --cert-file=/certs/node.cert --key-file=/certs/node.key --trusted-ca-file=/certs/ca.cert --peer-client-cert-auth --peer-cert-file=/certs/node.cert --peer-key-file=/certs/node.key --peer-trusted-ca-file=/certs/ca.cert

One more thing- on health members we have close to 400MB etcd data. And every time we add the member back, we are expecting the newly added node to sync up ~400MB- can that be a concern?

djain13 on 13 Dec 2017

👍1

@djain13 great thanks, I will try and play with this tonight and to confirm your using SSD drives? Could you also post the output of the logs from node 3 please, that should shed a bit of light. Also have you tried anything more recent from 3.1.x?

hexfusion on 13 Dec 2017

Thanks for looking into this issue @hexfusion . Yes we store etcd data is stored on SSD drive.
Some more context around how we are using etcd- we have a pool of close to 400 hosts running docker daemon, docker containers scheduled on these hosts are networked using docker overlay n/w, and we use etcd to store data related to overlay networks.

Here are the logs from etcd container leaking memory-
etcd02-12132017.log

Let me know if you need more info.

djain13 on 13 Dec 2017

@djain13, sorry for the delay I will have time this weekend to review. Thanks!

hexfusion on 15 Dec 2017

djain13 on 15 Dec 2017

👍2

Some more updates:

[root@etcd3 dharjain]# docker ps
CONTAINER ID        IMAGE                        COMMAND                  CREATED             STATUS              PORTS                  NAMES
90df93922c8f        quay.io/coreos/etcd:v3.1.7   "/usr/local/bin/etcd "   23 hours ago        Up 23 hours                                etcd
>80/tcp   software_repo
[root@etcd3 dharjain]# 
[root@etcd3 dharjain]# 
[root@etcd3 dharjain]# free -m
              total        used        free      shared  buff/cache   available
Mem:          31742        1551       27603         201        2587       29493
Swap:          2047         183        1864
[root@etcd3 dharjain]# 
[root@etcd3 dharjain]# docker exec -it etcd etcdctl -C https://<ip-of-cluster-member>:2379 --cert-file /certs/node.cert --key-file /certs/node.key --ca-file /certs/ca.cert cluster-health
2017-12-15 13:23:33.313064 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2017-12-15 13:23:33.316380 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
member 61f9be4275db04aa is unhealthy: got unhealthy result from https://<member3>:2379
member a2ce6c1e04b4a127 is healthy: got healthy result from https://<member1>:2379
member b169a1692c599ddd is healthy: got healthy result from https://<member2>:2379
cluster is healthy
[root@etcd3 dharjain]#

djain13 on 15 Dec 2017

@hexfusion
Some more observations:

Unhealthy node has become leader of the cluster.
Using etcdctl ls I could fetch the key from unhealthy node.

Now I am confused- Is the node really unhealthy? or etcdctl cluster-health command is returning health incorrectly.

djain13 on 16 Dec 2017

@hexfusion
You got a chance to look into this issue? Also can you respond to below question, or point me to appropriate documentation.

I wanted to understand- what an unhealthy node means? And for what all reason a node is qualified to be called unhealthy. I didn't find that in etcd documentation. I also ran the node in debug mode but couldn't not many logs at debug level.

djain13 on 20 Dec 2017

@djain13 taking a look today. Probably be tonight I have a loop creating a large datadir now for testing.

hexfusion on 20 Dec 2017

👍1

@djain13 sorry this is taking so long but perhaps you can help me while I get my ENV setup to handle these larger tests. One metric that is used to label a node unhealthy is proposals_committed_total. My theory is that because you are basically starting from scratch with the new node and a decent sized dataset that the replaced member is never quite caught up. Lets test that by getting some metrics on the cluster. Can you add Prometheus into the mix? Here is a link to a setup UI, I used it to test another issue using v2 with compose. But you should be able to extract the Prometheus config and get that working. Then lets review that data and see what is going on. I agree a writeup should be done on what is healthy and describe how the code defines this. The metrics will help paint that picture.

Also you can curl http://127.0.0.1:2379/metrics | grep $metric without setting anything up but the graphing is kind of nice.

hexfusion on 21 Dec 2017

@hexfusion
Thanks, I will setup Prometheus. Until then some updates:

My etcd3 crashed again, and this time it actually brought down the host as well :-(.
We rebooted the host and brought back the etcd3 node/container. Its been 4 hours, container is up, and it has used close to 1600mb, which is comparable to other two etcd nodes in the cluster.
Output from /metrics API-
etcd2-metrics.log
Does the below snippet from metric o/p looks alarming

# HELP etcd_server_proposals_applied_total The total number of consensus proposals applied.
# TYPE etcd_server_proposals_applied_total gauge
etcd_server_proposals_applied_total 1.1814133734e+10
# HELP etcd_server_proposals_committed_total The total number of consensus proposals committed.
# TYPE etcd_server_proposals_committed_total gauge
etcd_server_proposals_committed_total 1.1814136396e+10
# HELP etcd_server_proposals_failed_total The total number of failed proposals seen.
# TYPE etcd_server_proposals_failed_total counter
etcd_server_proposals_failed_total 1967
# HELP etcd_server_proposals_pending The current number of pending proposals to commit.
# TYPE etcd_server_proposals_pending gauge
etcd_server_proposals_pending 0

djain13 on 21 Dec 2017

@djain13 cluster-health command is same as curl -L http://localhost:2379/health. It just pings your client URL. Do you have server log at the moment you got the unhealthy status?

gyuho on 21 Dec 2017

@djain13 does this issue exist on latest version of etcd such as v3.2.12?

fanminshi on 21 Dec 2017

@gyuho
That was the first thing I tried. I don't see any specific logs rolling when I fired etcdctl cluster-health or /health.
Here is the o/p from /health- {"health": "false"}

djain13 on 21 Dec 2017

@djain13 Maybe try raw curl command (without docker)? Also try out the latest release as @fanminshi suggested.

gyuho on 21 Dec 2017

@djain13 does this issue exist on latest version of etcd such as v3.2.12?

We haven't tried the latest version yet- we are on {"etcdserver":"3.1.7","etcdcluster":"3.1.0"}. Do you think moving to latest version shall help? If version was an issue, then all the nodes in the cluster might have shown the same problem.

djain13 on 21 Dec 2017

@djain13 Maybe try with raw curl command (without docker)?
Output from raw curl:

[root@etcd2 dharjain]# curl --cacert /opt/bicloud/etcd/certs/ca.cert --key /opt/bicloud/etcd/certs/node.key --cert /opt/bicloud/etcd/certs/node.cert https://<unhealthy-node-ip>2379/health
{"health": "false"}

djain13 on 21 Dec 2017

@djain13 That's strange... I would not suggest upgrading it just because of this. It would be great if you can reproduce this behavior without affecting the production cluster.

gyuho on 21 Dec 2017

@gyuho I am thinking about one way to reproduce this is issue to first create an new etcd cluster with snapshot from the production cluster. Then try to add a new member to the new test cluster to see if memory leaks still occur?

We can test the the new cluster with v3.1.7 and v3.2.12 without affecting the production cluster. What do you think?

fanminshi on 21 Dec 2017

@fanminshi Agree. Just make sure to read upgrade guide https://github.com/coreos/etcd/blob/master/Documentation/upgrades/upgrade_3_2.md. Most breaking changes are from v3 client. Shouldn't break anything with v2.

gyuho on 21 Dec 2017

@djain13 it will be best if you can reproduce this issue without affecting existing production cluster.

fanminshi on 21 Dec 2017

Thanks @gyuho and @fanminshi .
Can you point me to steps to recreate a new cluster from a snapshot of production cluster.

djain13 on 21 Dec 2017

@djain13 read this guide https://github.com/coreos/etcd/blob/master/Documentation/op-guide/recovery.md

fanminshi on 21 Dec 2017

@djain13 If you have only v2 data, also see https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery.

gyuho on 21 Dec 2017

@hexfusion @gyuho @fanminshi
While I bring up parallel test cluster. Some metrics from the bad node:

# HELP etcd_server_has_leader Whether or not a leader exists. 1 is existence, 0 is not.
# TYPE etcd_server_has_leader gauge
etcd_server_has_leader 1
# HELP etcd_server_leader_changes_seen_total The number of leader changes seen.
# TYPE etcd_server_leader_changes_seen_total counter
etcd_server_leader_changes_seen_total 1
# HELP etcd_server_proposals_applied_total The total number of consensus proposals applied.
# TYPE etcd_server_proposals_applied_total gauge
etcd_server_proposals_applied_total 1.1834524999e+10
# HELP etcd_server_proposals_committed_total The total number of consensus proposals committed.
# TYPE etcd_server_proposals_committed_total gauge
etcd_server_proposals_committed_total 1.1834530489e+10
# HELP etcd_server_proposals_failed_total The total number of failed proposals seen.
# TYPE etcd_server_proposals_failed_total counter
etcd_server_proposals_failed_total 2435
# HELP etcd_server_proposals_pending The current number of pending proposals to commit.
# TYPE etcd_server_proposals_pending gauge
etcd_server_proposals_pending 0

I see a steady rise in etcd_server_proposals_failed_total metrics.
I also see a huge difference between etcd_server_proposals_committed_total and etcd_server_proposals_applied_total. On good node this difference is negligible, infact for leader node both the metrics are ditto.

Any specific reason for proposal application to fail? Can this be because of slow disk/network? We store etcd data on SSD.

djain13 on 21 Dec 2017

@djain13 There are many way for raft proposals to fail; leader election, server downtime, slow disks, etc.. Best way to troubleshoot is set up a Prometheus monitoring(preferrably with dashboard) and locate the server warnings when the spikes happen.

gyuho on 21 Dec 2017

@djain13 anything else we can help with this issue?

hexfusion on 21 Jan 2018

Closing due to inactivity. Please create a new issue with reproducible steps.

gyuho on 3 May 2018

Was this page helpful?

0 / 5 - 0 ratings