Kops: ETCD database space / quota exceeded, goes into maintenance mode

Created on 4 Dec 2017 · 12Comments · Source: kubernetes/kops

Kops Version: kops v1.8.0-beta.2
Kubernetes Version: kubernetes v1.8.2
ETCD Version: v3.0.17 (TLS enabled)
Cloud Provider: AWS

Steps to recreate (will take time):

Create a Kubernetes Cluster on the versions specified above, using ETCD v3 with config similar to below (I had 5 members configured, just trimmed this spec so less spammy).
Need to give some operation time on the Cluster (creating lots of deployments, events etc).

  etcdClusters:
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master0-az0
      name: a-1
    - encryptedVolume: true
      instanceGroup: master1-az0
      name: a-2
    name: main
    version: 3.0.17
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master0-az0
      name: a-1
    - encryptedVolume: true
      instanceGroup: master1-az0
      name: a-2
    name: events
    version: 3.0.17

After some operation time, you may begin to see warnings such as below in the logs:

kubelet[1495]: W1204 11:17:02.533588    1495 status_manager.go:446] Failed to update status for pod "custom-pod-A": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.542113    1495 status_manager.go:446] Failed to update status for pod "custom-pod-B": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.551753    1495 status_manager.go:446] Failed to update status for pod "canal-hcldk_kube-system(C)": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.557246    1495 status_manager.go:446] Failed to update status for pod "custom-pod-D": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.565505    1495 status_manager.go:446] Failed to update status for pod "custom-pod-E": etcdserver: mvcc: database space exceeded
kubelet[1495]: \"sizeBytes\":746888}]}}" for node "ip-1-2-3-4.aws-region.compute.internal": etcdserver: mvcc: database space exceeded

Check ETCD Status:

~ # ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT} alarm list
memberID:A alarm:NOSPACE
memberID:B alarm:NOSPACE
memberID:C alarm:NOSPACE

~ # ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT} --write-out=table endpoint status
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://localhost:4001 | 670630e06d36fd3c |  3.0.17 |  140 MB |      true |       358 |  120256658 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

~ # df -h | grep "master-vol"
/dev/xvdu         20G  442M   19G   3% /mnt/master-vol-A
/dev/xvdv         20G  419M   19G   3% /mnt/master-vol-B

According to the ETCD Maintenance Docs the cluster has gone into a limited operation maintenance mode, meaning that it will only accept key reads and deletes.

Recovery: History compaction needs to occur (and then possible defragmentation to release the free storage space for use) for it to be operational again, the steps for this are in the above docs link.

There are possible options we could supply to etcd via kops which will hopefully mitigate this issue and reduce manual user maintenance required (although I don't know much about etcd to be sure):

EtcdClusterSpec: Allow ETCD_QUOTA_BACKEND_BYTES to be configurable, so a higher value can be set rather than the default of 0 (0 defaults to low space quota)
EtcdClusterSpec: Allow ETCD_AUTO_COMPACTION_RETENTION to be configurable, so it can trigger automatically without user intervention.
- Could have some performance implications?
- If we were to support this, should we default it to be enabled for new clusters?
- Does periodic defragmentation still need to occur?

EDIT: 1 of the 5 nodes had etcd volume maxed out at 100%, due to a dodgy deployment. The other 4 were only 3% utilised as shown in the above log snippets.

Ping @gambol99 @justinsb @chrislovecnm

lifecyclrotten

Source

KashifSaadat

Most helpful comment

So the apiserver issues a compaction every 5 minutes (IIRC). I don't understand exactly the cause, but it looks like an etcd bug. Related:

https://github.com/kubernetes/kubernetes/issues/45037
https://github.com/coreos/etcd/issues/8009
https://github.com/coreos/etcd/issues/7116

It sounds like an etcd bug, @lavalamp asked for a backport and was told "no", but the fix will be in etcd 3.3.

justinsb on 4 Dec 2017

👍4

All 12 comments

So the apiserver issues a compaction every 5 minutes (IIRC). I don't understand exactly the cause, but it looks like an etcd bug. Related:

https://github.com/kubernetes/kubernetes/issues/45037
https://github.com/coreos/etcd/issues/8009
https://github.com/coreos/etcd/issues/7116

It sounds like an etcd bug, @lavalamp asked for a backport and was told "no", but the fix will be in etcd 3.3.

justinsb on 4 Dec 2017

👍4

Cheers, that's probably the cause of it then!

In regards to apiserver doing the compaction every 5 minutes, shouldn't this mean that the other 4 nodes with disk space remaining should have remained operational? Or maybe we still needed to do the defrag to reclaim the free space on the members / clear the alarms that had triggered?

KashifSaadat on 4 Dec 2017

If anyone runs into the above issue, you can attempt to follow the below very rough recovery steps that I took (tested on CoreOS).

Run this on each of the members affected, which still have available space on the etcd volume:

export ETCDCTL_API=3
ETCD_ENDPOINT_MAIN="https://localhost:4001"
ETCD_ENDPOINT_EVENTS="https://localhost:4002"
CA_FILE="/srv/kubernetes/ca.crt"
ETCD_CMD="etcdctl --cacert ${CA_FILE}"
rev=`${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*'`
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} compact $rev
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} defrag
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} alarm disarm

For any members that have experienced the above mentioned bug, where volume is at 100% (not entirely sure whether steps 2-5 are necessary in all cases):

Find the affected member in AWS, terminate the associated ASG and 2x attached EBS Volumes (etcd, etcd-events)
On one of the healthy-ish members, get the etcd member list: ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member list
Remove the dead member (should have the same tag name as the ASG / instance you deleted): ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member remove <member-id-from-above-command>
Add the member back in, will be in an un-started state: ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member add <etcd-member-name> --peer-urls="https://<etcd-member-name>.internal.${KOPS_CLUSTER_NAME}:2380"
Repeat Steps 2-5 for ${ETCD_ENDPOINT_EVENTS}. <etcd-member-name> will differ and port will be 2381 rather than 2380.
kops update cluster ${KOPS_CLUSTER_NAME} --yes (this will re-create the ASG and volumes)
Once the new master has started, ssh into the instance
Run as root: systemctl stop kubelet && systemctl stop protokube
Edit both/etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest, change ETCD_INITIAL_CLUSTER_STATE value to existing
Drop the docker containers: docker kill $(docker ps | grep "etcd" | awk '{print $1}')
For both the etcd volumes, remove the member dirs:
- rm -rf /mnt/master-vol-<vol-id-main>/var/etcd/data/member
- rm -rf /mnt/master-vol-<vol-id-events>/var/etcd/data-events/member
Start kubelet: systemctl start kubelet. Wait for the cluster to report healthy again (check etcd member list, kops validate cluster etc).
Start protokube again: systemctl start protokube
Once the cluster is all healthy, slowly terminate the masters one by one (giving time for the cluster to recover), to ensure they are all in a clean state.

The above steps were modified slightly from following this guide: https://github.com/kubernetes/kops/blob/master/docs/single-to-multi-master.md#4---add-the-third-master

KashifSaadat on 4 Dec 2017

👍3

v3.3.0 has officially been released. The following PR should correct issues with logging and pick up version changes for a rolling update: https://github.com/kubernetes/kops/pull/4371

I'll be testing this out and will see how it goes!

KashifSaadat on 2 Feb 2018

Tempted to close this issue now.. ETCD v3.3.0 appears to resolve this issue. I'm running a cluster on the newer version and haven't noticed any problems so far (including the PR referenced above).

Just a note, with kops you'll need to define the new version as follows in your kops spec:

  etcdClusters:
  - etcdMembers:
     ...
    enableEtcdTLS: true
    image: gcr.io/etcd-development/etcd:v3.3.0
    name: main
    version: 3.3.0

The version field doesn't need to be identical to the image, so long as it's 3.x.x.

@justinsb anything more you think we need to do here, or happy to close this?

KashifSaadat on 2 Mar 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 31 May 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot on 30 Jun 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 30 Jul 2018

Run this on each of the members affected, which still have available space on the etcd volume:

export ETCDCTL_API=3
ETCD_ENDPOINT_MAIN="https://localhost:4001"
ETCD_ENDPOINT_EVENTS="https://localhost:4002"
CA_FILE="/srv/kubernetes/ca.crt"
ETCD_CMD="etcdctl --cacert ${CA_FILE}"
rev=`${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*'`
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} compact $rev
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} defrag
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} alarm disarm

Sorry to resurrect this old issue. I just fell into it.

When I did the etcdctl [...] defrag I always got this into the error:
Failed to defragment etcd member[https://127.0.0.1:4001] (context deadline exceeded)

Setting the flag --command-timeout=120s solved this issue for me.

Hope that I could save someone some time.

voigt on 14 Nov 2019

Does KOPS support this --quota-backend-bytes param for etcd ?

jsonmp-k8 on 14 Oct 2020

You can specify which ENV vars to pass on to etcd: https://kops.sigs.k8s.io/cluster_spec/#etcdclusters
So you just have to set ETCD_QUOTA_BACKEND_BYTES there.