Kops Version: kops v1.8.0-beta.2
Kubernetes Version: kubernetes v1.8.2
ETCD Version: v3.0.17 (TLS enabled)
Cloud Provider: AWS
Steps to recreate (will take time):
etcdClusters:
- enableEtcdTLS: true
etcdMembers:
- encryptedVolume: true
instanceGroup: master0-az0
name: a-1
- encryptedVolume: true
instanceGroup: master1-az0
name: a-2
name: main
version: 3.0.17
- enableEtcdTLS: true
etcdMembers:
- encryptedVolume: true
instanceGroup: master0-az0
name: a-1
- encryptedVolume: true
instanceGroup: master1-az0
name: a-2
name: events
version: 3.0.17
After some operation time, you may begin to see warnings such as below in the logs:
kubelet[1495]: W1204 11:17:02.533588 1495 status_manager.go:446] Failed to update status for pod "custom-pod-A": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.542113 1495 status_manager.go:446] Failed to update status for pod "custom-pod-B": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.551753 1495 status_manager.go:446] Failed to update status for pod "canal-hcldk_kube-system(C)": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.557246 1495 status_manager.go:446] Failed to update status for pod "custom-pod-D": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.565505 1495 status_manager.go:446] Failed to update status for pod "custom-pod-E": etcdserver: mvcc: database space exceeded
kubelet[1495]: \"sizeBytes\":746888}]}}" for node "ip-1-2-3-4.aws-region.compute.internal": etcdserver: mvcc: database space exceeded
Check ETCD Status:
~ # ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT} alarm list
memberID:A alarm:NOSPACE
memberID:B alarm:NOSPACE
memberID:C alarm:NOSPACE
~ # ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT} --write-out=table endpoint status
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://localhost:4001 | 670630e06d36fd3c | 3.0.17 | 140 MB | true | 358 | 120256658 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
~ # df -h | grep "master-vol"
/dev/xvdu 20G 442M 19G 3% /mnt/master-vol-A
/dev/xvdv 20G 419M 19G 3% /mnt/master-vol-B
According to the ETCD Maintenance Docs the cluster has gone into a limited operation maintenance mode, meaning that it will only accept key reads and deletes.
Recovery: History compaction needs to occur (and then possible defragmentation to release the free storage space for use) for it to be operational again, the steps for this are in the above docs link.
There are possible options we could supply to etcd via kops which will hopefully mitigate this issue and reduce manual user maintenance required (although I don't know much about etcd to be sure):
ETCD_QUOTA_BACKEND_BYTES to be configurable, so a higher value can be set rather than the default of 0 (0 defaults to low space quota)ETCD_AUTO_COMPACTION_RETENTION to be configurable, so it can trigger automatically without user intervention.EDIT: 1 of the 5 nodes had etcd volume maxed out at 100%, due to a dodgy deployment. The other 4 were only 3% utilised as shown in the above log snippets.
Ping @gambol99 @justinsb @chrislovecnm
So the apiserver issues a compaction every 5 minutes (IIRC). I don't understand exactly the cause, but it looks like an etcd bug. Related:
https://github.com/kubernetes/kubernetes/issues/45037
https://github.com/coreos/etcd/issues/8009
https://github.com/coreos/etcd/issues/7116
It sounds like an etcd bug, @lavalamp asked for a backport and was told "no", but the fix will be in etcd 3.3.
Cheers, that's probably the cause of it then!
In regards to apiserver doing the compaction every 5 minutes, shouldn't this mean that the other 4 nodes with disk space remaining should have remained operational? Or maybe we still needed to do the defrag to reclaim the free space on the members / clear the alarms that had triggered?
If anyone runs into the above issue, you can attempt to follow the below very rough recovery steps that I took (tested on CoreOS).
Run this on each of the members affected, which still have available space on the etcd volume:
export ETCDCTL_API=3
ETCD_ENDPOINT_MAIN="https://localhost:4001"
ETCD_ENDPOINT_EVENTS="https://localhost:4002"
CA_FILE="/srv/kubernetes/ca.crt"
ETCD_CMD="etcdctl --cacert ${CA_FILE}"
rev=`${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*'`
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} compact $rev
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} defrag
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} alarm disarm
For any members that have experienced the above mentioned bug, where volume is at 100% (not entirely sure whether steps 2-5 are necessary in all cases):
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member list${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member remove <member-id-from-above-command>${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member add <etcd-member-name> --peer-urls="https://<etcd-member-name>.internal.${KOPS_CLUSTER_NAME}:2380"${ETCD_ENDPOINT_EVENTS}. <etcd-member-name> will differ and port will be 2381 rather than 2380.kops update cluster ${KOPS_CLUSTER_NAME} --yes (this will re-create the ASG and volumes)systemctl stop kubelet && systemctl stop protokube/etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest, change ETCD_INITIAL_CLUSTER_STATE value to existingdocker kill $(docker ps | grep "etcd" | awk '{print $1}')rm -rf /mnt/master-vol-<vol-id-main>/var/etcd/data/memberrm -rf /mnt/master-vol-<vol-id-events>/var/etcd/data-events/membersystemctl start kubelet. Wait for the cluster to report healthy again (check etcd member list, kops validate cluster etc).systemctl start protokubeThe above steps were modified slightly from following this guide: https://github.com/kubernetes/kops/blob/master/docs/single-to-multi-master.md#4---add-the-third-master
v3.3.0 has officially been released. The following PR should correct issues with logging and pick up version changes for a rolling update: https://github.com/kubernetes/kops/pull/4371
I'll be testing this out and will see how it goes!
Tempted to close this issue now.. ETCD v3.3.0 appears to resolve this issue. I'm running a cluster on the newer version and haven't noticed any problems so far (including the PR referenced above).
Just a note, with kops you'll need to define the new version as follows in your kops spec:
etcdClusters:
- etcdMembers:
...
enableEtcdTLS: true
image: gcr.io/etcd-development/etcd:v3.3.0
name: main
version: 3.3.0
The version field doesn't need to be identical to the image, so long as it's 3.x.x.
@justinsb anything more you think we need to do here, or happy to close this?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Run this on each of the members affected, which still have available space on the etcd volume:
export ETCDCTL_API=3 ETCD_ENDPOINT_MAIN="https://localhost:4001" ETCD_ENDPOINT_EVENTS="https://localhost:4002" CA_FILE="/srv/kubernetes/ca.crt" ETCD_CMD="etcdctl --cacert ${CA_FILE}" rev=`${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*'` ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} compact $rev ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} defrag ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} alarm disarm
Sorry to resurrect this old issue. I just fell into it.
When I did the etcdctl [...] defrag I always got this into the error:
Failed to defragment etcd member[https://127.0.0.1:4001] (context deadline exceeded)
Setting the flag --command-timeout=120s solved this issue for me.
Hope that I could save someone some time.
Does KOPS support this --quota-backend-bytes param for etcd ?
You can specify which ENV vars to pass on to etcd: https://kops.sigs.k8s.io/cluster_spec/#etcdclusters
So you just have to set ETCD_QUOTA_BACKEND_BYTES there.
Thanks @olemarkus
Most helpful comment
So the apiserver issues a compaction every 5 minutes (IIRC). I don't understand exactly the cause, but it looks like an etcd bug. Related:
https://github.com/kubernetes/kubernetes/issues/45037
https://github.com/coreos/etcd/issues/8009
https://github.com/coreos/etcd/issues/7116
It sounds like an etcd bug, @lavalamp asked for a backport and was told "no", but the fix will be in etcd 3.3.