Kops: Support upgrading from etcd2 to etcd3 for existing clusters

Created on 19 Feb 2018 · 36Comments · Source: kubernetes/kops

Is there plans to support upgrading an existing cluster from etcd2 to etcd3 via Kops?

I see the doc lists the etcd version as something that can be configured in the cluster yaml but states that this is only supported for new clusters.

It would be very useful if Kops could perform the etcd upgrade. Alternatively, any information about manually migrating the cluster in a Kops-compatabile way would be appreciated. (Kops-compatible = next time I do a Kops upgrade, it wont try to undo changes)

I see this comment does touch on migrating existing clusters, but was subsequently closed: https://github.com/kubernetes/kops/issues/2842#issuecomment-333987054

1.10 blocks-next lifecyclrotten

Source

alanbyr5

👍5

Most helpful comment

@alanbyr5 yes @justinsb is working on an Etcd manager, which should provide this functionality.

chrislovecnm on 19 Feb 2018

👍8

All 36 comments

@alanbyr5 yes @justinsb is working on an Etcd manager, which should provide this functionality.

chrislovecnm on 19 Feb 2018

👍8

@justinsb For a manual upgrade, If we follow the CoreOS official tutorial of upgrade etcd, will it be enough? Are there any other steps we should take in addition?

vendrov on 12 Mar 2018

I would be super-careful following any upgrade instructions - most of them fail in subtle ways. Or you end up running etcd3, but in "etcd2 mode".

The etcd plan of record is here: https://github.com/kubernetes/kops/blob/master/docs/etcd/roadmap.md

justinsb on 19 Mar 2018

This is going to become more urgent. Kubernetes 1.10 deprecates etcd2 storage, it will be removed in 1.13

tcolgate on 28 Mar 2018

That was actually deprecated in 1.9. It was repeated in the 1.10 release notes for visibility.

liggitt on 28 Mar 2018

@justinsb does kops 1.9 take care of etcd 3 upgrade?

bilalyasar on 16 Apr 2018

Etcd upgrades are not in kops 1.9, but we're on track for the plan, which is to opt-in to etcd-manager (which will allow upgrades) in kops 1.10: https://github.com/kubernetes/kops/blob/master/docs/etcd/roadmap.md

It's possible today to opt-in to etcd3 for new clusters, using KOPS_FEATURE_FLAGS=SpecOverrideFlag kops create cluster --override=cluster.spec.etcdClusters[*].version=3.1.11, and that'll be easier in kops 1.10.

justinsb on 20 Apr 2018

👍7

With the release of kops 1.10 is it possible now to upgrade to etcd3, by enabling etcd-manager?

sybeck2k on 21 Aug 2018

You don't have to use etcd-manager. But if you want to you can follow this documentation: https://github.com/kubernetes/kops/blob/master/docs/etcd/manager.md

With or without the manager, you can set etcd version using something like
``` etcdClusters:

etcdMembers:
- instanceGroup: master-eu-central-1a
  
  name: a
- instanceGroup: master-eu-central-1b
  
  name: b
- instanceGroup: master-eu-central-1c
  
  name: c
  
  name: main
  
  version: 3.1.11
etcdMembers:
- instanceGroup: master-eu-central-1a
  
  name: a
- instanceGroup: master-eu-central-1b
  
  name: b
- instanceGroup: master-eu-central-1c
  
  name: c
  
  name: events
  
  version: 3.1.11```

olemarkus on 21 Aug 2018

🎉4

Thank you @olemarkus - the Cluster Spec doc clearly states that it is not possible to upgrade that way: I guess that doc is outdated?

sybeck2k on 21 Aug 2018

With etcd-manager it definitely works to upgrade. Done that a few times. Setting version without manager is something I have only done on brand new clusters.

olemarkus on 21 Aug 2018

ok so the recommended path to upgrade a pre v1.10 etc2 cluster to etcd3 is to upgrade KOPS to 1.10, and use the etcd-manger as explained in the doc

sybeck2k on 21 Aug 2018

I tried upgrading a HA cluster running etcd v2 to v3 following the etcd-manager docs and I failed miserably. I'm curious if anyone has gotten it to work?

I can also confirm that you don't need Etcd-manager to upgrade from etcd v3.0 to v3.1. Why this works is explained in the etcd docs.

bismarck on 21 Aug 2018

I have been following this since I would like to get to v3 soon. I just upgrade to kops 1.10 and also k8s 1.10.3 and have installed the etcd-manager. After reading @bismarck's comment, I did a bit more reading and looking at the etcd 3.0 release notes, it mentions to upgrade to 2.3 before 3.0:

https://github.com/coreos/etcd/blob/master/Documentation/upgrades/upgrade_3_0.md

Should we be following this guide to get to etcd3? The reason i ask is because i'm still on 2.2.1.

zivagolee on 25 Aug 2018

I tried enabling etcd-manager on a kops (1.10) cluster (with etcd 2.2.1) running Kubernetes 1.10.7. When the new master came up, I have seen some backups written to S3 and the cluster didn't boot up.
Seen these error logs in api-server.

Unable to perform initial IP allocation check: unable to refresh the service IP block: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused

And there are no etcd containers running on the master instance.

kumudt on 1 Sep 2018

We faced something similar when trying to use the manager with CoreOS instances https://github.com/kopeio/etcd-manager/issues/125

johanneswuerbach on 1 Sep 2018

I followed the following steps and the upgrade worked for me.

I had my kops 1.9 based K8s cluster running 1.9.7 with etcd version 2.2.1
I enabled backups on the above cluster by setting this in my kops config.

- backups:
    backupStore: s3://<kops-bucket>/<cluster-name>/backups/etcd/main/
  etcdMembers:
  - instanceGroup: <some-instance-group>
    name: a
  name: main

Post kops upgrade cluster and kops rolling-update cluster backups started working on the above cluster.
Now I upgraded kops binary to 1.10. And changed the cluster version in the cluster spec to 1.10.7 and ran these to enable etcd manager and to upgrade the etcd to v3

export KOPS_FEATURE_FLAGS=SpecOverrideFlag
kops set cluster "cluster.spec.etcdClusters[*].manager.image=kopeio/etcd-manager:latest"
kops set cluster "cluster.spec.etcdClusters[*].version=3.1.12"

Post kops upgrade cluster and kops rolling-update cluster the etcd-manager kicked-up started 2.2.1 cluster, took a backup of the 2.2.1 cluster, switched the etcd binary to 3.1.12 and started the new etcd cluster and restored the old backup.
Looks like there are limited etcd versions that etcd-manager supports today. More details on this can be found here and here

Eventhough the cluster is healthy now the DNS records for the etcd etcd-a.internal and etcd-events-a.internal are still pointing to the old IPs. Not sure if this is a bug.

kumudt on 2 Sep 2018

@kumudt Thanks for the heads up about etcd-manager only supporting certain versions of etcd. However, I am still having trouble upgrading my cluster.

Could you share some info about your cluster? Are you running HA? DNS or Gossip based cluster?

bismarck on 4 Sep 2018

@bismarck I am running a DNS based cluster. It's not running in HA.

If you are using Gossip based cluster and has etcd encryption enabled, then you need to check this out. Looks like there are issues surrounding that.
https://github.com/kopeio/etcd-manager/issues/64

Also, as etcd-manager hasn't updated my DNS records, I need to check further on how the discovery of etcd is happening now. If some one can help me out here it would be great. I just wanted to be sure that it's the expected behavior.

kumudt on 5 Sep 2018

I just used @kumudt's method upgrading our kops 1.10/k8s 1.10.3/etcd 2.2.1 cluster and after a bit of issues, I got it upgraded. Some of my pods would go into a CrashLoopBackOff until i rebooted the nodes again so that resolved them. Also when upgrading the nodes, I did one instancegroup at a time -- kops didn't like that since it wouldn't validate so I had to just update the rest of the master nodes with --cloudonly (2/3).

I also have noticed the DNS not updated for the etcd hosts so i'm wondering how etcd discovery is happening as well.

As an FYI, I didn't need to enable backups on my cluster. After enabling etcd-manager with etcd 2.2.1, it turned on backups automatically as I saw the backups running in the backups/etcd/main directories on s3. I did this before upgrading to 3.1.12.

zivagolee on 6 Sep 2018

Interestingly, when i'm in the container, I can ping the etcd-[a,b,c] dns names but externally, I am unable to do so (because dns entries haven't been updated). i'm wondering where these entries are now...

zivagolee on 6 Sep 2018

@zivagolee backups have to be enabled if you are running a kops 1.9 cluster. etcd-manager wouldn't work there (As per the etcd roadmap). In your case it's fine as you have a kops 1.10 cluster.

kumudt on 6 Sep 2018

I believe those DNS entries for etcd are supposed to be handled by the dns-controller pod running on your cluster. Kops pre-creates those DNS entries when you create your cluster and then dns-controller updates them with the appropriate value.

bismarck on 6 Sep 2018

Actually, this was handled by protokube (Before kops 1.10) for etcd. Now that, etcd-manager is split from protokube not sure if it's the responsibility of protokube / etcd-manager (neither of them is doing this as of now).
Filed an issue for this in etcd-manager as well -> https://github.com/kopeio/etcd-manager/issues/126
All other DNSs are managed my dns-controller including the api-server's internal / external record updations (which is working fine).

kumudt on 6 Sep 2018

@kumudt thanks for filing the issue!

WRT the upgrade, you are correct. I did:

kops 1.9 -> kops 1.10
k8s 1.9.x -> k8s 1.10.3
etcd 2.2.1 -> etcd-manager 2.2.1

And, then I noticed the backup directory after that.

zivagolee on 6 Sep 2018

@zivagolee not sure if you have observed this. But, if I upgrade using kops 1.10 with custom API servers registered and deployed (metrics servers in my case), the controller goes into crashloopbackoff until the DNS records of the master are updated (but dns-controller will not be available until the controller manager is healthy.).
I have filed more details about the issue here.
https://github.com/kubernetes/kops/issues/5756
This basically happens whenever a new master is created and the cluster has any custom API servers.

kumudt on 11 Sep 2018

@kumudt i haven't encountered this during my upgrade. The kops seemed fairly painless to me but we aren't doing a whole lot on our k8s cluster yet (soon!).

zivagolee on 12 Sep 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 11 Dec 2018

/remove-lifecycle stale

cameronattard on 14 Dec 2018

Working with multi-master clusters, I've been able to enabled etcd-manager without downtime so far. However, I've not managed to go from etcd2->etcd3 without downtime. I had to manually adjust the DNS records, and use cloudonly during the rolling update.

tcolgate on 11 Jan 2019

Yes, I haven't figured out how to do the etcd2 2.2.1 to etcd 3.1.12 without an outage. It makes sense though, since your etcd cluster will always need to go from version 2 to version 3, and that switchover will be an outage as a new cluster is formed (with version 3), and the old cluster (version 2) is left behind. There's a moment of time when the old cluster loses quorum, and the new cluster creates quorum. That time is usually the time it takes your "second" upgrade of a master (if you have an etcd cluster of size 3) to take a backup, update the data, and join the new cluster with the "first" rebooted master, and finally update the local /etc/hosts file. Also, I don't use kops to do this rolling update. We use terraform to apply new launch configurations for the masters, and we use the AWS console to manually terminate the instances, one by one, slowly. We don't move to the second master until we see the first master come up with version 3, and have it's /etc/hosts file updated correctly.

NOTE: You need to manually delete those old etcd route 53 entries if they still exist from an older cluster. We did that while our "second" master was being terminated.

Your API servers on the masters will be setup to use the version 3 backend when they are upgraded, which will cause some of your pods to act abnormally (weave npc, kube2iam, etc). This is because after you reboot the "first" master, the API server will look for an cluster with v3 as the backend. It won't exist until your "second" master joins the "first" master and make a new v3 cluster.

What we have done is setup two clusters, and control API traffic to move away from the cluster that we are upgrading towards the alternative cluster. As an added benefit, this can give you cross region HA for your APIs. We use weighted records in Route53 to control API traffic.

mmerrill3 on 11 Jan 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 11 Apr 2019

I've performed the following update path https://github.com/kubernetes/kops/issues/6736:

start with k8s-1.10.6/etcd-2.2.1 created with kops-1.10.0
switch to kops-1.11.1, update to k8s-1.11.9/etcd-2.2.1
opt-in etcd-manager, still with kops-1.11.1. This resulted in a broken etcd, repaired by switching to kops-1.12.0-beta.1 and kops update, kops rolling-update (see https://github.com/kubernetes/kops/issues/6736)
etcd2 -> etcd3 update using kops-1.12.0-beta.1 and etcd-manager-3.0.20190328

kops set cluster cluster.spec.etcdClusters[*].version=3.2.24
kops update cluster
kops rolling-update cluster

This resulted in many kube-system pods with Unknown STATUS, kube-dns with CrashLoopBackOff STATUS, and two masters weave-net with NodeLost. Basically only one master was functional. Reboot of the two remaining masters seems to restore a working cluster.

marcindulak on 12 Apr 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 12 May 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 11 Jun 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.