Kops: Route53 Records are not updated when etcd-manager is enabled

Created on 21 Aug 2018 · 13Comments · Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information. 1.10.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-08T16:31:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:04:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
Enabled etcd-manager in test cluster

5. What happened after the commands executed?
As etcd-manager disables protokube management of etcd, master node doesn't create Route53 record for etcd cluster(https://github.com/kubernetes/kops/blob/master/protokube/pkg/protokube/kube_boot.go#L133), and etcd manager doesn't seem to update Route53 records as well. As a result, etcd related deployments fails (calico-node daemonset goes to CrashLoopBackOff state). Once I manually set dns records for etcd, calico-node became healthy.

6. What did you expect to happen?
When etcd-manager is enabled, Route53 records for etcd cluster should be updated

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

lifecyclrotten

Source

kimxogus

👍6

Most helpful comment

etcd-manager doesn't update /etc/hosts in nodes. It causes CrashLoopBackOff of calico nodes of worker nodes

kimxogus on 24 Sep 2018

👍4 😕1

All 13 comments

is this related to #3502 ?

kimxogus on 22 Aug 2018

We are seeing the same issue here with experimenting with etcd-manager and kops. This is a non-starter, really big issue. It forces us to manually update rout53 entries for etcd main and events whenever a master is recreated.

mmerrill3 on 6 Sep 2018

In an HA cluster, the members of etcd were using the static entries that were in route53 to ping other members of the cluster. Of course, that times out since the IP addresses are fake.

mmerrill3 on 6 Sep 2018

When I remove the entries from Route53, then the members of the cluster who are unavailable stop being resolved to the static IP. The work around is to just remove the place holders in Route53 for the etcd hosts.

mmerrill3 on 11 Sep 2018

Seeing the same things as the original , moving to etcd-manager with calico networking (or any pods that run outside the masters and require access to etcd) causes issues. The behaviour of etcd-manager is expected, so at a bit of a loss for what the correct solution is.

MFAnderson on 13 Sep 2018

@MFAnderson Thanks!
I found etcd-manager updates /etc/hosts correctly.

kimxogus on 21 Sep 2018

etcd-manager doesn't update /etc/hosts in nodes. It causes CrashLoopBackOff of calico nodes of worker nodes

kimxogus on 24 Sep 2018

👍4 😕1

Not sure if this is at the root of it, but just wanted to confirm I do have to disable etcd-manager in order for any of the etcd-using CNI providers to work at all. (Calico, Cilium, etc).

jhohertz on 10 Oct 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 8 Jan 2019

/remove-lifecycle stale.

olemarkus on 25 Jan 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 24 Feb 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 26 Mar 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.