Kops: Route53 Records are not updated when etcd-manager is enabled

Created on 21 Aug 2018  路  13Comments  路  Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information.
1.10.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-08T16:31:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:04:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
Enabled etcd-manager in test cluster

5. What happened after the commands executed?
As etcd-manager disables protokube management of etcd, master node doesn't create Route53 record for etcd cluster(https://github.com/kubernetes/kops/blob/master/protokube/pkg/protokube/kube_boot.go#L133), and etcd manager doesn't seem to update Route53 records as well. As a result, etcd related deployments fails (calico-node daemonset goes to CrashLoopBackOff state). Once I manually set dns records for etcd, calico-node became healthy.

6. What did you expect to happen?
When etcd-manager is enabled, Route53 records for etcd cluster should be updated

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

lifecyclrotten

Most helpful comment

etcd-manager doesn't update /etc/hosts in nodes. It causes CrashLoopBackOff of calico nodes of worker nodes

All 13 comments

is this related to #3502 ?

We are seeing the same issue here with experimenting with etcd-manager and kops. This is a non-starter, really big issue. It forces us to manually update rout53 entries for etcd main and events whenever a master is recreated.

In an HA cluster, the members of etcd were using the static entries that were in route53 to ping other members of the cluster. Of course, that times out since the IP addresses are fake.

When I remove the entries from Route53, then the members of the cluster who are unavailable stop being resolved to the static IP. The work around is to just remove the place holders in Route53 for the etcd hosts.

Seeing the same things as the original , moving to etcd-manager with calico networking (or any pods that run outside the masters and require access to etcd) causes issues. The behaviour of etcd-manager is expected, so at a bit of a loss for what the correct solution is.

@MFAnderson Thanks!
I found etcd-manager updates /etc/hosts correctly.

etcd-manager doesn't update /etc/hosts in nodes. It causes CrashLoopBackOff of calico nodes of worker nodes

Not sure if this is at the root of it, but just wanted to confirm I do have to disable etcd-manager in order for any of the etcd-using CNI providers to work at all. (Calico, Cilium, etc).

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale.

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rot26 picture rot26  路  5Comments

chrislovecnm picture chrislovecnm  路  3Comments

Caskia picture Caskia  路  3Comments

lnformer picture lnformer  路  3Comments

DocValerian picture DocValerian  路  4Comments