Kops: etcd is not started on CoreOS-based masters since 1688.4.0

Created on 29 Mar 2018 · 18Comments · Source: kubernetes/kops

Thanks for submitting an issue! Please fill in as much of the template below as
you can.

------------- BUG REPORT TEMPLATE --------------------

What kops version are you running? The command kops version, will display
this information.

Version 1.9.0-alpha.3 (git-ad210dc4b)

What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.9.3

What cloud provider are you using?

AWS

What commands did you run? What is the simplest way to reproduce this issue?

Build two clusters:

Cluster A (CoreOS-stable 1632.3.0)

kops create cluster     \
     --ssh-public-key=~/.ssh/my_ssh_key.pub \
     --authorization RBAC   \
     --node-count 3   \
     --zones "us-east-1a,us-east-1b,us-east-1c" \
     --master-zones "us-east-1a,us-east-1b,us-east-1c" \
     --node-size t2.large   \
     --master-size t2.medium  \
     --topology public  \
     --network-cidr=10.25.0.0/16 \
     --networking canal \
     --name coreos-1632-3-0.us-east-1.kube.redacted.com

Cluster B (CoreOS-stable 1688.4.0)

kops create cluster     \
     --ssh-public-key=~/.ssh/my_ssh_key.pub \
     --authorization RBAC   \
     --node-count 3   \
     --zones "us-east-1a,us-east-1b,us-east-1c" \
     --master-zones "us-east-1a,us-east-1b,us-east-1c" \
     --node-size t2.large   \
     --master-size t2.medium  \
     --topology public  \
     --network-cidr=10.25.0.0/16 \
     --networking canal \
     --name coreos-1688-4-0.us-east-1.kube.redacted.com

Edit the IGs for cluster A's masters to utilize the 595879546273/CoreOS-stable-1632.3.0-hvm image
kops edit ig --name=coreos-1632-3-0.us-east-1.kube.redacted.com master-us-east-1a
kops edit ig --name=coreos-1632-3-0.us-east-1.kube.redacted.com master-us-east-1c
kops edit ig --name=coreos-1632-3-0.us-east-1.kube.redacted.com master-us-east-1b

Edit the IGs for cluster B's masters to utilize the 595879546273/CoreOS-stable-1688.4.0-hvm image
kops edit ig --name=coreos-1688-4-0.us-east-1.kube.redacted.com master-us-east-1a
kops edit ig --name=coreos-1688-4-0.us-east-1.kube.redacted.com master-us-east-1c
kops edit ig --name=coreos-1688-4-0.us-east-1.kube.redacted.com master-us-east-1b

Instantiate the clusters:
kops update cluster coreos-1632-3-0.us-east-1.kube.redacted.com --yes
kops update cluster coreos-1688-4-0.us-east-1.kube.redacted.com --yes

In ten minutes, validate both clusters and observe that 1632.3.0 is healthy but 1688.4.0 never validates. This is because etcd is not started under 1688.4.0.

What happened after the commands executed?

The 1688.4.0 cluster never validates; the 1632.3.0 comes up and is healthy.

What did you expect to happen?

1688.4.0 comes alive just like previous versions did.

Please provide your cluster manifest. Execute

Available upon request.

blocks-next

Source

chrissnell

👍7

Most helpful comment

@chrissnell thanks for beating me to it. I ran into the same issue.

sstarcher on 29 Mar 2018

👍5

All 18 comments

@chrissnell thanks for beating me to it. I ran into the same issue.

sstarcher on 29 Mar 2018

👍5

I am facing the exact same problem.

JayBee6 on 29 Mar 2018

Rolling update using new CoreOS image triggered this bug for our clusters as well.

azuretek on 29 Mar 2018

👍1

/assign @KashifSaadat @gambol99

chrislovecnm on 29 Mar 2018

Anyone have logs from protokube or etcd on why it is not starting?

chrislovecnm on 29 Mar 2018

Looks like 1688.4.0 is broken and auto-updates on the stable channel are paused https://groups.google.com/forum/#!topic/coreos-user/5ihE2cKuYck/discussion

johanneswuerbach on 29 Mar 2018

@johanneswuerbach that certainly looks bad but I'm not sure if it's the same bug. On my cluster, 1688.4.0 does boot...it's just that etcd doesn't start.

chrissnell on 29 Mar 2018

Is kops attempting to use etcd2?

1688.4.0 is the first stable version since etcd2 was removed (first version was alpha 1675.0.1). More information @ https://coreos.com/blog/toward-etcd-v3-in-container-linux.html & https://groups.google.com/forum/#!topic/coreos-user/x89Az5blhFw/discussion

arithx on 29 Mar 2018

My cluster was configured to use etcd3 when I ran into this issue.

sstarcher on 29 Mar 2018

@arithx In my case, I was just using whatever kops chose by default which--I think--is still etcd2.

chrissnell on 29 Mar 2018

OK, I can confirm @sstarcher's report: specifying etcd 3.2.15 does not fix this issue. etcd still does not start.

chrissnell on 29 Mar 2018

Last night during troubleshooting we found this difference in the protokube log

working node:

I0329 00:47:38.308599    1276 mount_linux.go:408] Disk successfully formatted (mkfs): ext4 - /dev/xvdu /mnt/master-vol-0628600e8446e50cf
I0329 00:47:38.333788    1276 mount_linux.go:168] Detected OS with systemd
I0329 00:47:38.334423    1276 volume_mounter.go:194] mounting inside container: /rootfs/dev/xvdu -> /rootfs/mnt/master-vol-0628600e8446e50cf

broken node:

I0328 22:48:14.272611    1229 mount_linux.go:408] Disk successfully formatted (mkfs): ext4 - /dev/xvdv /mnt/master-vol-06f8ca421bf463dba
I0328 22:48:14.311179    1229 mount_linux.go:168] Detected OS with systemd
W0328 22:48:14.312184    1229 volume_mounter.go:76] unable to mount master volume: "device already mounted at /rootfs/mnt/master-vol-06f8ca421bf463dba, but is /dev/xvdv and we want /rootfs/dev/xvdv"

Looks like it mounts it incorrectly and never generates the etcd manifests for kubelet

azuretek on 29 Mar 2018

something has changed with CoreOS’ fikesystem or how the mount paths. Something has changed.

I am seeing that they changed to “/dev” instead of “/rootfs/dev”? Can someone confirm this?

Also does this work with c5’s?

chrislovecnm on 30 Mar 2018

Marked as blocks-next. We also need to make it easy to turn off auto-upgrades for CoreOS :-)

justinsb on 30 Mar 2018

This is actually a duplicate ... https://github.com/kubernetes/kops/issues/4813

chrislovecnm on 31 Mar 2018

This is fixed in master following PR #4849 and made it into the latest release (1.9.0-beta.2), thanks @justinsb!

I've raised PR #4909 in relation to disabling the update-engine by default for CoreOS.

KashifSaadat on 4 Apr 2018

🎉1

I'm going to close this. I honestly don't know if this was fixed by PR #4849 or by the new CoreOS release or by both, but everything seems to work now. Thanks @justinsb @KashifSaadat @chrislovecnm and everybody else for helping debug.

Closing this.

chrissnell on 5 Apr 2018

Can we get this fix back-ported to 1.8?

macropin on 6 Apr 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Support for newer docker versions like 18.03

lnformer · 3Comments

CoreDNS externalCoreFile Parsing Invalid - Indentation

joshbranham · 3Comments

kops drain node

chrislovecnm · 3Comments

Allow opt-in to etcd3

justinsb · 4Comments

error: error validating "cluster-autoscaler.yml": error validating data: found invalid field tolerations for v1.PodSpec; if you choose to ignore these errors, turn validation off with --validate=false

endejoli · 4Comments