Kops: etcd is not started on CoreOS-based masters since 1688.4.0

Created on 29 Mar 2018  Â·  18Comments  Â·  Source: kubernetes/kops

Thanks for submitting an issue! Please fill in as much of the template below as
you can.

------------- BUG REPORT TEMPLATE --------------------

  1. What kops version are you running? The command kops version, will display
    this information.

Version 1.9.0-alpha.3 (git-ad210dc4b)

  1. What Kubernetes version are you running? kubectl version will print the
    version if a cluster is running or provide the Kubernetes version specified as
    a kops flag.

1.9.3

  1. What cloud provider are you using?

AWS

  1. What commands did you run? What is the simplest way to reproduce this issue?

Build two clusters:

Cluster A (CoreOS-stable 1632.3.0)

kops create cluster     \
     --ssh-public-key=~/.ssh/my_ssh_key.pub \
     --authorization RBAC   \
     --node-count 3   \
     --zones "us-east-1a,us-east-1b,us-east-1c" \
     --master-zones "us-east-1a,us-east-1b,us-east-1c" \
     --node-size t2.large   \
     --master-size t2.medium  \
     --topology public  \
     --network-cidr=10.25.0.0/16 \
     --networking canal \
     --name coreos-1632-3-0.us-east-1.kube.redacted.com

Cluster B (CoreOS-stable 1688.4.0)

kops create cluster     \
     --ssh-public-key=~/.ssh/my_ssh_key.pub \
     --authorization RBAC   \
     --node-count 3   \
     --zones "us-east-1a,us-east-1b,us-east-1c" \
     --master-zones "us-east-1a,us-east-1b,us-east-1c" \
     --node-size t2.large   \
     --master-size t2.medium  \
     --topology public  \
     --network-cidr=10.25.0.0/16 \
     --networking canal \
     --name coreos-1688-4-0.us-east-1.kube.redacted.com

Edit the IGs for cluster A's masters to utilize the 595879546273/CoreOS-stable-1632.3.0-hvm image
kops edit ig --name=coreos-1632-3-0.us-east-1.kube.redacted.com master-us-east-1a
kops edit ig --name=coreos-1632-3-0.us-east-1.kube.redacted.com master-us-east-1c
kops edit ig --name=coreos-1632-3-0.us-east-1.kube.redacted.com master-us-east-1b

Edit the IGs for cluster B's masters to utilize the 595879546273/CoreOS-stable-1688.4.0-hvm image
kops edit ig --name=coreos-1688-4-0.us-east-1.kube.redacted.com master-us-east-1a
kops edit ig --name=coreos-1688-4-0.us-east-1.kube.redacted.com master-us-east-1c
kops edit ig --name=coreos-1688-4-0.us-east-1.kube.redacted.com master-us-east-1b

Instantiate the clusters:
kops update cluster coreos-1632-3-0.us-east-1.kube.redacted.com --yes
kops update cluster coreos-1688-4-0.us-east-1.kube.redacted.com --yes

In ten minutes, validate both clusters and observe that 1632.3.0 is healthy but 1688.4.0 never validates. This is because etcd is not started under 1688.4.0.

  1. What happened after the commands executed?

The 1688.4.0 cluster never validates; the 1632.3.0 comes up and is healthy.

  1. What did you expect to happen?

1688.4.0 comes alive just like previous versions did.

  1. Please provide your cluster manifest. Execute

Available upon request.

blocks-next

Most helpful comment

@chrissnell thanks for beating me to it. I ran into the same issue.

All 18 comments

@chrissnell thanks for beating me to it. I ran into the same issue.

I am facing the exact same problem.

Rolling update using new CoreOS image triggered this bug for our clusters as well.

/assign @KashifSaadat @gambol99

Anyone have logs from protokube or etcd on why it is not starting?

Looks like 1688.4.0 is broken and auto-updates on the stable channel are paused https://groups.google.com/forum/#!topic/coreos-user/5ihE2cKuYck/discussion

@johanneswuerbach that certainly looks bad but I'm not sure if it's the same bug. On my cluster, 1688.4.0 does boot...it's just that etcd doesn't start.

Is kops attempting to use etcd2?

1688.4.0 is the first stable version since etcd2 was removed (first version was alpha 1675.0.1). More information @ https://coreos.com/blog/toward-etcd-v3-in-container-linux.html & https://groups.google.com/forum/#!topic/coreos-user/x89Az5blhFw/discussion

My cluster was configured to use etcd3 when I ran into this issue.

@arithx In my case, I was just using whatever kops chose by default which--I think--is still etcd2.

OK, I can confirm @sstarcher's report: specifying etcd 3.2.15 does not fix this issue. etcd still does not start.

Last night during troubleshooting we found this difference in the protokube log

working node:

I0329 00:47:38.308599    1276 mount_linux.go:408] Disk successfully formatted (mkfs): ext4 - /dev/xvdu /mnt/master-vol-0628600e8446e50cf
I0329 00:47:38.333788    1276 mount_linux.go:168] Detected OS with systemd
I0329 00:47:38.334423    1276 volume_mounter.go:194] mounting inside container: /rootfs/dev/xvdu -> /rootfs/mnt/master-vol-0628600e8446e50cf

broken node:

I0328 22:48:14.272611    1229 mount_linux.go:408] Disk successfully formatted (mkfs): ext4 - /dev/xvdv /mnt/master-vol-06f8ca421bf463dba
I0328 22:48:14.311179    1229 mount_linux.go:168] Detected OS with systemd
W0328 22:48:14.312184    1229 volume_mounter.go:76] unable to mount master volume: "device already mounted at /rootfs/mnt/master-vol-06f8ca421bf463dba, but is /dev/xvdv and we want /rootfs/dev/xvdv"

Looks like it mounts it incorrectly and never generates the etcd manifests for kubelet

something has changed with CoreOS’ fikesystem or how the mount paths. Something has changed.

I am seeing that they changed to “/dev” instead of “/rootfs/dev”? Can someone confirm this?

Also does this work with c5’s?

Marked as blocks-next. We also need to make it easy to turn off auto-upgrades for CoreOS :-)

This is actually a duplicate ... https://github.com/kubernetes/kops/issues/4813

This is fixed in master following PR #4849 and made it into the latest release (1.9.0-beta.2), thanks @justinsb!

I've raised PR #4909 in relation to disabling the update-engine by default for CoreOS.

I'm going to close this. I honestly don't know if this was fixed by PR #4849 or by the new CoreOS release or by both, but everything seems to work now. Thanks @justinsb @KashifSaadat @chrislovecnm and everybody else for helping debug.

Closing this.

Can we get this fix back-ported to 1.8?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mikejoh picture mikejoh  Â·  3Comments

justinsb picture justinsb  Â·  4Comments

chrislovecnm picture chrislovecnm  Â·  3Comments

lnformer picture lnformer  Â·  3Comments

chrislovecnm picture chrislovecnm  Â·  3Comments