Thanks for submitting an issue! Please fill in as much of the template below as
you can.
------------- BUG REPORT TEMPLATE --------------------
kops version are you running? The command kops version, will displayVersion 1.9.0-alpha.3 (git-ad210dc4b)
kubectl version will print thekops flag.1.9.3
AWS
Build two clusters:
Cluster A (CoreOS-stable 1632.3.0)
kops create cluster \
--ssh-public-key=~/.ssh/my_ssh_key.pub \
--authorization RBAC \
--node-count 3 \
--zones "us-east-1a,us-east-1b,us-east-1c" \
--master-zones "us-east-1a,us-east-1b,us-east-1c" \
--node-size t2.large \
--master-size t2.medium \
--topology public \
--network-cidr=10.25.0.0/16 \
--networking canal \
--name coreos-1632-3-0.us-east-1.kube.redacted.com
Cluster B (CoreOS-stable 1688.4.0)
kops create cluster \
--ssh-public-key=~/.ssh/my_ssh_key.pub \
--authorization RBAC \
--node-count 3 \
--zones "us-east-1a,us-east-1b,us-east-1c" \
--master-zones "us-east-1a,us-east-1b,us-east-1c" \
--node-size t2.large \
--master-size t2.medium \
--topology public \
--network-cidr=10.25.0.0/16 \
--networking canal \
--name coreos-1688-4-0.us-east-1.kube.redacted.com
Edit the IGs for cluster A's masters to utilize the 595879546273/CoreOS-stable-1632.3.0-hvm image
kops edit ig --name=coreos-1632-3-0.us-east-1.kube.redacted.com master-us-east-1a
kops edit ig --name=coreos-1632-3-0.us-east-1.kube.redacted.com master-us-east-1c
kops edit ig --name=coreos-1632-3-0.us-east-1.kube.redacted.com master-us-east-1b
Edit the IGs for cluster B's masters to utilize the 595879546273/CoreOS-stable-1688.4.0-hvm image
kops edit ig --name=coreos-1688-4-0.us-east-1.kube.redacted.com master-us-east-1a
kops edit ig --name=coreos-1688-4-0.us-east-1.kube.redacted.com master-us-east-1c
kops edit ig --name=coreos-1688-4-0.us-east-1.kube.redacted.com master-us-east-1b
Instantiate the clusters:
kops update cluster coreos-1632-3-0.us-east-1.kube.redacted.com --yes
kops update cluster coreos-1688-4-0.us-east-1.kube.redacted.com --yes
In ten minutes, validate both clusters and observe that 1632.3.0 is healthy but 1688.4.0 never validates. This is because etcd is not started under 1688.4.0.
The 1688.4.0 cluster never validates; the 1632.3.0 comes up and is healthy.
1688.4.0 comes alive just like previous versions did.
Available upon request.
@chrissnell thanks for beating me to it. I ran into the same issue.
I am facing the exact same problem.
Rolling update using new CoreOS image triggered this bug for our clusters as well.
/assign @KashifSaadat @gambol99
Anyone have logs from protokube or etcd on why it is not starting?
Looks like 1688.4.0 is broken and auto-updates on the stable channel are paused https://groups.google.com/forum/#!topic/coreos-user/5ihE2cKuYck/discussion
@johanneswuerbach that certainly looks bad but I'm not sure if it's the same bug. On my cluster, 1688.4.0 does boot...it's just that etcd doesn't start.
Is kops attempting to use etcd2?
1688.4.0 is the first stable version since etcd2 was removed (first version was alpha 1675.0.1). More information @ https://coreos.com/blog/toward-etcd-v3-in-container-linux.html & https://groups.google.com/forum/#!topic/coreos-user/x89Az5blhFw/discussion
My cluster was configured to use etcd3 when I ran into this issue.
@arithx In my case, I was just using whatever kops chose by default which--I think--is still etcd2.
OK, I can confirm @sstarcher's report: specifying etcd 3.2.15 does not fix this issue. etcd still does not start.
Last night during troubleshooting we found this difference in the protokube log
working node:
I0329 00:47:38.308599 1276 mount_linux.go:408] Disk successfully formatted (mkfs): ext4 - /dev/xvdu /mnt/master-vol-0628600e8446e50cf
I0329 00:47:38.333788 1276 mount_linux.go:168] Detected OS with systemd
I0329 00:47:38.334423 1276 volume_mounter.go:194] mounting inside container: /rootfs/dev/xvdu -> /rootfs/mnt/master-vol-0628600e8446e50cf
broken node:
I0328 22:48:14.272611 1229 mount_linux.go:408] Disk successfully formatted (mkfs): ext4 - /dev/xvdv /mnt/master-vol-06f8ca421bf463dba
I0328 22:48:14.311179 1229 mount_linux.go:168] Detected OS with systemd
W0328 22:48:14.312184 1229 volume_mounter.go:76] unable to mount master volume: "device already mounted at /rootfs/mnt/master-vol-06f8ca421bf463dba, but is /dev/xvdv and we want /rootfs/dev/xvdv"
Looks like it mounts it incorrectly and never generates the etcd manifests for kubelet
something has changed with CoreOS’ fikesystem or how the mount paths. Something has changed.
I am seeing that they changed to “/dev” instead of “/rootfs/dev”? Can someone confirm this?
Also does this work with c5’s?
Marked as blocks-next. We also need to make it easy to turn off auto-upgrades for CoreOS :-)
This is actually a duplicate ... https://github.com/kubernetes/kops/issues/4813
This is fixed in master following PR #4849 and made it into the latest release (1.9.0-beta.2), thanks @justinsb!
I've raised PR #4909 in relation to disabling the update-engine by default for CoreOS.
I'm going to close this. I honestly don't know if this was fixed by PR #4849 or by the new CoreOS release or by both, but everything seems to work now. Thanks @justinsb @KashifSaadat @chrislovecnm and everybody else for helping debug.
Closing this.
Can we get this fix back-ported to 1.8?
Most helpful comment
@chrissnell thanks for beating me to it. I ran into the same issue.