RKE version: 1.5
Docker version: (docker version,docker info preferred) 1.12.6
Operating system and kernel: (cat /etc/os-release, uname -r preferred) 4.14.19-coreos
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Azure
cluster.yml file:
Steps to Reproduce:
v1.8.7-rancher1-1v1.8.9-rancher1-1docker ps -a, old containers still exist and cause problems with deployments/upgradesdocker rm manuallyResults:
Cluster upgrades are not very stable and fail pretty often with the above specs. Typically it results an error that looks like this:
FATA[0020] [controlPlane] Failed to bring up Control Plane: Can't rename Docker container [kube-apiserver] for host [172.28.2.5]: Error response from daemon: Error when allocating new name: Conflict. The name "/old-kube-apiserver" is already in use by container ff4bb53efc2ea63b46d993f3d93a94228076f5944c009342b87d7d98188d1f65. You have to remove (or rename) that container to be able to reuse that name.
These containers hang around even though they are stopped. Even if the cluster is destroyed the still exist. This continues to cause problems once a failure occurs.
Wanted to add to this that after I removed these failed containers I got 5 successful upgrades in a row. So I think it was caused from this. We were hitting our ingress controllers + services at about 1000rps while doing the upgrade as well.
I can't reproduce the problem with v0.1.5. What was the version use for the initial deployment being upgraded here ?
@moelsayed It was all 1.5 for everything. Stood up and tore down the cluster all in the same sitting.
@moelsayed do you have a way to fail a cluster upgrade? maybe you can recreate by bailing half way through. I can try to find some steps to replicate better.
Our upgrade process is:
1 - stop and rename the old container
2 - create and start the new container
3 - remove the old container
If the upgrade process was interrupted during step 2, it might cause this.
@moelsayed yeah I think the concerning part is if it does fail, and you try to upgrade after that, failures are continually caused even if you destroy the cluster because those containers hang around. so you have to manually SSH into every node and remove them yourself.
To test this:
Successfully stopped old container and cancel the upgrade.@moelsayed will test as well. need to do those exact upgrades this week.
Performed above test scenario.
To test this:
Create an rke cluster with version 1.8.10
Try to upgrade the cluster to 1.10.1 with debug enabled
Watch for the message Successfully stopped old container and cancel the upgrade.
Try to upgrade again. Before this fix, the upgrade would break. Using the fix, it will continue with no problem.
INFO[0039] Finished building Kubernetes cluster successfully
@adingilloRancher Flavortown achievement activated.
Most helpful comment
Performed above test scenario.
To test this:
Create an rke cluster with version 1.8.10
Try to upgrade the cluster to 1.10.1 with debug enabled
Watch for the message Successfully stopped old container and cancel the upgrade.
Try to upgrade again. Before this fix, the upgrade would break. Using the fix, it will continue with no problem.
INFO[0039] Finished building Kubernetes cluster successfully