Rke: /old-kube-apiserver and /old-kube-proxy not cleaned up properly

Created on 20 Apr 2018 · 10Comments · Source: rancher/rke

RKE version: 1.5

Docker version: (docker version,docker info preferred) 1.12.6

Operating system and kernel: (cat /etc/os-release, uname -r preferred) 4.14.19-coreos

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Azure

cluster.yml file:

Steps to Reproduce:

Deploy cluster with v1.8.7-rancher1-1
Upgrade cluster to v1.8.9-rancher1-1
Cluster upgrade fails
Destroy cluster
docker ps -a, old containers still exist and cause problems with deployments/upgrades
SSH into every machine and docker rm manually

Results:

Cluster upgrades are not very stable and fail pretty often with the above specs. Typically it results an error that looks like this:

FATA[0020] [controlPlane] Failed to bring up Control Plane: Can't rename Docker container [kube-apiserver] for host [172.28.2.5]: Error response from daemon: Error when allocating new name: Conflict. The name "/old-kube-apiserver" is already in use by container ff4bb53efc2ea63b46d993f3d93a94228076f5944c009342b87d7d98188d1f65. You have to remove (or rename) that container to be able to reuse that name.

These containers hang around even though they are stopped. Even if the cluster is destroyed the still exist. This continues to cause problems once a failure occurs.

kinbug

Source

rossedman

Most helpful comment

Performed above test scenario.

To test this:

Create an rke cluster with version 1.8.10
Try to upgrade the cluster to 1.10.1 with debug enabled
Watch for the message Successfully stopped old container and cancel the upgrade.
Try to upgrade again. Before this fix, the upgrade would break. Using the fix, it will continue with no problem.

INFO[0039] Finished building Kubernetes cluster successfully

adingilloRancher on 14 May 2018

👍2

All 10 comments

Wanted to add to this that after I removed these failed containers I got 5 successful upgrades in a row. So I think it was caused from this. We were hitting our ingress controllers + services at about 1000rps while doing the upgrade as well.

rossedman on 20 Apr 2018

🎉1 👍1

I can't reproduce the problem with v0.1.5. What was the version use for the initial deployment being upgraded here ?

moelsayed on 22 Apr 2018

@moelsayed It was all 1.5 for everything. Stood up and tore down the cluster all in the same sitting.

rossedman on 23 Apr 2018

@moelsayed do you have a way to fail a cluster upgrade? maybe you can recreate by bailing half way through. I can try to find some steps to replicate better.

rossedman on 26 Apr 2018

Our upgrade process is:
1 - stop and rename the old container
2 - create and start the new container
3 - remove the old container

If the upgrade process was interrupted during step 2, it might cause this.

moelsayed on 26 Apr 2018

@moelsayed yeah I think the concerning part is if it does fail, and you try to upgrade after that, failures are continually caused even if you destroy the cluster because those containers hang around. so you have to manually SSH into every node and remove them yourself.

rossedman on 26 Apr 2018

To test this:

Create an rke cluster with version 1.8.10
Try to upgrade the cluster to 1.10.1 with debug enabled
Watch for the message Successfully stopped old container and cancel the upgrade.
Try to upgrade again. Before this fix, the upgrade would break. Using the fix, it will continue with no problem.

moelsayed on 10 May 2018

@moelsayed will test as well. need to do those exact upgrades this week.

rossedman on 10 May 2018

👍1

Performed above test scenario.

To test this:

INFO[0039] Finished building Kubernetes cluster successfully

adingilloRancher on 14 May 2018

👍2

@adingilloRancher Flavortown achievement activated.

rossedman on 14 May 2018

Was this page helpful?