Rke: /old-kube-apiserver and /old-kube-proxy not cleaned up properly

Created on 20 Apr 2018  路  10Comments  路  Source: rancher/rke

RKE version: 1.5

Docker version: (docker version,docker info preferred) 1.12.6

Operating system and kernel: (cat /etc/os-release, uname -r preferred) 4.14.19-coreos

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Azure

cluster.yml file:

Steps to Reproduce:

  • Deploy cluster with v1.8.7-rancher1-1
  • Upgrade cluster to v1.8.9-rancher1-1
  • Cluster upgrade fails
  • Destroy cluster
  • docker ps -a, old containers still exist and cause problems with deployments/upgrades
  • SSH into every machine and docker rm manually

Results:

Cluster upgrades are not very stable and fail pretty often with the above specs. Typically it results an error that looks like this:

FATA[0020] [controlPlane] Failed to bring up Control Plane: Can't rename Docker container [kube-apiserver] for host [172.28.2.5]: Error response from daemon: Error when allocating new name: Conflict. The name "/old-kube-apiserver" is already in use by container ff4bb53efc2ea63b46d993f3d93a94228076f5944c009342b87d7d98188d1f65. You have to remove (or rename) that container to be able to reuse that name.

These containers hang around even though they are stopped. Even if the cluster is destroyed the still exist. This continues to cause problems once a failure occurs.

kinbug

Most helpful comment

Performed above test scenario.

To test this:

Create an rke cluster with version 1.8.10
Try to upgrade the cluster to 1.10.1 with debug enabled
Watch for the message Successfully stopped old container and cancel the upgrade.
Try to upgrade again. Before this fix, the upgrade would break. Using the fix, it will continue with no problem.

INFO[0039] Finished building Kubernetes cluster successfully

All 10 comments

Wanted to add to this that after I removed these failed containers I got 5 successful upgrades in a row. So I think it was caused from this. We were hitting our ingress controllers + services at about 1000rps while doing the upgrade as well.

I can't reproduce the problem with v0.1.5. What was the version use for the initial deployment being upgraded here ?

@moelsayed It was all 1.5 for everything. Stood up and tore down the cluster all in the same sitting.

@moelsayed do you have a way to fail a cluster upgrade? maybe you can recreate by bailing half way through. I can try to find some steps to replicate better.

Our upgrade process is:
1 - stop and rename the old container
2 - create and start the new container
3 - remove the old container

If the upgrade process was interrupted during step 2, it might cause this.

@moelsayed yeah I think the concerning part is if it does fail, and you try to upgrade after that, failures are continually caused even if you destroy the cluster because those containers hang around. so you have to manually SSH into every node and remove them yourself.

To test this:

  • Create an rke cluster with version 1.8.10
  • Try to upgrade the cluster to 1.10.1 with debug enabled
  • Watch for the message Successfully stopped old container and cancel the upgrade.
  • Try to upgrade again. Before this fix, the upgrade would break. Using the fix, it will continue with no problem.

@moelsayed will test as well. need to do those exact upgrades this week.

Performed above test scenario.

To test this:

Create an rke cluster with version 1.8.10
Try to upgrade the cluster to 1.10.1 with debug enabled
Watch for the message Successfully stopped old container and cancel the upgrade.
Try to upgrade again. Before this fix, the upgrade would break. Using the fix, it will continue with no problem.

INFO[0039] Finished building Kubernetes cluster successfully

@adingilloRancher Flavortown achievement activated.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

piwi91 picture piwi91  路  25Comments

PDQDakota picture PDQDakota  路  15Comments

superseb picture superseb  路  15Comments

niko-lay picture niko-lay  路  14Comments

myselfghost picture myselfghost  路  17Comments