Kops: Changing cluster CNI from canal to amazonvpc results in broken cluster networking

Created on 14 Sep 2018 · 10Comments · Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information.

% kops version
Version 1.10.0-alpha.1 (git-7f70266f5)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:13:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:04:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Edited kops Cluster resource to set CNI to amazonvpc:

  networking:
    amazonvpc: {}

Then I used kops update to apply that resource template (YAML) to kops and then generated Terraform output.

Next, I ran terraform plan and terraform apply. This proceeded normally. At this point, I started a rolling update but noticed immediately that something was awry. As soon as I had updated kops with the new CNI configuration, an aws-node DaemonSet was launched on my cluster alongside the existing canal DaemonSet. This led to brokenness amongst pods.

I ended up fixing it by deleting the aws-node DaemonSet and then deleting all of the canal pods, allowing the canal DaemonSet to start fresh pods, which seem to work. I'm going to have to revert my changes to the Cluster resource and hope that it stays fixed.

6. What did you expect to happen?

I assumed that nothing would be changed as far as pods go until the rolling update was performed.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

9. Anything else do we need to know?

I'm chris.snell on Slack #kops-users if you have any questions.

lifecyclrotten

Source

chrissnell

Most helpful comment

Hello everyone,
Not sure how relevant is this now, but I've recently changed overlay networks, from weave and cilium to calico. This is a disruptive operation since nodes and masters have to be rolled. If you can afford downtime (10-30 minutes), this is how I've done it.

If your cluster has been configured with a specific overlay network (i.e. weave), you need to edit your cluster first and change:

...
  networking:
    weave: {}
...

...
  networking:
    cni: {}
...

Save the file, run kops update cluster $NAME --yes

Remove everything that pertains to weave (svc, cluster roles, daemonsets...). This can be accomplished by running kubectl delete -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
Roll your masters and nodes as fast as possible:

kops rolling-update cluster --cloudonly --force --master-interval=1s --node-interval=1s --yes

Wait until the new masters pup up (they will show as NotReady) and then apply your desired overlay. In my case was calico as described here https://docs.projectcalico.org/v3.6/getting-started/kubernetes/installation/calico (less or more than 50 nodes, not the etcd method).
Wait until your calico pods are ready and your masters and nodes will show as Ready

If you configured your cluster using networking=cni, then skip step 1

HTH

ssro on 1 Apr 2019

🎉3

All 10 comments

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 13 Dec 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 12 Jan 2019

/remove-lifecycle rotten

I was facing a similar issue today when I wanted to switch from weave to flannel-vxlan. I couldn't find anything about changing/migrating the CNI provider in the docs. So I just tried. And it failed. Maybe just a docs issue? A small disclaimer "do not change the cni provider in a running cluster" would have been enough.

I had a similar experience. It started the new pods, leaving the old daemonset in place. then the networking issues started. I had trouble rolling back to weave but eventually succeeded.

m1schka on 6 Feb 2019

There is actually a section about switching CNI providers https://github.com/kubernetes/kops/blob/master/docs/networking.md#switching-between-networking-providers

However it says "Switching from kubenet to a CNI network provider has not been tested at this time." and by extension you can probably expect that switching from one CNI to another CNI probably hasn't been tested either.

FrederikNS on 7 Mar 2019

If your cluster has been configured with a specific overlay network (i.e. weave), you need to edit your cluster first and change:

...
  networking:
    weave: {}
...

...
  networking:
    cni: {}
...

Save the file, run kops update cluster $NAME --yes

Remove everything that pertains to weave (svc, cluster roles, daemonsets...). This can be accomplished by running kubectl delete -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
Roll your masters and nodes as fast as possible:

kops rolling-update cluster --cloudonly --force --master-interval=1s --node-interval=1s --yes

Wait until the new masters pup up (they will show as NotReady) and then apply your desired overlay. In my case was calico as described here https://docs.projectcalico.org/v3.6/getting-started/kubernetes/installation/calico (less or more than 50 nodes, not the etcd method).
Wait until your calico pods are ready and your masters and nodes will show as Ready

If you configured your cluster using networking=cni, then skip step 1

HTH

ssro on 1 Apr 2019

🎉3

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 30 Jun 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 30 Jul 2019

What @ssro suggested worked for me also. I switched from calico to weave.

The reason behind that was related to the fact that calico was previously installed with etcdv2 and when I did the upgrade to etcd-manager it broke calico (as it was relying on the etcd) - it couldn't find the certificates to boot.

Ditched it and replaced with weave. Everything's working as expected 😉

dminca on 26 Aug 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 25 Sep 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 25 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

SSL handshake time issue on AWS with kops 1.6.1 and weave

DocValerian · 4Comments

Automatic Instance Group labels for Node Affinity scheduling

RXminuS · 5Comments

Cluster create fails with kops-version.txt not found

mikejoh · 3Comments

error: error validating "cluster-autoscaler.yml": error validating data: found invalid field tolerations for v1.PodSpec; if you choose to ignore these errors, turn validation off with --validate=false

endejoli · 4Comments

kopeio vs kopeio-vxlan

olalonde · 4Comments