Kops: PodCIDR not set on the master after moving to a different EC2 instance

Created on 13 Jul 2018  Â·  11Comments  Â·  Source: kubernetes/kops

  1. What kops version are you running?

Version 1.10.0-alpha.1

  1. What Kubernetes version are you running?
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T22:29:25Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  1. What cloud provider are you using?

AWS

  1. What commands did you run? What is the simplest way to reproduce this issue?
$ kops upgrade cluster $NAME --yes
W0712 15:49:11.154329   67809 s3context.go:210] Unable to read bucket encryption policy: will encrypt using AES256
ITEM    PROPERTY        OLD NEW
Cluster KubernetesVersion   1.9.6   1.10.3

Updates applied to configuration.
You can now apply these changes, using `kops update cluster foo`

```
$ kops update cluster $NAME --yes
W0712 15:50:01.445591 67811 s3context.go:210] Unable to read bucket encryption policy: will encrypt using AES256
I0712 15:50:03.724755 67811 executor.go:103] Tasks: 0 done / 73 total; 31 can run
I0712 15:50:04.132758 67811 executor.go:103] Tasks: 31 done / 73 total; 24 can run
I0712 15:50:04.618563 67811 executor.go:103] Tasks: 55 done / 73 total; 16 can run
I0712 15:50:05.900642 67811 executor.go:103] Tasks: 71 done / 73 total; 2 can run
I0712 15:50:06.270333 67811 executor.go:103] Tasks: 73 done / 73 total; 0 can run
I0712 15:50:06.270454 67811 dns.go:153] Pre-creating DNS records
I0712 15:50:06.585097 67811 update_cluster.go:290] Exporting kubecfg for cluster
kops has set your kubectl context to foo

Cluster changes have been applied to the cloud.

Changes may require instances to restart: kops rolling-update cluster


$ kops rolling-update cluster $NAME --yes
W0712 15:50:25.1.2.3.4 67814 s3context.go:210] Unable to read bucket encryption policy: will encrypt using AES256
NAME STATUS NEEDUPDATE READY MIN MAX NODES
master-us-west-2a NeedsUpdate 1 0 1 1 1
nodes NeedsUpdate 2 0 2 2 2
I0712 15:50:27.331775 67814 instancegroups.go:157] Draining the node: "ip-.us-west-2.compute.internal".
node "ip-.us-west-2.compute.internal" cordoned
node "ip-.us-west-2.compute.internal" cordoned
node "ip-.us-west-2.compute.internal" drained
I0712 15:50:28.300408 67814 instancegroups.go:333] Waiting for 1m30s for pods to stabilize after draining.
I0712 15:51:58.306755 67814 instancegroups.go:273] Stopping instance "i-xxx", node "ip-.us-west-2.compute.internal", in group "master-us-west-2a.masters.k8s.example.com" (this may take a while).
I0712 15:56:58.743357 67814 instancegroups.go:188] Validating the cluster.
I0712 15:57:28.924945 67814 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.k8s.example.com/api/v1/nodes: dial tcp 1.2.3.4:443: i/o timeout.
I0712 15:58:29.080624 67814 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.k8s.example.com/api/v1/nodes: dial tcp 1.2.3.4:443: i/o timeout.
I0712 15:58:59.462089 67814 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.k8s.example.com/api/v1/nodes: dial tcp 1.2.3.4:443: i/o timeout.
I0712 15:59:29.603733 67814 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.k8s.example.com/api/v1/nodes: dial tcp 1.2.3.4:443: i/o timeout.

In the above the IP `1.2.3.4` is the public IP of the old EC2 instance where the old master was running, which had just been terminated by kops.

5. What happened after the commands executed?

I had to go to Route53 and update the A record in the zone for k8s.example.com to have it point to the new public IP of the EC2 instance where the new master was.  Shortly after updating the A record, I could finally see:

I0712 15:59:30.758285 67814 instancegroups.go:249] Cluster validated.

6. What did you expect to happen?

Validation should work, kops should've updated the A record in Route53.

7. Please provide your cluster manifest.

```yaml
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-06-22T23:09:49Z
  name: k8s.example.com
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://example.com-k8s-state-store/k8s.example.com
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    name: main
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.10.3
  masterPublicName: api.k8s.example.com
  networkCIDR: 172.20.0.0/16
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: us-west-2a
    type: Public
    zone: us-west-2a
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-06-22T23:09:49Z
  labels:
    kops.k8s.io/cluster: k8s.example.com
  name: master-us-west-2a
spec:
  image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2a
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-06-22T23:09:49Z
  labels:
    kops.k8s.io/cluster: k8s.example.com
  name: nodes
spec:
  image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
  machineType: t2.small
  maxSize: 2
  minSize: 2
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - us-west-2a
  1. Anything else do we need to know?

The master still didn't come up successfully. I posted logs for kubelet, api-server, and controller-manager here: https://gist.github.com/tsuna/594fef65be39ecd7e0ffe05bf8113998

Of interest is Unable to update cni config: No networks found in /etc/cni/net.d/ (the directory is indeed empty), which I think led to a bunch of Jul 12 22:53:38 ip-172-x-y-z kubelet[1635]: E0712 22:53:38.225953 1635 kubelet.go:2130] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR and other connection errors trying to get to the api-server.

Most helpful comment

So I figured it out. There were two issues:

  1. The Route53 entry to the external IP of the apiserver wasn't updated — work around: go into Route53 and manually update the DNS record to point to the IP of the new EC2 instance.
  2. The PodCIDR wasn't set on the new master node.

Since the first problem is already covered by #5289, I'm making this issue only about problem 2.

kubectl get nodes was showing old masters and doing kubectl get node ip-172-x-y-z.us-west-2.compute.internal --template={{.spec.podCIDR}} was returning <no value> on all except the oldest one (initially provisioned by kops), which was the only one correctly returning 100.96.0.0/24.

To fix this I had to kubectl edit node ip-172-x-y-z.us-west-2.compute.internal and manually set podCIDR: 100.96.0.0/24 in the spec. As soon as I did this kubelet reacted to the change:

Jul 19 05:23:31 ip-<snip> kubelet[1441]: E0719 05:23:31.296647    1441 kubelet.go:2130] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network 
plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 kuberuntime_manager.go:917] updating runtime config through cri with podcidr 100.96.0.0/24
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 docker_service.go:340] docker cri received runtime config &RuntimeConfig{NetworkConfig:&NetworkConfig{PodCidr:100.96.0.0/24,},
}
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 kubenet_linux.go:258] CNI network config set to {
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "cniVersion": "0.1.0",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "name": "kubenet",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "type": "bridge",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "bridge": "cbr0",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "mtu": 9001,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "addIf": "eth0",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "isGateway": true,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "ipMasq": false,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "hairpinMode": false,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "ipam": {
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "type": "host-local",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "subnet": "100.96.0.0/24",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "gateway": "100.96.0.1",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "routes": [
Jul 19 05:23:34 ip-<snip> kubelet[1441]: { "dst": "0.0.0.0/0" }
Jul 19 05:23:34 ip-<snip> kubelet[1441]: ]
Jul 19 05:23:34 ip-<snip> kubelet[1441]: }
Jul 19 05:23:34 ip-<snip> kubelet[1441]: }
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 kubelet_network.go:196] Setting Pod CIDR:  -> 100.96.0.0/24

H/T kubernetes/kubernetes#32900 for putting me on the right track.

Now the question is why did kops not set this properly on the new master node?

All 11 comments

The fact that the master's API DNS entry wasn't updated in Route53 seems to be a dup of #5289 but there is still the other issue (which may or may not be related) of Kubenet does not have netConfig, which prevents the controller-manager from starting.

So I figured it out. There were two issues:

  1. The Route53 entry to the external IP of the apiserver wasn't updated — work around: go into Route53 and manually update the DNS record to point to the IP of the new EC2 instance.
  2. The PodCIDR wasn't set on the new master node.

Since the first problem is already covered by #5289, I'm making this issue only about problem 2.

kubectl get nodes was showing old masters and doing kubectl get node ip-172-x-y-z.us-west-2.compute.internal --template={{.spec.podCIDR}} was returning <no value> on all except the oldest one (initially provisioned by kops), which was the only one correctly returning 100.96.0.0/24.

To fix this I had to kubectl edit node ip-172-x-y-z.us-west-2.compute.internal and manually set podCIDR: 100.96.0.0/24 in the spec. As soon as I did this kubelet reacted to the change:

Jul 19 05:23:31 ip-<snip> kubelet[1441]: E0719 05:23:31.296647    1441 kubelet.go:2130] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network 
plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 kuberuntime_manager.go:917] updating runtime config through cri with podcidr 100.96.0.0/24
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 docker_service.go:340] docker cri received runtime config &RuntimeConfig{NetworkConfig:&NetworkConfig{PodCidr:100.96.0.0/24,},
}
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 kubenet_linux.go:258] CNI network config set to {
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "cniVersion": "0.1.0",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "name": "kubenet",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "type": "bridge",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "bridge": "cbr0",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "mtu": 9001,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "addIf": "eth0",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "isGateway": true,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "ipMasq": false,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "hairpinMode": false,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "ipam": {
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "type": "host-local",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "subnet": "100.96.0.0/24",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "gateway": "100.96.0.1",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "routes": [
Jul 19 05:23:34 ip-<snip> kubelet[1441]: { "dst": "0.0.0.0/0" }
Jul 19 05:23:34 ip-<snip> kubelet[1441]: ]
Jul 19 05:23:34 ip-<snip> kubelet[1441]: }
Jul 19 05:23:34 ip-<snip> kubelet[1441]: }
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 kubelet_network.go:196] Setting Pod CIDR:  -> 100.96.0.0/24

H/T kubernetes/kubernetes#32900 for putting me on the right track.

Now the question is why did kops not set this properly on the new master node?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Had exactly the same issue. did anyone found the rootcause?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen
/remove-lifecycle rotten

@AntoninBeaufort: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings