Kops: Nodes not joinning the cluster

Created on 19 Feb 2019 · 27Comments · Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information.
Version 1.11.0 (git-2c2042465)

**2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:39:04Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.6", GitCommit:"b1d75deca493a24a2f87eb1efde1a569e52fc8d9", GitTreeState:"clean", BuildDate:"2018-12-16T04:30:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops validate cluster
kubectl get nodes

5. What happened after the commands executed?

kops validate cluster:

VALIDATION ERRORS
KIND    NAME                    MESSAGE
Machine i-0596d5c309b146318     machine "i-0596d5c309b146318" has not yet joined cluster
Machine i-05a5f9ce77fbaa5a9     machine "i-05a5f9ce77fbaa5a9" has not yet joined cluster
Machine i-067761576c64cb2f0     machine "i-067761576c64cb2f0" has not yet joined cluster
Machine i-0bb86d0bde11aa2d4     machine "i-0bb86d0bde11aa2d4" has not yet joined cluster
Machine i-0e9105f26a8d84b8a     machine "i-0e9105f26a8d84b8a" has not yet joined cluster

Validation Failed

kubectl get nodes:

NAME                          STATUS   ROLES    AGE   VERSION
ip-10-0-34-209.ec2.internal   Ready    node     2h    v1.11.6
ip-10-0-41-138.ec2.internal   Ready    node     3h    v1.11.6
ip-10-0-51-206.ec2.internal   Ready    node     1h    v1.11.6
ip-10-0-53-172.ec2.internal   Ready    master   22h   v1.11.6

6. What did you expect to happen?
Cluster to be ready

**7. Please provide your cluster manifest. Execute

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2019-02-08T19:54:37Z
  name: mycluster
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://mycluster/mycluster
  dnsZone: myzone
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    name: main
    version: 3.2.24
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    name: events
    version: 3.2.24
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.11.6
  masterInternalName: api.internal.mycluster
  masterPublicName: api.mycluster
  networkCIDR: 10.0.0.0/16
  networkID: vpc-0bfd0c279ff45bc01
  networking:
    calico:
      majorVersion: v3
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - some-ips
  subnets:
  - cidr: 10.0.32.0/19
    id: subnet-0a0f0a2288311e115
    name: us-east-1a
    type: Private
    zone: us-east-1a
  - cidr: 10.0.64.0/19
    id: subnet-0f6e7ddae04b994db
    name: us-east-1b
    type: Private
    zone: us-east-1b
  - cidr: 10.0.0.0/22
    id: subnet-0a90362b197c32064
    name: utility-us-east-1a
    type: Utility
    zone: us-east-1a
  topology:
    bastion:
      bastionPublicName: bastion.mycluster
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-02-08T19:54:38Z
  labels:
    kops.k8s.io/cluster: mycluster
  name: bastions
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t2.micro
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: bastions
  role: Bastion
  subnets:
  - utility-us-east-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-02-08T19:54:38Z
  labels:
    kops.k8s.io/cluster: mycluster
  name: master-us-east-1a
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t2.medium
  maxSize: 3
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1a
  role: Master
  rootVolumeSize: 100
  subnets:
  - us-east-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-02-08T19:54:38Z
  labels:
    kops.k8s.io/cluster: mycluster
  name: nodes
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t2.medium
  maxPrice: "0.4"
  maxSize: 3
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
    spot: "true"
  role: Node
  rootVolumeSize: 200
  subnets:
  - us-east-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-02-18T16:55:20Z
  labels:
    kops.k8s.io/cluster: mycluster
  name: nodes-1b
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t2.medium
  maxPrice: "0.4"
  maxSize: 3
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-1b
    spot: "true"
  role: Node
  subnets:
  - us-east-1b

Logs from one of the failing master node:
Beside this, there's a lot of the k8s failing scheduling.

==> daemon.log <==
Feb 19 17:24:14 ip-10-0-52-88 kubelet[2669]: E0219 17:24:14.848900    2669 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://127.0.0.1/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-0-52-88.ec2.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Feb 19 17:24:14 ip-10-0-52-88 kubelet[2669]: E0219 17:24:14.849865    2669 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get https://127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Feb 19 17:24:14 ip-10-0-52-88 kubelet[2669]: E0219 17:24:14.850981    2669 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://127.0.0.1/api/v1/nodes?fieldSelector=metadata.name%3Dip-10-0-52-88.ec2.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Feb 19 17:24:15 ip-10-0-52-88 kubelet[2669]: W0219 17:24:15.115294    2669 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/
Feb 19 17:24:15 ip-10-0-52-88 kubelet[2669]: E0219 17:24:15.115431    2669 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

9. Anything else do we need to know?
Also, I created another ig to have master at 1b zone, delete the ig but the auto scaling groups was still there at AWS and had to manually delete it.
Terminated EC2 instances so kops got me new ones but stilll, nothing. (at previous versions, this used to solve the issue or a simple instance reboot).

lifecyclrotten

Source

pulpbill

👍3

Most helpful comment

Does anyone experience this recently? I'm seeing this issue for the past couple days in aws.

weihh2017 on 4 Oct 2019

👍10

All 27 comments

I thought there was something going on with etcd, differents AWS zones, so I created with kops edit cluster another 2 etcd members (so now I have 3, a-b-c) I remember that in another cluster, fails when adding masters to an ig, so I created 3 ig with 1 master node each (each being an etcd member). But still, I only have zone us-east-1a working with its 3 nodes, master-b and master-c with 3x nodes-b , don't join the cluster. I think it's something with the manually adding ig/nodes to the cluster at other zones.

pulpbill on 20 Feb 2019

I'm trying to delete ig, and get a dump of cluster/ig , edit with multi AZ, and kops replace -f cluster.yaml, says that us-east-1c zone doesn't exists, so, it keeps thinking it has an us-east-1b ig already which it doesn't :/ I'm going to delete the cluster and recreate it since it's making waste time (and I will have to re create all the apps again, sadly).

pulpbill on 20 Feb 2019

Any updates on this issue?

alihalabyah on 25 Apr 2019

Make sure the nodes have access to the s3 bucket where the KOPS config is stored.

tjsb on 22 May 2019

👍1

Not to parachute in, but one thing that I hit going through and playing with kops is that if you're using encrypted S3 buckets, you'll need to explicitly ensure that your nodes also have access to the key used to encrypt the S3 bucket (i.e. the AWS KMS key). I was using terraform with kops for this and it didn't generate the full IAM policy to encompass the key usage.

ttacon on 29 Jun 2019

I don't encrypt bucket and some nodes could join after I increased the max number of nodes. Weird.

pulpbill on 30 Jun 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 28 Sep 2019

Does anyone experience this recently? I'm seeing this issue for the past couple days in aws.

weihh2017 on 4 Oct 2019

👍10

We are seeing this since the last 24 hours as well.
Seeing this happen across multiple clusters, across aws regions, weave & cilium CNIs, Kubernetes version 1.11.x & 1.12.x & 1.13.x

Any help would be appreciated. We did find that restarting kubelet 2-3 times seems to fix things but its happening for nearly every new node thats coming up.

kousik93 on 5 Oct 2019

I was using kops 1.12.x and Kubernetes 1.11.x. It looks like the new protokube:1.14 and kope/dns-controller:1.14.0 don't work properly with a 1.11.x cluster. It's no longer an issue with upgraded 1.14 Kubernetes cluster and kops

weihh2017 on 7 Oct 2019

I'm now experiencing this error with both v1.12.10 and v1.10.13 on AWS.

These clusters have been up awhile, I'm only now seeing this issue.

Edit: Kops version 1.12.2, on EC2 instances. Private cluster within a VPC.

KristianWindsor on 8 Oct 2019

We were seeing this issue since 10/03/2019 12:00PM PST, before we can find the root cause the issue seems to be improving since today(10/07/2019) around 3PM PST. There was no change on our side.

Anyone aware of any change / update that might have caused this?

wilson-chen on 8 Oct 2019

Same problem with old kubernetes 1.10.11 (deployed with kube-aws for me)

laghoule on 8 Oct 2019

/remove-lifecycle stale

rajatjindal on 8 Oct 2019

/assign

mikesplain on 8 Oct 2019

Based on my old version testing I'm not able to replicate so we'll likely need more info. Just to centralize info, please respond with Kops version and kubernetes version for each cluster.

I'm wondering if this could have to do with https://github.com/kubernetes/kops/pull/7596, for those experiencing this issue, I think we may need more info to debug. Kops versions are pretty easily notable in nodeup config in user-data. In the aws console, right click on an instance, instance settings, View/change user data. Scroll down and look for something like

NODEUP_URL=https://kubeupv2.s3.amazonaws.com/kops/1.10.1/linux/amd64/nodeup
                                                    ^ kops version

Please check that version for both working and non-working instances and report back.

mikesplain on 8 Oct 2019

When I was using 1.11.10 for a cluster and 1.12 for kops, the behavior that I was seeing is only one master of three will join the cluster. The reset of nodes kept restarting the services. I compared the version of the images then found two images were changed.

These are the images version in the master node from the master that's not joining the cluster (1.11.10):

REPOSITORY                           TAG                 IMAGE ID            CREATED             SIZE
protokube                            1.14.0              7c28bbfad0a2        3 days ago          288 MB
kope/dns-controller                  1.14.0              6ee2d45938ee        3 days ago          123 MB
weaveworks/weave-kube                2.5.2               f04a043bb67a        4 months ago        148 MB
weaveworks/weave-npc                 2.5.2               5ce48e0d813c        4 months ago        49.6 MB
k8s.gcr.io/kube-proxy                v1.11.10            6d859c5ba087        5 months ago        97.4 MB
k8s.gcr.io/kube-controller-manager   v1.11.10            3eb89ae87afc        5 months ago        156 MB
k8s.gcr.io/kube-apiserver            v1.11.10            b2e25e147ed1        5 months ago        187 MB
k8s.gcr.io/kube-scheduler            v1.11.10            b91e0e16786e        5 months ago        56.9 MB
k8s.gcr.io/pause-amd64               3.0                 99e59f495ffa        3 years ago         747 kB
k8s.gcr.io/etcd                      2.2.1               ef5842ca5c42        3 years ago         28.2 MB

The images in the master node from a cluster(1.11.10) that's working

protokube                            1.12.1              f8e918dad22a        4 months ago        295 MB
kope/dns-controller                  1.12.0              f2b96ddac37c        4 months ago        125 MB
k8s.gcr.io/kube-proxy                v1.11.10            6d859c5ba087        5 months ago        97.4 MB
k8s.gcr.io/kube-controller-manager   v1.11.10            3eb89ae87afc        5 months ago        156 MB
k8s.gcr.io/kube-apiserver            v1.11.10            b2e25e147ed1        5 months ago        187 MB
k8s.gcr.io/kube-scheduler            v1.11.10            b91e0e16786e        5 months ago        56.9 MB
weaveworks/weave-npc                 2.5.1               789b7f496034        8 months ago        49.6 MB
weaveworks/weave-kube                2.5.1               1f394ae9e226        8 months ago        148 MB
k8s.gcr.io/pause-amd64               3.0                 99e59f495ffa        3 years ago         747 kB
k8s.gcr.io/etcd                      2.2.1               ef5842ca5c42        3 years ago         28.2 MB

weihh2017 on 8 Oct 2019

Ahh thats helpful @weihh2017.

Based on what you said, your 1.11.10 cluster is running kops 1.14.0 (look at protokube and dns-controller). If I were you, I would download kops 1.12.1 as that's what your other masters are using and redo a kops update cluster --yes then delete the instance that isn't registering and give it a go again.

I'm not sure why your cluster is having issues but that is the clear issue between these two nodes. I would then upgrade version by version from kops 1.12 -> kubernetes 1.12, then to kops 1.13 and k8s 1.13 etc. Just make sure to look at the release notes when doing this as alot has changed and it's possible you could hit a disruptive upgrade.

mikesplain on 8 Oct 2019

Tried with 4 different version of Kops and K8s (even the lastest stable release) still having the same issue. Since the past week I am facing this issue on AWS. Do we have a solution for this yet?

ashishxooa on 24 Dec 2019

👍5

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 23 Mar 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 22 Apr 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 22 May 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 22 May 2020

any update on this? from yesterday to today I see this error on 1.15.11

SCLogo on 2 Jul 2020

I am facing this issue currently. I tried t3.large instance and still the same.

Velfo on 30 Jul 2020

Can you open a new issue on this?

olemarkus on 4 Aug 2020

Upgraded from 1.17 to 1.18.6 and the new masters are unable to join the cluster because of this issue.
Get https://127.0.0.1/apis/storage.k8s.io/v1/csinodes/ip-10-0-85-237.us-west-1.compute.internal: dial tcp 127.0.0.1:443: connect: connection refused Aug 14 13:53:22 ip-10-0-85-237 kubelet[6167]: E0814 13:53:22.969616 6167 event.go:269] Unable to write event: 'Post https://127.0.0.1/api/v1/namespaces/default/events: dial tcp 127.0.0.1:443: connect: connection refused' (may retry after sleeping) Aug 14 13:53:22 ip-10-0-85-237 kubelet[6167]: E0814 13:53:22.992813 6167 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized Aug 14 13:53:23 ip-10-0-85-237 kubelet[6167]: E0814 13:53:23.032263 6167 kubelet.go:2268] node "ip-10-0-85-237.us-west-1.compute.internal" not found