1. What kops version are you running? The command kops version, will display
this information.
Version 1.11.0 (git-2c2042465)
**2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:39:04Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.6", GitCommit:"b1d75deca493a24a2f87eb1efde1a569e52fc8d9", GitTreeState:"clean", BuildDate:"2018-12-16T04:30:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
kops validate cluster
kubectl get nodes
5. What happened after the commands executed?
kops validate cluster:
VALIDATION ERRORS
KIND NAME MESSAGE
Machine i-0596d5c309b146318 machine "i-0596d5c309b146318" has not yet joined cluster
Machine i-05a5f9ce77fbaa5a9 machine "i-05a5f9ce77fbaa5a9" has not yet joined cluster
Machine i-067761576c64cb2f0 machine "i-067761576c64cb2f0" has not yet joined cluster
Machine i-0bb86d0bde11aa2d4 machine "i-0bb86d0bde11aa2d4" has not yet joined cluster
Machine i-0e9105f26a8d84b8a machine "i-0e9105f26a8d84b8a" has not yet joined cluster
Validation Failed
kubectl get nodes:
NAME STATUS ROLES AGE VERSION
ip-10-0-34-209.ec2.internal Ready node 2h v1.11.6
ip-10-0-41-138.ec2.internal Ready node 3h v1.11.6
ip-10-0-51-206.ec2.internal Ready node 1h v1.11.6
ip-10-0-53-172.ec2.internal Ready master 22h v1.11.6
6. What did you expect to happen?
Cluster to be ready
**7. Please provide your cluster manifest. Execute
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2019-02-08T19:54:37Z
name: mycluster
spec:
api:
loadBalancer:
type: Public
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://mycluster/mycluster
dnsZone: myzone
etcdClusters:
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
name: main
version: 3.2.24
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
name: events
version: 3.2.24
iam:
allowContainerRegistry: true
legacy: false
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.11.6
masterInternalName: api.internal.mycluster
masterPublicName: api.mycluster
networkCIDR: 10.0.0.0/16
networkID: vpc-0bfd0c279ff45bc01
networking:
calico:
majorVersion: v3
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- some-ips
subnets:
- cidr: 10.0.32.0/19
id: subnet-0a0f0a2288311e115
name: us-east-1a
type: Private
zone: us-east-1a
- cidr: 10.0.64.0/19
id: subnet-0f6e7ddae04b994db
name: us-east-1b
type: Private
zone: us-east-1b
- cidr: 10.0.0.0/22
id: subnet-0a90362b197c32064
name: utility-us-east-1a
type: Utility
zone: us-east-1a
topology:
bastion:
bastionPublicName: bastion.mycluster
dns:
type: Public
masters: private
nodes: private
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-02-08T19:54:38Z
labels:
kops.k8s.io/cluster: mycluster
name: bastions
spec:
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t2.micro
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: bastions
role: Bastion
subnets:
- utility-us-east-1a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-02-08T19:54:38Z
labels:
kops.k8s.io/cluster: mycluster
name: master-us-east-1a
spec:
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t2.medium
maxSize: 3
minSize: 3
nodeLabels:
kops.k8s.io/instancegroup: master-us-east-1a
role: Master
rootVolumeSize: 100
subnets:
- us-east-1a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-02-08T19:54:38Z
labels:
kops.k8s.io/cluster: mycluster
name: nodes
spec:
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t2.medium
maxPrice: "0.4"
maxSize: 3
minSize: 3
nodeLabels:
kops.k8s.io/instancegroup: nodes
spot: "true"
role: Node
rootVolumeSize: 200
subnets:
- us-east-1a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-02-18T16:55:20Z
labels:
kops.k8s.io/cluster: mycluster
name: nodes-1b
spec:
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t2.medium
maxPrice: "0.4"
maxSize: 3
minSize: 3
nodeLabels:
kops.k8s.io/instancegroup: nodes-1b
spot: "true"
role: Node
subnets:
- us-east-1b
Logs from one of the failing master node:
Beside this, there's a lot of the k8s failing scheduling.
==> daemon.log <==
Feb 19 17:24:14 ip-10-0-52-88 kubelet[2669]: E0219 17:24:14.848900 2669 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://127.0.0.1/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-0-52-88.ec2.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Feb 19 17:24:14 ip-10-0-52-88 kubelet[2669]: E0219 17:24:14.849865 2669 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get https://127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Feb 19 17:24:14 ip-10-0-52-88 kubelet[2669]: E0219 17:24:14.850981 2669 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://127.0.0.1/api/v1/nodes?fieldSelector=metadata.name%3Dip-10-0-52-88.ec2.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Feb 19 17:24:15 ip-10-0-52-88 kubelet[2669]: W0219 17:24:15.115294 2669 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/
Feb 19 17:24:15 ip-10-0-52-88 kubelet[2669]: E0219 17:24:15.115431 2669 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
9. Anything else do we need to know?
Also, I created another ig to have master at 1b zone, delete the ig but the auto scaling groups was still there at AWS and had to manually delete it.
Terminated EC2 instances so kops got me new ones but stilll, nothing. (at previous versions, this used to solve the issue or a simple instance reboot).
I thought there was something going on with etcd, differents AWS zones, so I created with kops edit cluster another 2 etcd members (so now I have 3, a-b-c) I remember that in another cluster, fails when adding masters to an ig, so I created 3 ig with 1 master node each (each being an etcd member). But still, I only have zone us-east-1a working with its 3 nodes, master-b and master-c with 3x nodes-b , don't join the cluster. I think it's something with the manually adding ig/nodes to the cluster at other zones.
I'm trying to delete ig, and get a dump of cluster/ig , edit with multi AZ, and kops replace -f cluster.yaml, says that us-east-1c zone doesn't exists, so, it keeps thinking it has an us-east-1b ig already which it doesn't :/ I'm going to delete the cluster and recreate it since it's making waste time (and I will have to re create all the apps again, sadly).
Any updates on this issue?
Make sure the nodes have access to the s3 bucket where the KOPS config is stored.
Not to parachute in, but one thing that I hit going through and playing with kops is that if you're using encrypted S3 buckets, you'll need to explicitly ensure that your nodes also have access to the key used to encrypt the S3 bucket (i.e. the AWS KMS key). I was using terraform with kops for this and it didn't generate the full IAM policy to encompass the key usage.
I don't encrypt bucket and some nodes could join after I increased the max number of nodes. Weird.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Does anyone experience this recently? I'm seeing this issue for the past couple days in aws.
We are seeing this since the last 24 hours as well.
Seeing this happen across multiple clusters, across aws regions, weave & cilium CNIs, Kubernetes version 1.11.x & 1.12.x & 1.13.x
Any help would be appreciated. We did find that restarting kubelet 2-3 times seems to fix things but its happening for nearly every new node thats coming up.
I was using kops 1.12.x and Kubernetes 1.11.x. It looks like the new protokube:1.14 and kope/dns-controller:1.14.0 don't work properly with a 1.11.x cluster. It's no longer an issue with upgraded 1.14 Kubernetes cluster and kops
I'm now experiencing this error with both v1.12.10 and v1.10.13 on AWS.
These clusters have been up awhile, I'm only now seeing this issue.
Edit: Kops version 1.12.2, on EC2 instances. Private cluster within a VPC.
We were seeing this issue since 10/03/2019 12:00PM PST, before we can find the root cause the issue seems to be improving since today(10/07/2019) around 3PM PST. There was no change on our side.
Anyone aware of any change / update that might have caused this?
Same problem with old kubernetes 1.10.11 (deployed with kube-aws for me)
/remove-lifecycle stale
/assign
Based on my old version testing I'm not able to replicate so we'll likely need more info. Just to centralize info, please respond with Kops version and kubernetes version for each cluster.
I'm wondering if this could have to do with https://github.com/kubernetes/kops/pull/7596, for those experiencing this issue, I think we may need more info to debug. Kops versions are pretty easily notable in nodeup config in user-data. In the aws console, right click on an instance, instance settings, View/change user data. Scroll down and look for something like
NODEUP_URL=https://kubeupv2.s3.amazonaws.com/kops/1.10.1/linux/amd64/nodeup
^ kops version
Please check that version for both working and non-working instances and report back.
When I was using 1.11.10 for a cluster and 1.12 for kops, the behavior that I was seeing is only one master of three will join the cluster. The reset of nodes kept restarting the services. I compared the version of the images then found two images were changed.
These are the images version in the master node from the master that's not joining the cluster (1.11.10):
REPOSITORY TAG IMAGE ID CREATED SIZE
protokube 1.14.0 7c28bbfad0a2 3 days ago 288 MB
kope/dns-controller 1.14.0 6ee2d45938ee 3 days ago 123 MB
weaveworks/weave-kube 2.5.2 f04a043bb67a 4 months ago 148 MB
weaveworks/weave-npc 2.5.2 5ce48e0d813c 4 months ago 49.6 MB
k8s.gcr.io/kube-proxy v1.11.10 6d859c5ba087 5 months ago 97.4 MB
k8s.gcr.io/kube-controller-manager v1.11.10 3eb89ae87afc 5 months ago 156 MB
k8s.gcr.io/kube-apiserver v1.11.10 b2e25e147ed1 5 months ago 187 MB
k8s.gcr.io/kube-scheduler v1.11.10 b91e0e16786e 5 months ago 56.9 MB
k8s.gcr.io/pause-amd64 3.0 99e59f495ffa 3 years ago 747 kB
k8s.gcr.io/etcd 2.2.1 ef5842ca5c42 3 years ago 28.2 MB
The images in the master node from a cluster(1.11.10) that's working
protokube 1.12.1 f8e918dad22a 4 months ago 295 MB
kope/dns-controller 1.12.0 f2b96ddac37c 4 months ago 125 MB
k8s.gcr.io/kube-proxy v1.11.10 6d859c5ba087 5 months ago 97.4 MB
k8s.gcr.io/kube-controller-manager v1.11.10 3eb89ae87afc 5 months ago 156 MB
k8s.gcr.io/kube-apiserver v1.11.10 b2e25e147ed1 5 months ago 187 MB
k8s.gcr.io/kube-scheduler v1.11.10 b91e0e16786e 5 months ago 56.9 MB
weaveworks/weave-npc 2.5.1 789b7f496034 8 months ago 49.6 MB
weaveworks/weave-kube 2.5.1 1f394ae9e226 8 months ago 148 MB
k8s.gcr.io/pause-amd64 3.0 99e59f495ffa 3 years ago 747 kB
k8s.gcr.io/etcd 2.2.1 ef5842ca5c42 3 years ago 28.2 MB
Ahh thats helpful @weihh2017.
Based on what you said, your 1.11.10 cluster is running kops 1.14.0 (look at protokube and dns-controller). If I were you, I would download kops 1.12.1 as that's what your other masters are using and redo a kops update cluster --yes then delete the instance that isn't registering and give it a go again.
I'm not sure why your cluster is having issues but that is the clear issue between these two nodes. I would then upgrade version by version from kops 1.12 -> kubernetes 1.12, then to kops 1.13 and k8s 1.13 etc. Just make sure to look at the release notes when doing this as alot has changed and it's possible you could hit a disruptive upgrade.
Tried with 4 different version of Kops and K8s (even the lastest stable release) still having the same issue. Since the past week I am facing this issue on AWS. Do we have a solution for this yet?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
any update on this? from yesterday to today I see this error on 1.15.11
I am facing this issue currently. I tried t3.large instance and still the same.
Can you open a new issue on this?
Upgraded from 1.17 to 1.18.6 and the new masters are unable to join the cluster because of this issue.
Get https://127.0.0.1/apis/storage.k8s.io/v1/csinodes/ip-10-0-85-237.us-west-1.compute.internal: dial tcp 127.0.0.1:443: connect: connection refused
Aug 14 13:53:22 ip-10-0-85-237 kubelet[6167]: E0814 13:53:22.969616 6167 event.go:269] Unable to write event: 'Post https://127.0.0.1/api/v1/namespaces/default/events: dial tcp 127.0.0.1:443: connect: connection refused' (may retry after sleeping)
Aug 14 13:53:22 ip-10-0-85-237 kubelet[6167]: E0814 13:53:22.992813 6167 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Aug 14 13:53:23 ip-10-0-85-237 kubelet[6167]: E0814 13:53:23.032263 6167 kubelet.go:2268] node "ip-10-0-85-237.us-west-1.compute.internal" not found
Most helpful comment
Does anyone experience this recently? I'm seeing this issue for the past couple days in aws.