Kops: Cluster validation didn't pass after upgrading to kops version 1.11.0

Created on 3 Jan 2019  路  17Comments  路  Source: kubernetes/kops

**1. What kops version are you running?
Version 1.11.0

**2. What Kubernetes version are you running?
v1.10.11

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops upgrade cluster $NAME
ITEM    PROPERTY                     OLD       NEW
Cluster KubernetesVersion       1.10.11   1.11.6

kops upgrade cluster $NAME --yes
kops rolling-update cluster --yes

5. What happened after the commands executed?
Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "*"has not yet joined cluster
master not healthy after update, stopping rolling-update: "error validating cluster after removing a node: cluster did not validate within a duration of \"5m0s\""

6. What did you expect to happen?
Cluster validation should pass for up kops as well kubernetes version upgrade.

**7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 
  name: 
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://
  etcdClusters:
  - etcdMembers:
    - instanceGroup: 
      name: a-1
    - instanceGroup: 
      name: b-1
    - instanceGroup: 
      name: a-2
    name: main
  - etcdMembers:
    - instanceGroup: 
      name: a-1
    - instanceGroup: 
      name: b-1
    - instanceGroup: 
      name: a-2
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - 
  kubernetesVersion: 1.11.6
  masterInternalName: 
  masterPublicName: 
  networkCIDR: 
  networking:
    calico: {}
  nonMasqueradeCIDR: 
  sshAccess:
  - 
  subnets:
  - cidr: 
    name: 
    type: Private
    zone: 
  - cidr: 
    name: 
    type: Private
    zone: 
  - cidr: 
    name: 
    type: Utility
    zone: 
  - cidr: 
    name: 
    type: Utility
    zone: 
  topology:
    dns:
      type: Public
    masters: private
    nodes: private


lifecyclrotten

Most helpful comment

Have the same issue in AWS, kops 1.11, trying to upgrade 1.10.6, then 1.10.12 to the 1.11.6. Every time got like this:
VALIDATION ERRORS
KIND NAME MESSAGE
Machine i-088ee22081adaa2b1 machine "i-088ee22081adaa2b1" has not yet joined cluster
Machine i-0cd125be94a9e05fd machine "i-0cd125be94a9e05fd" has not yet joined cluster

Any advices with horizontalPodAutoscalerUseRestClients and rbac does not work for me.

All 17 comments

I'm not able to reproduce this. I've tried with --topology=private, --networking=calico, both HA and non-HA. Is there anything additional that I can try to repoduce this?

Does the cluster recover despite the validation failure? In other words, is it just that 5 minutes is too short a time? It seems unlikely, but maybe if you have a pod that is slow to terminate or restart.

@justinsb Is kops 1.11.0 supporting etcdv2 ?

@tsahoo yes, and etcd3. The upgrade from etcd2 -> etcd3 relies on etcd-manager, and the plan is to finish up the final edge cases for that upgrade in kops 1.12.

~I've also realized that we really should print the validation failure on a kops rolling-update validation failure. (I take it we don't, which just isn't helpful)~

~@tsahoo I don't suppose you ran kops validate cluster and were able to see the problem?~

Edit: actually, it looks like we know what happened - the new machine did not join the cluster.

@justinsb Yes .While we upgrade the cluster the new master node is not joining the cluster with kubernetes version 1.11.6. And after that the cluster validation didn't pass. But kops 1.11.0 is running fine with kubernetes version below 1.11.X.

Thanks @tsahoo - are you able to SSH to the instance which didn't join (it should be the one that started most recently), and look at the logs to figure out what went wrong. The error should either be in journalctl -u kops-configuration, or in journalctl -u kubelet - or maybe in journalctl -u protokube. Hopefully one of those gives us the hint why the node isn't rejoining the cluster.

You could also try kops validate cluster again to see if the master was just very slow to join.

Also experiencing this problem upgrading from 1.10.11 to 1.11.6. Very similar cluster config to the OP. Also using weave as the CNI.

I am seeing this log multiple times in kubelet logs:

Jan 08 16:05:23 ip-172-21-37-167 kubelet[2512]: W0108 16:05:23.846996    2512 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/

kube-apiserver seems to be in a CrashLoopBackoff.

I had same problem but turned out because of enable-custom-metrics flag, which is deprecated in 1.11.
Please make sure you do not have that. In stead of this please use following spec -
spec:
kubeControllerManager:
horizontalPodAutoscalerUseRestClients: true

@justinsb It would be a good idea to put it in required actions.

I think I've figured out the cause of my problems so I'll post a new issue as I don't want to hijack this one.

For folks having problems with 1.11, if you are using OIDC for cluster authentication, see this comment
https://github.com/kubernetes/kops/issues/6046#issuecomment-441579967

authorization-rbac-super-user was removed in 1.11 so you'll need to remove that from your cluster spec if you were using it.

Have the same issue in AWS, kops 1.11, trying to upgrade 1.10.6, then 1.10.12 to the 1.11.6. Every time got like this:
VALIDATION ERRORS
KIND NAME MESSAGE
Machine i-088ee22081adaa2b1 machine "i-088ee22081adaa2b1" has not yet joined cluster
Machine i-0cd125be94a9e05fd machine "i-0cd125be94a9e05fd" has not yet joined cluster

Any advices with horizontalPodAutoscalerUseRestClients and rbac does not work for me.

Here was my upgrade procedure which worked from 1.9 -> 1.11

Procedure

Pre-upgrade

The kubelet configuration (in my case) needed to be changed from:

  kubelet:
    enableCustomMetrics: true

to

  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook

The --enable-custom-metrics flag is no longer supported in v1.13 and will cause the kublet to fail on startup. The new settings are to secure the kubelet process. In kops v1.13 anonymous authentication defaults to be being switched off, which in turn means we must enable webhook authentication, so that process like tiller (helm) and metrics server can now login using bearer tokens.

Post-upgrade

Necessary kubelet-api fix:

kubectl create clusterrolebinding kubelet-api-admin --clusterrole=system:kubelet-api-admin --user=kubelet-api

Introduced v1.10: Need to authorize kubelet-api to access kublet API

Looks like it's fixed in the next version of kops

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings