Current:
master1
master2
master3
[etcd]
master1
master2
master3
[kube-node]
node1
node2
node3
Added node3 to group etcd:
master1
master2
master3
master4
master5
[etcd]
master1
master2
master3
master4
master5
[kube-node]
node1
node2
node3
Now i have 3 master/etcd and 45 nodes, i've already refrenced #1122 but couldn't fix it.I extended etcd success but master failed.It shows "kubectl" error:
Unable to connect to the server: x509: certificate is valid for "new master ip"
And my extend command is:
ansible-playbook -i inventory/mycluster/host.ini cluster.yml -l master1,master2,master3,master4,master5
My kubernetes cluster version is 1.9.3,how to fix it?
The feature of scaling master nodes seems imperfect, but it's possible to scale etcd cluster seperatly, to do so, just add etcd nodes under [etcd] and rerun cluster.yml.
I'm facing the same issue with adding new masters. I'm using Kubespray v2.10.x and the reason it fails is that Kubespray does not update the apiserver certificates to add the new master to the SAN list.
You can check your certificate with
openssl x509 -text -noout -in /etc/kubernetes/ssl/apiserver.crt
... and the new master IP and hostname should be listed in the Subject Alternative Name section.
X509v3 Subject Alternative Name:
DNS:infra00-lab, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:localhost, DNS:infra00-lab, DNS:lb-apiserver.kubernetes.local, IP Address:10.233.0.1, IP Address:172.31.134.110, IP Address:172.31.134.110, IP Address:10.233.0.1, IP Address:127.0.0.1, IP Address:172.31.134.110
The execution of cluster.yml adds the new master IP and hostname to /etc/kubernetes/kubeadm-config.yaml as expected. It seems, however, that Kubespray is not calling kubeadm to replace the certificate before trying to join the new master node. We fixed this by using kubeadm manually to recreate the certificate.
NOTE: The works for v2.10.x. I never tested this in older versions of Kubespray.
In your first master, recreate the apiserver certificate.
cd /etc/kubernetes/ssl
mv apiserver.crt apiserver.crt.old
mv apiserver.key apiserver.key.old
cd /etc/kubernetes
kubeadm init phase certs apiserver --config kubeadm-config.yaml
If you are doing this after you ended up with a broken master, be sure to run reset.yml using the parameter --limit=<broken_master_hostname> before continuing. If you take the precaution of recreating the certificate before adding the new master node, you won't need this.
Run cluster.yml to include the new master node. You should end up with a working cluster.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
is it possible to add master? or replace failed master for new one?
You should be able to. In the past, we managed to replace all nodes in the cluster: master, etcd and workers. But.... there are some misteps you need to be carefull along the way. After a lot of experiments and retries in our lab environment, we came up with a few guidelines.
For some reason, Kubespray will not update the apiserver certificate.
Edit /etc/kubernetes/kubeadm-config.yaml, include new host in certSANs list.
Use kubeadm to recreate the certs.
cd /etc/kubernetes/ssl
mv apiserver.crt apiserver.crt.old
mv apiserver.key apiserver.key.old
cd /etc/kubernetes
kubeadm init phase certs apiserver --config kubeadm-config.yaml
Check the certificate, new host needs to be there.
openssl x509 -text -noout -in /etc/kubernetes/ssl/apiserver.crt
cluster.ymlAdd the new host to the inventory and run cluster.yml.
In all hosts, restart nginx-proxy pod. This pod is a local proxy for the apiserver. Kubespray will update its static config, but it needs to be restarted in order to reload.
# run in every host
docker ps | grep k8s_nginx-proxy_nginx-proxy | awk '{print $1}' | xargs docker restart
If you are replacing a node, remove the old one from the inventory, and remove from the cluster runtime.
kubectl drain --force --ignore-daemonsets --grace-period 300 --timeout 360s --delete-local-data NODE_NAME
kubectl delete node NODE_NAME
After that, the old node can be safely shutdown. Also, make sure to restart nginx-proxy in all remaining nodes (step 3)
From any active master that remains in the cluster, re-upload kubeadm-config.yaml
kubeadm config upload from-file --config /etc/kubernetes/kubeadm-config.yaml
This should be the easiest.
You can use --limit=node1 to limit Kubespray to avoid disturbing other nodes in the cluster.
kubectl drain --force=true --grace-period=10 --ignore-daemonsets=true --timeout=0s --delete-local-data NODE_NAME
remove-node.yml playbookWith the old node still in the inventory, run remove-node.yml. You need to pass -e node=NODE_NAME to the playbook to limit the execution to the node being removed.
That's it.
You need to make sure there are always an odd number of etcd nodes in the cluster. In such a way, this is always a replace or scale up operation. Either add two new nodes or remove an old one.
cluster.yml.Update the inventory and run cluster.yml passing --limit=etcd,kube-master -e ignore_assert_errors=yes.
Run upgrade-cluster.yml also passing --limit=etcd,kube-master -e ignore_assert_errors=yes. This is necessary to update all etcd configuration in the cluster.
At this point, you will have an even number of nodes. Everything should still be working, and you should only have problems if the cluster decides to elect a new etcd leader before you remove a node. Even so, running applications should continue to be available.
With the node still in the inventory, run remove-node.yml passing -e node=NODE_NAME as the name of the node that should be removed.
In each etcd host that remains in the cluster:
cat /etc/etcd.env | grep ETCD_INITIAL_CLUSTER
Only active etcd members should be in that list.
Acquire a shell prompt into one of the etcd containers and use etcdctl to remove the old member.
# list all members
etcdctl member list
# remove old member
etcdctl member remove MEMBER_ID
# careful!!! if you remove a wrong member you will be in trouble
# note: these command lines are actually much bigger, since you need to pass all certificates to etcdctl.
In every master node, edit /etc/kubernetes/manifests/kube-apiserver.yaml. Make sure only active etcd nodes are still present in the apiserver command line parameter --etcd-servers=....
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Can https://github.com/kubernetes-sigs/kubespray/blob/48a182844c9c3438e36c78cbc4518c962e0a9ab2/docs/recover-control-plane.md be applied for adding new master/etcd nodes? @qvicksilver
@yujunz Not sure, haven't really tried that use case. Also I'm a bit unsure of the state of that playbook. Haven't had time to add it to CI. But please do try.
The procedure to add\remove masters belongs in the readme, not hidden away in a comment in this issue.
The procedure to add\remove masters belongs in the readme, not hidden away in a comment in this issue.
To be sure everybody see this, this was PR in #5570 and you can now find it here https://kubespray.io/#/docs/nodes
docker ps | grep k8s_nginx-proxy_nginx-proxy | awk '{print $1}' | xargs docker restar
I think this line doesn't work anymore, there is no k8s_nginx-proxy_nginx-proxy pod.
You should be able to. In the past, we managed to replace all nodes in the cluster: master, etcd and workers. But.... there are some misteps you need to be carefull along the way. After a lot of experiments and retries in our lab environment, we came up with a few guidelines.
Adding/replacing a master node
1) Recreate apiserver certs manually to include the new master node in the cert SAN field.
For some reason, Kubespray will not update the apiserver certificate.
Edit
/etc/kubernetes/kubeadm-config.yaml, include new host incertSANslist.Use kubeadm to recreate the certs.
cd /etc/kubernetes/ssl mv apiserver.crt apiserver.crt.old mv apiserver.key apiserver.key.old cd /etc/kubernetes kubeadm init phase certs apiserver --config kubeadm-config.yamlCheck the certificate, new host _needs_ to be there.
openssl x509 -text -noout -in /etc/kubernetes/ssl/apiserver.crt2) Run
cluster.ymlAdd the new host to the inventory and run
cluster.yml.3) Restart kube-system/nginx-proxy
In _all hosts_, restart nginx-proxy pod. This pod is a local proxy for the apiserver. Kubespray will update its static config, but it needs to be restarted in order to reload.
# run in every host docker ps | grep k8s_nginx-proxy_nginx-proxy | awk '{print $1}' | xargs docker restart4) Remove old master nodes
If you are replacing a node, remove the old one from the inventory, and remove from the cluster runtime.
kubectl drain --force --ignore-daemonsets --grace-period 300 --timeout 360s --delete-local-data NODE_NAME kubectl delete node NODE_NAMEAfter that, the old node can be safely shutdown. Also, make sure to restart nginx-proxy in all remaining nodes (step 3)
From any active master that remains in the cluster, re-upload kubeadm-config.yaml
kubeadm config upload from-file --config /etc/kubernetes/kubeadm-config.yamlAdding/replacing a worker node
This should be the easiest.
1) Add new node to the inventory.
2) Run upgrade-cluster.yml
You can use
--limit=node1to limit Kubespray to avoid disturbing other nodes in the cluster.3) Drain the node that will be removed
kubectl drain --force=true --grace-period=10 --ignore-daemonsets=true --timeout=0s --delete-local-data NODE_NAME4) Run the
remove-node.ymlplaybookWith the old node still in the inventory, run
remove-node.yml. You need to pass-e node=NODE_NAMEto the playbook to limit the execution to the node being removed.5) Remove the node from the inventory
That's it.
Adding/Replacing an etcd node
You need to make sure there are always an odd number of etcd nodes in the cluster. In such a way, this is always a replace or scale up operation. Either add two new nodes or remove an old one.
1) Add the new node running
cluster.yml.Update the inventory and run
cluster.ymlpassing--limit=etcd,kube-master -e ignore_assert_errors=yes.Run
upgrade-cluster.ymlalso passing--limit=etcd,kube-master -e ignore_assert_errors=yes. This is necessary to update all etcd configuration in the cluster.At this point, you will have an even number of nodes. Everything should still be working, and you should only have problems if the cluster decides to elect a new etcd leader before you remove a node. Even so, running applications should continue to be available.
2) Remove an old etcd node
With the node still in the inventory, run
remove-node.ymlpassing-e node=NODE_NAMEas the name of the node that should be removed.3) Make sure the remaining etcd members have their config updated
In each etcd host that remains in the cluster:
cat /etc/etcd.env | grep ETCD_INITIAL_CLUSTEROnly active etcd members should be in that list.
4) Remove old etcd members from the cluster runtime
Acquire a shell prompt into one of the etcd containers and use
etcdctlto remove the old member.# list all members etcdctl member list # remove old member etcdctl member remove MEMBER_ID # careful!!! if you remove a wrong member you will be in trouble # note: these command lines are actually much bigger, since you need to pass all certificates to etcdctl.5) Make sure the apiserver config is correctly updated.
In every master node, edit
/etc/kubernetes/manifests/kube-apiserver.yaml. Make sure only active etcd nodes are still present in the apiserver command line parameter--etcd-servers=....6) Shutdown the old instance
Hello!
I have some issue with this commands
quersys@node1:/etc/kubernetes$ _sudo kubeadm init phase certs apiserver --config kubeadm-config.yaml_
W0810 11:08:48.479307 31818 utils.go:26] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10]
W0810 11:08:48.479525 31818 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[certs] Using existing apiserver certificate and key on disk
What version of K8s are you using? It's been almost a year since I posted. Did something change in kubeadm since then?
I would start by searching for official instructios on how to renew and recreate certs.
What version of K8s are you using? It's been almost a year since I posted. Did something change in kubeadm since then?
I would start by searching for official instructios on how to renew and recreate certs.
quersys@node1:/etc/kubernetes$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
I tried to find information about 6 hours :(
W0810 11:08:48.479307 31818 utils.go:26] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10]
W0810 11:08:48.479525 31818 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
Those look like warnings. Most people seem to ignore them. Are you sure no error messages appear as well? Does it hang and never return? If that's the case, I'd wait for a timeout to hopefully get some actual error messages.
Yeh, also I have timeout error with my node, when I use cluster.yml and tried to add this node to master
10 авг. 2020 г. 17:10 пользователь Julio H Morimoto написал:
W0810 11:08:48.479307 31818 utils.go:26] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10]
W0810 11:08:48.479525 31818 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
Those look like warnings. Most people seem to ignore them. Are you sure no error messages appear as well? Does it hang and never return? If that's the case, I'd wait for a timeout to hopefully get some actual error messages.
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/kubernetes-sigs/kubespray/issues/3471#issuecomment-67141349
Sounds like a conectivity problem or something that leads to it. If you can provide any further logs and relevant messages, it would be helpful.
Thanks!!!
I have my new node ip in apiserver.crt
DNS:node1, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:localhost, DNS:node1, DNS:node3, DNS:lb-apiserver.kubernetes.local, DNS:node1.cluster.local, DNS:node3.cluster.local, IP Address:10.233.0.1, IP Address:172.26.1.225, IP Address:172.26.1.225, IP Address:10.233.0.1, IP Address:127.0.0.1, IP Address:172.26.1.225, IP Address:172.26.1.130
but when I use command ansible-playbook -i inventory/quersyscluster/hosts.yml cluster.yml I have problem connection "timeout"
Please post relevant log messages for more context. At this level, "connection timeout" is a broad error message.
Hi,
I am interested in to replace the first master (and the others) in kubernetes cluster using kubespray scripts. Is it possible?
Story:
I have build k8s cluster using kubespray scripts on Openstack with an old centos7 image. Next I want to upgrade OS, eg. from 7.7 to 7.8. I have newer OS image on Openstack prepared. I am able to deploy new masters and new workers with newer OS image. But there is a problem with first master. I need to delete whole vm and bring the new one with new OS. Did you have similar problem?
I tried to force master2 to be the first one, but when I do a join task on new master (eg. master4) it looks like kubeadm still want to connect to master1 (6.0.1.57):
kubeadm join --config kubeadm-controlplane.yaml --ignore-preflight-errors=all
W1028 12:37:17.050916 1666 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[WARNING FileExisting-ebtables]: ebtables not found in system path
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Get https://6.0.1.57:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s: dial tcp 6.0.1.57:6443: connect: no route to host
To see the stack trace of this error execute with --v=5 or higher
Present, eg.:
master1, centos7.7
master2, centos7.7
master3, centos7.7
worker1, centos7.7
worker2, centos7.7
worker3, centos7.7
Expected:
master2, centos7.8 - master2 becomes the first one
master3, centos7.8
master4, centos7.8
worker1, centos7.8
worker2, centos7.8
worker3, centos7.8
How did you manage with recreate first master?
@juliohm1978 , maybe can you help?
Thanks!
Most helpful comment
You should be able to. In the past, we managed to replace all nodes in the cluster: master, etcd and workers. But.... there are some misteps you need to be carefull along the way. After a lot of experiments and retries in our lab environment, we came up with a few guidelines.
Adding/replacing a master node
1) Recreate apiserver certs manually to include the new master node in the cert SAN field.
For some reason, Kubespray will not update the apiserver certificate.
Edit
/etc/kubernetes/kubeadm-config.yaml, include new host incertSANslist.Use kubeadm to recreate the certs.
Check the certificate, new host needs to be there.
2) Run
cluster.ymlAdd the new host to the inventory and run
cluster.yml.3) Restart kube-system/nginx-proxy
In all hosts, restart nginx-proxy pod. This pod is a local proxy for the apiserver. Kubespray will update its static config, but it needs to be restarted in order to reload.
4) Remove old master nodes
If you are replacing a node, remove the old one from the inventory, and remove from the cluster runtime.
After that, the old node can be safely shutdown. Also, make sure to restart nginx-proxy in all remaining nodes (step 3)
From any active master that remains in the cluster, re-upload kubeadm-config.yaml
Adding/replacing a worker node
This should be the easiest.
1) Add new node to the inventory.
2) Run upgrade-cluster.yml
You can use
--limit=node1to limit Kubespray to avoid disturbing other nodes in the cluster.3) Drain the node that will be removed
4) Run the
remove-node.ymlplaybookWith the old node still in the inventory, run
remove-node.yml. You need to pass-e node=NODE_NAMEto the playbook to limit the execution to the node being removed.5) Remove the node from the inventory
That's it.
Adding/Replacing an etcd node
You need to make sure there are always an odd number of etcd nodes in the cluster. In such a way, this is always a replace or scale up operation. Either add two new nodes or remove an old one.
1) Add the new node running
cluster.yml.Update the inventory and run
cluster.ymlpassing--limit=etcd,kube-master -e ignore_assert_errors=yes.Run
upgrade-cluster.ymlalso passing--limit=etcd,kube-master -e ignore_assert_errors=yes. This is necessary to update all etcd configuration in the cluster.At this point, you will have an even number of nodes. Everything should still be working, and you should only have problems if the cluster decides to elect a new etcd leader before you remove a node. Even so, running applications should continue to be available.
2) Remove an old etcd node
With the node still in the inventory, run
remove-node.ymlpassing-e node=NODE_NAMEas the name of the node that should be removed.3) Make sure the remaining etcd members have their config updated
In each etcd host that remains in the cluster:
Only active etcd members should be in that list.
4) Remove old etcd members from the cluster runtime
Acquire a shell prompt into one of the etcd containers and use
etcdctlto remove the old member.5) Make sure the apiserver config is correctly updated.
In every master node, edit
/etc/kubernetes/manifests/kube-apiserver.yaml. Make sure only active etcd nodes are still present in the apiserver command line parameter--etcd-servers=....6) Shutdown the old instance