Running playbooks/redeploy-certificates.yml fails in play Restart nodes. Nodes origin-node service then fails to bring up the node service as all API calls fail with x509: certificate signed by unknown authority
I don't have any specific certificate configs in my inventory so I'm relying on self-signed CA & certs as generated by openshift-ansible.
ansible 2.6.3
config file = /home/os-admin/cluster/ansible.cfg
configured module search path = [u'/home/os-admin/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /bin/ansible
python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch
# oc get nodes
NAME STATUS ROLES AGE VERSION
n22.redacted.com Ready compute,master 1y v1.10.0+b81c8f8
n23.redacted.com Ready compute,master 1y v1.10.0+b81c8f8
n24.redacted.com Ready compute,master 1y v1.10.0+b81c8f8
TASK [Wait for node to be ready] ***********************************************************************
FAILED - RETRYING: Wait for node to be ready (36 retries left).
...
FAILED - RETRYING: Wait for node to be ready (1 retries left).
fatal: [192.168.173.23 -> 192.168.173.22]: FAILED! => {"attempts": 36, "changed": false, "results": {"cmd": "/usr/bin/oc get node n23.redacted.com -o json -n default", "results": [{"apiVersion": "v1", "kind": "Node", "metadata": {"annotations": {"volumes.kubernetes.io/controller-managed-attach-detach": "true"}, "creationTimestamp": "2017-01-10T15:08:52Z", "labels": {"beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/os": "linux", "kubernetes.io/hostname": "n23.redacted.com", "node-role.kubernetes.io/compute": "true", "node-role.kubernetes.io/master": "true", "region": "infra"}, "name": "n23.redacted.com", "resourceVersion": "220141119", "selfLink": "/api/v1/nodes/n23.redacted.com", "uid": "b29b57b3-d746-11e6-a28f-0cc47af70e64"}, "spec": {"externalID": "n23.redacted.com"}, "status": {"addresses": [{"address": "192.168.173.23", "type": "InternalIP"}, {"address": "n23.redacted.com", "type": "Hostname"}], "allocatable": {"alpha.kubernetes.io/nvidia-gpu": "0", "cpu": "40", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "131816644Ki", "pods": "250"}, "capacity": {"alpha.kubernetes.io/nvidia-gpu": "0", "cpu": "40", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "131919044Ki", "pods": "250"}, "conditions": [{"lastHeartbeatTime": "2018-09-05T15:40:49Z", "lastTransitionTime": "2018-09-05T15:42:28Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "OutOfDisk"}, {"lastHeartbeatTime": "2018-09-05T15:40:49Z", "lastTransitionTime": "2018-09-05T15:42:28Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "MemoryPressure"}, {"lastHeartbeatTime": "2018-09-05T15:40:49Z", "lastTransitionTime": "2018-09-05T15:42:28Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "Ready"}, {"lastHeartbeatTime": "2018-09-05T15:40:49Z", "lastTransitionTime": "2018-09-05T15:42:28Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "DiskPressure"}, {"lastHeartbeatTime": "2018-09-05T15:40:49Z", "lastTransitionTime": "2018-08-24T15:38:34Z", "message": "kubelet has sufficient PID available", "reason": "KubeletHasSufficientPID", "status": "False", "type": "PIDPressure"}], "daemonEndpoints": {"kubeletEndpoint": {"Port": 10250}}, "images": ["REDACTED"], "nodeInfo": {"architecture": "amd64", "bootID": "fc93f998-6475-4ec6-aab2-eee83ec78dd8", "containerRuntimeVersion": "docker://1.13.1", "kernelVersion": "3.10.0-862.11.6.el7.x86_64", "kubeProxyVersion": "v1.10.0+b81c8f8", "kubeletVersion": "v1.10.0+b81c8f8", "machineID": "74e3cadf71174ca1aff1c61f824e9b07", "operatingSystem": "linux", "osImage": "CentOS Linux 7 (Core)", "systemUUID": "00000000-0000-0000-0000-0CC47AF70E64"}}}], "returncode": 0}, "state": "list"}
# oc get nodes
NAME STATUS ROLES AGE VERSION
n22.redacted.com NotReady compute,master 1y v1.10.0+b81c8f8
n23.redacted.com NotReady compute,master 1y v1.10.0+b81c8f8
n24.redacted.com NotReady compute,master 1y v1.10.0+b81c8f8
OutOfDisk Unknown Wed, 05 Sep 2018 11:41:06 -0400 Wed, 05 Sep 2018 11:42:27 -0400 NodeStatusUnknown Kubelet stopped posting node status.
MemoryPressure Unknown Wed, 05 Sep 2018 11:41:06 -0400 Wed, 05 Sep 2018 11:42:27 -0400 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 05 Sep 2018 11:41:06 -0400 Wed, 05 Sep 2018 11:42:27 -0400 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 05 Sep 2018 11:41:06 -0400 Wed, 05 Sep 2018 11:42:27 -0400 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure False Wed, 05 Sep 2018 11:41:06 -0400 Fri, 24 Aug 2018 11:41:17 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
origin-node logs shows all API calls are failingSep 05 12:31:53 n22.redacted.com origin-node[27743]: I0905 12:31:53.626700 27743 kubelet_node_status.go:82] Attempting to register node n22.redacted.com
Sep 05 12:31:53 n22.redacted.com origin-node[27743]: E0905 12:31:53.632462 27743 kubelet_node_status.go:106] Unable to register node "n22.redacted.com" with API server: Post https://n20.redacted.com:8443/api/v1/nodes: x509: certificate signed by unknown authority
Sep 05 12:31:54 n22.redacted.com origin-node[27743]: E0905 12:31:54.276000 27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list *v1.Node: Get https://n20.redacted.com:8443/api/v1/nodes?fieldSelector=metadata.name%3Dn22.redacted.com&limit=500&resourceVersion=0: x509: certificate signed by unknown authority
Sep 05 12:31:54 n22.redacted.com origin-node[27743]: E0905 12:31:54.276513 27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list *v1.Service: Get https://n20.redacted.com:8443/api/v1/services?limit=500&resourceVersion=0: x509: certificate signed by unknown authority
Sep 05 12:31:54 n22.redacted.com origin-node[27743]: E0905 12:31:54.277002 27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://n20.redacted.com:8443/api/v1/pods?fieldSelector=spec.nodeName%3Dn22.redacted.com&limit=500&resourceVersion=0: x509: certificate signed by unknown authority
Sep 05 12:31:55 n22.redacted.com origin-node[27743]: E0905 12:31:55.283133 27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list *v1.Node: Get https://n20.redacted.com:8443/api/v1/nodes?fieldSelector=metadata.name%3Dn22.redacted.com&limit=500&resourceVersion=0: x509: certificate signed by unknown authority
Sep 05 12:31:55 n22.redacted.com origin-node[27743]: E0905 12:31:55.283667 27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list *v1.Service: Get https://n20.redacted.com:8443/api/v1/services?limit=500&resourceVersion=0: x509: certificate signed by unknown authority
Sep 05 12:31:55 n22.redacted.com origin-node[27743]: E0905 12:31:55.284443 27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://n20.redacted.com:8443/api/v1/pods?fieldSelector=spec.nodeName%3Dn22.redacted.com&limit=500&resourceVersion=0: x509: certificate signed by unknown authority
playbooks/redeploy-certificates.yml should redeploy the certificates
Nodes end up not being able to talk to the API service as it doesn't trust the CA
# inventory
[OSEv3:children]
nodes
nfs
masters
lb
etcd
[OSEv3:vars]
openshift_master_default_subdomain=dev.redacted.com
ansible_ssh_user=root
openshift_override_hostname_check=true
openshift_master_cluster_method=native
openshift_master_cluster_hostname=n20.redacted.com
openshift_master_cluster_public_hostname=dev.redacted.com
deployment_type=origin
openshift_release="3.10"
openshift_repos_enable_testing=true
openshift_service_catalog_remove=true
openshift_enable_service_catalog=true
ansible_service_broker_remove=true
ansible_service_broker_install=false
template_service_broker_remove=true
template_service_broker_install=false
osm_default_node_selector='node-role.kubernetes.io/compute=true'
[nodes]
192.168.173.22 openshift_public_ip=192.168.173.22 openshift_ip=192.168.173.22 openshift_public_hostname=n22.redacted.com openshift_hostname=n22.redacted.com connect_to=192.168.173.22 openshift_schedulable=True openshift_node_group_name='node-config-compute'
192.168.173.23 openshift_public_ip=192.168.173.23 openshift_ip=192.168.173.23 openshift_public_hostname=n23.redacted.com openshift_hostname=n23.redacted.com connect_to=192.168.173.23 openshift_schedulable=True openshift_node_group_name='node-config-compute'
192.168.173.24 openshift_public_ip=192.168.173.24 openshift_ip=192.168.173.24 openshift_public_hostname=n24.redacted.com openshift_hostname=n24.redacted.com connect_to=192.168.173.24 openshift_schedulable=True openshift_node_group_name='node-config-compute'
192.168.173.25 openshift_public_ip=192.168.173.25 openshift_ip=192.168.173.25 openshift_public_hostname=n25.redacted.com openshift_hostname=n25.redacted.com connect_to=192.168.173.25 openshift_schedulable=True openshift_node_group_name='node-config-compute'
192.168.173.26 openshift_public_ip=192.168.173.26 openshift_ip=192.168.173.26 openshift_public_hostname=n26.redacted.com openshift_hostname=n26.redacted.com connect_to=192.168.173.26 openshift_schedulable=True openshift_node_group_name='node-config-compute'
[nfs]
192.168.173.27 openshift_public_ip=192.168.173.27 openshift_ip=192.168.173.27 openshift_public_hostname=n27.redacted.com openshift_hostname=n27.redacted.com connect_to=192.168.173.27
#192.168.173.28 openshift_public_ip=192.168.173.28 openshift_ip=192.168.173.28 openshift_public_hostname=n28.redacted.com openshift_hostname=n28.redacted.com connect_to=192.168.173.28
[masters]
192.168.173.22 openshift_public_ip=192.168.173.22 openshift_ip=192.168.173.22 openshift_public_hostname=n22.redacted.com openshift_hostname=n22.redacted.com connect_to=192.168.173.22 openshift_node_group_name='node-config-master'
192.168.173.23 openshift_public_ip=192.168.173.23 openshift_ip=192.168.173.23 openshift_public_hostname=n23.redacted.com openshift_hostname=n23.redacted.com connect_to=192.168.173.23 openshift_node_group_name='node-config-master'
192.168.173.24 openshift_public_ip=192.168.173.24 openshift_ip=192.168.173.24 openshift_public_hostname=n24.redacted.com openshift_hostname=n24.redacted.com connect_to=192.168.173.24 openshift_node_group_name='node-config-master'
[lb]
192.168.173.20 openshift_public_ip=8.43.84.242 openshift_ip=192.168.173.20 openshift_public_hostname=dev.redacted.com openshift_hostname=n20.redacted.com connect_to=192.168.173.20
[etcd]
192.168.173.22 openshift_public_ip=192.168.173.22 openshift_ip=192.168.173.22 openshift_public_hostname=n22.redacted.com openshift_hostname=n22.redacted.com connect_to=192.168.173.22
192.168.173.23 openshift_public_ip=192.168.173.23 openshift_ip=192.168.173.23 openshift_public_hostname=n23.redacted.com openshift_hostname=n23.redacted.com connect_to=192.168.173.23
192.168.173.24 openshift_public_ip=192.168.173.24 openshift_ip=192.168.173.24 openshift_public_hostname=n24.redacted.com openshift_hostname=n24.redacted.com connect_to=192.168.173.24
I've also attempted to use redeploy-openshift-ca.yml to generate a fresh new CA and redeploy the others certs from that but end up with the same errors.
At this point I'd be OK with starting over with all new certs but I don't seem to be able to achieve that with the current redeploy-certificates playbooks.
@sdodson we have definite problems with 3.10+ cert plays.
@jfchevrette thanks for the report. This is probably completely broken at the moment.
I can now report that running deploy_cluster.yml allows the cluster to go back to a good state. However my web-console remains inaccessible and return a HTTP 502.
The webconsole issue was solved by reinstalling it
ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-web-console/config.yml -e openshift_web_console_install=false
ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-web-console/config.yml -e openshift_web_console_install=true
Was this issue fixed by just deleting them or something? All the cert playbooks referenced in the 3.10 documentation do not exist anymore. So there's no way to renew certificates.
I ended up reinstalling the cluster completely and have not attempted to reset the certificates since.
I'm experiencing this problem with 3.10. During redeploy-certificates, a playbook called revert-client-ca is executed, which sets servingInfo.clientCA to ca.crt. I could work around this problem by patching this playbook to use ca-bundle.crt. According to https://bugzilla.redhat.com/show_bug.cgi?id=1493276, it's the wrong solution, but it's 3am and I don't have any other solution to bring the cluster back online until morning.
So what's the problem here and what are the prospects of seeing a solution in 3.10? Is upgrading to 3.11 going to make this better?
The webconsole issue was solved by reinstalling it
ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-web-console/config.yml -e openshift_web_console_install=false ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-web-console/config.yml -e openshift_web_console_install=true
Can also be fixed by deleting the webconsole-serving-cert secret and deleting the webconsole pod. The issue here is that the service was running with an old certificate (no longer trusted)
I found this while searching the issue. Looks like you need to replace the certificates using bootstrap on each node. https://bugzilla.redhat.com/show_bug.cgi?id=1652746
UPDATED MANUAL STEPS to replace certs on NODE.
1. Create a new bootstrap.kubeconfig for nodes (MASTER nodes will just copy admin.kubeconfig)
# oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra > bootstrap.kubeconfig
1B. JUST ON THE MASTERS
# cp /etc/origin/master/admin.kubeconfig /etc/origin/node/bootstrap.kubeconfig
2. Distribute configure create in step 1A to infra and compute nodes replacing /etc/origin/node/bootstrap.kubeconfig
3A. Remove contents of /etc/origin/node/certificates and move the node.kubeconfig and client-ca.crt
# rm -rf /etc/origin/node/certificates
# mv /etc/origin/node/client-ca.crt{,.old}
# mv /etc/origin/node/node.kubeconfig{,.old}
4. Restart node service.
# systemctl restart atomic-openshift-node.service
5. Approve CSRs. 2 should be approved.
# oc get csr -o name | xargs oc adm certificate approve
This might be helpful for step 2:
ansible -K all -m copy -a "src=~/bootstrap.kubeconfig dest=/etc/origin/node/bootstrap.kubeconfig"
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle stale
Most helpful comment
The webconsole issue was solved by reinstalling it