Openshift-ansible: playbooks/redeploy-certificates breaks nodes to master communication, x509: certificate signed by unknown authority

Created on 5 Sep 2018  路  11Comments  路  Source: openshift/openshift-ansible

Description

Running playbooks/redeploy-certificates.yml fails in play Restart nodes. Nodes origin-node service then fails to bring up the node service as all API calls fail with x509: certificate signed by unknown authority

I don't have any specific certificate configs in my inventory so I'm relying on self-signed CA & certs as generated by openshift-ansible.

Version
ansible 2.6.3
  config file = /home/os-admin/cluster/ansible.cfg
  configured module search path = [u'/home/os-admin/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /bin/ansible
  python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch
Steps To Reproduce
  1. Cluster is in good shape
# oc get nodes
NAME               STATUS  ROLES            AGE       VERSION
n22.redacted.com   Ready   compute,master   1y        v1.10.0+b81c8f8
n23.redacted.com   Ready   compute,master   1y        v1.10.0+b81c8f8
n24.redacted.com   Ready   compute,master   1y        v1.10.0+b81c8f8
  1. Run openshift-ansible/playbooks/redeploy-certificates.yml which result in an error
TASK [Wait for node to be ready] ***********************************************************************
FAILED - RETRYING: Wait for node to be ready (36 retries left).
...
FAILED - RETRYING: Wait for node to be ready (1 retries left).
fatal: [192.168.173.23 -> 192.168.173.22]: FAILED! => {"attempts": 36, "changed": false, "results": {"cmd": "/usr/bin/oc get node n23.redacted.com -o json -n default", "results": [{"apiVersion": "v1", "kind": "Node", "metadata": {"annotations": {"volumes.kubernetes.io/controller-managed-attach-detach": "true"}, "creationTimestamp": "2017-01-10T15:08:52Z", "labels": {"beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/os": "linux", "kubernetes.io/hostname": "n23.redacted.com", "node-role.kubernetes.io/compute": "true", "node-role.kubernetes.io/master": "true", "region": "infra"}, "name": "n23.redacted.com", "resourceVersion": "220141119", "selfLink": "/api/v1/nodes/n23.redacted.com", "uid": "b29b57b3-d746-11e6-a28f-0cc47af70e64"}, "spec": {"externalID": "n23.redacted.com"}, "status": {"addresses": [{"address": "192.168.173.23", "type": "InternalIP"}, {"address": "n23.redacted.com", "type": "Hostname"}], "allocatable": {"alpha.kubernetes.io/nvidia-gpu": "0", "cpu": "40", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "131816644Ki", "pods": "250"}, "capacity": {"alpha.kubernetes.io/nvidia-gpu": "0", "cpu": "40", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "131919044Ki", "pods": "250"}, "conditions": [{"lastHeartbeatTime": "2018-09-05T15:40:49Z", "lastTransitionTime": "2018-09-05T15:42:28Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "OutOfDisk"}, {"lastHeartbeatTime": "2018-09-05T15:40:49Z", "lastTransitionTime": "2018-09-05T15:42:28Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "MemoryPressure"}, {"lastHeartbeatTime": "2018-09-05T15:40:49Z", "lastTransitionTime": "2018-09-05T15:42:28Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "Ready"}, {"lastHeartbeatTime": "2018-09-05T15:40:49Z", "lastTransitionTime": "2018-09-05T15:42:28Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "DiskPressure"}, {"lastHeartbeatTime": "2018-09-05T15:40:49Z", "lastTransitionTime": "2018-08-24T15:38:34Z", "message": "kubelet has sufficient PID available", "reason": "KubeletHasSufficientPID", "status": "False", "type": "PIDPressure"}], "daemonEndpoints": {"kubeletEndpoint": {"Port": 10250}}, "images": ["REDACTED"], "nodeInfo": {"architecture": "amd64", "bootID": "fc93f998-6475-4ec6-aab2-eee83ec78dd8", "containerRuntimeVersion": "docker://1.13.1", "kernelVersion": "3.10.0-862.11.6.el7.x86_64", "kubeProxyVersion": "v1.10.0+b81c8f8", "kubeletVersion": "v1.10.0+b81c8f8", "machineID": "74e3cadf71174ca1aff1c61f824e9b07", "operatingSystem": "linux", "osImage": "CentOS Linux 7 (Core)", "systemUUID": "00000000-0000-0000-0000-0CC47AF70E64"}}}], "returncode": 0}, "state": "list"}
  1. Notice the cluster is in bad shape
# oc get nodes
NAME                      STATUS     ROLES            AGE       VERSION
n22.redacted.com   NotReady   compute,master   1y        v1.10.0+b81c8f8
n23.redacted.com   NotReady   compute,master   1y        v1.10.0+b81c8f8
n24.redacted.com   NotReady   compute,master   1y        v1.10.0+b81c8f8

OutOfDisk        Unknown   Wed, 05 Sep 2018 11:41:06 -0400   Wed, 05 Sep 2018 11:42:27 -0400   NodeStatusUnknown         Kubelet stopped posting node status.
  MemoryPressure   Unknown   Wed, 05 Sep 2018 11:41:06 -0400   Wed, 05 Sep 2018 11:42:27 -0400   NodeStatusUnknown         Kubelet stopped posting node status.
  Ready            Unknown   Wed, 05 Sep 2018 11:41:06 -0400   Wed, 05 Sep 2018 11:42:27 -0400   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure     Unknown   Wed, 05 Sep 2018 11:41:06 -0400   Wed, 05 Sep 2018 11:42:27 -0400   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure      False     Wed, 05 Sep 2018 11:41:06 -0400   Fri, 24 Aug 2018 11:41:17 -0400   KubeletHasSufficientPID   kubelet has sufficient PID available
  1. origin-node logs shows all API calls are failing
Sep 05 12:31:53 n22.redacted.com origin-node[27743]: I0905 12:31:53.626700   27743 kubelet_node_status.go:82] Attempting to register node n22.redacted.com
Sep 05 12:31:53 n22.redacted.com origin-node[27743]: E0905 12:31:53.632462   27743 kubelet_node_status.go:106] Unable to register node "n22.redacted.com" with API server: Post https://n20.redacted.com:8443/api/v1/nodes: x509: certificate signed by unknown authority
Sep 05 12:31:54 n22.redacted.com origin-node[27743]: E0905 12:31:54.276000   27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list *v1.Node: Get https://n20.redacted.com:8443/api/v1/nodes?fieldSelector=metadata.name%3Dn22.redacted.com&limit=500&resourceVersion=0: x509: certificate signed by unknown authority
Sep 05 12:31:54 n22.redacted.com origin-node[27743]: E0905 12:31:54.276513   27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list *v1.Service: Get https://n20.redacted.com:8443/api/v1/services?limit=500&resourceVersion=0: x509: certificate signed by unknown authority
Sep 05 12:31:54 n22.redacted.com origin-node[27743]: E0905 12:31:54.277002   27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://n20.redacted.com:8443/api/v1/pods?fieldSelector=spec.nodeName%3Dn22.redacted.com&limit=500&resourceVersion=0: x509: certificate signed by unknown authority
Sep 05 12:31:55 n22.redacted.com origin-node[27743]: E0905 12:31:55.283133   27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list *v1.Node: Get https://n20.redacted.com:8443/api/v1/nodes?fieldSelector=metadata.name%3Dn22.redacted.com&limit=500&resourceVersion=0: x509: certificate signed by unknown authority
Sep 05 12:31:55 n22.redacted.com origin-node[27743]: E0905 12:31:55.283667   27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list *v1.Service: Get https://n20.redacted.com:8443/api/v1/services?limit=500&resourceVersion=0: x509: certificate signed by unknown authority
Sep 05 12:31:55 n22.redacted.com origin-node[27743]: E0905 12:31:55.284443   27743 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://n20.redacted.com:8443/api/v1/pods?fieldSelector=spec.nodeName%3Dn22.redacted.com&limit=500&resourceVersion=0: x509: certificate signed by unknown authority
Expected Results

playbooks/redeploy-certificates.yml should redeploy the certificates

Observed Results

Nodes end up not being able to talk to the API service as it doesn't trust the CA

Additional Information
# inventory
[OSEv3:children]
nodes
nfs
masters
lb
etcd

[OSEv3:vars]
openshift_master_default_subdomain=dev.redacted.com
ansible_ssh_user=root
openshift_override_hostname_check=true
openshift_master_cluster_method=native
openshift_master_cluster_hostname=n20.redacted.com
openshift_master_cluster_public_hostname=dev.redacted.com
deployment_type=origin
openshift_release="3.10"
openshift_repos_enable_testing=true

openshift_service_catalog_remove=true
openshift_enable_service_catalog=true
ansible_service_broker_remove=true
ansible_service_broker_install=false
template_service_broker_remove=true
template_service_broker_install=false

osm_default_node_selector='node-role.kubernetes.io/compute=true'

[nodes]
192.168.173.22  openshift_public_ip=192.168.173.22 openshift_ip=192.168.173.22 openshift_public_hostname=n22.redacted.com openshift_hostname=n22.redacted.com connect_to=192.168.173.22 openshift_schedulable=True openshift_node_group_name='node-config-compute'
192.168.173.23  openshift_public_ip=192.168.173.23 openshift_ip=192.168.173.23 openshift_public_hostname=n23.redacted.com openshift_hostname=n23.redacted.com connect_to=192.168.173.23 openshift_schedulable=True openshift_node_group_name='node-config-compute'
192.168.173.24  openshift_public_ip=192.168.173.24 openshift_ip=192.168.173.24 openshift_public_hostname=n24.redacted.com openshift_hostname=n24.redacted.com connect_to=192.168.173.24 openshift_schedulable=True openshift_node_group_name='node-config-compute'
192.168.173.25  openshift_public_ip=192.168.173.25 openshift_ip=192.168.173.25 openshift_public_hostname=n25.redacted.com openshift_hostname=n25.redacted.com connect_to=192.168.173.25 openshift_schedulable=True openshift_node_group_name='node-config-compute'
192.168.173.26  openshift_public_ip=192.168.173.26 openshift_ip=192.168.173.26 openshift_public_hostname=n26.redacted.com openshift_hostname=n26.redacted.com connect_to=192.168.173.26 openshift_schedulable=True openshift_node_group_name='node-config-compute'

[nfs]
192.168.173.27  openshift_public_ip=192.168.173.27 openshift_ip=192.168.173.27 openshift_public_hostname=n27.redacted.com openshift_hostname=n27.redacted.com connect_to=192.168.173.27
#192.168.173.28  openshift_public_ip=192.168.173.28 openshift_ip=192.168.173.28 openshift_public_hostname=n28.redacted.com openshift_hostname=n28.redacted.com connect_to=192.168.173.28

[masters]
192.168.173.22  openshift_public_ip=192.168.173.22 openshift_ip=192.168.173.22 openshift_public_hostname=n22.redacted.com openshift_hostname=n22.redacted.com connect_to=192.168.173.22 openshift_node_group_name='node-config-master'
192.168.173.23  openshift_public_ip=192.168.173.23 openshift_ip=192.168.173.23 openshift_public_hostname=n23.redacted.com openshift_hostname=n23.redacted.com connect_to=192.168.173.23 openshift_node_group_name='node-config-master'
192.168.173.24  openshift_public_ip=192.168.173.24 openshift_ip=192.168.173.24 openshift_public_hostname=n24.redacted.com openshift_hostname=n24.redacted.com connect_to=192.168.173.24 openshift_node_group_name='node-config-master'

[lb]
192.168.173.20  openshift_public_ip=8.43.84.242 openshift_ip=192.168.173.20 openshift_public_hostname=dev.redacted.com openshift_hostname=n20.redacted.com connect_to=192.168.173.20

[etcd]
192.168.173.22  openshift_public_ip=192.168.173.22 openshift_ip=192.168.173.22 openshift_public_hostname=n22.redacted.com openshift_hostname=n22.redacted.com connect_to=192.168.173.22
192.168.173.23  openshift_public_ip=192.168.173.23 openshift_ip=192.168.173.23 openshift_public_hostname=n23.redacted.com openshift_hostname=n23.redacted.com connect_to=192.168.173.23
192.168.173.24  openshift_public_ip=192.168.173.24 openshift_ip=192.168.173.24 openshift_public_hostname=n24.redacted.com openshift_hostname=n24.redacted.com connect_to=192.168.173.24

lifecyclrotten

Most helpful comment

The webconsole issue was solved by reinstalling it

ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-web-console/config.yml -e openshift_web_console_install=false
ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-web-console/config.yml -e openshift_web_console_install=true

All 11 comments

I've also attempted to use redeploy-openshift-ca.yml to generate a fresh new CA and redeploy the others certs from that but end up with the same errors.

At this point I'd be OK with starting over with all new certs but I don't seem to be able to achieve that with the current redeploy-certificates playbooks.

@sdodson we have definite problems with 3.10+ cert plays.

@jfchevrette thanks for the report. This is probably completely broken at the moment.

I can now report that running deploy_cluster.yml allows the cluster to go back to a good state. However my web-console remains inaccessible and return a HTTP 502.

The webconsole issue was solved by reinstalling it

ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-web-console/config.yml -e openshift_web_console_install=false
ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-web-console/config.yml -e openshift_web_console_install=true

Was this issue fixed by just deleting them or something? All the cert playbooks referenced in the 3.10 documentation do not exist anymore. So there's no way to renew certificates.

I ended up reinstalling the cluster completely and have not attempted to reset the certificates since.

I'm experiencing this problem with 3.10. During redeploy-certificates, a playbook called revert-client-ca is executed, which sets servingInfo.clientCA to ca.crt. I could work around this problem by patching this playbook to use ca-bundle.crt. According to https://bugzilla.redhat.com/show_bug.cgi?id=1493276, it's the wrong solution, but it's 3am and I don't have any other solution to bring the cluster back online until morning.

So what's the problem here and what are the prospects of seeing a solution in 3.10? Is upgrading to 3.11 going to make this better?

The webconsole issue was solved by reinstalling it

ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-web-console/config.yml -e openshift_web_console_install=false
ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-web-console/config.yml -e openshift_web_console_install=true

Can also be fixed by deleting the webconsole-serving-cert secret and deleting the webconsole pod. The issue here is that the service was running with an old certificate (no longer trusted)

I found this while searching the issue. Looks like you need to replace the certificates using bootstrap on each node. https://bugzilla.redhat.com/show_bug.cgi?id=1652746

UPDATED MANUAL STEPS to replace certs on NODE. 

1. Create a new bootstrap.kubeconfig for nodes (MASTER nodes will just copy admin.kubeconfig)
# oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra > bootstrap.kubeconfig

1B. JUST ON THE MASTERS 
# cp /etc/origin/master/admin.kubeconfig /etc/origin/node/bootstrap.kubeconfig

2. Distribute configure create in step 1A to infra and compute nodes replacing /etc/origin/node/bootstrap.kubeconfig

3A. Remove contents of /etc/origin/node/certificates and move the node.kubeconfig and client-ca.crt 
# rm -rf  /etc/origin/node/certificates
# mv /etc/origin/node/client-ca.crt{,.old}
# mv /etc/origin/node/node.kubeconfig{,.old}

4. Restart node service.
# systemctl restart atomic-openshift-node.service 

5. Approve CSRs. 2 should be approved.  
# oc get csr -o name | xargs oc adm certificate approve

This might be helpful for step 2:

ansible -K all -m copy -a "src=~/bootstrap.kubeconfig dest=/etc/origin/node/bootstrap.kubeconfig"

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Was this page helpful?
0 / 5 - 0 ratings