Openshift-ansible: Timed out accepting certificate signing requests.

Created on 16 Aug 2018 · 21Comments · Source: openshift/openshift-ansible

Description

Provide a brief description of your issue here. For example:

A Openshift 3.10 cluster installation fails when attempting to accept certificate signing requests. The oc_adm_csr.py times out after 60 seconds.
4 certificates needed to be signed. They all PASSED. But took 67 seconds to complete.

Version

# ansible --version
ansible 2.4.4.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May 31 2018, 09:41:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

If you're operating from a git clone:

release-3.10 branch

$ git describe 
openshift-ansible-3.10.27-2-69-gd96b19f2a

Steps To Reproduce

Configure inventory using 1 master and 1 node on bare metal.
Run the cluster install script.

Expected Results

The installation completes successfully.

Observed Results

"Timed out accepting certificate signing requests. Failing as requested."

INSTALLER STATUS ***************************************************************
Initialization              : Complete (0:00:09)
Health Check                : Complete (0:02:48)
Node Bootstrap Preparation  : Complete (0:00:01)
etcd Install                : Complete (0:00:22)
Master Install              : Complete (0:01:29)
Master Additional Install   : Complete (0:00:48)
Node Join                   : In Progress (0:01:10)

Failure summary:
  1. Hosts:    benchserver7.acme.com
     Play:     Approve any pending CSR requests from inventory nodes
     Task:     Report approval errors
     Message:  Node approval failed

For long output or logs, consider using a gist

Detailed -vvv logging of Ansible script. All the gory details are here.

Additional Information

Provide any additional information which may help us diagnose the
issue.

# uname -a
Linux benchserver7 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22:15 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/ansible/hosts 
# This is the default ansible 'hosts' file.

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
containerized=false
openshift_deployment_type=openshift-enterprise
debug_level=0
openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true',]}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true'], 'edits': [{ 'key': 'kubeletArguments.pods-per-core','value': ['20']}]}]
openshift_master_cluster_hostname=benchserver7
ansible_ssh_user=root
openshift_enable_service_catalog=false
disk_availability=false
openshift_disable_check=memory_availability,disk_availability

[masters]
benchserver7.acme.com

[etcd]
benchserver7.acme.com

[nodes]
benchserver7.acme.com openshift_node_group_name='node-config-master'
#benchserver5.acme.com openshift_node_group_name='node-config-infra'
benchserver2.acme.com openshift_node_group_name='node-config-compute'
#

EXTRA INFORMATION GOES HERE

Source

whitingjr

Most helpful comment

All, thank you for the detailed failure reports. I am in the process of creating a custom module to deal with this csr signing issue here: https://github.com/openshift/openshift-ansible/pull/9711

We plan to backport to 3.10 as soon as it's ready, hopefully in the next day or so.

michaelgugino on 22 Aug 2018

👍3

All 21 comments

I've seen this happening a few times as well, looks like random API server hangups

@mfojtik any ideas how to debug timeouts on api server requests?

vrutkovs on 16 Aug 2018

I've restarted the installation from scratch. Setting the timeout to 120.
This time the timeout fails again. At 133 seconds. Checking the CSRs....

# oc get csr
NAME        AGE       REQUESTOR                  CONDITION
csr-ll5xn   5m        system:admin               Approved,Issued
csr-p5mwc   3m        system:node:benchserver7   Approved,Issued
csr-qj5wc   5m        system:admin               Approved,Issued
csr-xdbw2   2m        system:node:benchserver7   Approved,Issued

Is this Python script oc_adm_csr.py functioning correct ?

whitingjr on 16 Aug 2018

@vrutkovs this isn't random for me. It is repeatable. Every time.

whitingjr on 16 Aug 2018

The weird part is that hosts are named benchserver7.acme.com, but CSR is created for benchserver7 node.

Does hostname -f on the host matches hostnames in ansible inventory? What's the output of oc get nodes and oc describe csr-p5mwc?

vrutkovs on 16 Aug 2018

I am hitting the same issue with 3.10 openshift-enterprise. The csr are approved and the hostname -f is correct. But the ansible fails with "Node approval failed"

karthikrmit on 16 Aug 2018

looks like it fails because
`"server_accepted": false, "csrs": {}, "client_accepted": false, "name": "benchserver2",
until these are true for both it will timeout.

whitingjr on 16 Aug 2018

@vrutkovs I have obfuscated the domain name with .acme.com

yes hostname -fmatches the value for openshift_master_cluster_hostname.
Further down in hosts the hostname is fully qualified with domain.

whitingjr on 16 Aug 2018

@vrutkovs looking back over the Gist attached to this issue. Looking at lines 201-207 that shows the json for CSR requests by benchserver2. There are none. It is empty. Whereas benchserver7 had 4. No wonder server_accepted=false and client_accepted=false for benchserver2.
In 3.9 only the initial master had CSR signing requests issued.
In 3.10 the CSR signing has been re-factored. Currently both infrastructure and compute nodes are having signing requests issued. Which I suspect is not what was intended when the change to use all nodes was made. Is that correct @michaelgugino @vrutkovs ?

whitingjr on 16 Aug 2018

@whitingjr I have the log at node_approval_failure_log
I just tried the below config (fail_on_timeout: false) in playbooks/openshift-node/private/join.yml and it worked

-name: Approve bootstrap node
    oc_adm_csr:
      nodes: "{{ l_nodes_to_join }}"
      timeout: 60
      fail_on_timeout: false

karthikrmit on 17 Aug 2018

I also encountered the same problem

TASK [Approve bootstrap nodes] ******************************************************************************************
fatal: [10.10.244.212]: FAILED! => {"changed": true, "finished": false, "msg": "Timed out accepting certificate signing requests. Failing as requested.", "nodes": [{"client_accepted": true, "csrs": {"csr-4v5sv": {"apiVersion": "certificates.k8s.io/v1beta1", "kind": "CertificateSigningRequest", "metadata": 
{"creationTimestamp": "2018-08-17T06:50:04Z", "generateName": "csr-", "name": "csr-4v5sv", "namespace": "", "resourceVersion": "714", "selfLink": 
"/apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-4v5sv", "uid": "c5219812-a1e9-11e8-8618-525400614a73"}, "spec": {"groups": ["system:masters", "system:cluster-admins", "system:authenticated"], "request": 
"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQkx6Q0IxZ0lCQURBN01SVXdFd1lEVlFRS0V3eHplWE4wWlcwNmJtOWtaWE14SWpBZ0JnTlZCQU1UR1hONQpjM1JsYlRwdWIyUmxPakV3TFRFd0xUSTBOQzB5TVRJd1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBVGxsdlowZkZPak5zbnBjdDEwSEdBMkpjNTY5ZzFTK0Y3NWpOcUowRDkrUXJhaWx2eVIxN0x0T3ViVFp0RVUKQTdkTmViQzErd2dyc2tOaDlzOWVZaEhYb0Rrd053WUpLb1pJaHZjTkFRa09NU293S0RBbUJnTlZIUkVFSHpBZApnZzB4TUMweE1DMHlORFF0TWpFeWdnQ0hCQW9LOU5TSEJLd1JBQUV3Q2dZSUtvWkl6ajBFQXdJRFNBQXdSUUlnCkFRenRhRUhTcWcwMXNmdUVmQWRZcnRnQWhadVpaaUlPL3hBNTZUUHRKbElDSVFEWDRaWjgyNEJwK3hGWm9qaXgKYU1Bd2l1K3RQczhobTdFaVBQYnpVT3BVRUE9PQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0K", "usages": ["digital signature", "key encipherment", "server auth"], "username": "system:admin"},

TASK [Report approval errors] ******************************************************************************************
fatal: [10.10.244.212]: FAILED! => {"changed": false, "msg": "Node approval failed"}

  1. Hosts:    10.10.244.212
     Play:     Approve any pending CSR requests from inventory nodes
     Task:     Report approval errors
     Message:  Node approval failed

oc get csr             
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-4v5sv                                              12m       system:admin                                              Approved
csr-dt7s6                                              12m       system:admin                                              Approved
csr-zr8k5                                              6m        system:admin                                              Approved
node-csr-BFTK_sNDdRTqFpukK3a-t_Dw6utL0JgPqDloWm3xbdg   6m        system:serviceaccount:openshift-infra:node-bootstrapper   Approved
node-csr-iBMOx7LAmvdYUdE1cclIdg9ZafQUNkD7QW9uSkGthg8   6m        system:serviceaccount:openshift-infra:node-bootstrapper   Approved
node-csr-yGjHFFSyx3Js5SyW9Xu0r5VMCK3btSazgqRvB5ggy-8   6m        system:serviceaccount:openshift-infra:node-bootstrapper   Approved

oc describe csr csr-4v5sv                                                                                                           
Name:               csr-4v5sv
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Fri, 17 Aug 2018 14:50:04 +0800
Requesting User:    system:admin
Status:             Approved
Subject:
  Common Name:    system:node:10-10-244-212
  Serial Number:  
  Organization:   system:nodes
Subject Alternative Names:
         DNS Names:     10-10-244-212

         IP Addresses:  10.10.244.212
                        172.17.0.1

aland-zhang on 17 Aug 2018

@kmurthy1 thanks for sharing that workaround to disable the fail_on_timeout. It worked for me to progress beyond this issue.
@aland-zhang I suggest you try switching the boolean to false and having another go.

whitingjr on 17 Aug 2018

👍1

While there is a workaround the underlying issue of unnecessary attempts at CSR requests for compute nodes is still there. If my understanding of the issue is correct.

whitingjr on 17 Aug 2018

@vrutkovs Is this issue fixed upstream ?

karthikrmit on 21 Aug 2018

For what it's worth I'm experiencing the same, I'm able to reliably reproduce the Timed out accepting certificate signing requests. Failing as requested.

I attempted the work-around suggested, with:

sed -i -e 's/fail_on_timeout: true/fail_on_timeout: false/' playbooks/openshift-node/private/join.yml

While this does make it so the plays continue after that failing play, I'm then faced with the play openshift_manage_node : Wait for Node Registration failing, and wind up with oc get nodes showing only the master available, no other nodes in the cluster.

The openshift-ansible version I have locally (pulled this morning)...

# [root@droctagon3 openshift-ansible]# git rev-parse HEAD
# 7a136e99c33927a00f2f3a58b2de5e170e880252

The inventory I'm using follows:

threeten-infra.test.example.com ansible_host=192.168.1.19
threeten-master.test.example.com ansible_host=192.168.1.43
threeten-node1.test.example.com ansible_host=192.168.1.197

[masters]
threeten-master.test.example.com

[etcd]
threeten-master.test.example.com

[nodes]
threeten-master.test.example.com openshift_node_group_name="node-config-master"
threeten-infra.test.example.com openshift_node_group_name="node-config-infra"
threeten-node1.test.example.com openshift_node_group_name="node-config-compute"

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
openshift_release="3.10"
openshift_install_examples=false
openshift_deployment_type=origin
openshift_master_default_subdomain=apps.test.example.com
openshift_master_cluster_hostname=threeten-master.test.example.com
openshift_disable_check=disk_availability,memory_availability
openshift_enable_docker_excluder=False
debug_level=2
ansible_ssh_user=centos
ansible_become=yes
ansible_ssh_private_key_file=/root/.ssh/id_vm_rsa

dougbtv on 22 Aug 2018

All, thank you for the detailed failure reports. I am in the process of creating a custom module to deal with this csr signing issue here: https://github.com/openshift/openshift-ansible/pull/9711

We plan to backport to 3.10 as soon as it's ready, hopefully in the next day or so.

michaelgugino on 22 Aug 2018

👍3

For what it's worth, I realized that one issue I had on my deployment was that my nodes weren't able to resolve the DNS name of the other nodes/master in the cluster. In the past, this would've been caught in the prerequisites playbook, but, wasn't.

In my case, this is in a lab, and I have an /etc/hosts that has the DNS names and IPs of each node in the cluster. Once that was in place, this play succeeded.

dougbtv on 22 Aug 2018

@michaelgugino the changes solved the problem. Doing an install using 3.10 progressed beyond this issue. Happy to consider this issue solved for me.

whitingjr on 6 Sep 2018

@whitingjr thanks for the update!

Closed by: https://github.com/openshift/openshift-ansible/pull/9711

michaelgugino on 6 Sep 2018

@whitingjr I have the log at node_approval_failure_log
I just tried the below config (fail_on_timeout: false) in playbooks/openshift-node/private/join.yml and it worked
-name: Approve bootstrap node
    oc_adm_csr:
      nodes: "{{ l_nodes_to_join }}"
      timeout: 60
      fail_on_timeout: false

I tried the same as you and it didn't work for me.

whitingjr on 17 Sep 2018

There's been a lot of changes in this area, if anyone is still having issues, recommend using the latest of your desired release branch to see if it has been fixed in your case.

If not, please open a new bugzilla or github issue.

michaelgugino on 17 Sep 2018

this issue may related whether you install packages pyOpenSSL, you can use rpm -aq|grep pyOpenSSL to check it firstly.