Openshift-ansible: Timed out accepting certificate signing requests.

Created on 16 Aug 2018  路  21Comments  路  Source: openshift/openshift-ansible

Description

Provide a brief description of your issue here. For example:

A Openshift 3.10 cluster installation fails when attempting to accept certificate signing requests. The oc_adm_csr.py times out after 60 seconds.
4 certificates needed to be signed. They all PASSED. But took 67 seconds to complete.

Version
# ansible --version
ansible 2.4.4.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May 31 2018, 09:41:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

If you're operating from a git clone:

release-3.10 branch

$ git describe 
openshift-ansible-3.10.27-2-69-gd96b19f2a
Steps To Reproduce
  1. Configure inventory using 1 master and 1 node on bare metal.
  2. Run the cluster install script.
Expected Results

The installation completes successfully.

Observed Results
"Timed out accepting certificate signing requests. Failing as requested."
INSTALLER STATUS ***************************************************************
Initialization              : Complete (0:00:09)
Health Check                : Complete (0:02:48)
Node Bootstrap Preparation  : Complete (0:00:01)
etcd Install                : Complete (0:00:22)
Master Install              : Complete (0:01:29)
Master Additional Install   : Complete (0:00:48)
Node Join                   : In Progress (0:01:10)

Failure summary:
  1. Hosts:    benchserver7.acme.com
     Play:     Approve any pending CSR requests from inventory nodes
     Task:     Report approval errors
     Message:  Node approval failed

For long output or logs, consider using a gist

Detailed -vvv logging of Ansible script. All the gory details are here.

Additional Information

Provide any additional information which may help us diagnose the
issue.

# uname -a
Linux benchserver7 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22:15 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/ansible/hosts 
# This is the default ansible 'hosts' file.

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
containerized=false
openshift_deployment_type=openshift-enterprise
debug_level=0
openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true',]}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true'], 'edits': [{ 'key': 'kubeletArguments.pods-per-core','value': ['20']}]}]
openshift_master_cluster_hostname=benchserver7
ansible_ssh_user=root
openshift_enable_service_catalog=false
disk_availability=false
openshift_disable_check=memory_availability,disk_availability

[masters]
benchserver7.acme.com

[etcd]
benchserver7.acme.com

[nodes]
benchserver7.acme.com openshift_node_group_name='node-config-master'
#benchserver5.acme.com openshift_node_group_name='node-config-infra'
benchserver2.acme.com openshift_node_group_name='node-config-compute'
#
EXTRA INFORMATION GOES HERE

Most helpful comment

All, thank you for the detailed failure reports. I am in the process of creating a custom module to deal with this csr signing issue here: https://github.com/openshift/openshift-ansible/pull/9711

We plan to backport to 3.10 as soon as it's ready, hopefully in the next day or so.

All 21 comments

I've seen this happening a few times as well, looks like random API server hangups

@mfojtik any ideas how to debug timeouts on api server requests?

I've restarted the installation from scratch. Setting the timeout to 120.
This time the timeout fails again. At 133 seconds. Checking the CSRs....

# oc get csr
NAME        AGE       REQUESTOR                  CONDITION
csr-ll5xn   5m        system:admin               Approved,Issued
csr-p5mwc   3m        system:node:benchserver7   Approved,Issued
csr-qj5wc   5m        system:admin               Approved,Issued
csr-xdbw2   2m        system:node:benchserver7   Approved,Issued

Is this Python script oc_adm_csr.py functioning correct ?

@vrutkovs this isn't random for me. It is repeatable. Every time.

The weird part is that hosts are named benchserver7.acme.com, but CSR is created for benchserver7 node.

Does hostname -f on the host matches hostnames in ansible inventory? What's the output of oc get nodes and oc describe csr-p5mwc?

I am hitting the same issue with 3.10 openshift-enterprise. The csr are approved and the hostname -f is correct. But the ansible fails with "Node approval failed"

looks like it fails because
`"server_accepted": false, "csrs": {}, "client_accepted": false, "name": "benchserver2",
until these are true for both it will timeout.

@vrutkovs I have obfuscated the domain name with .acme.com

yes hostname -fmatches the value for openshift_master_cluster_hostname.
Further down in hosts the hostname is fully qualified with domain.

@vrutkovs looking back over the Gist attached to this issue. Looking at lines 201-207 that shows the json for CSR requests by benchserver2. There are none. It is empty. Whereas benchserver7 had 4. No wonder server_accepted=false and client_accepted=false for benchserver2.
In 3.9 only the initial master had CSR signing requests issued.
In 3.10 the CSR signing has been re-factored. Currently both infrastructure and compute nodes are having signing requests issued. Which I suspect is not what was intended when the change to use all nodes was made. Is that correct @michaelgugino @vrutkovs ?

@whitingjr I have the log at node_approval_failure_log
I just tried the below config (fail_on_timeout: false) in playbooks/openshift-node/private/join.yml and it worked

-name: Approve bootstrap node
    oc_adm_csr:
      nodes: "{{ l_nodes_to_join }}"
      timeout: 60
      fail_on_timeout: false


I also encountered the same problem

TASK [Approve bootstrap nodes] ******************************************************************************************
fatal: [10.10.244.212]: FAILED! => {"changed": true, "finished": false, "msg": "Timed out accepting certificate signing requests. Failing as requested.", "nodes": [{"client_accepted": true, "csrs": {"csr-4v5sv": {"apiVersion": "certificates.k8s.io/v1beta1", "kind": "CertificateSigningRequest", "metadata": 
{"creationTimestamp": "2018-08-17T06:50:04Z", "generateName": "csr-", "name": "csr-4v5sv", "namespace": "", "resourceVersion": "714", "selfLink": 
"/apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-4v5sv", "uid": "c5219812-a1e9-11e8-8618-525400614a73"}, "spec": {"groups": ["system:masters", "system:cluster-admins", "system:authenticated"], "request": 
"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQkx6Q0IxZ0lCQURBN01SVXdFd1lEVlFRS0V3eHplWE4wWlcwNmJtOWtaWE14SWpBZ0JnTlZCQU1UR1hONQpjM1JsYlRwdWIyUmxPakV3TFRFd0xUSTBOQzB5TVRJd1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBVGxsdlowZkZPak5zbnBjdDEwSEdBMkpjNTY5ZzFTK0Y3NWpOcUowRDkrUXJhaWx2eVIxN0x0T3ViVFp0RVUKQTdkTmViQzErd2dyc2tOaDlzOWVZaEhYb0Rrd053WUpLb1pJaHZjTkFRa09NU293S0RBbUJnTlZIUkVFSHpBZApnZzB4TUMweE1DMHlORFF0TWpFeWdnQ0hCQW9LOU5TSEJLd1JBQUV3Q2dZSUtvWkl6ajBFQXdJRFNBQXdSUUlnCkFRenRhRUhTcWcwMXNmdUVmQWRZcnRnQWhadVpaaUlPL3hBNTZUUHRKbElDSVFEWDRaWjgyNEJwK3hGWm9qaXgKYU1Bd2l1K3RQczhobTdFaVBQYnpVT3BVRUE9PQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0K", "usages": ["digital signature", "key encipherment", "server auth"], "username": "system:admin"},

TASK [Report approval errors] ******************************************************************************************
fatal: [10.10.244.212]: FAILED! => {"changed": false, "msg": "Node approval failed"}

  1. Hosts:    10.10.244.212
     Play:     Approve any pending CSR requests from inventory nodes
     Task:     Report approval errors
     Message:  Node approval failed

oc get csr             
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-4v5sv                                              12m       system:admin                                              Approved
csr-dt7s6                                              12m       system:admin                                              Approved
csr-zr8k5                                              6m        system:admin                                              Approved
node-csr-BFTK_sNDdRTqFpukK3a-t_Dw6utL0JgPqDloWm3xbdg   6m        system:serviceaccount:openshift-infra:node-bootstrapper   Approved
node-csr-iBMOx7LAmvdYUdE1cclIdg9ZafQUNkD7QW9uSkGthg8   6m        system:serviceaccount:openshift-infra:node-bootstrapper   Approved
node-csr-yGjHFFSyx3Js5SyW9Xu0r5VMCK3btSazgqRvB5ggy-8   6m        system:serviceaccount:openshift-infra:node-bootstrapper   Approved

oc describe csr csr-4v5sv                                                                                                           
Name:               csr-4v5sv
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Fri, 17 Aug 2018 14:50:04 +0800
Requesting User:    system:admin
Status:             Approved
Subject:
  Common Name:    system:node:10-10-244-212
  Serial Number:  
  Organization:   system:nodes
Subject Alternative Names:
         DNS Names:     10-10-244-212

         IP Addresses:  10.10.244.212
                        172.17.0.1

@kmurthy1 thanks for sharing that workaround to disable the fail_on_timeout. It worked for me to progress beyond this issue.
@aland-zhang I suggest you try switching the boolean to false and having another go.

While there is a workaround the underlying issue of unnecessary attempts at CSR requests for compute nodes is still there. If my understanding of the issue is correct.

@vrutkovs Is this issue fixed upstream ?

For what it's worth I'm experiencing the same, I'm able to reliably reproduce the Timed out accepting certificate signing requests. Failing as requested.

I attempted the work-around suggested, with:

sed -i -e 's/fail_on_timeout: true/fail_on_timeout: false/' playbooks/openshift-node/private/join.yml

While this does make it so the plays continue after that failing play, I'm then faced with the play openshift_manage_node : Wait for Node Registration failing, and wind up with oc get nodes showing only the master available, no other nodes in the cluster.

The openshift-ansible version I have locally (pulled this morning)...

# [root@droctagon3 openshift-ansible]# git rev-parse HEAD
# 7a136e99c33927a00f2f3a58b2de5e170e880252

The inventory I'm using follows:

threeten-infra.test.example.com ansible_host=192.168.1.19
threeten-master.test.example.com ansible_host=192.168.1.43
threeten-node1.test.example.com ansible_host=192.168.1.197

[masters]
threeten-master.test.example.com

[etcd]
threeten-master.test.example.com

[nodes]
threeten-master.test.example.com openshift_node_group_name="node-config-master"
threeten-infra.test.example.com openshift_node_group_name="node-config-infra"
threeten-node1.test.example.com openshift_node_group_name="node-config-compute"

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
openshift_release="3.10"
openshift_install_examples=false
openshift_deployment_type=origin
openshift_master_default_subdomain=apps.test.example.com
openshift_master_cluster_hostname=threeten-master.test.example.com
openshift_disable_check=disk_availability,memory_availability
openshift_enable_docker_excluder=False
debug_level=2
ansible_ssh_user=centos
ansible_become=yes
ansible_ssh_private_key_file=/root/.ssh/id_vm_rsa

All, thank you for the detailed failure reports. I am in the process of creating a custom module to deal with this csr signing issue here: https://github.com/openshift/openshift-ansible/pull/9711

We plan to backport to 3.10 as soon as it's ready, hopefully in the next day or so.

For what it's worth, I realized that one issue I had on my deployment was that my nodes weren't able to resolve the DNS name of the other nodes/master in the cluster. In the past, this would've been caught in the prerequisites playbook, but, wasn't.

In my case, this is in a lab, and I have an /etc/hosts that has the DNS names and IPs of each node in the cluster. Once that was in place, this play succeeded.

@michaelgugino the changes solved the problem. Doing an install using 3.10 progressed beyond this issue. Happy to consider this issue solved for me.

@whitingjr thanks for the update!

Closed by: https://github.com/openshift/openshift-ansible/pull/9711

@whitingjr I have the log at node_approval_failure_log
I just tried the below config (fail_on_timeout: false) in playbooks/openshift-node/private/join.yml and it worked

-name: Approve bootstrap node
    oc_adm_csr:
      nodes: "{{ l_nodes_to_join }}"
      timeout: 60
      fail_on_timeout: false

I tried the same as you and it didn't work for me.

There's been a lot of changes in this area, if anyone is still having issues, recommend using the latest of your desired release branch to see if it has been fixed in your case.

If not, please open a new bugzilla or github issue.

this issue may related whether you install packages pyOpenSSL, you can use rpm -aq|grep pyOpenSSL to check it firstly.

Was this page helpful?
0 / 5 - 0 ratings