Provide a brief description of your issue here. For example:
A Openshift 3.10 cluster installation fails when attempting to accept certificate signing requests. The oc_adm_csr.py times out after 60 seconds.
4 certificates needed to be signed. They all PASSED. But took 67 seconds to complete.
# ansible --version
ansible 2.4.4.0
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5 (default, May 31 2018, 09:41:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
If you're operating from a git clone:
release-3.10 branch
$ git describe
openshift-ansible-3.10.27-2-69-gd96b19f2a
The installation completes successfully.
"Timed out accepting certificate signing requests. Failing as requested."
INSTALLER STATUS ***************************************************************
Initialization : Complete (0:00:09)
Health Check : Complete (0:02:48)
Node Bootstrap Preparation : Complete (0:00:01)
etcd Install : Complete (0:00:22)
Master Install : Complete (0:01:29)
Master Additional Install : Complete (0:00:48)
Node Join : In Progress (0:01:10)
Failure summary:
1. Hosts: benchserver7.acme.com
Play: Approve any pending CSR requests from inventory nodes
Task: Report approval errors
Message: Node approval failed
For long output or logs, consider using a gist
Detailed -vvv logging of Ansible script. All the gory details are here.
Provide any additional information which may help us diagnose the
issue.
# uname -a
Linux benchserver7 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22:15 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/ansible/hosts
# This is the default ansible 'hosts' file.
[OSEv3:children]
masters
nodes
etcd
[OSEv3:vars]
containerized=false
openshift_deployment_type=openshift-enterprise
debug_level=0
openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true',]}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true'], 'edits': [{ 'key': 'kubeletArguments.pods-per-core','value': ['20']}]}]
openshift_master_cluster_hostname=benchserver7
ansible_ssh_user=root
openshift_enable_service_catalog=false
disk_availability=false
openshift_disable_check=memory_availability,disk_availability
[masters]
benchserver7.acme.com
[etcd]
benchserver7.acme.com
[nodes]
benchserver7.acme.com openshift_node_group_name='node-config-master'
#benchserver5.acme.com openshift_node_group_name='node-config-infra'
benchserver2.acme.com openshift_node_group_name='node-config-compute'
#
EXTRA INFORMATION GOES HERE
I've seen this happening a few times as well, looks like random API server hangups
@mfojtik any ideas how to debug timeouts on api server requests?
I've restarted the installation from scratch. Setting the timeout to 120.
This time the timeout fails again. At 133 seconds. Checking the CSRs....
# oc get csr
NAME AGE REQUESTOR CONDITION
csr-ll5xn 5m system:admin Approved,Issued
csr-p5mwc 3m system:node:benchserver7 Approved,Issued
csr-qj5wc 5m system:admin Approved,Issued
csr-xdbw2 2m system:node:benchserver7 Approved,Issued
Is this Python script oc_adm_csr.py functioning correct ?
@vrutkovs this isn't random for me. It is repeatable. Every time.
The weird part is that hosts are named benchserver7.acme.com, but CSR is created for benchserver7 node.
Does hostname -f on the host matches hostnames in ansible inventory? What's the output of oc get nodes and oc describe csr-p5mwc?
I am hitting the same issue with 3.10 openshift-enterprise. The csr are approved and the hostname -f is correct. But the ansible fails with "Node approval failed"
looks like it fails because
`"server_accepted": false, "csrs": {}, "client_accepted": false, "name": "benchserver2",
until these are true for both it will timeout.
@vrutkovs I have obfuscated the domain name with .acme.com
yes hostname -fmatches the value for openshift_master_cluster_hostname.
Further down in hosts the hostname is fully qualified with domain.
@vrutkovs looking back over the Gist attached to this issue. Looking at lines 201-207 that shows the json for CSR requests by benchserver2. There are none. It is empty. Whereas benchserver7 had 4. No wonder server_accepted=false and client_accepted=false for benchserver2.
In 3.9 only the initial master had CSR signing requests issued.
In 3.10 the CSR signing has been re-factored. Currently both infrastructure and compute nodes are having signing requests issued. Which I suspect is not what was intended when the change to use all nodes was made. Is that correct @michaelgugino @vrutkovs ?
@whitingjr I have the log at node_approval_failure_log
I just tried the below config (fail_on_timeout: false) in playbooks/openshift-node/private/join.yml and it worked
-name: Approve bootstrap node
oc_adm_csr:
nodes: "{{ l_nodes_to_join }}"
timeout: 60
fail_on_timeout: false
I also encountered the same problem
TASK [Approve bootstrap nodes] ******************************************************************************************
fatal: [10.10.244.212]: FAILED! => {"changed": true, "finished": false, "msg": "Timed out accepting certificate signing requests. Failing as requested.", "nodes": [{"client_accepted": true, "csrs": {"csr-4v5sv": {"apiVersion": "certificates.k8s.io/v1beta1", "kind": "CertificateSigningRequest", "metadata":
{"creationTimestamp": "2018-08-17T06:50:04Z", "generateName": "csr-", "name": "csr-4v5sv", "namespace": "", "resourceVersion": "714", "selfLink":
"/apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-4v5sv", "uid": "c5219812-a1e9-11e8-8618-525400614a73"}, "spec": {"groups": ["system:masters", "system:cluster-admins", "system:authenticated"], "request":
"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQkx6Q0IxZ0lCQURBN01SVXdFd1lEVlFRS0V3eHplWE4wWlcwNmJtOWtaWE14SWpBZ0JnTlZCQU1UR1hONQpjM1JsYlRwdWIyUmxPakV3TFRFd0xUSTBOQzB5TVRJd1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBVGxsdlowZkZPak5zbnBjdDEwSEdBMkpjNTY5ZzFTK0Y3NWpOcUowRDkrUXJhaWx2eVIxN0x0T3ViVFp0RVUKQTdkTmViQzErd2dyc2tOaDlzOWVZaEhYb0Rrd053WUpLb1pJaHZjTkFRa09NU293S0RBbUJnTlZIUkVFSHpBZApnZzB4TUMweE1DMHlORFF0TWpFeWdnQ0hCQW9LOU5TSEJLd1JBQUV3Q2dZSUtvWkl6ajBFQXdJRFNBQXdSUUlnCkFRenRhRUhTcWcwMXNmdUVmQWRZcnRnQWhadVpaaUlPL3hBNTZUUHRKbElDSVFEWDRaWjgyNEJwK3hGWm9qaXgKYU1Bd2l1K3RQczhobTdFaVBQYnpVT3BVRUE9PQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0K", "usages": ["digital signature", "key encipherment", "server auth"], "username": "system:admin"},
TASK [Report approval errors] ******************************************************************************************
fatal: [10.10.244.212]: FAILED! => {"changed": false, "msg": "Node approval failed"}
1. Hosts: 10.10.244.212
Play: Approve any pending CSR requests from inventory nodes
Task: Report approval errors
Message: Node approval failed
oc get csr
NAME AGE REQUESTOR CONDITION
csr-4v5sv 12m system:admin Approved
csr-dt7s6 12m system:admin Approved
csr-zr8k5 6m system:admin Approved
node-csr-BFTK_sNDdRTqFpukK3a-t_Dw6utL0JgPqDloWm3xbdg 6m system:serviceaccount:openshift-infra:node-bootstrapper Approved
node-csr-iBMOx7LAmvdYUdE1cclIdg9ZafQUNkD7QW9uSkGthg8 6m system:serviceaccount:openshift-infra:node-bootstrapper Approved
node-csr-yGjHFFSyx3Js5SyW9Xu0r5VMCK3btSazgqRvB5ggy-8 6m system:serviceaccount:openshift-infra:node-bootstrapper Approved
oc describe csr csr-4v5sv
Name: csr-4v5sv
Labels: <none>
Annotations: <none>
CreationTimestamp: Fri, 17 Aug 2018 14:50:04 +0800
Requesting User: system:admin
Status: Approved
Subject:
Common Name: system:node:10-10-244-212
Serial Number:
Organization: system:nodes
Subject Alternative Names:
DNS Names: 10-10-244-212
IP Addresses: 10.10.244.212
172.17.0.1
@kmurthy1 thanks for sharing that workaround to disable the fail_on_timeout. It worked for me to progress beyond this issue.
@aland-zhang I suggest you try switching the boolean to false and having another go.
While there is a workaround the underlying issue of unnecessary attempts at CSR requests for compute nodes is still there. If my understanding of the issue is correct.
@vrutkovs Is this issue fixed upstream ?
For what it's worth I'm experiencing the same, I'm able to reliably reproduce the Timed out accepting certificate signing requests. Failing as requested.
I attempted the work-around suggested, with:
sed -i -e 's/fail_on_timeout: true/fail_on_timeout: false/' playbooks/openshift-node/private/join.yml
While this does make it so the plays continue after that failing play, I'm then faced with the play openshift_manage_node : Wait for Node Registration failing, and wind up with oc get nodes showing only the master available, no other nodes in the cluster.
The openshift-ansible version I have locally (pulled this morning)...
# [root@droctagon3 openshift-ansible]# git rev-parse HEAD
# 7a136e99c33927a00f2f3a58b2de5e170e880252
The inventory I'm using follows:
threeten-infra.test.example.com ansible_host=192.168.1.19
threeten-master.test.example.com ansible_host=192.168.1.43
threeten-node1.test.example.com ansible_host=192.168.1.197
[masters]
threeten-master.test.example.com
[etcd]
threeten-master.test.example.com
[nodes]
threeten-master.test.example.com openshift_node_group_name="node-config-master"
threeten-infra.test.example.com openshift_node_group_name="node-config-infra"
threeten-node1.test.example.com openshift_node_group_name="node-config-compute"
[OSEv3:children]
masters
nodes
etcd
[OSEv3:vars]
openshift_release="3.10"
openshift_install_examples=false
openshift_deployment_type=origin
openshift_master_default_subdomain=apps.test.example.com
openshift_master_cluster_hostname=threeten-master.test.example.com
openshift_disable_check=disk_availability,memory_availability
openshift_enable_docker_excluder=False
debug_level=2
ansible_ssh_user=centos
ansible_become=yes
ansible_ssh_private_key_file=/root/.ssh/id_vm_rsa
All, thank you for the detailed failure reports. I am in the process of creating a custom module to deal with this csr signing issue here: https://github.com/openshift/openshift-ansible/pull/9711
We plan to backport to 3.10 as soon as it's ready, hopefully in the next day or so.
For what it's worth, I realized that one issue I had on my deployment was that my nodes weren't able to resolve the DNS name of the other nodes/master in the cluster. In the past, this would've been caught in the prerequisites playbook, but, wasn't.
In my case, this is in a lab, and I have an /etc/hosts that has the DNS names and IPs of each node in the cluster. Once that was in place, this play succeeded.
@michaelgugino the changes solved the problem. Doing an install using 3.10 progressed beyond this issue. Happy to consider this issue solved for me.
@whitingjr thanks for the update!
Closed by: https://github.com/openshift/openshift-ansible/pull/9711
@whitingjr I have the log at node_approval_failure_log
I just tried the below config (fail_on_timeout: false) in playbooks/openshift-node/private/join.yml and it worked-name: Approve bootstrap node oc_adm_csr: nodes: "{{ l_nodes_to_join }}" timeout: 60 fail_on_timeout: false
I tried the same as you and it didn't work for me.
There's been a lot of changes in this area, if anyone is still having issues, recommend using the latest of your desired release branch to see if it has been fixed in your case.
If not, please open a new bugzilla or github issue.
this issue may related whether you install packages pyOpenSSL, you can use rpm -aq|grep pyOpenSSL to check it firstly.
Most helpful comment
All, thank you for the detailed failure reports. I am in the process of creating a custom module to deal with this csr signing issue here: https://github.com/openshift/openshift-ansible/pull/9711
We plan to backport to 3.10 as soon as it's ready, hopefully in the next day or so.