Openshift-ansible: Approve node certificates when bootstrapping - 3.11

Created on 22 Mar 2019 · 10Comments · Source: openshift/openshift-ansible

Description

On a single master, multi worker install the installation fails with:

PLAY RECAP **********************************************************************************************************************************************************************
k8s-ctl01.dev.mydomain.com     : ok=463  changed=85   unreachable=0    failed=1   
k8s-worker01.dev.mydomain.com  : ok=99   changed=11   unreachable=0    failed=0   
k8s-worker02.dev.mydomain.com  : ok=98   changed=11   unreachable=0    failed=0   
k8s-worker03.dev.mydomain.com  : ok=98   changed=11   unreachable=0    failed=0   
localhost                  : ok=11   changed=0    unreachable=0    failed=0   


INSTALLER STATUS ****************************************************************************************************************************************************************
Initialization              : Complete (0:00:21)
Health Check                : Complete (0:00:12)
Node Bootstrap Preparation  : Complete (0:01:05)
etcd Install                : Complete (0:00:30)
Master Install              : Complete (0:02:34)
Master Additional Install   : Complete (0:00:31)
Node Join                   : In Progress (0:03:01)
        This phase can be restarted by running: playbooks/openshift-node/join.yml


Failure summary:


  1. Hosts:    k8s-ctl01.dev.mydomain.com
     Play:     Approve any pending CSR requests from inventory nodes
     Task:     Approve node certificates when bootstrapping
     Message:  Could not find csr for nodes: k8s-worker03.dev.mydomain.com, k8s-worker02.dev.mydomain.com, k8s-worker01.dev.mydomain.com

Version

Ansible Version

ansible 2.5.7
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]

The output of git describe

openshift-ansible-3.11.98-1-4-g44190fcda

Steps To Reproduce

clone this repo
using the below inventory, run prereq and deploy playbooks.

Expected Results

Installation to complete successfully.

Observed Results

On the Master node, I see the following.

oc get node k8s-worker03.dev.mydomain.com results in:

No resources found.
Error from server (NotFound): nodes "k8s-worker03.dev.mydomain.com" not found

oc get node -w
NAME                     STATUS    ROLES                  AGE       VERSION
k8s-ctl01.dev.mydomain.com   Ready     compute,infra,master   2h        v1.11.0+d4cacc0

Ansible verbose output

Please see https://gist.github.com/magick93/814aaa5f825c5021c2d031d8b7944c28

Additional Information

Provide any additional information which may help us diagnose the
issue.

Operating system and version: CentOS Linux release 7.6.1810 (Core)
Your inventory file

[OSEv3:children]
masters
etcd
nodes 


[masters]
k8s-ctl01.dev.mydomain.com  openshift_node_group_name='node-config-master-infra'
#openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=false']}]



[etcd]
k8s-ctl01.dev.mydomain.com openshift_node_group_name='node-config-master-infra'

[nodes]
k8s-worker01.dev.mydomain.com   openshift_schedulable=true openshift_node_group_name="node-config-compute"
k8s-worker02.dev.mydomain.com   openshift_schedulable=true openshift_node_group_name="node-config-compute"
k8s-worker03.dev.mydomain.com   openshift_schedulable=true openshift_node_group_name="node-config-compute"
k8s-ctl01.dev.mydomain.com openshift_node_group_name='node-config-master-infra'



[OSEv3:vars]
ansible_ssh_user=deploy
ansible_sudo=true
ansible_become=true
enable_excluders=False
enable_docker_excluder=False
ansible_service_broker_install=True

containerized=True



#os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability

#openshift_node_kubelet_args={'pods-per-core': ['10']}

#deployment_type=origin
openshift_deployment_type=origin



openshift_additional_repos=[{'id': 'centos-okd-ci', 'name': 'centos-okd-ci', 'baseurl' :'http://buildlogs.centos.org/centos/7/paas/x86_64/openshift-origin311/', 'gpgcheck' :'0', 'enabled' :'1'}]




openshift_disable_check=package_version
openshift_disable_check=package_version
openshift_disable_check=docker_image_availability,docker_storage

openshift_release="v3.11.0"
openshift_image_tag="v3.11.0"
openshift_pkg_version="-3.11.0"
openshift_install_examples=true

openshift_management_install_management=True

#openshift_web_console_nodeselector={'region':'nodes'}




template_service_broker_selector={"region":"infra"}
openshift_metrics_image_version="v3.11.0"
openshift_logging_image_version="v3.11.0"
openshift_logging_elasticsearch_proxy_image_version="v1.0.0"
logging_elasticsearch_rollout_override=false
osm_use_cockpit=true

# Install CloudForms/ManagementIQ
openshift_management_install_management=false
# We have a default storage class that will take care of everything
openshift_management_storage_class=preconfigured

# Prometheus
#openshift_prometheus_namespace=openshift-metrics
#openshift_prometheus_node_selector={"region":"nodes"}
#openshift_prometheus_storage_kind=glusterfs 
#openshift_prometheus_storage_type=pvc
#openshift_prometheus_alertbuffer_storage_type=pvc
#openshift_prometheus_alertmanager_storage_type=pvc

# logging
openshift_logging_install_logging=false                          
openshift_logging_es_cluster_size=3  
openshift_logging_es_nodeselector={"region":"nodes"}             
openshift_logging_kibana_nodeselector={"region":"nodes"}
openshift_logging_curator_nodeselector={"region":"nodes"}
openshift_logging_fluentd_nodeselector={"region":"nodes"}
#openshift_logging_storage_kind=pvc

openshift_metrics_hawkular_hostname=metrics.k8s-ctl01.dev.mydomain.com

openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind':'HTPasswdPasswordIdentityProvider'}]
#openshift_master_htpasswd_file='/etc/origin/master/htpasswd'

openshift_public_hostname=console.openshift.dev.mydomain.com
openshift_master_default_subdomain=k8s-ctl01.dev.mydomain.com
openshift_master_api_port=8443
openshift_master_console_port=8443

#[nfs]
#k8s-ctl01.dev.mydomain.com openshift_schedulable=true

Additional

I have checked, and confirm that each worker node has hostname == hostname -f, eg:

hostname -f
k8s-worker03.dev.mydomain.com
[deploy@k8s-worker03 ~]$ hostname
k8s-worker03.dev.mydomain.com

oc get csr on master:

NAME        AGE       REQUESTOR      CONDITION
csr-2swlc   28m       system:admin   Pending
csr-5swf5   1h        system:admin   Pending
csr-5tdl7   1h        system:admin   Pending
...
csr-xkq9n   1h        system:admin   Pending
csr-xtk8j   1h        system:admin   Pending

Nodes not joined

On the master node, when I run oc get node -w I only see the master node, no others, eg:

k8s-ctl01.dev.mydomain.com   Ready     compute,infra,master   3h        v1.11.0+d4cacc0
k8s-ctl01.dev.mydomain.com   Ready     compute,infra,master   3h        v1.11.0+d4cacc0

Manaully approving certs

I tried manually approving certs on the master, eg, oc adm certificate approve csr-v85dj however this also didnt make any difference.

Source

magick93

Most helpful comment

Checkout your hostname as well using hostnamectl command.

arocki7 on 16 Jul 2019

👍3

All 10 comments

I managed to resolve this - it was a firewall issue.

The firewall requirements are listed on https://docs.openshift.com/container-platform/3.11/install/prerequisites.html

magick93 on 22 Mar 2019

👍1

Checkout your hostname as well using hostnamectl command.

arocki7 on 16 Jul 2019

👍3

I am observing this issue while adding new node to the cluster.
Installation worker fine for me with single master and single worker, then i am trying to add new node to the cluster but every time its failing at same step:
TASK [Approve node certificates when bootstrapping]
FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
.
.
.
FAILED - RETRYING: Approve node certificates when bootstrapping (0 retries left).

fatal: [172.16.0.93]: FAILED! => {"all_subjects_found": ["subject=/O=system:nodes/CN=system:node:okd-new\n", "subject=/O=system:nodes/CN=system:node:okd-new\n", "subject=/O=system:nodes/CN=system:node:okd-new\n", "subject=/O=system:nodes/CN=system:node:okd-new\n"], "attempts": 30, "changed": false, "client_approve_results": [], "client_csrs": {}, "msg": "Could not find csr for nodes: okd-new", . . . ....

I tried to solution given by @magick93 i.e adding os_firewall_use_firewalld=True in the inventory. also i added openshift_master_bootstrap_auto_approve=True but same result everytime.

harshalkwagh on 29 Jul 2019

same issue for me. I tried both suggestions from @magick93 and @harshalkwagh but also ended up with the same result.

damora on 28 Aug 2019

Same here: blocked on scaleup.yml due to to failed certificates

aijanai on 29 Aug 2019

I'm once again running into this issue, and previous solutions arent working.

magick93 on 4 Sep 2019

A partial workaround, referring to checking whether hostname and hostname -f had the same outcome, was to ensure that no funny alias existed in /etc/hosts on any server.

In my case, we were able to get over it by sanitizing the hosts file everywhere because Ansible seemed to take that into account for determining the FQDN even when an entry on the DNS already existed and the hostname was explicitly set with hostnamectl utility.

aijanai on 4 Sep 2019

👍1

The problem, in my case, was host files on the nodes were incorrect.

Solution was to use DNS.

magick93 on 5 Sep 2019

🎉2

I had the same problem, but what fixed this issue for me was: (i) reboot all the nodes (or VMs), (ii) reconfigure the certificates, and (iii) try again to join the openshift-nodes, as follows:
ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml

ansible-playbook -i inventory -e openshift_master_bootstrap_auto_approve=true /usr/share/ansible/openshift-ansible/playbooks/openshift-node/join.yml

Although reconfiguring the certificates gives a new error, it fixed the previous issue of the "approve node certificates"
Therefore, it seems you can ignore this new error...
TASK [Verify that the console is running] **************************************************************************** fatal: [master-0]: FAILED! => {"msg": "The conditional check 'console_deployment.module_results.results[0].status.readyReplicas is defined' failed. The error was: error while evaluating conditional (console_deployment.module_results.results[0].status.readyReplicas is defined): 'dict object' has no attribute 'module_results'"}

marceloamaral on 2 Mar 2020

I got rid of the issue by using the same value for "hostname" and "hostname -f". Initially, hostname -f returned FQDN. By editing /etc/hosts, I made it short name which is equal to hostname value.