On a single master, multi worker install the installation fails with:
PLAY RECAP **********************************************************************************************************************************************************************
k8s-ctl01.dev.mydomain.com : ok=463 changed=85 unreachable=0 failed=1
k8s-worker01.dev.mydomain.com : ok=99 changed=11 unreachable=0 failed=0
k8s-worker02.dev.mydomain.com : ok=98 changed=11 unreachable=0 failed=0
k8s-worker03.dev.mydomain.com : ok=98 changed=11 unreachable=0 failed=0
localhost : ok=11 changed=0 unreachable=0 failed=0
INSTALLER STATUS ****************************************************************************************************************************************************************
Initialization : Complete (0:00:21)
Health Check : Complete (0:00:12)
Node Bootstrap Preparation : Complete (0:01:05)
etcd Install : Complete (0:00:30)
Master Install : Complete (0:02:34)
Master Additional Install : Complete (0:00:31)
Node Join : In Progress (0:03:01)
This phase can be restarted by running: playbooks/openshift-node/join.yml
Failure summary:
1. Hosts: k8s-ctl01.dev.mydomain.com
Play: Approve any pending CSR requests from inventory nodes
Task: Approve node certificates when bootstrapping
Message: Could not find csr for nodes: k8s-worker03.dev.mydomain.com, k8s-worker02.dev.mydomain.com, k8s-worker01.dev.mydomain.com
Ansible Version
ansible 2.5.7
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/home/user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
git describeopenshift-ansible-3.11.98-1-4-g44190fcda
Installation to complete successfully.
On the Master node, I see the following.
oc get node k8s-worker03.dev.mydomain.com results in:
No resources found.
Error from server (NotFound): nodes "k8s-worker03.dev.mydomain.com" not found
oc get node -w
NAME STATUS ROLES AGE VERSION
k8s-ctl01.dev.mydomain.com Ready compute,infra,master 2h v1.11.0+d4cacc0
Please see https://gist.github.com/magick93/814aaa5f825c5021c2d031d8b7944c28
Provide any additional information which may help us diagnose the
issue.
CentOS Linux release 7.6.1810 (Core)[OSEv3:children]
masters
etcd
nodes
[masters]
k8s-ctl01.dev.mydomain.com openshift_node_group_name='node-config-master-infra'
#openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=false']}]
[etcd]
k8s-ctl01.dev.mydomain.com openshift_node_group_name='node-config-master-infra'
[nodes]
k8s-worker01.dev.mydomain.com openshift_schedulable=true openshift_node_group_name="node-config-compute"
k8s-worker02.dev.mydomain.com openshift_schedulable=true openshift_node_group_name="node-config-compute"
k8s-worker03.dev.mydomain.com openshift_schedulable=true openshift_node_group_name="node-config-compute"
k8s-ctl01.dev.mydomain.com openshift_node_group_name='node-config-master-infra'
[OSEv3:vars]
ansible_ssh_user=deploy
ansible_sudo=true
ansible_become=true
enable_excluders=False
enable_docker_excluder=False
ansible_service_broker_install=True
containerized=True
#os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability
#openshift_node_kubelet_args={'pods-per-core': ['10']}
#deployment_type=origin
openshift_deployment_type=origin
openshift_additional_repos=[{'id': 'centos-okd-ci', 'name': 'centos-okd-ci', 'baseurl' :'http://buildlogs.centos.org/centos/7/paas/x86_64/openshift-origin311/', 'gpgcheck' :'0', 'enabled' :'1'}]
openshift_disable_check=package_version
openshift_disable_check=package_version
openshift_disable_check=docker_image_availability,docker_storage
openshift_release="v3.11.0"
openshift_image_tag="v3.11.0"
openshift_pkg_version="-3.11.0"
openshift_install_examples=true
openshift_management_install_management=True
#openshift_web_console_nodeselector={'region':'nodes'}
template_service_broker_selector={"region":"infra"}
openshift_metrics_image_version="v3.11.0"
openshift_logging_image_version="v3.11.0"
openshift_logging_elasticsearch_proxy_image_version="v1.0.0"
logging_elasticsearch_rollout_override=false
osm_use_cockpit=true
# Install CloudForms/ManagementIQ
openshift_management_install_management=false
# We have a default storage class that will take care of everything
openshift_management_storage_class=preconfigured
# Prometheus
#openshift_prometheus_namespace=openshift-metrics
#openshift_prometheus_node_selector={"region":"nodes"}
#openshift_prometheus_storage_kind=glusterfs
#openshift_prometheus_storage_type=pvc
#openshift_prometheus_alertbuffer_storage_type=pvc
#openshift_prometheus_alertmanager_storage_type=pvc
# logging
openshift_logging_install_logging=false
openshift_logging_es_cluster_size=3
openshift_logging_es_nodeselector={"region":"nodes"}
openshift_logging_kibana_nodeselector={"region":"nodes"}
openshift_logging_curator_nodeselector={"region":"nodes"}
openshift_logging_fluentd_nodeselector={"region":"nodes"}
#openshift_logging_storage_kind=pvc
openshift_metrics_hawkular_hostname=metrics.k8s-ctl01.dev.mydomain.com
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind':'HTPasswdPasswordIdentityProvider'}]
#openshift_master_htpasswd_file='/etc/origin/master/htpasswd'
openshift_public_hostname=console.openshift.dev.mydomain.com
openshift_master_default_subdomain=k8s-ctl01.dev.mydomain.com
openshift_master_api_port=8443
openshift_master_console_port=8443
#[nfs]
#k8s-ctl01.dev.mydomain.com openshift_schedulable=true
I have checked, and confirm that each worker node has hostname == hostname -f, eg:
hostname -f
k8s-worker03.dev.mydomain.com
[deploy@k8s-worker03 ~]$ hostname
k8s-worker03.dev.mydomain.com
oc get csr on master:
NAME AGE REQUESTOR CONDITION
csr-2swlc 28m system:admin Pending
csr-5swf5 1h system:admin Pending
csr-5tdl7 1h system:admin Pending
...
csr-xkq9n 1h system:admin Pending
csr-xtk8j 1h system:admin Pending
On the master node, when I run oc get node -w I only see the master node, no others, eg:
k8s-ctl01.dev.mydomain.com Ready compute,infra,master 3h v1.11.0+d4cacc0
k8s-ctl01.dev.mydomain.com Ready compute,infra,master 3h v1.11.0+d4cacc0
I tried manually approving certs on the master, eg, oc adm certificate approve csr-v85dj however this also didnt make any difference.
I managed to resolve this - it was a firewall issue.
The firewall requirements are listed on https://docs.openshift.com/container-platform/3.11/install/prerequisites.html
Checkout your hostname as well using hostnamectl command.
I am observing this issue while adding new node to the cluster.
Installation worker fine for me with single master and single worker, then i am trying to add new node to the cluster but every time its failing at same step:
TASK [Approve node certificates when bootstrapping]
FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
.
.
.
FAILED - RETRYING: Approve node certificates when bootstrapping (0 retries left).
fatal: [172.16.0.93]: FAILED! => {"all_subjects_found": ["subject=/O=system:nodes/CN=system:node:okd-new\n", "subject=/O=system:nodes/CN=system:node:okd-new\n", "subject=/O=system:nodes/CN=system:node:okd-new\n", "subject=/O=system:nodes/CN=system:node:okd-new\n"], "attempts": 30, "changed": false, "client_approve_results": [], "client_csrs": {}, "msg": "Could not find csr for nodes: okd-new", . . . ....
I tried to solution given by @magick93 i.e adding os_firewall_use_firewalld=True in the inventory. also i added openshift_master_bootstrap_auto_approve=True but same result everytime.
same issue for me. I tried both suggestions from @magick93 and @harshalkwagh but also ended up with the same result.
Same here: blocked on scaleup.yml due to to failed certificates
I'm once again running into this issue, and previous solutions arent working.
A partial workaround, referring to checking whether hostname and hostname -f had the same outcome, was to ensure that no funny alias existed in /etc/hosts on any server.
In my case, we were able to get over it by sanitizing the hosts file everywhere because Ansible seemed to take that into account for determining the FQDN even when an entry on the DNS already existed and the hostname was explicitly set with hostnamectl utility.
The problem, in my case, was host files on the nodes were incorrect.
Solution was to use DNS.
I had the same problem, but what fixed this issue for me was: (i) reboot all the nodes (or VMs), (ii) reconfigure the certificates, and (iii) try again to join the openshift-nodes, as follows:
ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml
ansible-playbook -i inventory -e openshift_master_bootstrap_auto_approve=true /usr/share/ansible/openshift-ansible/playbooks/openshift-node/join.yml
Although reconfiguring the certificates gives a new error, it fixed the previous issue of the "approve node certificates"
Therefore, it seems you can ignore this new error...
TASK [Verify that the console is running] ****************************************************************************
fatal: [master-0]: FAILED! => {"msg": "The conditional check 'console_deployment.module_results.results[0].status.readyReplicas is defined' failed. The error was: error while evaluating conditional (console_deployment.module_results.results[0].status.readyReplicas is defined): 'dict object' has no attribute 'module_results'"}
I got rid of the issue by using the same value for "hostname" and "hostname -f". Initially, hostname -f returned FQDN. By editing /etc/hosts, I made it short name which is equal to hostname value.
Most helpful comment
Checkout your hostname as well using hostnamectl command.