Openshift-ansible: Upgrade 3.9 to 3.10 error: Cound not find csr for nodes

Created on 6 Sep 2018 · 18Comments · Source: openshift/openshift-ansible

Description

I have an openshift origin cluster in version 3.9. I want to upgrade in 3.10 but at the stage "Approve the node", i have always this message: Cound not find csr for nodes: XXXX". The upgrade hangs at this step.

Version

ansible version: 2.6.3
git clone: release-3.10
openshift-ansible-3.10.43-1-2-gf78e916aa

Steps To Reproduce

ansible-playbook -vvv -i host.dll.upgrade ../openshift-ansible/playbooks/openshift-master/openshift_node_group.yml
ansible-playbook -vvv -i host.dll.upgrade ../openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml

Observed Results

The full traceback is:
File "/tmp/ansible_oAgH_A/ansible_module_oc_csr_approve.py", line 24, in
from json.decoder import JSONDecodeError

fatal: [uosm1.XXXX -> uosm1.XXXX]: FAILED! => {
"attempts": 30,
"changed": false,
"invocation": {
"module_args": {
"node_list": [
"uosm1.XXXX"
],
"oc_bin": "oc",
"oc_conf": "/etc/origin/master/admin.kubeconfig"
}
},
"msg": "Cound not find csr for nodes: uosm1.XXXX",
"state": "unknown"
}

Failure summary:
1. Hosts: uosm1.XXXX
  Play: Update master nodes
  Task: Approve the node
  Message: Cound not find csr for nodes: uosm1.XXXX
oc --config=admin.kubeconfig.udll get nodes
NAME STATUS ROLES AGE VERSION
uosi1.XXXX Ready infra 133d v1.9.1+a0ce1bc657
uosi2.XXXX Ready infra 133d v1.9.1+a0ce1bc657
uosi3.XXXX Ready infra 133d v1.9.1+a0ce1bc657
uosm1.XXXX NotReady master 133d v1.10.0+b81c8f8
uosm2.XXXX Ready master 133d v1.9.1+a0ce1bc657
uosm3.XXXX Ready master 133d v1.9.1+a0ce1bc657
uosn1.XXXX Ready compute 133d v1.9.1+a0ce1bc657
uosn2.XXXX Ready compute 133d v1.9.1+a0ce1bc657
uosn3.XXXX Ready compute 133d v1.9.1+a0ce1bc657
oc --config=admin.kubeconfig.udll get csr
NAME AGE REQUESTOR CONDITION
csr-2kgng 15h system:node:uosm1.XXXX Pending
csr-2z4f7 9h system:node:uosm1.XXXX Pending
csr-4hsbd 13h system:node:uosm1.XXXX Pending
[...]

Additional Information

3 masters/3infranodes/3nodes in rhel 7.5 on vmware
Inventory file:
[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
openshift_template_service_broker_namespaces=['openshift','default']
openshift_master_default_subdomain=uapps.XXXX
ansible_ssh_user=root
debug_level=2
openshift_master_cluster_hostname=uopenshift.XXXX
openshift_master_cluster_public_hostname=uopenshift.XXXX
openshift_deployment_type=origin
openshift_release="3.10"
openshift_clock_enabled=true
openshift_use_openshift_sdn=true

openshift_master_named_certificates=[{"certfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.crt", "keyfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.key", "cafile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.ca"}]

openshift_master_overwrite_named_certificates=true

openshift_hosted_router_certificate={"certfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.crt", "keyfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.key", "cafile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.ca"}

openshift_hosted_registry_routehost=registry.uapps.XXXX

openshift_hosted_registry_routecertificates={"certfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/uapps.XXXX.crt", "keyfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/uapps.XXXX.key", "cafile": "/home/XXXX/git/openshift/configuration/before_openshift/https/uapps.XXXX.ca"}

osm_cluster_network_cidr=173.18.0.0/16
openshift_portal_net=172.19.0.0/16
openshift_docker_options='--insecure-registry 172.18.0.0/15'
openshift_router_selector='node-role.kubernetes.io/infra=true'
openshift_registry_selector='node-role.kubernetes.io/infra=true'
osm_default_node_selector='node-role.kubernetes.io/compute=true'
openshift_master_api_port=443
openshift_master_console_port=443
openshift_disable_check=docker_image_availability,memory_availability,disk_availability,package_availability,docker_storage
openshift_http_proxy=http://XXXX:3128
openshift_https_proxy=https://XXX:3128
openshift_no_proxy='172.18.0.0/15,registry.uapps.XXXX,ceph-s3.XXXX'

openshift_hosted_registry_storage_kind=object
openshift_hosted_registry_storage_provider=s3
openshift_hosted_registry_storage_s3_accesskey=XXXX
openshift_hosted_registry_storage_s3_secretkey=XXXX
openshift_hosted_registry_storage_s3_regionendpoint=http://ceph-s3.XXXX:8080
openshift_hosted_registry_storage_s3_bucket=XXXX
openshift_hosted_registry_storage_s3_region=default
openshift_hosted_registry_storage_s3_chunksize=26214400
openshift_hosted_registry_storage_s3_rootdirectory=/registry
openshift_hosted_registry_pullthrough=true
openshift_hosted_registry_acceptschema2=true
openshift_hosted_registry_enforcequota=false

[masters]
uosm1.XXXX
uosm2.XXXX
uosm3.XXXX

[etcd]
uosm1.XXXX
uosm2.XXXX
uosm3.XXXX

[nodes]
uosm1.XXXX openshift_node_group_name='node-config-master'
uosm2.XXXX openshift_node_group_name='node-config-master'
uosm3.XXXX openshift_node_group_name='node-config-master'
uosi1.XXXX openshift_node_group_name='node-config-infra'
uosi2.XXXX openshift_node_group_name='node-config-infra'
uosi3.XXXX openshift_node_group_name='node-config-infra'
uosn1.XXXX openshift_node_group_name='node-config-compute'
uosn2.XXXX openshift_node_group_name='node-config-compute'
uosn3.XXXX openshift_node_group_name='node-config-compute'

Source

infrasystemelille

Most helpful comment

The native dns must be resolved to all machine names, in version 3.10.

aland-zhang on 7 Sep 2018

👍2 ❤1

All 18 comments

The native dns must be resolved to all machine names, in version 3.10.

aland-zhang on 7 Sep 2018

👍2 ❤1

Thanks for your answer.
All nodes can resolve each other. On the first master, i can resolve the other nodes, but also the nodes out of this cluster.
There is a problem when the csr is created. When i do an oc describe on it, i don't see the real name of this node. Instead of the node's fqdn the certificate is created for the node's hostname.

infrasystemelille on 7 Sep 2018

👍1

If you were previously using /etc/hosts on each node, how do you fix this? Entering each node into DNS doesn't seem to resolve by itself

felsys on 12 Sep 2018

I have the same issue on a new installation of the 90 Day Trial from RedHat for OpenShift-Enterprise v3.10.41-1

FAILED - RETRYING: Approve node certificates when bootstrapping (2 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left).
fatal: [AgoodDNSHostname]: FAILED! => {"attempts": 30, "changed": false, "msg": "Cound not find csr for nodes: "AgoodDNSHostname", "state": "unknown"}

Edited: I have ticket open with redhat by the way, no work around or diagnostic since the 6th of September

a0149659 on 12 Sep 2018

I have the same problem upgrading to 3.10 using openshift-ansible.noarch 3.10.41-1.git.0.fd15dd7.el7 @rhel-7-server-ose-3.10-rpms

output of 'hostname': node1
output of 'hostname -f': node1.ourdomain.com

nodes in the inventory file are fqdn

mark-00 on 14 Sep 2018

I have the same issue on a new installation of the 90 Day Trial from RedHat for OpenShift-Enterprise v3.10.41-1

FAILED - RETRYING: Approve node certificates when bootstrapping (2 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left).
fatal: [AgoodDNSHostname]: FAILED! => {"attempts": 30, "changed": false, "msg": "Cound not find csr for nodes: "AgoodDNSHostname", "state": "unknown"}

Edited: I have ticket open with redhat by the way, no work around or diagnostic since the 6th of September

Can you use the openshift-ansible-3.10.21-1.git.0.6446011 ? It works for me with this release. If not, when your playbook is at this stage, do an "oc get csr". You can see all csr and you can approve the last generated csr with the command "oc adm certificate approve "CSR_ID" "

infrasystemelille on 18 Sep 2018

That didn't help form me. With a bit of hacking I got past the csr problem but then I got cni errors: "Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"

I then decided to change the hostnames to fqdn

After clearing up the mess of previous upgrade attempts like a failed etcd node and sync daemon set I got through the upgrade ( at least until the installation of the service catalog. But that is an other story)

mark-00 on 20 Sep 2018

There have been a series of fixes around the CSR process in the last couple of weeks. I don't think all the fixes have shipped out yet, but the latest of git branch release-3.10 should contain them.

It's important that when upgrading from 3.9, your hostnames match the node names in 'oc get nodes' otherwise, we won't be able to find the CSRs for your nodes.

michaelgugino on 20 Sep 2018

we were able to solve the problem using the updated playbooks in 3.10 from the github repo and replace the RPM playbooks with the ones from the github repo release-3.10 and that worked.

How ever for some reason if I download the entire release-3.10 it still fails with the same csr issue.

Any way my 3.10 openshift enterprise is now online

a0149659 on 21 Sep 2018

we were able to solve the problem using the updated playbooks in 3.10 from the github repo and replace the RPM playbooks with the ones from the github repo release-3.10 and that worked.

Can you please explain what you did exactly by "using the updated playbooks"? It seems to fail for me...

marc-ledent on 28 Sep 2018

Downloaded these and replaced:
https://github.com/openshift/openshift-ansible/pull/10055
roles/lib_openshift/library/oc_csr_approve.py
roles/lib_openshift/test/test_oc_csr_approve.py

a0149659 on 1 Oct 2018

Works for me with openshift-ansible-3.10.51-1

infrasystemelille on 8 Oct 2018

@infrasystemelille , i'm not able to find the openshift-ansible-3.10.51-1 rpm in rhel yum repo.

thiyagu06 on 12 Oct 2018

It is not a rpm,you have to clone the git repository of openshift ansible and do a git checkout of this release or download this rpm: http://mirror.centos.org/centos/7/paas/x86_64/openshift-origin310/openshift-ansible-3.10.51-1.git.0.44a646c.el7.noarch.rpm

infrasystemelille on 12 Oct 2018

release-3.10 is the branch right?

thiyagu06 on 12 Oct 2018

release-3.10 is the branch right?

yes

infrasystemelille on 12 Oct 2018

@infrasystemelille am getting the same error using the rpm as well as the git source.

thiyagu06 on 12 Oct 2018

👍2

If you were previously using /etc/hosts instead of DNS, and then switch to DNS, is there any way to fix the existing nodes without rebuilding the cluster?

felsys on 29 Oct 2018

👍2

Was this page helpful?

0 / 5 - 0 ratings