Openshift-ansible: Upgrade 3.9 to 3.10 error: Cound not find csr for nodes

Created on 6 Sep 2018  路  18Comments  路  Source: openshift/openshift-ansible

Description

I have an openshift origin cluster in version 3.9. I want to upgrade in 3.10 but at the stage "Approve the node", i have always this message: Cound not find csr for nodes: XXXX". The upgrade hangs at this step.

Version
  • ansible version: 2.6.3
  • git clone: release-3.10
  • openshift-ansible-3.10.43-1-2-gf78e916aa
Steps To Reproduce
  1. ansible-playbook -vvv -i host.dll.upgrade ../openshift-ansible/playbooks/openshift-master/openshift_node_group.yml
  2. ansible-playbook -vvv -i host.dll.upgrade ../openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml
Observed Results
  • The full traceback is:
    File "/tmp/ansible_oAgH_A/ansible_module_oc_csr_approve.py", line 24, in
    from json.decoder import JSONDecodeError

fatal: [uosm1.XXXX -> uosm1.XXXX]: FAILED! => {
"attempts": 30,
"changed": false,
"invocation": {
"module_args": {
"node_list": [
"uosm1.XXXX"
],
"oc_bin": "oc",
"oc_conf": "/etc/origin/master/admin.kubeconfig"
}
},
"msg": "Cound not find csr for nodes: uosm1.XXXX",
"state": "unknown"
}

  • Failure summary:

    1. Hosts: uosm1.XXXX
      Play: Update master nodes
      Task: Approve the node
      Message: Cound not find csr for nodes: uosm1.XXXX
  • oc --config=admin.kubeconfig.udll get nodes
    NAME STATUS ROLES AGE VERSION
    uosi1.XXXX Ready infra 133d v1.9.1+a0ce1bc657
    uosi2.XXXX Ready infra 133d v1.9.1+a0ce1bc657
    uosi3.XXXX Ready infra 133d v1.9.1+a0ce1bc657
    uosm1.XXXX NotReady master 133d v1.10.0+b81c8f8
    uosm2.XXXX Ready master 133d v1.9.1+a0ce1bc657
    uosm3.XXXX Ready master 133d v1.9.1+a0ce1bc657
    uosn1.XXXX Ready compute 133d v1.9.1+a0ce1bc657
    uosn2.XXXX Ready compute 133d v1.9.1+a0ce1bc657
    uosn3.XXXX Ready compute 133d v1.9.1+a0ce1bc657

  • oc --config=admin.kubeconfig.udll get csr
    NAME AGE REQUESTOR CONDITION
    csr-2kgng 15h system:node:uosm1.XXXX Pending
    csr-2z4f7 9h system:node:uosm1.XXXX Pending
    csr-4hsbd 13h system:node:uosm1.XXXX Pending
    [...]

Additional Information
  • 3 masters/3infranodes/3nodes in rhel 7.5 on vmware

  • Inventory file:
    [OSEv3:children]
    masters
    nodes
    etcd

[OSEv3:vars]
openshift_template_service_broker_namespaces=['openshift','default']
openshift_master_default_subdomain=uapps.XXXX
ansible_ssh_user=root
debug_level=2
openshift_master_cluster_hostname=uopenshift.XXXX
openshift_master_cluster_public_hostname=uopenshift.XXXX
openshift_deployment_type=origin
openshift_release="3.10"
openshift_clock_enabled=true
openshift_use_openshift_sdn=true

openshift_master_named_certificates=[{"certfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.crt", "keyfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.key", "cafile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.ca"}]

openshift_master_overwrite_named_certificates=true

openshift_hosted_router_certificate={"certfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.crt", "keyfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.key", "cafile": "/home/XXXX/git/openshift/configuration/before_openshift/https/XXXX.ca"}

openshift_hosted_registry_routehost=registry.uapps.XXXX

openshift_hosted_registry_routecertificates={"certfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/uapps.XXXX.crt", "keyfile": "/home/XXXX/git/openshift/configuration/before_openshift/https/uapps.XXXX.key", "cafile": "/home/XXXX/git/openshift/configuration/before_openshift/https/uapps.XXXX.ca"}

osm_cluster_network_cidr=173.18.0.0/16
openshift_portal_net=172.19.0.0/16
openshift_docker_options='--insecure-registry 172.18.0.0/15'
openshift_router_selector='node-role.kubernetes.io/infra=true'
openshift_registry_selector='node-role.kubernetes.io/infra=true'
osm_default_node_selector='node-role.kubernetes.io/compute=true'
openshift_master_api_port=443
openshift_master_console_port=443
openshift_disable_check=docker_image_availability,memory_availability,disk_availability,package_availability,docker_storage
openshift_http_proxy=http://XXXX:3128
openshift_https_proxy=https://XXX:3128
openshift_no_proxy='172.18.0.0/15,registry.uapps.XXXX,ceph-s3.XXXX'

openshift_hosted_registry_storage_kind=object
openshift_hosted_registry_storage_provider=s3
openshift_hosted_registry_storage_s3_accesskey=XXXX
openshift_hosted_registry_storage_s3_secretkey=XXXX
openshift_hosted_registry_storage_s3_regionendpoint=http://ceph-s3.XXXX:8080
openshift_hosted_registry_storage_s3_bucket=XXXX
openshift_hosted_registry_storage_s3_region=default
openshift_hosted_registry_storage_s3_chunksize=26214400
openshift_hosted_registry_storage_s3_rootdirectory=/registry
openshift_hosted_registry_pullthrough=true
openshift_hosted_registry_acceptschema2=true
openshift_hosted_registry_enforcequota=false

[masters]
uosm1.XXXX
uosm2.XXXX
uosm3.XXXX

[etcd]
uosm1.XXXX
uosm2.XXXX
uosm3.XXXX

[nodes]
uosm1.XXXX openshift_node_group_name='node-config-master'
uosm2.XXXX openshift_node_group_name='node-config-master'
uosm3.XXXX openshift_node_group_name='node-config-master'
uosi1.XXXX openshift_node_group_name='node-config-infra'
uosi2.XXXX openshift_node_group_name='node-config-infra'
uosi3.XXXX openshift_node_group_name='node-config-infra'
uosn1.XXXX openshift_node_group_name='node-config-compute'
uosn2.XXXX openshift_node_group_name='node-config-compute'
uosn3.XXXX openshift_node_group_name='node-config-compute'

Most helpful comment

The native dns must be resolved to all machine names, in version 3.10.

All 18 comments

The native dns must be resolved to all machine names, in version 3.10.

Thanks for your answer.
All nodes can resolve each other. On the first master, i can resolve the other nodes, but also the nodes out of this cluster.
There is a problem when the csr is created. When i do an oc describe on it, i don't see the real name of this node. Instead of the node's fqdn the certificate is created for the node's hostname.

If you were previously using /etc/hosts on each node, how do you fix this? Entering each node into DNS doesn't seem to resolve by itself

I have the same issue on a new installation of the 90 Day Trial from RedHat for OpenShift-Enterprise v3.10.41-1

FAILED - RETRYING: Approve node certificates when bootstrapping (2 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left).
fatal: [AgoodDNSHostname]: FAILED! => {"attempts": 30, "changed": false, "msg": "Cound not find csr for nodes: "AgoodDNSHostname", "state": "unknown"}

Edited: I have ticket open with redhat by the way, no work around or diagnostic since the 6th of September

I have the same problem upgrading to 3.10 using openshift-ansible.noarch 3.10.41-1.git.0.fd15dd7.el7 @rhel-7-server-ose-3.10-rpms

output of 'hostname': node1
output of 'hostname -f': node1.ourdomain.com

nodes in the inventory file are fqdn

I have the same issue on a new installation of the 90 Day Trial from RedHat for OpenShift-Enterprise v3.10.41-1

FAILED - RETRYING: Approve node certificates when bootstrapping (2 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left).
fatal: [AgoodDNSHostname]: FAILED! => {"attempts": 30, "changed": false, "msg": "Cound not find csr for nodes: "AgoodDNSHostname", "state": "unknown"}

Edited: I have ticket open with redhat by the way, no work around or diagnostic since the 6th of September

Can you use the openshift-ansible-3.10.21-1.git.0.6446011 ? It works for me with this release. If not, when your playbook is at this stage, do an "oc get csr". You can see all csr and you can approve the last generated csr with the command "oc adm certificate approve "CSR_ID" "

That didn't help form me. With a bit of hacking I got past the csr problem but then I got cni errors: "Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"

I then decided to change the hostnames to fqdn

After clearing up the mess of previous upgrade attempts like a failed etcd node and sync daemon set I got through the upgrade ( at least until the installation of the service catalog. But that is an other story)

There have been a series of fixes around the CSR process in the last couple of weeks. I don't think all the fixes have shipped out yet, but the latest of git branch release-3.10 should contain them.

It's important that when upgrading from 3.9, your hostnames match the node names in 'oc get nodes' otherwise, we won't be able to find the CSRs for your nodes.

we were able to solve the problem using the updated playbooks in 3.10 from the github repo and replace the RPM playbooks with the ones from the github repo release-3.10 and that worked.

How ever for some reason if I download the entire release-3.10 it still fails with the same csr issue.

Any way my 3.10 openshift enterprise is now online

we were able to solve the problem using the updated playbooks in 3.10 from the github repo and replace the RPM playbooks with the ones from the github repo release-3.10 and that worked.

Can you please explain what you did exactly by "using the updated playbooks"? It seems to fail for me...

Downloaded these and replaced:
https://github.com/openshift/openshift-ansible/pull/10055
roles/lib_openshift/library/oc_csr_approve.py
roles/lib_openshift/test/test_oc_csr_approve.py

Works for me with openshift-ansible-3.10.51-1

@infrasystemelille , i'm not able to find the openshift-ansible-3.10.51-1 rpm in rhel yum repo.

It is not a rpm,you have to clone the git repository of openshift ansible and do a git checkout of this release or download this rpm: http://mirror.centos.org/centos/7/paas/x86_64/openshift-origin310/openshift-ansible-3.10.51-1.git.0.44a646c.el7.noarch.rpm

release-3.10 is the branch right?

release-3.10 is the branch right?

yes

@infrasystemelille am getting the same error using the rpm as well as the git source.

If you were previously using /etc/hosts instead of DNS, and then switch to DNS, is there any way to fix the existing nodes without rebuilding the cluster?

Was this page helpful?
0 / 5 - 0 ratings