Openshift-ansible: openshift_service_catalog install fails (OKD 3.11) - Wait for API Server rollout success

Created on 4 Dec 2018 · 3Comments · Source: openshift/openshift-ansible

Description

OKD 3.11 installation fails at:

TASK [openshift_service_catalog : Wait for API Server rollout success]

Version

Please put the following version information in the code block
indicated below.

Your ansible version per ansible --version

ansible 2.6.5
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Oct 30 2018, 23:45:53) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

If you're operating from a git clone:

The output of git describe

openshift-ansible-3.11.51-1

Steps To Reproduce

Node setup: 2 masters, 2 external etcd, 1 load-balancer, 2 nodes, 2 infra-nodes -- inventory file being used:

[OSEv3:children]
masters
etcd
lb
nodes

[masters]
master[1:2].example.com

[etcd]
etcd[1:2].example.com

[lb]
lb1.example.com

[nodes]
master[1:2].example.com openshift_node_group_name='node-config-master'
node[1:2].example.com openshift_node_group_name='node-config-compute'
infra-node[1:2].example.com openshift_node_group_name='node-config-infra'


[OSEv3:vars]
ansible_ssh_user=root
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true','challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]

openshift_deployment_type=origin
openshift_release=v3.11

openshift_master_cluster_method=native
openshift_master_cluster_hostname=console.example.com
openshift_master_default_subdomain=apps.example.com

openshift_master_api_port=8443
openshift_master_console_port=8443

openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability

os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'

## Needed for OKD 3.11
openshift_additional_repos=[{'id': 'centos-paas', 'name': 'centos-paas','baseurl' :'https://buildlogs.centos.org/centos/7/paas/x86_64/openshift-origin311','gpgcheck' :'0', 'enabled' :'1'}]

DNS Server running DNSMasq with:

/etc/hosts:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.4.1.100  ns1.example.com ns1
10.4.1.101  lb1.example.com lb1 console.example.com console
10.4.1.102  master1.example.com master1
10.4.1.103  master2.example.com master2
10.4.1.104  etcd1.example.com etcd1
10.4.1.105  etcd2.example.com etcd2
10.4.1.106  node1.example.com node1
10.4.1.107  node2.example.com node2
10.4.1.108  infra-node1.example.com infra-node1
10.4.1.109  infra-node2.example.com infra-node2

/etc/dnsmasq.conf:

conf-dir=/etc/dnsmasq.d,.rpmnew,.rpmsave,.rpmorig

strict-order
domain-needed
local=/example.com/
bind-dynamic
log-queries

address=/.example.com/10.4.1.101 # load-balancer

/etc/resolv.conf on each Node (DNS nameserver [ns1] additionally has an upstream DNS):

# Generated by NetworkManager
search example.com
nameserver 10.4.1.100

dig output to test DNS resolution -- run from Load Balancer:

[root@lb1 ~]# dig ns1.example.com @10.4.1.100 +short                                                                                                            
10.4.1.100
[root@lb1 ~]# dig lb1.example.com @10.4.1.100 +short                                                                                                            
10.4.1.101
[root@lb1 ~]# dig master1.example.com @10.4.1.100 +short
10.4.1.102
[root@lb1 ~]# dig master2.example.com @10.4.1.100 +short
10.4.1.103
[root@lb1 ~]# dig etcd1.example.com @10.4.1.100 +short
10.4.1.104
[root@lb1 ~]# dig etcd2.example.com @10.4.1.100 +short
10.4.1.105
[root@lb1 ~]# dig node1.example.com @10.4.1.100 +short                                                                                                          
10.4.1.106
[root@lb1 ~]# dig node2.example.com @10.4.1.100 +short
10.4.1.107
[root@lb1 ~]# dig infra-node1.example.com @10.4.1.100 +short
10.4.1.108
[root@lb1 ~]# dig infra-node2.example.com @10.4.1.100 +short
10.4.1.109

Installing from the DNS Nameserver (IP: 10.4.1.100), keys copied to all nodes (including self) via: ssh-copy-id:

for host in \
  lb1.example.com \
  master1.example.com \
  master2.example.com \
  etcd1.example.com \
  etcd2.example.com \
  node1.example.com \
  node2.example.com \
  infra-node1.example.com \
  infra-node2.example.com; \
  do
    ssh-copy-id ${host}; \
  done

Ran Prerequisite Playbook:

# ansible-playbook -i inventory.ini openshift-ansible/playbooks/prerequisites.yml

Ran Deploy Cluster Playbook:

# ansible-playbook -i inventory.ini openshift-ansible/playbooks/deploy_cluster.yml

Expected Results

OKD 3.11 to install and the Service Catalog install to rollout successfully

Observed Results

OKD 3.11 fails to install the Service Catalog:

TASK [openshift_service_catalog : Wait for API Server rollout success]

FAILED - RETRYING: Wait for API Server rollout success (1 retries left).
fatal: [master1.example.com]: FAILED! => {
    "attempts": 5, 
    "changed": false, 
    "cmd": [
        "oc", 
        "rollout", 
        "status", 
        "--config=/etc/origin/master/admin.kubeconfig", 
        "-n", 
        "kube-service-catalog", 
        "ds/apiserver"
    ], 
    "delta": "0:00:00.219700", 
    "end": "2018-12-04 10:53:07.205662", 
    "invocation": {
        "module_args": {
            "_raw_params": "oc rollout status --config=/etc/origin/master/admin.kubeconfig -n kube-service-catalog ds/apiserver", 
            "_uses_shell": false, 
            "argv": null, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "msg": "non-zero return code", 
    "rc": 1, 
    "start": "2018-12-04 10:53:06.985962", 
    "stderr": "error: watch closed before Until timeout", 
    "stderr_lines": [
        "error: watch closed before Until timeout"
    ], 
    "stdout": "Waiting for daemon set \"apiserver\" rollout to finish: 0 of 2 updated pods are available...\nWaiting for daemon set \"apiserver\" rollout to
 finish: 0 of 2 updated pods are available...", 
    "stdout_lines": [
        "Waiting for daemon set \"apiserver\" rollout to finish: 0 of 2 updated pods are available...", 
        "Waiting for daemon set \"apiserver\" rollout to finish: 0 of 2 updated pods are available..."
    ]
}
...ignoring

TASK [openshift_service_catalog : Wait for Controller Manager rollout success]

FAILED - RETRYING: Wait for Controller Manager rollout success (1 retries left).
fatal: [master1.example.com]: FAILED! => {
    "attempts": 5, 
    "changed": false, 
    "cmd": [
        "oc", 
        "rollout", 
        "status", 
        "--config=/etc/origin/master/admin.kubeconfig", 
        "-n", 
        "kube-service-catalog", 
        "ds/controller-manager"
    ], 
    "delta": "0:00:00.221175", 
    "end": "2018-12-04 10:54:00.366703", 
    "invocation": {
        "module_args": {
            "_raw_params": "oc rollout status --config=/etc/origin/master/admin.kubeconfig -n kube-service-catalog ds/controller-manager", 
            "_uses_shell": false, 
            "argv": null, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "msg": "non-zero return code", 
    "rc": 1, 
    "start": "2018-12-04 10:54:00.145528", 
    "stderr": "error: watch closed before Until timeout", 
    "stderr_lines": [
        "error: watch closed before Until timeout"
    ], 
    "stdout": "Waiting for daemon set \"controller-manager\" rollout to finish: 0 of 2 updated pods are available...\nWaiting for daemon set \"controller-ma
nager\" rollout to finish: 0 of 2 updated pods are available...", 
    "stdout_lines": [
        "Waiting for daemon set \"controller-manager\" rollout to finish: 0 of 2 updated pods are available...", 
        "Waiting for daemon set \"controller-manager\" rollout to finish: 0 of 2 updated pods are available..."
    ]
}
...ignoring

TASK [openshift_service_catalog : Verify that the Catalog API Server is running]

FAILED - RETRYING: Verify that the Catalog API Server is running (1 retries left).
fatal: [master1.example.com]: FAILED! => {
    "attempts": 60, 
    "changed": false, 
    "cmd": [
        "curl", 
        "-k", 
        "https://apiserver.kube-service-catalog.svc/healthz"
    ], 
    "delta": "0:00:01.014590", 
    "end": "2018-12-04 11:17:36.736802", 
    "invocation": {
        "module_args": {
            "_raw_params": "curl -k https://apiserver.kube-service-catalog.svc/healthz", 
            "_uses_shell": false, 
            "argv": null, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": false
        }
    }, 
    "msg": "non-zero return code", 
    "rc": 7, 
    "start": "2018-12-04 11:17:35.722212", 
    "stderr": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spe
nt    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:-- 
 0:00:01 --:--:--     0curl: (7) Failed connect to apiserver.kube-service-catalog.svc:443; Connection refused", 
    "stderr_lines": [
        "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current", 
        "                                 Dload  Upload   Total   Spent    Left  Speed", 
        "", 
        "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0", 
        "  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to apiserver.kube-service-catalog.svc:443; C
onnection refused"
    ], 
    "stdout": "", 
    "stdout_lines": []
}
...ignoring

TASK [openshift_service_catalog : Report errors] ***********************************************************************************************************
fatal: [master1.example.com]: FAILED! => {"changed": false, "msg": "Catalog install failed."}
        to retry, use: --limit @/root/okd-installer/openshift-ansible/playbooks/openshift-service-catalog/config.retry

PLAY RECAP *************************************************************************************************************************************************
etcd1.example.com              : ok=18   changed=1    unreachable=0    failed=0   
etcd2.example.com              : ok=16   changed=1    unreachable=0    failed=0   
infra-node1.example.com        : ok=0    changed=0    unreachable=0    failed=0   
infra-node2.example.com        : ok=0    changed=0    unreachable=0    failed=0   
lb1.example.com                : ok=1    changed=0    unreachable=0    failed=0   
localhost                      : ok=12   changed=0    unreachable=0    failed=0   
master1.example.com            : ok=91   changed=25   unreachable=0    failed=1   
master2.example.com            : ok=28   changed=1    unreachable=0    failed=0   
node1.example.com              : ok=0    changed=0    unreachable=0    failed=0   
node2.example.com              : ok=0    changed=0    unreachable=0    failed=0   


INSTALLER STATUS *******************************************************************************************************************************************
Initialization           : Complete (0:00:34)
Service Catalog Install  : In Progress (0:24:32)
        This phase can be restarted by running: playbooks/openshift-service-catalog/config.yml

Additional Information

Your operating system and version:

CentOS Linux release 7.6.1810 (Core)

I've retired just running the playbooks/openshift-service-catalog/config.yml, same output.

Anyone see anything wrong with my inventory/setup?

Thanks!

Source

DizzyThermal

Most helpful comment

I've resolved this particular issue, installation now finishes and the service catalog rolls out successfully.

My particular issue was in the DNSMasq configuration.

I had:

address=/.example.com/10.4.1.101 # load-balancer

and needed, instead (note the added apps):

address=/.apps.example.com/10.4.1.101 # load-balancer

My guess at the root cause is that, even though I was able to resolve all domains, it was still failing because wildcard was setup to resolve everything unknown to the Load Balancer, including the expected: foobar.apps.example.com, making it a bit tricky to figure out.

Sorry for the noise, hopefully this saves others some time if they wind up in a similar DNS situation.

DizzyThermal on 4 Dec 2018

👍3

All 3 comments

@JayKayy mentioned that the template being used to create the daemonsets (in his case) were using the wrong etcd value for etcd_servers. It was trying to use master[0] host (etcd isn't co-located).

Since my inventory is setup to have external etcds too, maybe this is a similar issue?

DizzyThermal on 4 Dec 2018

I've resolved this particular issue, installation now finishes and the service catalog rolls out successfully.

My particular issue was in the DNSMasq configuration.

I had:

address=/.example.com/10.4.1.101 # load-balancer

and needed, instead (note the added apps):

address=/.apps.example.com/10.4.1.101 # load-balancer

Sorry for the noise, hopefully this saves others some time if they wind up in a similar DNS situation.

DizzyThermal on 4 Dec 2018

👍3

Hey all,

i need for you experience. i search this ns1.example.com ns1 in the inventory but i can't found.
I wanted to know the usefulness of this host. Can you please help me please.

best regards
khaled Moez