Openshift-ansible: service catalog install failed, may be have some prerequisites ?

Created on 22 Mar 2018 · 23Comments · Source: openshift/openshift-ansible

Description

I have installed the openshift-origin v3.7, v3.8, v3.9, v3.10, but all got following issues:
There may be some prerequisites for service catalog?

fatal: [dev.cefcfco.com]: FAILED! => {
    "attempts": 120,
    "changed": false,
    "cmd": [
        "curl",
        "-k",
        "https://apiserver.kube-service-catalog.svc/healthz"
    ],
    "delta": "0:00:01.188682",
    "end": "2018-03-22 02:32:27.933614",
    "invocation": {
        "module_args": {
            "_raw_params": "curl -k https://apiserver.kube-service-catalog.svc/healthz",
            "_uses_shell": false,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": false
        }
    },
    "rc": 0,
    "start": "2018-03-22 02:32:26.744932",
    "stderr": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent
    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0\r100   180  100   180    0     0    153      0  0:00:01  0:00:01 --:--:--   153",
    "stderr_lines": [
        "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current",
        "                                 Dload  Upload   Total   Spent    Left  Speed",
        "",
        "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0",
        "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0",
        "  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0",
        "100   180  100   180    0     0    153      0  0:00:01  0:00:01 --:--:--   153"
    ],
    "stdout": "[+]ping ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-service-catalog-apiserver-informers ok\n[-]etcd failed: reason withheld\nhealthz check failed",
    "stdout_lines": [
        "[+]ping ok",
        "[+]poststarthook/generic-apiserver-start-informers ok",
        "[+]poststarthook/start-service-catalog-apiserver-informers ok",
        "[-]etcd failed: reason withheld",
        "healthz check failed"
    ]
}
        to retry, use: --limit @/root/openshift-ansible/playbooks/byo/config.retry

INSTALLER STATUS ***********************************************************************************************************************************
Initialization             : Complete
Health Check               : Complete
etcd Install               : Complete
Master Install             : Complete
Master Additional Install  : Complete
Node Install               : Complete
Hosted Install             : Complete
Service Catalog Install    : In Progress
        This phase can be restarted by running: playbooks/byo/openshift-cluster/service-catalog.yml



Failure summary:


  1. Hosts:    dev.cefcfco.com
     Play:     Service Catalog
     Task:     wait for api server to be ready
     Message:  Failed without returning a message.



md5-e20ea802451fc96636da74758c78b608



[root@feng ~]# curl -k https://apiserver.kube-service-catalog.svc/healthz
[+]ping ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-service-catalog-apiserver-informers ok
[-]etcd failed: reason withheld
healthz check failed



md5-e20ea802451fc96636da74758c78b608



[root@dev ~]# oc get pods -n kube-service-catalog
NAME                       READY     STATUS    RESTARTS   AGE
apiserver-qbjj7            1/1       Running   0          9m
controller-manager-ptz7v   1/1       Running   1          9m



md5-dac2bdbc8817d492bf82f7117bf04e47



[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
ansible_ssh_user=root
enable_excluders=False
enable_docker_excluder=False
ansible_service_broker_install=False

containerized=True
os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability,package_version

deployment_type=origin
openshift_deployment_type=origin

openshift_release=v3.7.2
openshift_release=v3.7.2
openshift_pkg_version=v3.7.2
openshift_image_tag=v3.7.2
openshift_service_catalog_image_version=v3.7.2
template_service_broker_image_version=v3.7.2
openshift_metrics_image_version=v3.7.2

osm_use_cockpit=true

openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/htpasswd'}]

openshift_public_hostname=dev.cefcfco.com
openshift_master_default_subdomain=apps.dev.cefcfco.com

[masters]
dev.cefcfco.com openshift_schedulable=true

[etcd]
dev.cefcfco.com

[nodes]
dev.cefcfco.com openshift_schedulable=true openshift_node_labels="{'region': 'infra', 'zone': 'default'}"

lifecyclrotten

Source

HP-dufeng

👍4

Most helpful comment

I think i found the reason why its not working:

some of the api servers do not work:

apiserver-8n5g5   @node1   curl -k https://10.128.0.4:6443  healthz check failed
apiserver-cdbfh   @node3   curl -k https://10.129.0.4:6443  healthz check failed
apiserver-n4qm7   @node2   curl -k https://10.130.0.6:6443  ok

a quick look with describe showed me that hey try to reslove the etcd servers:

    Command:
      /usr/bin/service-catalog
    Args:
      apiserver
      --storage-type
      etcd
      --secure-port
      6443
      --etcd-servers
      https://node1.k8s.unigs.de:2379,https://node2.k8s.unigs.de:2379,https://node3.k8s.unigs.de:2379
      --etcd-cafile
      /etc/origin/master/master.etcd-ca.crt
      --etcd-certfile
      /etc/origin/master/master.etcd-client.crt
      --etcd-keyfile
      /etc/origin/master/master.etcd-client.key
      -v
      3
      --cors-allowed-origins
      localhost
      --admission-control
      KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck
      --feature-gates
      OriginatingIdentity=true

i exec into the contianer and run the following commands:

sh-4.2# ping node1.k8s.unigs.de
PING node1.k8s.unigs.de.k8s.unigs.de (10.18.255.99) 56(84) bytes of data.
64 bytes from lb.k8s.unigs.de (10.18.255.99): icmp_seq=1 ttl=63 time=0.213 ms

that is clearly wrong. Notice the point on the end on the next command.

sh-4.2# ping node1.k8s.unigs.de.
PING node1.k8s.unigs.de (10.18.255.1) 56(84) bytes of data.
64 bytes from node1.k8s.unigs.de (10.18.255.1): icmp_seq=1 ttl=63 time=0.730 ms

oh interesting!

sh-4.2# cat /etc/resolv.conf  
nameserver 10.18.255.2
search kube-service-catalog.svc.cluster.local svc.cluster.local cluster.local k8s.unigs.de
options ndots:5

as far as i understand it, the ndots:5 option forces to lookup hostnames with fewer than 5 dots. i have 4. so node1.k8s.unigs.de gets resolved to node1.k8s.unigs.de.k8s.unigs.de.

does this ndots option make sense? and how can i force it to use the domain name i provided?

i tried adding openshift_ip= to all of my hosts, but that did not change the result.

foosinn on 28 May 2018

👍2

All 23 comments

i have the same Problem with Openshift 3.7

flipkill1985 on 22 Mar 2018

I find all the related content, but not give me answers.

HP-dufeng on 23 Mar 2018

If anyone knows the answer, please let me know.

HP-dufeng on 23 Mar 2018

@flipkill1985
I finally installed successed it this morning.

[root@localhost ~]# oc get pods --all-namespaces
NAMESPACE                           NAME                       READY     STATUS             RESTARTS   AGE
default                             docker-registry-1-zmgt4    1/1       Running            0          10m
default                             registry-console-1-6dnjv   1/1       Running            0          10m
default                             router-1-n479h             1/1       Running            0          12m
kube-service-catalog                apiserver-8sd62            1/1       Running            0          9m
kube-service-catalog                controller-manager-5bbvb   1/1       Running            0          9m
openshift-ansible-service-broker    asb-1-deploy               1/1       Running            0          8m
openshift-ansible-service-broker    asb-1-scb6l                0/1       ImagePullBackOff   0          8m
openshift-ansible-service-broker    asb-etcd-1-jq6s5           1/1       Running            0          8m
openshift-template-service-broker   apiserver-dd6mh            1/1       Running            0          7m

I've always followed this video: https://blog.openshift.com/installing-openshift-3-7-1-30-minutes/
but all failed.

After I follow this step one by one, successed: https://docs.openshift.org/latest/install_config/install/host_preparation.html

maybe missing some prerequisites package before.

HP-dufeng on 23 Mar 2018

https://docs.openshift.org/latest/install_config/install/host_preparation.html

This is for openshift 3.9 not 3.7.x ???

flipkill1985 on 23 Mar 2018

@flipkill1985
I installed v3.7 successed.

HP-dufeng on 23 Mar 2018

Can you post your steps and the playbook you use? Please :)

flipkill1985 on 23 Mar 2018

This time, I use basic config for test, so I don't set hostname, dns, docker storage .... ,but I think this is easy until you successed installed.

yum install wget git net-tools bind-utils iptables-services bridge-utils bash-completion kexec-tools sos psacct

yum update

yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

sed -i -e "s/^enabled=1/enabled=0/" /etc/yum.repos.d/epel.repo

yum -y --enablerepo=epel install ansible pyOpenSSL

git clone https://github.com/openshift/openshift-ansible

cd openshift-ansible

git checkout release-3.7

cd ~/

yum install docker-1.13.1

systemctl start docker
systemctl enable docker

ssh-keygen -t rsa

-- change to your host ip
ssh-copy-id -i ~/.ssh/id_rsa.pub 10.1.7.39

vi /etc/ansible/hosts

-- my hosts, change with your host ip, 
-- dev.cefcfco.com this is my domain, change with your's

[OSEv3:children]
masters
nodes
etcd
nfs

[OSEv3:vars]
ansible_ssh_user=root

os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability,package_version

openshift_docker_options='--selinux-enabled --insecure-registry 172.30.0.0/16'

deployment_type=origin
openshift_deployment_type=origin

openshift_release=v3.7

openshift_hosted_etcd_storage_kind=nfs
openshift_hosted_etcd_storage_nfs_options="*(rw,root_squash,sync,no_wdelay)"
openshift_hosted_etcd_storage_nfs_directory=/opt/osev3-etcd 
openshift_hosted_etcd_storage_volume_name=etcd-vol2 
openshift_hosted_etcd_storage_access_modes=["ReadWriteOnce"]
openshift_hosted_etcd_storage_volume_size=1G
openshift_hosted_etcd_storage_labels={'storage': 'etcd'}

ansible_service_broker_image_prefix=openshift/
ansible_service_broker_registry_url="registry.access.redhat.com"

openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/htpasswd'}]

openshift_public_hostname=dev.cefcfco.com
openshift_master_default_subdomain=apps.dev.cefcfco.com

[masters]
10.1.7.39 openshift_schedulable=true

[etcd]
10.1.7.39

[nfs]
10.1.7.39

[nodes]
10.1.7.39 openshift_schedulable=true openshift_node_labels="{'region': 'infra', 'zone': 'default'}"

ansible-playbook -i /etc/ansible/hosts openshift-ansible/playbooks/byo/config.yml -vvv

HP-dufeng on 23 Mar 2018

dont work :( wich Distribution do you use, i use centos 7.4

Thats the Error:

fatal: [sp-peter02.os.peter.es]: FAILED! => {
    "attempts": 120,
    "changed": false,
    "cmd": [
        "curl",
        "-k",
        "https://apiserver.kube-service-catalog.svc/healthz"
    ],
    "delta": "0:00:00.144529",
    "end": "2018-03-23 09:34:21.024849",
    "invocation": {
        "module_args": {
            "_raw_params": "curl -k https://apiserver.kube-service-catalog.svc/healthz",
            "_uses_shell": false,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": false
        }
    },
    "rc": 0,
    "start": "2018-03-23 09:34:20.880320",
    "stderr": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r100   180  100   180    0     0   1311      0 --:--:-- --:--:-- --:--:--  1313",
    "stderr_lines": [
        "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current",
        "                                 Dload  Upload   Total   Spent    Left  Speed",
        "",
        "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0",
        "100   180  100   180    0     0   1311      0 --:--:-- --:--:-- --:--:--  1313"
    ],
    "stdout": "[+]ping ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-service-catalog-apiserver-informers ok\n[-]etcd failed: reason withheld\nhealthz check failed",
    "stdout_lines": [
        "[+]ping ok",
        "[+]poststarthook/generic-apiserver-start-informers ok",
        "[+]poststarthook/start-service-catalog-apiserver-informers ok",
        "[-]etcd failed: reason withheld",
        "healthz check failed"
    ]
}


# curl -k https://apiserver.kube-service-catalog.svc/healthz
[+]ping ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-service-catalog-apiserver-informers ok
[-]etcd failed: reason withheld
healthz check failed

flipkill1985 on 23 Mar 2018

Please can someone help me???

flipkill1985 on 24 Mar 2018

Thats the Problem

# curl -k https://apiserver.kube-service-catalog.svc/healthz
[+]ping ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-service-catalog-apiserver-informers ok
[-]etcd failed: reason withheld
healthz check failed

flipkill1985 on 24 Mar 2018

@flipkill1985
ansible hosts , you use ip or hostname ?
I found, if I use hostname , then failed, but use ip successed.
I guess this is a DNS problem.
so I will try to install a dns server after.

HP-dufeng on 27 Mar 2018

i use hostname

flipkill1985 on 27 Mar 2018

@flipkill1985 I think this _may_ be related to https://github.com/openshift/origin/issues/17316

Do you have a wildcard entry for *.dev.cefcfco.com configured in your DNS?

I've recently experienced a similar issue where the apiserver pod failed to resolve the etcd hosts correctly because the DNS lookup was matching a wildcard DNS, entry due to the search and ndots configuration in /etc/resolv.conf inside the apiserver pod

ich199 on 10 Apr 2018

👀1

see my comment here as i found similar behavior: https://github.com/openshift/openshift-ansible/issues/8076

JayKayy on 10 May 2018

I'm running into the same issue.

[OSEv3:children]
masters
nodes
etcd
lb

[OSEv3:vars]
ansible_python_interpreter=/usr/bin/python3
ansible_ssh_user=fedora
ansible_become=true
openshift_deployment_type=origin
openshift_release=v3.9
openshift_master_cluster_method=native

openshift_master_cluster_hostname=k8s.unigs.de
openshift_master_cluster_public_hostname=cloud.unigs.de

[masters]
node1.k8s.unigs.de
node3.k8s.unigs.de
node5.k8s.unigs.de

[etcd]
node1.k8s.unigs.de
node3.k8s.unigs.de
node5.k8s.unigs.de

[lb]
lb.k8s.unigs.de ansible_python_interpreter=/usr/bin/python ansible_ssh_user=root

[nodes]
node1.k8s.unigs.de openshift_node_labels="{'region': 'infra','zone': 'default'}"
node3.k8s.unigs.de openshift_node_labels="{'region': 'infra','zone': 'default'}"
node5.k8s.unigs.de openshift_node_labels="{'region': 'infra','zone': 'default'}"
node2.k8s.unigs.de openshift_node_labels="{'region': 'infra','primary': 'default'}"
node4.k8s.unigs.de openshift_node_labels="{'region': 'infra','primary': 'default'}"
node6.k8s.unigs.de openshift_node_labels="{'region': 'infra','primary': 'default'}"

node 1 to 6 are fedora atomic, lb is centos 7. All on the latest version.

I have done all the prepare commands and setup a fully working dns (inluding wildcard, they point to the lb).

I noticed that 1 of 3 curl -k https://apiserver.kube-service-catalog.svc/healthz will return ok.

Is there anything i can provide to give you a clue what could be wrong?

foosinn on 23 May 2018

On a retest from scratch with an external load balancer i got stuck in exactly the same error.

inventory
error messsage

The healthz url seems to only work on a single node. It fails in ~66% of the curls.

for i in {1..1000}; do curl -s -k https://apiserver.kube-service-catalog.svc/healthz \
  | grep -oE '^ok \
  |etcd.*'; done \
  | sort \
  | uniq -c
    662 etcd failed: reason withheld
    338 ok

foosinn on 25 May 2018

I think i found the reason why its not working:

some of the api servers do not work:

apiserver-8n5g5   @node1   curl -k https://10.128.0.4:6443  healthz check failed
apiserver-cdbfh   @node3   curl -k https://10.129.0.4:6443  healthz check failed
apiserver-n4qm7   @node2   curl -k https://10.130.0.6:6443  ok

a quick look with describe showed me that hey try to reslove the etcd servers:

    Command:
      /usr/bin/service-catalog
    Args:
      apiserver
      --storage-type
      etcd
      --secure-port
      6443
      --etcd-servers
      https://node1.k8s.unigs.de:2379,https://node2.k8s.unigs.de:2379,https://node3.k8s.unigs.de:2379
      --etcd-cafile
      /etc/origin/master/master.etcd-ca.crt
      --etcd-certfile
      /etc/origin/master/master.etcd-client.crt
      --etcd-keyfile
      /etc/origin/master/master.etcd-client.key
      -v
      3
      --cors-allowed-origins
      localhost
      --admission-control
      KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck
      --feature-gates
      OriginatingIdentity=true

i exec into the contianer and run the following commands:

sh-4.2# ping node1.k8s.unigs.de
PING node1.k8s.unigs.de.k8s.unigs.de (10.18.255.99) 56(84) bytes of data.
64 bytes from lb.k8s.unigs.de (10.18.255.99): icmp_seq=1 ttl=63 time=0.213 ms

that is clearly wrong. Notice the point on the end on the next command.

sh-4.2# ping node1.k8s.unigs.de.
PING node1.k8s.unigs.de (10.18.255.1) 56(84) bytes of data.
64 bytes from node1.k8s.unigs.de (10.18.255.1): icmp_seq=1 ttl=63 time=0.730 ms

oh interesting!

sh-4.2# cat /etc/resolv.conf  
nameserver 10.18.255.2
search kube-service-catalog.svc.cluster.local svc.cluster.local cluster.local k8s.unigs.de
options ndots:5

as far as i understand it, the ndots:5 option forces to lookup hostnames with fewer than 5 dots. i have 4. so node1.k8s.unigs.de gets resolved to node1.k8s.unigs.de.k8s.unigs.de.

does this ndots option make sense? and how can i force it to use the domain name i provided?

i tried adding openshift_ip= to all of my hosts, but that did not change the result.

foosinn on 28 May 2018

👍2

I finally got it to work. The cause of the issue was that i had a wildcard A record on the domain i used.
If there is no wildcard entry node1.k8s.unigs.de.k8s.unigs.de gets not resolved and it will try to resolve the correct name.

I redeployed the same stuff on another domain, without a wildcard record and it worked!

this may also works for these issues:

8195 #7611 #6572 #8076 #7639 #7578 #6355

foosinn on 29 May 2018

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 24 May 2020

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot on 23 Jun 2020

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot on 23 Jul 2020

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.