The installer hangs when waiting for the catalog-api-server. The reason is that the command curl -k https://apiserver.kube-service-catalog.svc/healthz cannot resolve the hostname for the service. The installer is missing some Sky-DNS setup because there is no entry which points to the local cluster
[vagrant@master ~]$ sudo cat /etc/resolv.conf
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search cluster.local vnet.de
nameserver 192.168.60.150 <- this is the node-ip (master)
[vagrant@master ~]$ sudo cat /etc/dnsmasq.d/origin-upstream-dns.conf
server=192.168.60.100 <- this is my external DNS which resolves router-wildcards and so on
[vagrant@master ~]$
A workaround is a ansible task which updates all nodes with the following in /etc/dnsmasq.d/origin-upstream-dns.conf
server=/svc/172.30.0.1
Once this has been added and dnsmasq is restarted, the curl -k https://apiserver.kube-service-catalog.svc/healthz returns ok.
There have been several reports that this is a main problem in 3.10 and our 3.9 installation does not suffer from this.
Ansible: 2.6.4
openshift-ansible-3.10.53-1
Installer (deploy_cluster.yaml) finishes successfully without additional workarounds
Without the mentioned workaround, no kubernetes service can be resolved within the cluster
Inventory.
# Create an OSEv3 group that contains the masters, nodes, and etcd groups
[OSEv3:children]
masters
nodes
etcd
# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=vagrant
# If ansible_ssh_user is not root, ansible_become must be set to true
ansible_become=true
openshift_deployment_type=origin
openshift_release=v3.10
openshift_master_cluster_public_hostname=openshift.vnet.de
openshift_master_default_subdomain=apps.vnet.de
openshift_disable_check=memory_availability,disk_availability,docker_storage, docker_storage_driver
# host group for masters
[masters]
master openshift_public_ip=192.168.60.150 ip=192.168.60.150
# host group for etcd
[etcd]
master etcd_ip=192.168.60.150 ip=192.168.60.150
# host group for nodes, includes region info
[nodes]
master openshift_node_group_name='node-config-master' openshift_public_ip=192.168.60.150 ip=192.168.60.150
infra openshift_node_group_name='node-config-infra' openshift_public_ip=192.168.60.160 ip=192.168.60.160
app1 openshift_node_group_name='node-config-compute' openshift_public_ip=192.168.60.170 ip=192.168.60.170
app2 openshift_node_group_name='node-config-compute' openshift_public_ip=192.168.60.171 ip=192.168.60.171
Same problem here.
In roles/openshift_node/files/networkmanager/99-origin-dns.sh theres is a check:
if [ ! -f /etc/dnsmasq.d/origin-dns.conf ]; then
that prevents the file to be created with server entries
Running into same issue when attempting to upgrade to latest v3_10. Existing v3_10 cluster was installed using openshift-ansible-3.10.51-1.git.0.44a646c.el7 without issue. Upgraded ansible playbooks to openshift-ansible-3.10.68-1.git.0.f908cf5.el7 and tried to run the upgrade.yml playbook and getting failing with "Could not resolve host: apiserver.kube-service-catalog.svc; Unknown error"
We run into the same issue with OKD 3.11.37 on CentOS 7.6.
I'm getting the same issue on RHEL7.7
[root@masteroc openshift-ansible]# git branch
master
* release-3.11
[root@masteroc openshift-ansible]# cat /etc/*release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.8 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.8"
PRETTY_NAME="Red Hat Enterprise Linux"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.8:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.8"
Red Hat Enterprise Linux Server release 7.8 (Maipo)
Red Hat Enterprise Linux Server release 7.8 (Maipo)
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen.
Mark the issue as fresh by commenting/remove-lifecycle rotten.
Exclude this issue from closing again by commenting/lifecycle frozen./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.