Openshift-ansible: [3.10] DNS not setup correct - cannot resolve services

Created on 10 Oct 2018 · 8Comments · Source: openshift/openshift-ansible

Description

The installer hangs when waiting for the catalog-api-server. The reason is that the command curl -k https://apiserver.kube-service-catalog.svc/healthz cannot resolve the hostname for the service. The installer is missing some Sky-DNS setup because there is no entry which points to the local cluster

[vagrant@master ~]$ sudo cat /etc/resolv.conf 
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search cluster.local vnet.de
nameserver 192.168.60.150 <- this is the node-ip (master)

[vagrant@master ~]$ sudo cat /etc/dnsmasq.d/origin-upstream-dns.conf 
server=192.168.60.100 <- this is my external DNS which resolves router-wildcards and so on
[vagrant@master ~]$

A workaround is a ansible task which updates all nodes with the following in /etc/dnsmasq.d/origin-upstream-dns.conf

server=/svc/172.30.0.1

Once this has been added and dnsmasq is restarted, the curl -k https://apiserver.kube-service-catalog.svc/healthz returns ok.

There have been several reports that this is a main problem in 3.10 and our 3.9 installation does not suffer from this.

Version

Ansible: 2.6.4
openshift-ansible-3.10.53-1

Steps To Reproduce

Run ansible-playbook prerequisites
Run ansible-playbook deploy_cluster

Expected Results

Installer (deploy_cluster.yaml) finishes successfully without additional workarounds

Observed Results

Without the mentioned workaround, no kubernetes service can be resolved within the cluster

Additional Information

Inventory.

# Create an OSEv3 group that contains the masters, nodes, and etcd groups
[OSEv3:children]
masters
nodes
etcd

# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=vagrant

# If ansible_ssh_user is not root, ansible_become must be set to true
ansible_become=true

openshift_deployment_type=origin
openshift_release=v3.10
openshift_master_cluster_public_hostname=openshift.vnet.de
openshift_master_default_subdomain=apps.vnet.de

openshift_disable_check=memory_availability,disk_availability,docker_storage, docker_storage_driver

# host group for masters
[masters]
master openshift_public_ip=192.168.60.150 ip=192.168.60.150

# host group for etcd
[etcd]
master etcd_ip=192.168.60.150 ip=192.168.60.150

# host group for nodes, includes region info
[nodes]
master openshift_node_group_name='node-config-master' openshift_public_ip=192.168.60.150 ip=192.168.60.150
infra openshift_node_group_name='node-config-infra' openshift_public_ip=192.168.60.160 ip=192.168.60.160
app1 openshift_node_group_name='node-config-compute' openshift_public_ip=192.168.60.170 ip=192.168.60.170
app2 openshift_node_group_name='node-config-compute' openshift_public_ip=192.168.60.171 ip=192.168.60.171

lifecyclrotten

Source

lostiniceland

👍1

All 8 comments

Same problem here.
In roles/openshift_node/files/networkmanager/99-origin-dns.sh theres is a check:

if [ ! -f /etc/dnsmasq.d/origin-dns.conf ]; then

that prevents the file to be created with server entries

amon-ra on 11 Dec 2018

👍1

Running into same issue when attempting to upgrade to latest v3_10. Existing v3_10 cluster was installed using openshift-ansible-3.10.51-1.git.0.44a646c.el7 without issue. Upgraded ansible playbooks to openshift-ansible-3.10.68-1.git.0.f908cf5.el7 and tried to run the upgrade.yml playbook and getting failing with "Could not resolve host: apiserver.kube-service-catalog.svc; Unknown error"

skynardo on 30 May 2019

👍1

We run into the same issue with OKD 3.11.37 on CentOS 7.6.

datapresso on 16 Sep 2019

👍1

I'm getting the same issue on RHEL7.7

[root@masteroc openshift-ansible]# git branch
  master
* release-3.11

[root@masteroc openshift-ansible]# cat /etc/*release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.8 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.8"
PRETTY_NAME="Red Hat Enterprise Linux"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.8:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.8"
Red Hat Enterprise Linux Server release 7.8 (Maipo)
Red Hat Enterprise Linux Server release 7.8 (Maipo)

dbgoytia on 3 Apr 2020

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 2 Jul 2020

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot on 1 Aug 2020

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot on 31 Aug 2020

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.