Openshift-ansible: resolv.conf not updated configured properly

Created on 19 Sep 2017 · 16Comments · Source: openshift/openshift-ansible

Description

After installing OpenShift Origin 3.6, resolv.conf does not contain the necessary
'search svc.cluster.local cluster.local'

Version

Ansible version - 2.3.1.0
git describe - openshift-ansible-3.6.173.0.37-1-2-g5929b6c

Steps To Reproduce

Clean install of RHEL 7.4
Install Docker and configure storage
ansible-playbook -i hosts openshift-ansible/playbooks/byo/config.yml

Expected Results

Installing OpenShift places a script (99-origin-dns.sh) in /etc/NetworkManager/dispatcher.d that should update /etc/resolv.conf with the appropriate values whenever the network status changes.

Observed Results

The script mentioned above is in fact created during the installation process. So far as I can tell, it also gets executed any time the network status changes. Something else, however, seems to come along right afterwards and clean out the changes that 99-origin-dns.sh made to resolv.conf, resulting in dns not working for the cluster.

Additional Information

This problem seems to be related to https://github.com/openshift/origin/issues/16097. Based on information in this issue and looking at the code mentioned when the issue was closed, it looks like this issue should be working, but for some reason is not being persisted.

lifecyclrotten

Source

ringmaster217

👍1

Most helpful comment

well I noticed the resolv.conf was not being changed properly but you can see 99-origin-dns.sh was being executed in logs, so you know that there is some kind of race going, I didn't find the time to properly debug NetworkManager but as a quick fix waiting a second is good enough here - it's just a test environment ... :)

Klaas- on 8 Mar 2018

😄1 👍1

All 16 comments

Also seem to be running into this issue after running the openshift-ansible-3.6.173.0.41-1 playbooks to scale up a number of nodes. Restarting the NetworkManager service temporarily fixes the problem, but resolv.conf gets overwritten after a while, or on reboot of the machine.

rezie on 26 Sep 2017

This is non-ideal, but as a (hopefully) temporary workaround, I can modify /etc/sysconfig/network-scripts/ifcfg-<IFACE>.conf to add SEARCH="cluster.local". This seems to persist a reboot, which had been the sticking point for me.

update#1
Nevermind - the solution doesn't seem consistent after repeated reboots. It feels like there's a race condition issue going on here.

update#2
After further debugging, this looks like it might be caused by something in our environment. But for reference in case others run into this issue, depending on your settings (e.g. if NM_CONTROLLED is set to yes) you should be able to confirm via /var/log/messages that dhclient first updates resolv.conf, followed by the 99-origin-dns.sh script, once nm-dispatcher starts up.

I haven't tested this option yet, but from Redhat's docs, it sounds like it might be possible to make the resolv.conf immutable as another workaround.

rezie on 27 Sep 2017

update#3
I fixed the issue in our environment that was overwriting resolv.conf, but it looks like something else continues to override the file. Strangely enough, I'm noticing that after a certain amount of time (the dhclient renewal time - check NetworkManager logs), resolv.conf seems to contain the output of the 99-origin-dns.sh changes. However, this would still be unacceptable as the nodes are not in a usable state after a reboot due to the initial configuration that is still incorrect.

Another workaround to fix the issue at boot is to modify the /etc/sysconfig/network-scripts/ifcfg-<IFACE> file to set PEERDNS to no and manually add the correct nameserver (the IP of the node) and the search domain (cluster.local), which will correctly configure the node's resolv.conf to provide pods with a working DNS configuration/service.

rezie on 28 Sep 2017

@rezie As you mentioned, Red Hat published a solution for making permanent changes to /etc/resolv.conf that involves making the file immutable. I have been using this as a workaround, and it's been working well.

ringmaster217 on 28 Sep 2017

I have a similar issue with openshift-enterprise 3.7, but it doesn't seem to affect my master, only the nodes. I see in logs that /etc/NetworkManager/dispatcher.d/99-origin-dns.sh is being executed and it seems to work but something else is overwriting the resolv.conf afterwards. If I insert sleep 1 into /etc/NetworkManager/dispatcher.d/99-origin-dns.sh it works :)

Greetings
Klaas

Klaas- on 15 Feb 2018

@Klaas- Where did You inserted the sleep statement exactly?

chris-str-cst on 8 Mar 2018

@chris-str-cst
/etc/NetworkManager/dispatcher.d/99-origin-dns.sh
start of the script, inserted line and comment why I changed it above "cd /etc/sysconfig/network-scripts"

Klaas- on 8 Mar 2018

👍1

@Klaas- Thanks mate! This fixes my network issues. I reinstalled the cluster fresh last week and the same problems are starting again. Thanks for providing a solution! How did You found that..?

chris-str-cst on 8 Mar 2018

Klaas- on 8 Mar 2018

😄1 👍1

I am hitting this only on some of the nodes after I upgraded from 3.6 to 3.7.42-1. I potentially have the same issue on one 3.6 node too as it's resolv.conf looks different than the others.

In my case it was fixed by just restarting NetworkManager. I rebooted the node too to see if it was still ok and it was OK. Apparently some race condition still going on.

bortek on 18 May 2018

I will post it here too. Since I am not the only one having discovered that the dispatcher is called:

I managed to have this issue exactly in Origin 3.9 on CentOS 7.5 and added a lot of debug printing into a file in the dispatcher.d/99-origin-dns.sh script.

I started having the issue when I switched my NetworkManager config from dhcp-provided IPs to static configuration.

What I found out is the following:

NetworkManager writes the /etc/resolv.conf with the static resolver/search/etc
the dispatcher script is called with $2 = up ; the script sets up /etc/resolv.conf correctly
the dispatcher script is called with $2 = connectivity-change ; does nothing ; /etc/resolv.conf stays identical (working)
NetworkManager overwrites /etc/resolv.conf with the static resolver/search/etc
dispatcher script is not called ; makes OpenShift unable to pull images from the private registry, and new pods unable to resolve kubernetes services

I have found a workaround, by changing the NetworkManager config including dns=none in the [main] section makes things work in my case since the dispatcher script is still called and thus setup /etc/dnsmasq.d/origin-upstream-dns.conf correctly with NetworkManager-provided resolvers and /etc/resolv.conf with the local dnsmasq IP.

dabelenda on 16 Jul 2018

In my case (OCP 3.9 containerized on Atomic) I had to add this to '/etc/dnsmasq.d/node-dnsmasq.conf' and restart dnsmasq.

server=/default.svc/172.30.0.1

ToroNZ on 24 Jul 2018

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 25 May 2020

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot on 25 Jun 2020

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot on 25 Jul 2020

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.