After installing OpenShift Origin 3.6, resolv.conf does not contain the necessary
'search svc.cluster.local cluster.local'
Ansible version - 2.3.1.0
git describe - openshift-ansible-3.6.173.0.37-1-2-g5929b6c
Installing OpenShift places a script (99-origin-dns.sh) in /etc/NetworkManager/dispatcher.d that should update /etc/resolv.conf with the appropriate values whenever the network status changes.
The script mentioned above is in fact created during the installation process. So far as I can tell, it also gets executed any time the network status changes. Something else, however, seems to come along right afterwards and clean out the changes that 99-origin-dns.sh made to resolv.conf, resulting in dns not working for the cluster.
This problem seems to be related to https://github.com/openshift/origin/issues/16097. Based on information in this issue and looking at the code mentioned when the issue was closed, it looks like this issue should be working, but for some reason is not being persisted.
Also seem to be running into this issue after running the openshift-ansible-3.6.173.0.41-1 playbooks to scale up a number of nodes. Restarting the NetworkManager service temporarily fixes the problem, but resolv.conf gets overwritten after a while, or on reboot of the machine.
This is non-ideal, but as a (hopefully) temporary workaround, I can modify /etc/sysconfig/network-scripts/ifcfg-<IFACE>.conf to add SEARCH="cluster.local". This seems to persist a reboot, which had been the sticking point for me.
update#1
Nevermind - the solution doesn't seem consistent after repeated reboots. It feels like there's a race condition issue going on here.
update#2
After further debugging, this looks like it might be caused by something in our environment. But for reference in case others run into this issue, depending on your settings (e.g. if NM_CONTROLLED is set to yes) you should be able to confirm via /var/log/messages that dhclient first updates resolv.conf, followed by the 99-origin-dns.sh script, once nm-dispatcher starts up.
I haven't tested this option yet, but from Redhat's docs, it sounds like it might be possible to make the resolv.conf immutable as another workaround.
update#3
I fixed the issue in our environment that was overwriting resolv.conf, but it looks like something else continues to override the file. Strangely enough, I'm noticing that after a certain amount of time (the dhclient renewal time - check NetworkManager logs), resolv.conf seems to contain the output of the 99-origin-dns.sh changes. However, this would still be unacceptable as the nodes are not in a usable state after a reboot due to the initial configuration that is still incorrect.
Another workaround to fix the issue at boot is to modify the /etc/sysconfig/network-scripts/ifcfg-<IFACE> file to set PEERDNS to no and manually add the correct nameserver (the IP of the node) and the search domain (cluster.local), which will correctly configure the node's resolv.conf to provide pods with a working DNS configuration/service.
@rezie As you mentioned, Red Hat published a solution for making permanent changes to /etc/resolv.conf that involves making the file immutable. I have been using this as a workaround, and it's been working well.
I have a similar issue with openshift-enterprise 3.7, but it doesn't seem to affect my master, only the nodes. I see in logs that /etc/NetworkManager/dispatcher.d/99-origin-dns.sh is being executed and it seems to work but something else is overwriting the resolv.conf afterwards. If I insert sleep 1 into /etc/NetworkManager/dispatcher.d/99-origin-dns.sh it works :)
Greetings
Klaas
@Klaas- Where did You inserted the sleep statement exactly?
@chris-str-cst
/etc/NetworkManager/dispatcher.d/99-origin-dns.sh
start of the script, inserted line and comment why I changed it above "cd /etc/sysconfig/network-scripts"
@Klaas- Thanks mate! This fixes my network issues. I reinstalled the cluster fresh last week and the same problems are starting again. Thanks for providing a solution! How did You found that..?
well I noticed the resolv.conf was not being changed properly but you can see 99-origin-dns.sh was being executed in logs, so you know that there is some kind of race going, I didn't find the time to properly debug NetworkManager but as a quick fix waiting a second is good enough here - it's just a test environment ... :)
I am hitting this only on some of the nodes after I upgraded from 3.6 to 3.7.42-1. I potentially have the same issue on one 3.6 node too as it's resolv.conf looks different than the others.
In my case it was fixed by just restarting NetworkManager. I rebooted the node too to see if it was still ok and it was OK. Apparently some race condition still going on.
I will post it here too. Since I am not the only one having discovered that the dispatcher is called:
I managed to have this issue exactly in Origin 3.9 on CentOS 7.5 and added a lot of debug printing into a file in the dispatcher.d/99-origin-dns.sh script.
I started having the issue when I switched my NetworkManager config from dhcp-provided IPs to static configuration.
What I found out is the following:
NetworkManager writes the /etc/resolv.conf with the static resolver/search/etc
the dispatcher script is called with $2 = up ; the script sets up /etc/resolv.conf correctly
the dispatcher script is called with $2 = connectivity-change ; does nothing ; /etc/resolv.conf stays identical (working)
NetworkManager overwrites /etc/resolv.conf with the static resolver/search/etc
dispatcher script is not called ; makes OpenShift unable to pull images from the private registry, and new pods unable to resolve kubernetes services
I have found a workaround, by changing the NetworkManager config including dns=none in the [main] section makes things work in my case since the dispatcher script is still called and thus setup /etc/dnsmasq.d/origin-upstream-dns.conf correctly with NetworkManager-provided resolvers and /etc/resolv.conf with the local dnsmasq IP.
In my case (OCP 3.9 containerized on Atomic) I had to add this to '/etc/dnsmasq.d/node-dnsmasq.conf' and restart dnsmasq.
server=/default.svc/172.30.0.1
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen.
Mark the issue as fresh by commenting/remove-lifecycle rotten.
Exclude this issue from closing again by commenting/lifecycle frozen./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
well I noticed the resolv.conf was not being changed properly but you can see 99-origin-dns.sh was being executed in logs, so you know that there is some kind of race going, I didn't find the time to properly debug NetworkManager but as a quick fix waiting a second is good enough here - it's just a test environment ... :)