Openshift-ansible: could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory

Created on 19 Oct 2017  Â·  15Comments  Â·  Source: openshift/openshift-ansible

Description

Uninstall openshift reinstall after installation

ansible-playbook /data/openshift-ansible/playbooks/adhoc/uninstall.yml

ansible-playbook /data/openshift-ansible/playbooks/byo/config.yml
Version
ansible 2.3.2.0

openshift-ansible  2017-10-19 update from master ,  commitid ca6581dbd5bf06152ad8a321e1fb45911a91cce4 
ansible log
TASK [openshift_manage_node : Wait for Node Registration] **************************************************************************************************************************************************************************************
Thursday 19 October 2017  21:32:38 +0800 (0:00:00.078)       0:03:00.870 ******
FAILED - RETRYING: Wait for Node Registration (50 retries left).
ok: [master -> master]
FAILED - RETRYING: Wait for Node Registration (49 retries left).
FAILED - RETRYING: Wait for Node Registration (48 retries left).
FAILED - RETRYING: Wait for Node Registration (47 retries left).
FAILED - RETRYING: Wait for Node Registration (46 retries left).
FAILED - RETRYING: Wait for Node Registration (45 retries left).
FAILED - RETRYING: Wait for Node Registration (44 retries left).
FAILED - RETRYING: Wait for Node Registration (43 retries left).
FAILED - RETRYING: Wait for Node Registration (42 retries left).
FAILED - RETRYING: Wait for Node Registration (41 retries left).
FAILED - RETRYING: Wait for Node Registration (40 retries left).
message log
Oct 19 21:24:19 node1 systemd: origin-node.service holdoff time over, scheduling restart.
Oct 19 21:24:19 node1 systemd: Starting OpenShift Node...
Oct 19 21:24:19 node1 dnsmasq[4965]: setting upstream servers from DBus
Oct 19 21:24:19 node1 dnsmasq[4965]: using nameserver 127.0.0.1#53 for domain in-addr.arpa
Oct 19 21:24:19 node1 dnsmasq[4965]: using nameserver 127.0.0.1#53 for domain cluster.local
Oct 19 21:24:20 node1 origin-node: I1019 21:24:20.297680   17564 start_node.go:251] Reading node configuration from /etc/origin/node/node-config.yaml
Oct 19 21:24:20 node1 origin-node: I1019 21:24:20.406336   17564 node.go:123] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "node1" (IP ""), iptables sync period "30s"
Oct 19 21:24:20 node1 origin-node: I1019 21:24:20.416313   17564 docker.go:364] Connecting to docker on unix:///var/run/docker.sock
Oct 19 21:24:20 node1 origin-node: I1019 21:24:20.416379   17564 docker.go:384] Start docker client with request timeout=2m0s
Oct 19 21:24:20 node1 origin-node: W1019 21:24:20.418569   17564 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Oct 19 21:24:20 node1 origin-node: F1019 21:24:20.438965   17564 start_node.go:140] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory
Oct 19 21:24:20 node1 systemd: origin-node.service: main process exited, code=exited, status=255/n/a
Oct 19 21:24:20 node1 dnsmasq[4965]: setting upstream servers from DBus
Oct 19 21:24:20 node1 systemd: Failed to start OpenShift Node.
Oct 19 21:24:20 node1 systemd: Unit origin-node.service entered failed state.
Oct 19 21:24:20 node1 systemd: origin-node.service failed.
Temporary solution
  1. Delete /etc/resolv.conf includes 99-origin-dns content

  2. Manually create /etc/origin/node/resolv.conf

echo 'nameserver 192.168.1.142' > /etc/origin/node/resolv.conf
Normal ansible log
TASK [openshift_manage_node : Wait for Node Registration] **************************************************************************************************************************************************************************************
Thursday 19 October 2017  21:32:38 +0800 (0:00:00.078)       0:03:00.870 ******
FAILED - RETRYING: Wait for Node Registration (50 retries left).
ok: [master -> master]
FAILED - RETRYING: Wait for Node Registration (49 retries left).
FAILED - RETRYING: Wait for Node Registration (48 retries left).
FAILED - RETRYING: Wait for Node Registration (47 retries left).
FAILED - RETRYING: Wait for Node Registration (46 retries left).
FAILED - RETRYING: Wait for Node Registration (45 retries left).
FAILED - RETRYING: Wait for Node Registration (44 retries left).
FAILED - RETRYING: Wait for Node Registration (43 retries left).
FAILED - RETRYING: Wait for Node Registration (42 retries left).
FAILED - RETRYING: Wait for Node Registration (41 retries left).
FAILED - RETRYING: Wait for Node Registration (40 retries left).
ok: [node1 -> master]

The origin-node starts normally

Most helpful comment

I found the problem with @jfchevrette (Thanks JF!).

The issue is that our environment configures eth0 in /etc/sysconfig/network-scripts/ifcfg-eth0 to explicitely not use NetworkManager for that interface:

# Automatically generated, do not edit
DEVICE=eth0
BOOTPROTO=dhcp
HWADDR=fa:16:3e:b1:98:77
ONBOOT=yes
NM_CONTROLLED=no
TYPE=Ethernet

This means that the interface is not controlled by NetworkManager and therefore restarting NetworkManager does not bring that interface up and the dispatcher script does not run for that interface.
Just by commenting out NM_CONTROLLED=NO in the ifcfg-eth0 file and restarting NetworkManager created the /etc/origin/node/resolv.conf properly.

I think a proper "fix" in openshift-ansible would be to add a check that verifies if the interface is in the output of "nmcli con", if it's not, fail with a friendly message. I'll send a PR for that.

All 15 comments

I am also seeing this issue, see full Ansible run here:

https://logs.rdoproject.org/59/10259/3/check/rdo-registry-integration/Zecce4a275f1347aab39ae5b022f86631/ara/

This task fails to restart origin node here with the same error in the original issue post:

Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 origin-node[18209]: I1025 01:48:05.172137   18209 start_node.go:257] Reading node configuration from /etc/origin/node/node-config.yaml
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 origin-node[18209]: I1025 01:48:05.191755   18209 node.go:146] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "192.168.1.42" (IP ""), iptables sync period "30s"
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 origin-node[18209]: F1025 01:48:05.192394   18209 start_node.go:146] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 dnsmasq[14479]: setting upstream servers from DBus
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 systemd[1]: Failed to start OpenShift Node.
-- Subject: Unit origin-node.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit origin-node.service has failed.
-- 
-- The result is failed.
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 systemd[1]: Unit origin-node.service entered failed state.
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 systemd[1]: origin-node.service failed.

Bringing back the discussion from the pull request, I've investigated the issue a little bit.

Those are the suspicious things I found and I hope someone more knowledgeable than I can shed some light:

  • I'm not sure what openshift_node_bootstrap is about but it seems it might inhibit NetworkManager from restarting here because it defaults to false. In my testing, I tried to naively set it to true in the inventory file. That didn't work because the node registration ended up timing out because the origin-node service did not get restarted/enabled because openshift_node_bootstrap was true. ಠ_ಠ
  • The /etc/origin/node/resolv.conf file is supposed to be set up by NetworkManager, in reality by the dispatcher script. This script fails to set up the resolv.conf file, the bash -x output is available here. The scripts exits very early on at the comparison if [[ ${DEVICE_IFACE} == ${def_route_int} ]] which evaluates to [[ docker0 == eth0 ]]. I've ran some commands to show you what the network configuration looks like here.

So it seems like the dispatcher runs on the Docker0 interface and that's it.
For what it's worth, this is reproduced on an OpenStack virtual machine with a fairly minimal CentOS image. I'll try and reproduce this on a more conventional CentOs image.

I found the problem with @jfchevrette (Thanks JF!).

The issue is that our environment configures eth0 in /etc/sysconfig/network-scripts/ifcfg-eth0 to explicitely not use NetworkManager for that interface:

# Automatically generated, do not edit
DEVICE=eth0
BOOTPROTO=dhcp
HWADDR=fa:16:3e:b1:98:77
ONBOOT=yes
NM_CONTROLLED=no
TYPE=Ethernet

This means that the interface is not controlled by NetworkManager and therefore restarting NetworkManager does not bring that interface up and the dispatcher script does not run for that interface.
Just by commenting out NM_CONTROLLED=NO in the ifcfg-eth0 file and restarting NetworkManager created the /etc/origin/node/resolv.conf properly.

I think a proper "fix" in openshift-ansible would be to add a check that verifies if the interface is in the output of "nmcli con", if it's not, fail with a friendly message. I'll send a PR for that.

Interestingly enough, I found this playbook that seems to ensure that NM_CONTROLLED is set to yes: https://github.com/openshift/openshift-ansible/blob/a974791553efd8f2080cc6735c0c5ba9e5bfe941/playbooks/common/openshift-node/network_manager.yml

That playbook is included by: https://github.com/openshift/openshift-ansible/blob/298f1aafc42e5c34938e1071353d103ad8964725/playbooks/byo/openshift-node/network_manager.yml

However, that playbook is not included anywhere (as far as I can tell) by running openshift-ansible/playbooks/byo/config.yml ... is it meant to be run standalone maybe ?

@dmsimard I think that code is dead, we don't use it as far as I know.

We have a fix out for this, which should work for everyone: https://github.com/openshift/openshift-ansible/pull/5953

@michaelgugino as per discussed (writing here for posterity..) that pull request doesn't resolve the issue.

The root cause of the problem is that openshift-ansible will keep going even though the default network interface might have NM_CONTROLLED=no in it's configuration.
This is bound to fail because it means the resolv.conf will not be set-up in /etc/origin/node.

We need to either implement a check that verifies if that is the case and exit with a friendly explanation error or include the playbook that installs NetworkManager and configures NM_CONTROLLED=yes.

@dmsimard my preference is the check.

Hi,
I also have the issue , I am running upgrade 3.5 to 3.6
Could you provide a PR for fixing this
Thanks

@sdodson @dmsimard Any clue about the issue I still having the message because I have NM_CONTROLLED-no, and As explained by @dmsimard it can not work in this case because the interface is not brought up by NM and the script is not executed
can you include nm playbooks where is needed or provide a fix for it please?

@jkhelil there is already a playbook that ensures NetworkManager is installed and that the primary interface is managed by it.

I haven't had time to push a PR which would give a friendly error message yet but you can find the playbook here: https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-node/network_manager.yml

@dmsimard @sdodson network_manager playbook is not included anywhere, So should be we understand that we need to run it before everything ? Anyway seems that some information are missing, we should add to documentation that network interfaces should have the flag NM_controlled to yes

@sdodson Would you mind explain why NM controlled needs to be enabled on interfaces, ok it is needed for the script generating /etc/origin/node/resolv.conf, but what are the architecture consideration for that, why not using /etc/resolv.conf
Thanks

I think the network_manager.yml include is a bug. Without running it, it won't work (I have had the exact same issues).

+1 same issue here as well. Setting my primary NIC to NM_CONTROLLED=yes fixed the issue.

We have documented that network manager is requried here: https://docs.openshift.org/latest/install_config/install/prerequisites.html#prereq-networkmanager

Was this page helpful?
0 / 5 - 0 ratings