Openshift-ansible: could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory

Created on 19 Oct 2017 · 15Comments · Source: openshift/openshift-ansible

Description

Uninstall openshift reinstall after installation

ansible-playbook /data/openshift-ansible/playbooks/adhoc/uninstall.yml

ansible-playbook /data/openshift-ansible/playbooks/byo/config.yml

Version

ansible 2.3.2.0

openshift-ansible  2017-10-19 update from master ,  commitid ca6581dbd5bf06152ad8a321e1fb45911a91cce4

ansible log

TASK [openshift_manage_node : Wait for Node Registration] **************************************************************************************************************************************************************************************
Thursday 19 October 2017  21:32:38 +0800 (0:00:00.078)       0:03:00.870 ******
FAILED - RETRYING: Wait for Node Registration (50 retries left).
ok: [master -> master]
FAILED - RETRYING: Wait for Node Registration (49 retries left).
FAILED - RETRYING: Wait for Node Registration (48 retries left).
FAILED - RETRYING: Wait for Node Registration (47 retries left).
FAILED - RETRYING: Wait for Node Registration (46 retries left).
FAILED - RETRYING: Wait for Node Registration (45 retries left).
FAILED - RETRYING: Wait for Node Registration (44 retries left).
FAILED - RETRYING: Wait for Node Registration (43 retries left).
FAILED - RETRYING: Wait for Node Registration (42 retries left).
FAILED - RETRYING: Wait for Node Registration (41 retries left).
FAILED - RETRYING: Wait for Node Registration (40 retries left).

message log

Oct 19 21:24:19 node1 systemd: origin-node.service holdoff time over, scheduling restart.
Oct 19 21:24:19 node1 systemd: Starting OpenShift Node...
Oct 19 21:24:19 node1 dnsmasq[4965]: setting upstream servers from DBus
Oct 19 21:24:19 node1 dnsmasq[4965]: using nameserver 127.0.0.1#53 for domain in-addr.arpa
Oct 19 21:24:19 node1 dnsmasq[4965]: using nameserver 127.0.0.1#53 for domain cluster.local
Oct 19 21:24:20 node1 origin-node: I1019 21:24:20.297680   17564 start_node.go:251] Reading node configuration from /etc/origin/node/node-config.yaml
Oct 19 21:24:20 node1 origin-node: I1019 21:24:20.406336   17564 node.go:123] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "node1" (IP ""), iptables sync period "30s"
Oct 19 21:24:20 node1 origin-node: I1019 21:24:20.416313   17564 docker.go:364] Connecting to docker on unix:///var/run/docker.sock
Oct 19 21:24:20 node1 origin-node: I1019 21:24:20.416379   17564 docker.go:384] Start docker client with request timeout=2m0s
Oct 19 21:24:20 node1 origin-node: W1019 21:24:20.418569   17564 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Oct 19 21:24:20 node1 origin-node: F1019 21:24:20.438965   17564 start_node.go:140] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory
Oct 19 21:24:20 node1 systemd: origin-node.service: main process exited, code=exited, status=255/n/a
Oct 19 21:24:20 node1 dnsmasq[4965]: setting upstream servers from DBus
Oct 19 21:24:20 node1 systemd: Failed to start OpenShift Node.
Oct 19 21:24:20 node1 systemd: Unit origin-node.service entered failed state.
Oct 19 21:24:20 node1 systemd: origin-node.service failed.

Temporary solution

Delete /etc/resolv.conf includes 99-origin-dns content
Manually create /etc/origin/node/resolv.conf

echo 'nameserver 192.168.1.142' > /etc/origin/node/resolv.conf

Normal ansible log

TASK [openshift_manage_node : Wait for Node Registration] **************************************************************************************************************************************************************************************
Thursday 19 October 2017  21:32:38 +0800 (0:00:00.078)       0:03:00.870 ******
FAILED - RETRYING: Wait for Node Registration (50 retries left).
ok: [master -> master]
FAILED - RETRYING: Wait for Node Registration (49 retries left).
FAILED - RETRYING: Wait for Node Registration (48 retries left).
FAILED - RETRYING: Wait for Node Registration (47 retries left).
FAILED - RETRYING: Wait for Node Registration (46 retries left).
FAILED - RETRYING: Wait for Node Registration (45 retries left).
FAILED - RETRYING: Wait for Node Registration (44 retries left).
FAILED - RETRYING: Wait for Node Registration (43 retries left).
FAILED - RETRYING: Wait for Node Registration (42 retries left).
FAILED - RETRYING: Wait for Node Registration (41 retries left).
FAILED - RETRYING: Wait for Node Registration (40 retries left).
ok: [node1 -> master]

The origin-node starts normally

Source

ss75710541

Most helpful comment

I found the problem with @jfchevrette (Thanks JF!).

The issue is that our environment configures eth0 in /etc/sysconfig/network-scripts/ifcfg-eth0 to explicitely not use NetworkManager for that interface:

# Automatically generated, do not edit
DEVICE=eth0
BOOTPROTO=dhcp
HWADDR=fa:16:3e:b1:98:77
ONBOOT=yes
NM_CONTROLLED=no
TYPE=Ethernet

This means that the interface is not controlled by NetworkManager and therefore restarting NetworkManager does not bring that interface up and the dispatcher script does not run for that interface.
Just by commenting out NM_CONTROLLED=NO in the ifcfg-eth0 file and restarting NetworkManager created the /etc/origin/node/resolv.conf properly.

I think a proper "fix" in openshift-ansible would be to add a check that verifies if the interface is in the output of "nmcli con", if it's not, fail with a friendly message. I'll send a PR for that.

dmsimard on 27 Oct 2017

👍4 😄1

All 15 comments

I am also seeing this issue, see full Ansible run here:

https://logs.rdoproject.org/59/10259/3/check/rdo-registry-integration/Zecce4a275f1347aab39ae5b022f86631/ara/

This task fails to restart origin node here with the same error in the original issue post:

Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 origin-node[18209]: I1025 01:48:05.172137   18209 start_node.go:257] Reading node configuration from /etc/origin/node/node-config.yaml
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 origin-node[18209]: I1025 01:48:05.191755   18209 node.go:146] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "192.168.1.42" (IP ""), iptables sync period "30s"
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 origin-node[18209]: F1025 01:48:05.192394   18209 start_node.go:146] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 dnsmasq[14479]: setting upstream servers from DBus
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 systemd[1]: Failed to start OpenShift Node.
-- Subject: Unit origin-node.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit origin-node.service has failed.
-- 
-- The result is failed.
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 systemd[1]: Unit origin-node.service entered failed state.
Oct 25 01:48:05 rdo-centos-7-rdo-cloud-24489 systemd[1]: origin-node.service failed.

dmsimard on 25 Oct 2017

Bringing back the discussion from the pull request, I've investigated the issue a little bit.

Those are the suspicious things I found and I hope someone more knowledgeable than I can shed some light:

I'm not sure what openshift_node_bootstrap is about but it seems it might inhibit NetworkManager from restarting here because it defaults to false. In my testing, I tried to naively set it to true in the inventory file. That didn't work because the node registration ended up timing out because the origin-node service did not get restarted/enabled because openshift_node_bootstrap was true. ಠ_ಠ

The /etc/origin/node/resolv.conf file is supposed to be set up by NetworkManager, in reality by the dispatcher script. This script fails to set up the resolv.conf file, the bash -x output is available here. The scripts exits very early on at the comparison if [[ ${DEVICE_IFACE} == ${def_route_int} ]] which evaluates to [[ docker0 == eth0 ]]. I've ran some commands to show you what the network configuration looks like here.

So it seems like the dispatcher runs on the Docker0 interface and that's it.
For what it's worth, this is reproduced on an OpenStack virtual machine with a fairly minimal CentOS image. I'll try and reproduce this on a more conventional CentOs image.

dmsimard on 25 Oct 2017

I found the problem with @jfchevrette (Thanks JF!).

The issue is that our environment configures eth0 in /etc/sysconfig/network-scripts/ifcfg-eth0 to explicitely not use NetworkManager for that interface:

# Automatically generated, do not edit
DEVICE=eth0
BOOTPROTO=dhcp
HWADDR=fa:16:3e:b1:98:77
ONBOOT=yes
NM_CONTROLLED=no
TYPE=Ethernet

I think a proper "fix" in openshift-ansible would be to add a check that verifies if the interface is in the output of "nmcli con", if it's not, fail with a friendly message. I'll send a PR for that.

dmsimard on 27 Oct 2017

👍4 😄1

Interestingly enough, I found this playbook that seems to ensure that NM_CONTROLLED is set to yes: https://github.com/openshift/openshift-ansible/blob/a974791553efd8f2080cc6735c0c5ba9e5bfe941/playbooks/common/openshift-node/network_manager.yml

That playbook is included by: https://github.com/openshift/openshift-ansible/blob/298f1aafc42e5c34938e1071353d103ad8964725/playbooks/byo/openshift-node/network_manager.yml

However, that playbook is not included anywhere (as far as I can tell) by running openshift-ansible/playbooks/byo/config.yml ... is it meant to be run standalone maybe ?

dmsimard on 28 Oct 2017

@dmsimard I think that code is dead, we don't use it as far as I know.

We have a fix out for this, which should work for everyone: https://github.com/openshift/openshift-ansible/pull/5953

michaelgugino on 31 Oct 2017

@michaelgugino as per discussed (writing here for posterity..) that pull request doesn't resolve the issue.

The root cause of the problem is that openshift-ansible will keep going even though the default network interface might have NM_CONTROLLED=no in it's configuration.
This is bound to fail because it means the resolv.conf will not be set-up in /etc/origin/node.

We need to either implement a check that verifies if that is the case and exit with a friendly explanation error or include the playbook that installs NetworkManager and configures NM_CONTROLLED=yes.

dmsimard on 31 Oct 2017

@dmsimard my preference is the check.

michaelgugino on 31 Oct 2017

👍1

Hi,
I also have the issue , I am running upgrade 3.5 to 3.6
Could you provide a PR for fixing this
Thanks

jkhelil on 20 Nov 2017

@sdodson @dmsimard Any clue about the issue I still having the message because I have NM_CONTROLLED-no, and As explained by @dmsimard it can not work in this case because the interface is not brought up by NM and the script is not executed
can you include nm playbooks where is needed or provide a fix for it please?

jkhelil on 22 Nov 2017

@jkhelil there is already a playbook that ensures NetworkManager is installed and that the primary interface is managed by it.

I haven't had time to push a PR which would give a friendly error message yet but you can find the playbook here: https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-node/network_manager.yml

dmsimard on 25 Nov 2017

@dmsimard @sdodson network_manager playbook is not included anywhere, So should be we understand that we need to run it before everything ? Anyway seems that some information are missing, we should add to documentation that network interfaces should have the flag NM_controlled to yes

jkhelil on 27 Nov 2017

@sdodson Would you mind explain why NM controlled needs to be enabled on interfaces, ok it is needed for the script generating /etc/origin/node/resolv.conf, but what are the architecture consideration for that, why not using /etc/resolv.conf
Thanks

jkhelil on 28 Nov 2017

I think the network_manager.yml include is a bug. Without running it, it won't work (I ~~have~~ had the exact same issues).

piwi91 on 28 Nov 2017

+1 same issue here as well. Setting my primary NIC to NM_CONTROLLED=yes fixed the issue.

ayen-tibco on 20 Mar 2018

We have documented that network manager is requried here: https://docs.openshift.org/latest/install_config/install/prerequisites.html#prereq-networkmanager

michaelgugino on 20 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Installing 3.11 cluster fails with "Node start failed"

dharmit · 4Comments

Detected OpenShift version 1.3.0 does not match requested openshift_release 1.5.0-alpha.2

cgutshal · 4Comments

How to redeploy only named certificates?

leoluk · 4Comments

Docker Registry and Router failed scheduling

anhnguyenbk · 7Comments

OKD 3.11 - deploy_cluster.yml fails ("Unable to connect to the server: unexpected EOF")

adamulacha · 6Comments