Openshift-ansible: Start of origin-node failed if there is no NetworkManager controlled interfaces

Created on 11 Sep 2017  Â·  8Comments  Â·  Source: openshift/openshift-ansible

Description

Installing Openshit Origin on node without NetworkManager controlled interfaces failed.

/etc/origin/node/resolv.conf generated by /etc/NetworkManager/dispatcher.d/99-dnsmasq-origin-dns.sh. Which is not invoked at all when there is no NM controlled NICs. Then systemctl start origin-node failed with

-- Subject: Unit origin-node.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit origin-node.service has begun starting up.
Sep 11 00:24:04 ocp-node-4 dnsmasq[15980]: setting upstream servers from DBus
Sep 11 00:24:04 ocp-node-4 dnsmasq[15980]: using nameserver 127.0.0.1#53 for domain cluster.local
Sep 11 00:24:04 ocp-node-4 dnsmasq[15980]: using nameserver 127.0.0.1#53 for domain in-addr.arpa
Sep 11 00:24:04 ocp-node-4 origin-node[7442]: I0911 00:24:04.919558    7442 start_node.go:251] Reading node configuration from /etc/origin/node/node-config.yaml
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.038696    7442 node.go:123] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "ocp-node-4.yeslab.local" (IP ""), iptables sync period "30s"
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.038981    7442 common.go:85] Skipping loopback/non-IPv4 addr: "127.0.0.1" for node ocp-node-4.yeslab.local
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.039007    7442 node.go:136] Failed to determine node address from hostname ocp-node-4.yeslab.local; using default interface (Failed to obtain IP address from node name: ocp-node-4.yeslab.local)
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.039217    7442 interface.go:248] Default route transits interface "ens4"
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.039454    7442 interface.go:93] Interface ens4 is up
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.039544    7442 interface.go:138] Interface "ens4" has 2 addresses :[192.168.8.19/24 fe80::c0ff:fea8:813/64].
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.039580    7442 interface.go:105] Checking addr  192.168.8.19/24.
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.039595    7442 interface.go:114] IP found 192.168.8.19
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.039618    7442 interface.go:144] valid IPv4 address for interface "ens4" found as 192.168.8.19.
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.039630    7442 interface.go:254] Choosing IP 192.168.8.19
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.039657    7442 node.go:143] Resolved IP address to "192.168.8.19"
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.039760    7442 ipcmd.go:44] Executing: /usr/sbin/ip link set lbr0 down
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.049676    7442 ipcmd.go:48] Error executing /usr/sbin/ip: Cannot find device "lbr0"
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.050085    7442 docker.go:364] Connecting to docker on unix:///var/run/docker.sock
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.050123    7442 docker.go:384] Start docker client with request timeout=2m0s
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: W0911 00:24:05.052102    7442 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.059097    7442 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.069173    7442 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: I0911 00:24:05.077280    7442 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Sep 11 00:24:05 ocp-node-4 origin-node[7442]: F0911 00:24:05.079517    7442 start_node.go:140] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory
Sep 11 00:24:05 ocp-node-4 systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Sep 11 00:24:05 ocp-node-4 dnsmasq[15980]: setting upstream servers from DBus
Sep 11 00:24:05 ocp-node-4 systemd[1]: Failed to start OpenShift Node.
-- Subject: Unit origin-node.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit origin-node.service has failed.
-- 
-- The result is failed.
Sep 11 00:24:05 ocp-node-4 systemd[1]: Unit origin-node.service entered failed state.
Sep 11 00:24:05 ocp-node-4 systemd[1]: origin-node.service failed.

Version
$ ansible --version
ansible 2.3.1.0
  config file = /home/bacek/openshift/openshift-ansible.bacek/ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, Nov  6 2016, 00:28:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]

$ git describe
openshift-ansible-3.6.173.0.5-5-44-g66ea091


Steps To Reproduce
  1. Create new VM with no NM controlled interfaces. E.g.
# cat /etc/sysconfig/network-scripts/ifcfg-ens3 
DEVICE=ens3
BOOTPROTO=static
NM_CONTROLLED=no
TYPE=Ethernet
ONBOOT=yes
NETMASK=255.255.255.0
IPADDR=10.1.0.145
  1. Install and enable NetworkManager to avoid #4950

  2. install or scaleup cluster to use this node.

Expected Results

Node is installed and added to the cluster.

Observed Results
$ ansible-playbook  -i ../hosts.ini playbooks/byo/openshift-node/scaleup.yml
...
RUNNING HANDLER [openshift_node : restart node] ***************************************************************************************************************************
FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (1 retries left).
FAILED - RETRYING: restart node (1 retries left).
FAILED - RETRYING: restart node (1 retries left).
fatal: [ocp-node-3.yeslab.local]: FAILED! => {
    "attempts": 3, 
    "changed": false, 
    "failed": true
}

MSG:

Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See "systemctl status origin-node.service" and "journalctl -xe" for details.


Additional Information

Simple touch /etc/origin/node/resolv.conf will kick-start node after ansible failed.

[root@ocp-node-4 ~]# systemctl status origin-node
â—Ź origin-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/origin-node.service.d
           └─openshift-sdn-ovs.conf
   Active: activating (auto-restart) (Result: exit-code) since Mon 2017-09-11 00:32:09 UTC; 2s ago
     Docs: https://github.com/openshift/origin
  Process: 10605 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
  Process: 10602 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
  Process: 10576 ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255)
  Process: 10573 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
  Process: 10571 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
 Main PID: 10576 (code=exited, status=255)

Sep 11 00:32:09 ocp-node-4 systemd[1]: Failed to start OpenShift Node.
Sep 11 00:32:09 ocp-node-4 systemd[1]: Unit origin-node.service entered failed state.
Sep 11 00:32:09 ocp-node-4 systemd[1]: origin-node.service failed.
[root@ocp-node-4 ~]# touch /etc/origin/node/resolv.conf
[root@ocp-node-4 ~]# systemctl reset-failed origin-node
[root@ocp-node-4 ~]# systemctl start origin-node
[root@ocp-node-4 ~]# systemctl status origin-node
â—Ź origin-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/origin-node.service.d
           └─openshift-sdn-ovs.conf
   Active: active (running) since Mon 2017-09-11 00:32:28 UTC; 10s ago
     Docs: https://github.com/openshift/origin
  Process: 10686 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
  Process: 10684 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
  Process: 10692 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
  Process: 10690 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
 Main PID: 10695 (openshift)
   Memory: 40.6M
   CGroup: /system.slice/origin-node.service
           ├─10695 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=5
           └─10737 journalctl -k -f

Sep 11 00:32:37 ocp-node-4 origin-node[10695]: I0911 00:32:37.265373   10695 kubelet.go:2069] Container runtime status: Runtime Conditions: RuntimeReady=true ...: message:
Sep 11 00:32:37 ocp-node-4 origin-node[10695]: I0911 00:32:37.307678   10695 eviction_manager.go:197] eviction manager: synchronize housekeeping
Sep 11 00:32:37 ocp-node-4 origin-node[10695]: I0911 00:32:37.348088   10695 summary.go:389] Missing default interface "eth0" for node:ocp-node-4.yeslab.local
Sep 11 00:32:37 ocp-node-4 origin-node[10695]: I0911 00:32:37.348219   10695 helpers.go:744] eviction manager: observations: signal=nodefs.inodesFree, availab... +0000 UTC
Sep 11 00:32:37 ocp-node-4 origin-node[10695]: I0911 00:32:37.348257   10695 helpers.go:744] eviction manager: observations: signal=imagefs.available, availab... +0000 UTC
Sep 11 00:32:37 ocp-node-4 origin-node[10695]: I0911 00:32:37.348271   10695 helpers.go:746] eviction manager: observations: signal=allocatableMemory.availabl... 7908384Ki
Sep 11 00:32:37 ocp-node-4 origin-node[10695]: I0911 00:32:37.348282   10695 helpers.go:744] eviction manager: observations: signal=memory.available, availabl... +0000 UTC
Sep 11 00:32:37 ocp-node-4 origin-node[10695]: I0911 00:32:37.348295   10695 helpers.go:744] eviction manager: observations: signal=nodefs.available, availabl... +0000 UTC
Sep 11 00:32:37 ocp-node-4 origin-node[10695]: I0911 00:32:37.348321   10695 eviction_manager.go:292] eviction manager: no resources are starved
Sep 11 00:32:37 ocp-node-4 origin-node[10695]: I0911 00:32:37.842202   10695 generic.go:182] GenericPLEG: Relisting
Hint: Some lines were ellipsized, use -l to show in full.

lifecyclrotten

Most helpful comment

Yes NetworkManager is required as described in the install doc https://docs.openshift.org/latest/install_config/install/prerequisites.html#prereq-networkmanager

Just remove NM_CONTROLLED=no from your NIC configuration.

All 8 comments

I'm facing the same issue as well.
Any inputs on workaround ?

I am facing the same issue as well.
```# ansible --version
ansible 2.3.1.0
config file = /root/openshift-ansible/ansible.cfg
configured module search path = Default w/o overrides
python version = 2.7.5 (default, Aug 2 2016, 04:20:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]

"msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n",

```

Is there any update on this issue? I tried to run the playbook from tag openshift-ansible-3.6.173.0.58-1.
When I create the resolv.conf manually before installation it works without problems, could someone tell if I can use this cluster now or if there will be any consequences?
I checked here where the following is mentioned for my case (static IP):

_Disabled, then configure your network interface to be static, and add DNS nameservers to NetworkManager._

Do I still have to let the NetworkManager manage my static IP? How else could I add a DNS server to NetworkManager?

Yes NetworkManager is required as described in the install doc https://docs.openshift.org/latest/install_config/install/prerequisites.html#prereq-networkmanager

Just remove NM_CONTROLLED=no from your NIC configuration.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings