Openshift-ansible: Empty /etc/cni/net.d/ causing nodes to fail

Created on 16 Apr 2018  ·  60Comments  ·  Source: openshift/openshift-ansible

Description

The ansible installer sometimes fails to create the file /etc/cni/net.d/80-openshift-network.conf which causes the node to fail. This seems to be a random event and happens on OpenStack and AWS environments. In one OpenStack environment this happens for approx 50% of nodes, on other environments it is less frequent.

If OpenShift is uninstalled (using the uninstall.yaml playbook) and then reinstalled the same nodes fail.

This has been discussed in the users mailing list in these topics, and at least one other user has reported similar problems:
https://lists.openshift.redhat.com/openshift-archives/users/2018-April/msg00035.html
https://lists.openshift.redhat.com/openshift-archives/users/2018-April/msg00045.html

Version
$ ansible --version
ansible 2.5.0
  config file = /home/centos/37-test/ansible.cfg
  configured module search path = [u'/home/centos/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Aug  4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
[centos@orndev-bastion-002 37-test]$ cd ../openshift-ansible
[centos@orndev-bastion-002 openshift-ansible]$ git describe
openshift-ansible-3.7.44-1-6-gbb4a8d7

$ git describe
openshift-ansible-3.7.44-1-6-gbb4a8d7

Nodes are based on plain Centos7 cloud images with the necessary packages described in the Origin docs installed.

OpenShift is version 3.7.

Steps To Reproduce

Using an Inventory file for a minimal 4 node setup (see below) we run the installer:

$ ansible-playbook -i inventory -vv ~/openshift-ansible/playbooks/byo/config.yml
<snip>
PLAY RECAP *****************************************************************************************************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
test37-infra               : ok=163  changed=61   unreachable=0    failed=1   
test37-master              : ok=615  changed=264  unreachable=0    failed=1   
test37-node-001            : ok=187  changed=65   unreachable=0    failed=0   
test37-node-002            : ok=163  changed=61   unreachable=0    failed=1

In this case the test37- infra and test37-node-002 nodes failed. On both of those the /etc/cni/net.d/ directory was empty.
On the test37-master and test37-node-001 nodes the file 80-openshift-network.conf was present. The master node fails because there was no infra node, not because of the problem described here.
If new nodes were created and the process repeated a different set of nodes would fail.

Expected Results

All nodes should start. The file /etc/cni/net.d/80-openshift-network.conf should be present on all nodes and the origin-node service should start successfully.

Observed Results

On the failed nodes the file /etc/cni/net.d/80-openshift-network.conf is not present and the origin-node service fails to start.
Note that this missing file may just be a symptom of the real problem that lies upstream.

On the failed nodes:

$ sudo systemctl status -l origin-node.service
● origin-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/origin-node.service.d
           └─openshift-sdn-ovs.conf
   Active: activating (start) since Mon 2018-04-16 09:36:24 UTC; 39s ago
     Docs: https://github.com/openshift/origin
  Process: 18844 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
  Process: 18840 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
  Process: 18886 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
  Process: 18882 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
 Main PID: 18889 (openshift)
   Memory: 43.7M
   CGroup: /system.slice/origin-node.service
           ├─18889 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2
           └─18933 journalctl -k -f

Apr 16 09:36:54 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:54.950946   18889 manager.go:306] Starting recovery of all containers
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.020197   18889 kubelet_node_status.go:270] Setting node annotation to enable volume controller attach/detach
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027120   18889 kubelet_node_status.go:433] Recording NodeHasSufficientDisk event message for node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027172   18889 kubelet_node_status.go:433] Recording NodeHasSufficientMemory event message for node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027194   18889 kubelet_node_status.go:433] Recording NodeHasNoDiskPressure event message for node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027224   18889 kubelet_node_status.go:82] Attempting to register node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.071793   18889 manager.go:311] Recovery completed
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: E0416 09:36:55.237121   18889 eviction_manager.go:238] eviction manager: unexpected err: failed GetNode: node 'test37-infra.openstacklocal' not found
Apr 16 09:37:00 test37-infra.openstacklocal origin-node[18889]: W0416 09:37:00.225298   18889 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d
Apr 16 09:37:00 test37-infra.openstacklocal origin-node[18889]: E0416 09:37:00.227084   18889 kubelet.go:2112] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

The bit that says Unable to update cni config: No networks found in /etc/cni/net.d seems to be the key thing, and is the result of the missing file.

Another result of this is that DNS does not get set up correctly. On failed nodes:

$ sudo netstat -tunlp | grep tcp | grep 53 | grep -v tcp6
tcp        0      0 10.0.0.30:53            0.0.0.0:*               LISTEN      14567/dnsmasq       
tcp        0      0 172.17.0.1:53           0.0.0.0:*               LISTEN      14567/dnsmasq 

On successful nodes:

$ sudo netstat -tunlp | grep tcp | grep 53 | grep -v tcp6
tcp        0      0 127.0.0.1:53            0.0.0.0:*               LISTEN      17353/openshift     
tcp        0      0 10.129.0.1:53           0.0.0.0:*               LISTEN      14544/dnsmasq       
tcp        0      0 10.0.0.35:53            0.0.0.0:*               LISTEN      14544/dnsmasq       
tcp        0      0 172.17.0.1:53           0.0.0.0:*               LISTEN      14544/dnsmasq

Ansible log:
https://gist.github.com/a4f6e17554e6c77db7d97eeecd2cde8f

The log makes no mention of 80-openshift-network.conf so no obvious clues to why it did not get created.

Additional Information

Provide any additional information which may help us diagnose the
issue.

$ cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core) 

Inventory file for a minimal 4 node setup:

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
ansible_ssh_user=centos
ansible_become=yes

openshift_deployment_type=origin
openshift_release=v3.7

openshift_disable_check=disk_availability,docker_storage,memory_availability
openshift_clock_enabled=true

# Enable htpasswd authentication
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/users.htpasswd'}]
# make sure this htpasswd file exists
openshift_master_htpasswd_file=/home/centos/users.htpasswd

openshift_master_cluster_public_hostname=130.238.28.199.nip.io
openshift_master_default_subdomain=130.238.28.131.nip.io

# default project node selector
osm_default_node_selector='zone=default'

openshift_docker_additional_registries = registry.access.redhat.com
openshift_docker_insecure_registries = registry.access.redhat.com

openshift_metrics_install_metrics=false
openshift_logging_install_logging=false
openshift_hosted_prometheus_deploy=false

ansible_service_broker_image_prefix=registry.access.redhat.com/openshift3/ose-
ansible_service_broker_registry_url=registry.access.redhat.com

[masters]
test37-master

[etcd]
test37-master

[nodes]
test37-master   openshift_hostname=test37-master.openstacklocal
test37-infra    openshift_hostname=test37-infra.openstacklocal openshift_node_labels="{'region': 'infra', 'zone': 'default'}" 
test37-node-001 openshift_hostname=test37-node-001.openstacklocal openshift_node_labels="{'region': 'primary', 'zone': 'default'}" 
test37-node-002 openshift_hostname=test37-node-002.openstacklocal openshift_node_labels="{'region': 'primary', 'zone': 'default'}" 

Most helpful comment

Here's how it works, so everyone is clear.

1) kubelet/openshift-node starts
2) kubelet registers itself as a Node object with the apiserver (eg 'oc get nodes')
3) master SDN controller notices the new Node and creates a new HostSubnet for the node (eg 'oc get hostsubnets')
4) node SDN (built into openshift-node in <= 3.9, but a DaemonSet in 3.10+) periodically polls for its HostSubnet; eventually times out if it doesn't find one
5) if the node SDN finds a HostSubnet matching its hostname, continues initialization
6) when initialization has completed successfully, writes out /etc/cni/net.d/80-openshift-sdn.conf
7) kubelet polls /etc/cni/net.d every 30 seconds for a config file, so it finally sees the config that's written and updates the apiserver with network-ready status

So an empty /etc/cni/net.d is a symptom of the SDN not being able to initialize, not a cause of any particular problem.

All 60 comments

I'm seeing the same error with a similar setup:

  • 4 nodes: 1 master, 1 infra, 2 compute
  • CentOS 7.4.1708, deployed as VMware virtual machines
  • OpenShift Origin 3.9

Here is my hosts file:

```[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]

containerized=true

openshift_use_system_containers=true

openshift_docker_use_system_container=false

openshift_repos_enable_testing=true

deploy OpenShift Origin

openshift_deployment_type=origin

OpenShift version

openshift_release=v3.9.0

use root

ansible_ssh_user=root

don't verify the node memory or disk

openshift_disable_check=memory_availability,disk_availability,docker_storage

enable ntp

openshift_clock_enabled=true

[masters]
oso-master-01

[etcd]
oso-master-01

[nodes]
oso-master-01
oso-node-01
oso-node-02
oso-infra-01 openshift_node_labels="{'region': 'infra','zone': 'default'}"
```

I have tried...

  • both the containerized and RPM based methods
  • RHEL 7.5 and CentOS 7.4.1708
  • Calico networking, openshift-ovs-subnet, and openshift-ovs-multitenant

Here are the logs from the most recent run, which was based on RHEL 7.5 + Calico.

@acsulli Yes, definitely seems to be the same problem. Contains the "No networks found in /etc/cni/net.d" error which results from that directory missing the file 80-openshift-network.conf

Running into the same exact problem on AWS origin 3.7. My infra node also suddenly separated itself from the cluster

EDIT: Should probably mention that this is an existing cluster and not a new install

I'm hitting the same symptom at install time on 3.10. Filed https://bugzilla.redhat.com/show_bug.cgi?id=1568583

3.7 and 3.10/master are worlds apart at this point in time. Likely a different issue for 3.10.

@acsulli hey, can you try with release-3.9 branch? The master branch is going through a huge refactoring to move to using static pods and it isn't expected to work for 3.9 at this point in time. Sorry for the inconvenience.

Seems to be quite a common problem!
The empty /etc/cni/net.d syndrom.
@mffiedler could you link to this issue from bugzilla - I tried to but bugzilla not liking my login any more :-(

the SDN itself creates the /etc/cni/net.d config file when it has started up and is ready. So if that file is not being created, then either the SDN daemonset has not been correctly started, or even if it has been started it is unable to start correct.

We'd need docker container logs from the SDN daemonset container to debug further, which should be available with kubectl/oc or 'docker logs'.

In one log I've seen, it seems to be a general problem starting containers on the node:

failed to create a sandbox for pod "ovs-rpljr": Error response from daemon: lstat /var/lib/docker/overlay2/09a5b1f6274d724b06bde4ba9f93bd7ac7254c56dfcf72b5989c47806de6e47c: no such file or directory

and not specifically network related.

SDN daemonset?
I'm running with 3.7 and don't see any such daemonsets

@tdudgeon when using openshift-sdn as the network plugin, whether daemonset (3.10+) or built-in (< 3.10), it still writes out the network configuration itself. Ansible is not responsible for writing /etc/cni/net.d.

So if you are on 3.7 then you'd want to get 'journalctl -b -u atomic-openshift-node' and those logs should indicate why openshift-sdn is not ready and thus has not written out the config file to /etc.

When not using openshift-sdn as the network plugin, whatever network plugin you do use typically writes that file out itself.

@sdodson I have successfully deployed one cluster using the release-3.9 branch on RHEL 7.5, which I was unable to do before. Next test is to use CentOS 7.4.

This is what I see in the logs for the node service where this fails (using journalctl -b -u origin-node):

Apr 19 07:46:41 simple-node-001.openrisknet systemd[1]: Starting OpenShift Node...
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: I0419 07:46:42.308735    5401 start_node.go:288] Reading node configuration from /etc/origin/node/node-config.yaml
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: W0419 07:46:42.313718    5401 server.go:190] WARNING: all flags other than --config, --write-config-to, and --cleanup-iptables are deprecated. Please begin using a config file ASAP.
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: I0419 07:46:42.324198    5401 node.go:146] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "simple-node-001.openrisknet" (IP ""), iptables sync period "30s"
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: I0419 07:46:42.459471    5401 mount_linux.go:200] Detected OS with systemd
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: I0419 07:46:42.459519    5401 client.go:72] Connecting to docker on unix:///var/run/docker.sock
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: I0419 07:46:42.459547    5401 client.go:92] Start docker client with request timeout=2m0s
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: W0419 07:46:42.467320    5401 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: I0419 07:46:42.822892    5401 node_config.go:120] Successfully initialized cloud provider: "openstack" from the config file: "/etc/origin/cloudprovider/openstack.conf"
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: I0419 07:46:42.822989    5401 start_node.go:469] Starting node simple-node-001.openrisknet (v3.7.2+cd74924-1)
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: I0419 07:46:42.823478    5401 client.go:72] Connecting to docker on unix:///var/run/docker.sock
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: I0419 07:46:42.823516    5401 client.go:92] Start docker client with request timeout=2m0s
Apr 19 07:46:42 simple-node-001.openrisknet origin-node[5401]: I0419 07:46:42.835720    5401 node.go:109] Connecting to Docker at unix:///var/run/docker.sock
Apr 19 07:47:12 simple-node-001.openrisknet origin-node[5401]: I0419 07:47:12.839419    5401 node.go:358] Starting openshift-sdn network plugin

The error that is reported (Unable to update cni config: No networks found in /etc/cni/net.d) seems be the RESULT of the file being missing, not the reason why it is missing).

My node didn't get hostsubnet assigned to it for some reason. Running oc get hostsubnet only returned 2 out of 3 hostsubnets for the node. I manually created hostsubnet with a yaml file, ran the scaleup playbook and it got reattached and the cni config reappeared in /etc/cni/net.d and the services came back up.

@daniyalj I can confirm that hostsubnet is not present for a failed node.
In my case I have 1 master/infra node and 5 worker nodes. node-001 failed to deploy this time and was missing the /etc/cni/net.d/80-openshift-network.conf file.
When I ran oc get hostsubnet against master I saw the following:

$ oc get hostsubnet
NAME                          HOST                          HOST IP     SUBNET          EGRESS IPS
simple-master.openrisknet     simple-master.openrisknet     10.0.0.27   10.128.0.0/23   []
simple-node-002.openrisknet   simple-node-002.openrisknet   10.0.0.6    10.129.0.0/23   []
simple-node-003.openrisknet   simple-node-003.openrisknet   10.0.0.33   10.128.2.0/23   []
simple-node-004.openrisknet   simple-node-004.openrisknet   10.0.0.29   10.130.0.0/23   []
simple-node-005.openrisknet   simple-node-005.openrisknet   10.0.0.18   10.131.0.0/23   []

Notice that node-001 is missing.
So maybe this pushes the problem further up the food chain?

@tdudgeon here's the yaml file we used to create the hostsubnet on the master. You will need to change it with your network. Then ran scaleup playbook which simply attaches a node to your existing cluster. But I agree there is a problem upstream..

```[centos@master-1 ~]$ cat subnet.yml
apiVersion: v1
host: ip-10-0-1-11.ca-central-1.compute.internal
hostIP: 10.0.1.11
kind: HostSubnet
subnet: 10.131.0.0/23
metadata:
name: ip-10-0-1-11.ca-central-1.compute.internal

We'd need to check the master controller logs to see why it's not creating the hostsubnet upon node registration as it should be.

/assign @openshift/networking

@tdudgeon can you get master controller logs from the time when the node was being added?

In the node log snipped you pasted, nothing is wrong there. Kubelet/openshift-node will print out the "no networks found" every 30 seconds until the SDN has been able to initialize, that's just how upstream kubelet works right now.

Here's how it works, so everyone is clear.

1) kubelet/openshift-node starts
2) kubelet registers itself as a Node object with the apiserver (eg 'oc get nodes')
3) master SDN controller notices the new Node and creates a new HostSubnet for the node (eg 'oc get hostsubnets')
4) node SDN (built into openshift-node in <= 3.9, but a DaemonSet in 3.10+) periodically polls for its HostSubnet; eventually times out if it doesn't find one
5) if the node SDN finds a HostSubnet matching its hostname, continues initialization
6) when initialization has completed successfully, writes out /etc/cni/net.d/80-openshift-sdn.conf
7) kubelet polls /etc/cni/net.d every 30 seconds for a config file, so it finally sees the config that's written and updates the apiserver with network-ready status

So an empty /etc/cni/net.d is a symptom of the SDN not being able to initialize, not a cause of any particular problem.

@dcbw Thanks for that explanation. Makes more sense now.

Here's some data from the origin-master-controllers logs as the host subnets are being created.
I have 5 nodes (001-005), plus a single master/infra node.
In this case nodes 002 and 005 failed, the others succeeded. The origin-master-controllers log shows this:

Apr 20 12:18:30 simple-master.openrisknet origin-master-controllers[19062]: E0420 12:18:30.886665   19062 actual_state_of_world.go:478] Failed to set statusUpdateNeeded to needed true because nodeName="simple-node-001.openrisknet"  does not exist
Apr 20 12:18:30 simple-master.openrisknet origin-master-controllers[19062]: E0420 12:18:30.886699   19062 actual_state_of_world.go:492] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true because nodeName="si
Apr 20 12:18:31 simple-master.openrisknet origin-master-controllers[19062]: E0420 12:18:31.030795   19062 actual_state_of_world.go:478] Failed to set statusUpdateNeeded to needed true because nodeName="simple-node-003.openrisknet"  does not exist
Apr 20 12:18:31 simple-master.openrisknet origin-master-controllers[19062]: E0420 12:18:31.030821   19062 actual_state_of_world.go:492] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true because nodeName="si
Apr 20 12:18:31 simple-master.openrisknet origin-master-controllers[19062]: E0420 12:18:31.030838   19062 actual_state_of_world.go:478] Failed to set statusUpdateNeeded to needed true because nodeName="simple-node-004.openrisknet"  does not exist
Apr 20 12:18:31 simple-master.openrisknet origin-master-controllers[19062]: E0420 12:18:31.030846   19062 actual_state_of_world.go:492] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true because nodeName="si
Apr 20 12:18:31 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:31.065521   19062 subnets.go:105] Created HostSubnet simple-node-001.openrisknet (host: "simple-node-001.openrisknet", ip: "10.0.0.30", subnet: "10.128.0.0/23")
Apr 20 12:18:31 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:31.095162   19062 subnets.go:105] Created HostSubnet simple-node-003.openrisknet (host: "simple-node-003.openrisknet", ip: "10.0.0.26", subnet: "10.129.0.0/23")
Apr 20 12:18:31 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:31.197434   19062 subnets.go:105] Created HostSubnet simple-node-004.openrisknet (host: "simple-node-004.openrisknet", ip: "10.0.0.33", subnet: "10.130.0.0/23")
Apr 20 12:18:31 simple-master.openrisknet origin-master-controllers[19062]: E0420 12:18:31.223991   19062 actual_state_of_world.go:478] Failed to set statusUpdateNeeded to needed true because nodeName="simple-master.openrisknet"  does not exist
Apr 20 12:18:31 simple-master.openrisknet origin-master-controllers[19062]: E0420 12:18:31.224024   19062 actual_state_of_world.go:492] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true because nodeName="si
Apr 20 12:18:31 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:31.390791   19062 subnets.go:105] Created HostSubnet simple-master.openrisknet (host: "simple-master.openrisknet", ip: "10.0.0.6", subnet: "10.131.0.0/23")
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.669733   19062 nodecontroller.go:525] NodeController observed a new Node: "simple-node-001.openrisknet"
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.669780   19062 controller_utils.go:274] Recording Registered Node simple-node-001.openrisknet in NodeController event message for node simple-node-001.openrisknet
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.669823   19062 nodecontroller.go:542] Initializing eviction metric for zone: UPPMAX:
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.669851   19062 nodecontroller.go:525] NodeController observed a new Node: "simple-node-003.openrisknet"
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.669864   19062 controller_utils.go:274] Recording Registered Node simple-node-003.openrisknet in NodeController event message for node simple-node-003.openrisknet
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.669884   19062 nodecontroller.go:525] NodeController observed a new Node: "simple-node-004.openrisknet"
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.669896   19062 controller_utils.go:274] Recording Registered Node simple-node-004.openrisknet in NodeController event message for node simple-node-004.openrisknet
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.669914   19062 nodecontroller.go:525] NodeController observed a new Node: "simple-master.openrisknet"
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.669927   19062 controller_utils.go:274] Recording Registered Node simple-master.openrisknet in NodeController event message for node simple-master.openrisknet
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: W0420 12:18:33.669981   19062 nodecontroller.go:877] Missing timestamp for Node simple-node-001.openrisknet. Assuming now as a timestamp.
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.670017   19062 openstack_instances.go:45] Claiming to support Instances
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.670753   19062 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"simple-node-003.openrisknet", UID:"f0216b15-4494-11e8-8b21-fa163ef97afb", APIVersion:"", Resou
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.670783   19062 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"simple-node-004.openrisknet", UID:"f028baf8-4494-11e8-8b21-fa163ef97afb", APIVersion:"", Resou
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.670804   19062 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"simple-master.openrisknet", UID:"f04c05ff-4494-11e8-8b21-fa163ef97afb", APIVersion:"", Resourc
Apr 20 12:18:33 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:33.670822   19062 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"simple-node-001.openrisknet", UID:"f0189f4a-4494-11e8-8b21-fa163ef97afb", APIVersion:"", Resou
Apr 20 12:18:34 simple-master.openrisknet origin-master-controllers[19062]: W0420 12:18:34.358121   19062 nodecontroller.go:877] Missing timestamp for Node simple-node-003.openrisknet. Assuming now as a timestamp.
Apr 20 12:18:34 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:34.358182   19062 openstack_instances.go:45] Claiming to support Instances
Apr 20 12:18:34 simple-master.openrisknet origin-master-controllers[19062]: W0420 12:18:34.577973   19062 nodecontroller.go:877] Missing timestamp for Node simple-node-004.openrisknet. Assuming now as a timestamp.
Apr 20 12:18:34 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:34.578060   19062 openstack_instances.go:45] Claiming to support Instances
Apr 20 12:18:34 simple-master.openrisknet origin-master-controllers[19062]: W0420 12:18:34.798315   19062 nodecontroller.go:877] Missing timestamp for Node simple-master.openrisknet. Assuming now as a timestamp.
Apr 20 12:18:34 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:34.798405   19062 openstack_instances.go:45] Claiming to support Instances
Apr 20 12:18:35 simple-master.openrisknet origin-master-controllers[19062]: I0420 12:18:35.017397   19062 nodecontroller.go:743] NodeController detected that all Nodes are not-Ready. Entering master disruption mode.

Note that nodes 002 and 005 are not mentioned. In fact there is no mention of them anywhere in that log file.

A little more on this. I looked at the ansible logs in some detail to look for differences between a node that succeeded (node-001) and one that failed (node-002) and there were hardly any significant differences. The one bit I found that might be indicative is this section:

2018-04-20 12:18:24,777 p=3270 u=centos |  TASK [openshift_node : Start and enable node dep] *****************************************************************************************************************************************************************
2018-04-20 12:18:24,777 p=3270 u=centos |  Friday 20 April 2018  12:18:24 +0000 (0:00:00.449)       0:19:50.675 ********** 
2018-04-20 12:18:24,860 p=3270 u=centos |  skipping: [simple-master.openrisknet]
2018-04-20 12:18:24,927 p=3270 u=centos |  skipping: [simple-node-001.openrisknet]
2018-04-20 12:18:25,059 p=3270 u=centos |  skipping: [simple-node-002.openrisknet]
2018-04-20 12:18:25,122 p=3270 u=centos |  skipping: [simple-node-003.openrisknet]
2018-04-20 12:18:25,124 p=3270 u=centos |  skipping: [simple-node-004.openrisknet]
2018-04-20 12:18:25,153 p=3270 u=centos |  skipping: [simple-node-005.openrisknet]
2018-04-20 12:18:25,205 p=3270 u=centos |  TASK [openshift_node : Start and enable node] *********************************************************************************************************************************************************************
2018-04-20 12:18:25,205 p=3270 u=centos |  Friday 20 April 2018  12:18:25 +0000 (0:00:00.427)       0:19:51.103 ********** 
2018-04-20 12:18:33,432 p=3270 u=centos |  changed: [simple-node-001.openrisknet]
2018-04-20 12:18:33,688 p=3270 u=centos |  changed: [simple-node-004.openrisknet]
2018-04-20 12:18:33,717 p=3270 u=centos |  changed: [simple-master.openrisknet]
2018-04-20 12:18:33,748 p=3270 u=centos |  changed: [simple-node-003.openrisknet]
2018-04-20 12:19:27,719 p=3270 u=centos |  FAILED - RETRYING: Start and enable node (1 retries left).
2018-04-20 12:19:27,980 p=3270 u=centos |  FAILED - RETRYING: Start and enable node (1 retries left).
2018-04-20 12:19:58,166 p=3270 u=centos |  ok: [simple-node-002.openrisknet]
2018-04-20 12:19:58,381 p=3270 u=centos |  ok: [simple-node-005.openrisknet]
2018-04-20 12:19:58,453 p=3270 u=centos |  TASK [openshift_node : Dump logs from node service if it failed] **************************************************************************************************************************************************

Notice that both the nodes that fail do not restart initially but then seem to restart OK at the second attempt. Not sure if this is significant, but it looks like its a strange coincidence that this happen for both of the failed nodes, but none of the successful ones.

Any info on this from anyone who understands the networking issues?

@tdudgeon do you still have the same issue. If not please close this issue.
Otherwise, I confirm that, according to @dcbw explanation, the root cause of a such issue would be a communication problem between the node(s) and the master (apiserver).

Yes, this is most certainly still happening, and I can easily reproduce it.
I see it when using both OpenStack and AWS.
Others have confirmed they see the same problem.

@abessifi
Yes, it exists. Happens, approximately, in 1 of 2 deployments. I use VMWare cloud platform.

This is happening to our cluster as well, same OpenShift version. Has anybody sourced a fix for it yet?

IHAC who is seeing this also during upgrades from a functional 3.7 cluster to 3.9 (AWS, 1x1x2, unremarkable configuration). Seems like there's some weird race condition going on, a simple reboot of the errant node clears it up and it's able to register properly with the SDN.

Rebooting the failed nodes has not worked for me. Once a node is 'tainted' is stays tainted and the only solution is to replace it with a new one and try the install again.

Please could someone who understands the networking setup take ownership of this issue.
I can very easily demonstrate it to anyone who wants to see it happening.

This problem is also being tracked in the Red Hat bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1592010

For me error was with disabled ip forwarding:
F0716 09:39:01.849262 2645 network.go:46] SDN node startup failed: node SDN setup failed: net/ipv4/ip_forward=0, it must be set to 1

After setting it to 1 docker container started and node became "Ready".

What I'm finding now with the latest code on the release-3.9 branch of the ansible installer is that this problem is NOT manifesting itself. I ran through the installation process about 8 times and did not hit this problem.

I'm not confident that this means its 'fixed', but right now I cannot reproduce it.

If the other users who have encountered this could also check if its still happening for them it would be useful.

The problem persists in latest OCP 3.10 with 1 master, 1 infra, 3 worker config

I saw this error this morning while adding a node to a fresh built Origin 3.9 stack. The same issue happened to all 3 nodes I was adding to a fresh built OCP 3.9 stack. I was using openshift-ansible-3.9.32 and will try the same process after updating to openshift-ansible-3.9.40.

I'm using "openshift-ansible-3.9.40-1" and seeing this problem. Is there a work-around so I can get past this?

I've found that the missing /etc/cni/net.d/ files is a semi-random event, and the only solution is to kill the failing node(s) and replace with a new one(s) and try again.
I documented an approach here but its pretty painful!

that is awful... what is supposed to be in /etc/cni/net.d? Can these files be created manually on the nodes before we run the broken openshift-ansible playbooks?

A summary for those joining this issue:

  • Starting in 3.9, the SDN process writes the cni configuration file itself. Ansible no longer does it.
  • The SDN process only writes the file when it is healthy. This means it must be able to talk to the apiserver

To debug this, check the logs of the openshift-node process and, if configured, the SDN pod. It should indicate why it is unhappy.

So I'm installing 3.9, if the API server is not accessible via the nodes, this would make sense. I've had to debug some API server related issues earlier this morning during the install and have found that there is a "Verify API Server" check that runs. If its passing (which it is now), why would this be happening?
TASK [Verify API Server] ********************************************************************************************* task path: /Users/xxxxx/src/xxxxxxx/openshift-ansible/playbooks/openshift-master/private/tasks/wire_aggregator.yml:205 Thursday 23 August 2018 10:24:21 -0400 (0:00:01.492) 0:13:42.184 ******* ok: [52.87.240.177] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://master.internal-elb.mydomain.com:8443/healthz/ready"], "delta": "0:00:00.120436", "end": "2018-08-23 14:24:22.285763", "rc": 0, "start": "2018-08-23 14:24:22.165327", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} META: ran handlers META: ran handlers

For me there should be a file named /etc/cni/net.d/80-openshift-network.conf that has content like this:

{
  "cniVersion": "0.2.0",
  "name": "openshift-sdn",
  "type": "openshift-sdn"
}

But I believe the file gets created as a result of successful creation of the SDN. Just creating it manually will not solve the problem.

I found my issue was a problem where the new node could not contact the master.
Our internal DNS is configured with bind views and queries from untrusted servers are forwarded. The new node was untrusted and getting the external ip for the master load balancer, which was being blocked by security groups. When I added the new node to these trusted zone, the node added successfully.
I assume this configuration does not apply to others seeing the issue, but I hope the detail might help in debugging.
Thanks

Found issue with 3.10 but manually creating file and directory did solve the problem. I rebooted after creating the file. 80-openshift-network.conf was missing on some nodes (approximately 50% of them).

Is this file (/etc/cni/net.d/80-openshift-network.conf) should be on the node server itself?
I see it inside /var/lib/docker/overlay/BLABLA/upper/etc/cni/net.d

You shouldn't need to create these files, SDN daemonset should take care of those. Maybe something in the logs of this daemonset pods?

/assign @openshift/sig-networking

I thought issues could be assigned to teams?
/assign @squeed @dcbw

As I mentioned in https://github.com/openshift/openshift-ansible/issues/7967#issuecomment-415459742 - I don't think there is a bug with the installer. The missing SDN file is a flag for larger issues with the cluster.

/close

if you follow the thread after https://github.com/openshift/openshift-ansible/issues/7967#issuecomment-415459742 you'll see multiple folks commenting just not on issues on a fresh install on 3.9 but 3.10 as well.
maybe closing this based on that comment was a bit premature?

I had a corrupted file in the etcd pod that may have been leading to my issue. I did add the files /etc/cni/net.d/80-openshift-network.conf to fix but I'm not able to verify this again. Just wanted to update.

@Azulinho if nodes are reliably failing to come up and join the cluster due to a bug or rough edge in openshift-ansible, then we should definitely open a specific issue for that. However, this particular github issue has become unreadable and, in the 4 months it's been open, hasn't actually been caused by any issues with openshift-ansible. Read the history and you'll see various misconfigurations (some of which should probably be caught... but that's separate).

In any case, this issue is particularly misleading because openshift-ansible shouldn't be writing the file in question anyways.

We should probably try to fix the error message. eg, by having the SDN call glog.Fatalf() with its own error rather than letting kubelet log the wrong thing.

I also met the issue when I tried to install OCP v3.9.14 in my lab. Should I create a /etc/cni/net.d/80-openshift-network.conf as https://github.com/openshift/openshift-ansible/issues/7967#issuecomment-415462544 mention to solve the problem? or any recommend solution?

Creating the CNI config file will not help, because the missing CNI file is just a symptom of the underlying problem, not the problem itself.

Same happens to me with a fresh install of OKD 3.10. (openshift-ansible) and Centos7.5 running on VMWare ESXi.
Master + node on same machine.

Same here, no way to get out. Tried with a fresh install of 3.10 with single machine.

Getting the same result on a fresh install of OCP 3.10 on GCP

same issue on fresh install 3.10 with three master, 2 infra, 2 app nodes

same issue on fresh install 3.10 with three master, 2 infra, 2 app nodes

I am trying to do a fresh install (RHEL 7.5) of OSE 3.10 with 3 masters, 3 infra and 4 compute nodes and it fails miserably when running deploy_cluster playbook. Looking at the messages logs reveals that I am hitting /tc/cni issue discussed here.

Is there any workaround to get going?

I'm going to lock this conversation. This error message is a symptom of 'SDN failed to start', which may have different reason for that - ansible misconfiguration, network issues etc. Commenting 'I have this issue too' won't help and would only confuse others

Was this page helpful?
0 / 5 - 0 ratings