Openshift-ansible: Installing 3.11 cluster fails with "Node start failed"

Created on 27 Nov 2018 · 4Comments · Source: openshift/openshift-ansible

Description

I am trying to install OKD 3.11 on a cluster that previously had 3.9 installed and working fine. I removed 3.9 using the uninstall.yml playbook because I didn't get any help on the other issue that I raised for 3.9 to 3.10 upgrade failure #10690.

Version

Ansible

$ ansible --version
ansible 2.7.2
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins /modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

openshift-ansible RPM

$ rpm -q openshift-ansible
openshift-ansible-3.11.37-1.git.0.3b8b341.el7.noarch

Steps To Reproduce

Try to install 3.11 cluster with this hosts file.

Expected Results

Successful install

Observed Results

Installation fails with following output:

TASK [openshift_control_plane : fail] ************************************************************************************************************************
fatal: [os-master-1.example.com]: FAILED! => {"changed": false, "msg": "Node start failed."}                                                                  

NO MORE HOSTS LEFT *******************************************************************************************************************************************
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.retry                                    

PLAY RECAP ***************************************************************************************************************************************************
localhost                  : ok=11   changed=0    unreachable=0    failed=0                
os-master-1.example.com    : ok=286  changed=41   unreachable=0    failed=1                                               
os-node-1.example.com      : ok=101  changed=15   unreachable=0    failed=0               
os-node-10.example.com     : ok=101  changed=15   unreachable=0    failed=0                                                                                   
os-node-2.example.com      : ok=101  changed=15   unreachable=0    failed=0                                                                                   
os-node-3.example.com      : ok=101  changed=15   unreachable=0    failed=0                                                                                   
os-node-4.example.com      : ok=101  changed=15   unreachable=0    failed=0
os-node-5.example.com      : ok=101  changed=15   unreachable=0    failed=0                                                            
os-node-6.example.com      : ok=101  changed=15   unreachable=0    failed=0
os-node-7.example.com      : ok=101  changed=15   unreachable=0    failed=0
os-node-8.example.com      : ok=101  changed=15   unreachable=0    failed=0
os-node-9.example.com      : ok=101  changed=15   unreachable=0    failed=0


INSTALLER STATUS *********************************************************************************************************************************************
Initialization              : Complete (0:01:54)
Health Check                : Complete (0:00:42)
Node Bootstrap Preparation  : Complete (0:05:39)
etcd Install                : Complete (0:01:04)
Master Install              : In Progress (0:02:16)

On the master node, origin-node service fails to come up with following error:

Nov 27 06:22:30 os-master-1.example.com systemd[1]: Unit origin-node.service entered failed state.
Nov 27 06:22:30 os-master-1.example.com systemd[1]: origin-node.service failed.
Nov 27 06:22:35 os-master-1.example.com systemd[1]: origin-node.service holdoff time over, scheduling restart.
Nov 27 06:22:35 os-master-1.example.com systemd[1]: Starting OpenShift Node...
-- Subject: Unit origin-node.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit origin-node.service has begun starting up.
Nov 27 06:22:35 os-master-1.example.com origin-node[4173]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" d
Nov 27 06:22:35 os-master-1.example.com systemd[1]: origin-node.service: main process exited, code=exited, status=1/FAILURE
Nov 27 06:22:35 os-master-1.example.com systemd[1]: Failed to start OpenShift Node.
-- Subject: Unit origin-node.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit origin-node.service has failed.
-- 
-- The result is failed.

I don't understand why it's failing to read the file /etc/origin/node/node-config.yaml when it's there and has read permissions for the root user:

$ ls -l /etc/origin/node/node-config.yaml
-rw-------. 1 root root 1795 Nov 27 06:23 /etc/origin/node/node-config.yaml

Looking into the journal logs I see this:

Nov 27 06:28:32 os-master-1.example.com origin-node[7312]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" due to an error: error reading config: v1.NodeConfig.KubeletArguments: []string: decode slice: expect [ or n, but found 5, error found in #10 byte of ...|tainers":50,"maximum|..., bigger context ...|70"],"max-pods":["80"],"maximum-dead-containers":50,"maximum-dead-containers-per-container":2,"minim|...
Nov 27 06:28:32 os-master-1.example.com systemd[1]: origin-node.service: main process exited, code=exited, status=1/FAILURE
Nov 27 06:28:32 os-master-1.example.com systemd[1]: Failed to start OpenShift Node.
Nov 27 06:28:32 os-master-1.example.com systemd[1]: Unit origin-node.service entered failed state.
Nov 27 06:28:32 os-master-1.example.com systemd[1]: origin-node.service failed.
Nov 27 06:28:37 os-master-1.example.com systemd[1]: origin-node.service holdoff time over, scheduling restart.
Nov 27 06:28:37 os-master-1.example.com systemd[1]: Starting OpenShift Node...
Nov 27 06:28:37 os-master-1.example.com origin-node[7378]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" due to an error: error reading config: v1.NodeConfig.KubeletArguments: []string: decode slice: expect [ or n, but found 5, error found in #10 byte of ...|tainers":50,"maximum|..., bigger context ...|70"],"max-pods":["80"],"maximum-dead-containers":50,"maximum-dead-containers-per-container":2,"minim|...
Nov 27 06:28:37 os-master-1.example.com systemd[1]: origin-node.service: main process exited, code=exited, status=1/FAILURE
Nov 27 06:28:38 os-master-1.example.com systemd[1]: Failed to start OpenShift Node.
Nov 27 06:28:38 os-master-1.example.com systemd[1]: Unit origin-node.service entered failed state.
Nov 27 06:28:38 os-master-1.example.com systemd[1]: origin-node.service failed.

It looks like this is not about not being able to read the file but about a configuration that I seem to have not written as per the expectation. I'm guessing so based on this log:

Nov 27 06:28:37 os-master-1.example.com origin-node[7378]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" due to an error: error reading config: v1.NodeConfig.KubeletArguments: []string: decode slice: expect [ or n, but found 5, error found in #10 byte of ...|tainers":50,"maximum|..., bigger context ...|70"],"max-pods":["80"],"maximum-dead-containers":50,"maximum-dead-containers-per-container":2,"minim|...

And the corresponding configuration in hosts file:

openshift_node_groups=[{'name': 'ccp-openshift-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/node-type=metrics', 'node-role.kubernetes.io/zone=default', 'node-role.kubernetes.io/infra=true'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['80']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': '2h'}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}, {'name': 'ccp-openshift-node', 'labels': ['node-role.kubernetes.io/node-type=logging', 'node-role.kubernetes.io/zone=default'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['80']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': '2h'}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}]

It's not an invalid python dictionary at least but something is probably still off.

Control plane pods (if I'm correct in assuming that these are the control plane pods) seem to be coming up just fine:

$ oc get pods -n kube-system
NAME                                         READY     STATUS    RESTARTS   AGE
master-api-os-master-1.example.com           1/1       Running   0          19h
master-controllers-os-master-1.example.com   1/1       Running   0          19h
master-etcd-os-master-1.example.com          1/1       Running   0          19h

Can someone please help?

Source

dharmit

Most helpful comment

@vrutkovs Thanks for the tip. I think the examples mentioned in the documentation are INI style. Could you please point me to a YAML based example that you recommend me to use instead?

dharmit on 29 Nov 2018

👍3

All 4 comments

This issue seems to have got fixed by using [ and ] for every value being set for the value key.

openshift_node_groups=[{'name': 'ccp-openshift-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/node-type=metrics', 'node-role.kubernetes.io/zone=default', 'node-role.kubernetes.io/infra=true'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['70']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': ['2h']}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}, {'name': 'ccp-openshift-node', 'labels': ['node-role.kubernetes.io/node-type=logging', 'node-role.kubernetes.io/zone=default'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['70']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': ['2h']}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}]

dharmit on 28 Nov 2018

Please avoid using INI-style inventories - these have a confusing syntax, YAML should be used instead

vrutkovs on 28 Nov 2018

❤2

@vrutkovs Thanks for the tip. I think the examples mentioned in the documentation are INI style. Could you please point me to a YAML based example that you recommend me to use instead?

dharmit on 29 Nov 2018

👍3

This issue seems to have got fixed by using [ and ] for every value being set for the value key.

openshift_node_groups=[{'name': 'ccp-openshift-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/node-type=metrics', 'node-role.kubernetes.io/zone=default', 'node-role.kubernetes.io/infra=true'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['70']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': ['2h']}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}, {'name': 'ccp-openshift-node', 'labels': ['node-role.kubernetes.io/node-type=logging', 'node-role.kubernetes.io/zone=default'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['70']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': ['2h']}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}]

I know; it's closed. But for completeness sake. Could someone translate the line to yaml ? so we have an example ?