I am trying to install OKD 3.11 on a cluster that previously had 3.9 installed and working fine. I removed 3.9 using the uninstall.yml playbook because I didn't get any help on the other issue that I raised for 3.9 to 3.10 upgrade failure #10690.
Ansible
$ ansible --version
ansible 2.7.2
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins /modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
openshift-ansible RPM
$ rpm -q openshift-ansible
openshift-ansible-3.11.37-1.git.0.3b8b341.el7.noarch
Successful install
Installation fails with following output:
TASK [openshift_control_plane : fail] ************************************************************************************************************************
fatal: [os-master-1.example.com]: FAILED! => {"changed": false, "msg": "Node start failed."}
NO MORE HOSTS LEFT *******************************************************************************************************************************************
to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.retry
PLAY RECAP ***************************************************************************************************************************************************
localhost : ok=11 changed=0 unreachable=0 failed=0
os-master-1.example.com : ok=286 changed=41 unreachable=0 failed=1
os-node-1.example.com : ok=101 changed=15 unreachable=0 failed=0
os-node-10.example.com : ok=101 changed=15 unreachable=0 failed=0
os-node-2.example.com : ok=101 changed=15 unreachable=0 failed=0
os-node-3.example.com : ok=101 changed=15 unreachable=0 failed=0
os-node-4.example.com : ok=101 changed=15 unreachable=0 failed=0
os-node-5.example.com : ok=101 changed=15 unreachable=0 failed=0
os-node-6.example.com : ok=101 changed=15 unreachable=0 failed=0
os-node-7.example.com : ok=101 changed=15 unreachable=0 failed=0
os-node-8.example.com : ok=101 changed=15 unreachable=0 failed=0
os-node-9.example.com : ok=101 changed=15 unreachable=0 failed=0
INSTALLER STATUS *********************************************************************************************************************************************
Initialization : Complete (0:01:54)
Health Check : Complete (0:00:42)
Node Bootstrap Preparation : Complete (0:05:39)
etcd Install : Complete (0:01:04)
Master Install : In Progress (0:02:16)
On the master node, origin-node service fails to come up with following error:
Nov 27 06:22:30 os-master-1.example.com systemd[1]: Unit origin-node.service entered failed state.
Nov 27 06:22:30 os-master-1.example.com systemd[1]: origin-node.service failed.
Nov 27 06:22:35 os-master-1.example.com systemd[1]: origin-node.service holdoff time over, scheduling restart.
Nov 27 06:22:35 os-master-1.example.com systemd[1]: Starting OpenShift Node...
-- Subject: Unit origin-node.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit origin-node.service has begun starting up.
Nov 27 06:22:35 os-master-1.example.com origin-node[4173]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" d
Nov 27 06:22:35 os-master-1.example.com systemd[1]: origin-node.service: main process exited, code=exited, status=1/FAILURE
Nov 27 06:22:35 os-master-1.example.com systemd[1]: Failed to start OpenShift Node.
-- Subject: Unit origin-node.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit origin-node.service has failed.
--
-- The result is failed.
I don't understand why it's failing to read the file /etc/origin/node/node-config.yaml when it's there and has read permissions for the root user:
$ ls -l /etc/origin/node/node-config.yaml
-rw-------. 1 root root 1795 Nov 27 06:23 /etc/origin/node/node-config.yaml
Looking into the journal logs I see this:
Nov 27 06:28:32 os-master-1.example.com origin-node[7312]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" due to an error: error reading config: v1.NodeConfig.KubeletArguments: []string: decode slice: expect [ or n, but found 5, error found in #10 byte of ...|tainers":50,"maximum|..., bigger context ...|70"],"max-pods":["80"],"maximum-dead-containers":50,"maximum-dead-containers-per-container":2,"minim|...
Nov 27 06:28:32 os-master-1.example.com systemd[1]: origin-node.service: main process exited, code=exited, status=1/FAILURE
Nov 27 06:28:32 os-master-1.example.com systemd[1]: Failed to start OpenShift Node.
Nov 27 06:28:32 os-master-1.example.com systemd[1]: Unit origin-node.service entered failed state.
Nov 27 06:28:32 os-master-1.example.com systemd[1]: origin-node.service failed.
Nov 27 06:28:37 os-master-1.example.com systemd[1]: origin-node.service holdoff time over, scheduling restart.
Nov 27 06:28:37 os-master-1.example.com systemd[1]: Starting OpenShift Node...
Nov 27 06:28:37 os-master-1.example.com origin-node[7378]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" due to an error: error reading config: v1.NodeConfig.KubeletArguments: []string: decode slice: expect [ or n, but found 5, error found in #10 byte of ...|tainers":50,"maximum|..., bigger context ...|70"],"max-pods":["80"],"maximum-dead-containers":50,"maximum-dead-containers-per-container":2,"minim|...
Nov 27 06:28:37 os-master-1.example.com systemd[1]: origin-node.service: main process exited, code=exited, status=1/FAILURE
Nov 27 06:28:38 os-master-1.example.com systemd[1]: Failed to start OpenShift Node.
Nov 27 06:28:38 os-master-1.example.com systemd[1]: Unit origin-node.service entered failed state.
Nov 27 06:28:38 os-master-1.example.com systemd[1]: origin-node.service failed.
It looks like this is not about not being able to read the file but about a configuration that I seem to have not written as per the expectation. I'm guessing so based on this log:
Nov 27 06:28:37 os-master-1.example.com origin-node[7378]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" due to an error: error reading config: v1.NodeConfig.KubeletArguments: []string: decode slice: expect [ or n, but found 5, error found in #10 byte of ...|tainers":50,"maximum|..., bigger context ...|70"],"max-pods":["80"],"maximum-dead-containers":50,"maximum-dead-containers-per-container":2,"minim|...
And the corresponding configuration in hosts file:
openshift_node_groups=[{'name': 'ccp-openshift-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/node-type=metrics', 'node-role.kubernetes.io/zone=default', 'node-role.kubernetes.io/infra=true'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['80']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': '2h'}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}, {'name': 'ccp-openshift-node', 'labels': ['node-role.kubernetes.io/node-type=logging', 'node-role.kubernetes.io/zone=default'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['80']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': '2h'}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}]
It's not an invalid python dictionary at least but something is probably still off.
Control plane pods (if I'm correct in assuming that these are the control plane pods) seem to be coming up just fine:
$ oc get pods -n kube-system
NAME READY STATUS RESTARTS AGE
master-api-os-master-1.example.com 1/1 Running 0 19h
master-controllers-os-master-1.example.com 1/1 Running 0 19h
master-etcd-os-master-1.example.com 1/1 Running 0 19h
Can someone please help?
This issue seems to have got fixed by using [ and ] for every value being set for the value key.
openshift_node_groups=[{'name': 'ccp-openshift-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/node-type=metrics', 'node-role.kubernetes.io/zone=default', 'node-role.kubernetes.io/infra=true'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['70']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': ['2h']}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}, {'name': 'ccp-openshift-node', 'labels': ['node-role.kubernetes.io/node-type=logging', 'node-role.kubernetes.io/zone=default'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['70']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': ['2h']}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}]
Please avoid using INI-style inventories - these have a confusing syntax, YAML should be used instead
@vrutkovs Thanks for the tip. I think the examples mentioned in the documentation are INI style. Could you please point me to a YAML based example that you recommend me to use instead?
This issue seems to have got fixed by using
[and]for every value being set for thevaluekey.openshift_node_groups=[{'name': 'ccp-openshift-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/node-type=metrics', 'node-role.kubernetes.io/zone=default', 'node-role.kubernetes.io/infra=true'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['70']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': ['2h']}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}, {'name': 'ccp-openshift-node', 'labels': ['node-role.kubernetes.io/node-type=logging', 'node-role.kubernetes.io/zone=default'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['70']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': ['2h']}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}]
I know; it's closed. But for completeness sake. Could someone translate the line to yaml ? so we have an example ?
Most helpful comment
@vrutkovs Thanks for the tip. I think the examples mentioned in the documentation are INI style. Could you please point me to a YAML based example that you recommend me to use instead?