Openshift-ansible: Ansible 2.7: Etcd install skipped -> Control plane pods didn't come up

Created on 10 Oct 2018 · 19Comments · Source: openshift/openshift-ansible

Description

The etcd installation is skipped on a simple single node setup (in vagrant).

Version

Ansible version per ansible --version:

ansible 2.7.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/lennart/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/dist-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.15rc1 (default, Apr 15 2018, 21:51:34) [GCC 7.3.0]

The output of git describe:

openshift-ansible-3.10.53-1-2-gb839d825c

Steps To Reproduce

ansible-playbook -i single-node.ini openshift-ansible/playbooks/prerequisites.yml
ansible-playbook -i single-node.ini openshift-ansible/playbooks/deploy_cluster.yml

Expected Results

The cluster installation is successful.

Observed Results

The deploy_cluster.yml playbook fails with the following message:

TASK [openshift_control_plane : Report control plane errors] *******************************************************************************
fatal: [master]: FAILED! => {"changed": false, "msg": "Control plane pods didn't come up"}

NO MORE HOSTS LEFT *************************************************************************************************************************
    to retry, use: --limit @/home/lennart/workspace/elastisys/vagrant/openshift/openshift-ansible/playbooks/deploy_cluster.retry

PLAY RECAP *********************************************************************************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
master                     : ok=248  changed=102  unreachable=0    failed=1   


INSTALLER STATUS ***************************************************************************************************************************
Initialization              : Complete (0:00:11)
Health Check                : Complete (0:00:38)
Node Bootstrap Preparation  : Complete (0:02:03)
etcd Install                : Complete (0:00:04)
Master Install              : In Progress (0:17:28)
    This phase can be restarted by running: playbooks/openshift-master/config.yml


Failure summary:


  1. Hosts:    master
     Play:     Configure masters
     Task:     Report control plane errors
     Message:  Control plane pods didn't come up

Full logs here: deploy.txt

Verbose logs here: deploy-verbose.txt

There were three warnings:

[WARNING]: Could not match supplied host pattern, ignoring: oo_lb_to_config
[WARNING]: Could not match supplied host pattern, ignoring: oo_nfs_to_config
[WARNING]: flush_handlers task does not support when conditional

I further debugged this and found this in the logs of the api pod:

I1010 08:38:44.914483       1 plugins.go:84] Registered admission plugin "NamespaceLifecycle"
I1010 08:38:44.914599       1 plugins.go:84] Registered admission plugin "Initializers"
I1010 08:38:44.914608       1 plugins.go:84] Registered admission plugin "ValidatingAdmissionWebhook"
I1010 08:38:44.914615       1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook"
I1010 08:38:44.914621       1 plugins.go:84] Registered admission plugin "AlwaysAdmit"
I1010 08:38:44.914626       1 plugins.go:84] Registered admission plugin "AlwaysPullImages"
I1010 08:38:44.914634       1 plugins.go:84] Registered admission plugin "LimitPodHardAntiAffinityTopology"
I1010 08:38:44.914643       1 plugins.go:84] Registered admission plugin "DefaultTolerationSeconds"
I1010 08:38:44.914648       1 plugins.go:84] Registered admission plugin "AlwaysDeny"
I1010 08:38:44.914655       1 plugins.go:84] Registered admission plugin "EventRateLimit"
I1010 08:38:44.914660       1 plugins.go:84] Registered admission plugin "DenyEscalatingExec"
I1010 08:38:44.914663       1 plugins.go:84] Registered admission plugin "DenyExecOnPrivileged"
I1010 08:38:44.914668       1 plugins.go:84] Registered admission plugin "ExtendedResourceToleration"
I1010 08:38:44.914675       1 plugins.go:84] Registered admission plugin "OwnerReferencesPermissionEnforcement"
I1010 08:38:44.914683       1 plugins.go:84] Registered admission plugin "ImagePolicyWebhook"
I1010 08:38:44.914688       1 plugins.go:84] Registered admission plugin "InitialResources"
I1010 08:38:44.914693       1 plugins.go:84] Registered admission plugin "LimitRanger"
I1010 08:38:44.914698       1 plugins.go:84] Registered admission plugin "NamespaceAutoProvision"
I1010 08:38:44.914703       1 plugins.go:84] Registered admission plugin "NamespaceExists"
I1010 08:38:44.914707       1 plugins.go:84] Registered admission plugin "NodeRestriction"
I1010 08:38:44.914712       1 plugins.go:84] Registered admission plugin "PersistentVolumeLabel"
I1010 08:38:44.914717       1 plugins.go:84] Registered admission plugin "PodNodeSelector"
I1010 08:38:44.914722       1 plugins.go:84] Registered admission plugin "PodPreset"
I1010 08:38:44.914726       1 plugins.go:84] Registered admission plugin "PodTolerationRestriction"
I1010 08:38:44.914731       1 plugins.go:84] Registered admission plugin "ResourceQuota"
I1010 08:38:44.914736       1 plugins.go:84] Registered admission plugin "PodSecurityPolicy"
I1010 08:38:44.914741       1 plugins.go:84] Registered admission plugin "Priority"
I1010 08:38:44.914747       1 plugins.go:84] Registered admission plugin "SecurityContextDeny"
I1010 08:38:44.914752       1 plugins.go:84] Registered admission plugin "ServiceAccount"
I1010 08:38:44.914757       1 plugins.go:84] Registered admission plugin "DefaultStorageClass"
I1010 08:38:44.914762       1 plugins.go:84] Registered admission plugin "PersistentVolumeClaimResize"
I1010 08:38:44.914766       1 plugins.go:84] Registered admission plugin "StorageObjectInUseProtection"
Invalid MasterConfig /etc/origin/master/master-config.yaml
  etcdClientInfo.ca: Invalid value: "/etc/origin/master/master.etcd-ca.crt": could not read file: stat /etc/origin/master/master.etcd-ca.crt: no such file or directory

Then I went back to the ansible logs and realized that it skipped almost all tasks when installing etcd. (Relevant part of logs here: etcd-installation-log.txt)

Why is this happening? Did I miss something in the inventory file?

Additional Information

There are quite a few issues about control plane pods not starting but I don't think this is a duplicate of any of them.
Here are some of the issues that I looked at before reporting:

#10110 did not include the master node in the nodes group
#10047 fails much later when control plane is already up
#9973 and #7967 could be the same issue. I am missing /etc/cni, however, these issues does not mention anything about etcd not being present.
#9894 seems to be a mistake in the inventory
#9852 seems related to AWS only if I understand correctly

Your operating system and version: CentOS Linux release 7.5.1804 (Core)
Your inventory file:

# single-node.ini mostly copied from inventory/hosts.localhost

master ansible_host=192.168.121.159 ansible_port=22 ansible_user='vagrant' ansible_ssh_private_key_file='/home/lennart/.vagrant.d/insecure_private_key'

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
ansible_become=yes
openshift_deployment_type=origin
openshift_portal_net=172.30.0.0/16
openshift_disable_check=disk_availability,memory_availability,docker_storage

openshift_node_groups=[{'name': 'node-config-all-in-one', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/infra=true', 'node-role.kubernetes.io/compute=true']}]


[masters]
master

[etcd]
master

[nodes]
master openshift_node_group_name="node-config-all-in-one"

Source

lentzi90

Most helpful comment

Ansible 2.7 is not recommended yet - the issue should go away once you downgrade to 2.6

This is a perfectly described issue though, lets use it as 'ansible 2.7 support' tracking bug

vrutkovs on 10 Oct 2018

❤6

All 19 comments

Ansible 2.7 is not recommended yet - the issue should go away once you downgrade to 2.6

This is a perfectly described issue though, lets use it as 'ansible 2.7 support' tracking bug

vrutkovs on 10 Oct 2018

❤6

Thanks for the quick reply!
I switched to the containerized installer to get the correct version, and the control plane came up as expected :)

lentzi90 on 10 Oct 2018

👍2

@lentzi90 thanks for pointing this out. I ran into the exact same issue jumping to 2.7.0. I reverted back to ansible 2.6.2 and the installation proceeded without error.

watsonb on 10 Oct 2018

just curious - why are you guys jumping on latest ansible version when it was made it clear in the requirement the ansible version ?

_not trying to start a debate, just i see more and more people ignoring and jumping to latest v_

DanyC97 on 10 Oct 2018

The laptop I'm using is running Ubuntu, which is still stuck with ansible 2.5 in the official repos, so I went with a PPA and got the latest.
To be honest I didn't think Ansible >= 2.6.2 implied that 2.7 was unsupported.

lentzi90 on 10 Oct 2018

@DanyC97

I've got a CI job that I use to run daily tests of the localhost origin installer on the release-3.10 branch. As part of that job, I first install the requirements per the README.md. Reading the requirements, I got the impressions it was ok to use _latest_ given I was installing origin and not OCP.

Requirements in README.md (release-3.10)

Requirements:

Ansible >= 2.4.3.0, 2.5.x is not currently supported for OCP installations
Jinja >= 2.7
pyOpenSSL
python-lxml

Code I used to prep installer:

- name: install required pip packages for installer
  pip:
    name: "{{ item.name }}"
    state: latest
  loop:
   - ansible
   - pyOpenSSL
  tags:
    - skip_ansible_lint

Would it make sense to add a check in the prerequisites.yml playbook that fails if required packages are higher than the supported versions?

nagonzalez on 10 Oct 2018

state: latest

That would pull in latest ansible, which is 2.7. You should use requirements.txt instead

vrutkovs on 11 Oct 2018

👍1

I think it would be a good idea to also mention requirements.txt in the readme. Something like:

You can ensure that you get supported versions of the packages by using requirements.txt when installing pip packages: pip install -r requirements.txt.

lentzi90 on 12 Oct 2018

Faced the same issue today. Continuing the failed installation after downgrading to ansible 2.6.5 didn't work but a clean installation with ansible 2.6.5 did work.

zizzencs on 13 Oct 2018

I believe the problem that caused this should fixed in Ansible 2.7.1.

In Ansible 2.7.0 we made a change to variable exposure from import_role as described at:

https://docs.ansible.com/ansible/latest/porting_guides/porting_guide_2.7.html#include-role-and-import-role-variable-exposure

However, that exposed a bug in regards to mutable defaults which was resolved for the Ansible 2.7.1 release as part of https://github.com/ansible/ansible/pull/46833

Based on what I understand of these playbooks, I believe that etcd_ca_setup: False carried over from 1 play to another play, causing the when statement on additional etcd role calls to be skipped.

I've had 2 co-workers confirm that Ansible 2.7.1 is behaving properly with the etcd roles.

sivel on 26 Oct 2018

/close
Fixed in Ansible 2.7.1

@sivel (and co-workers) thanks for confirming!

sdodson on 26 Oct 2018

@sdodson: Closing this issue.

In response to this:

/close
Fixed in Ansible 2.7.1

@sivel (and co-workers) thanks for confirming!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot on 26 Oct 2018

/reopen
@mtnbikenc indicates other problems remain https://github.com/openshift/openshift-ansible/pull/10523#issuecomment-433472776

sdodson on 26 Oct 2018

@sdodson: Reopening this issue.

In response to this:

/reopen
@mtnbikenc indicates other problems remain https://github.com/openshift/openshift-ansible/pull/10523#issuecomment-433472776

openshift-ci-robot on 26 Oct 2018

I had this issue as well trying to do a single node install.

A bad (edit: was "the" but surely this is not "the" solution) solution was to ditch the node-config-all-in-one node group and use just node-config-master while also setting openshift_schedulable=True. Then just manually re-label the node and re-run deploy_cluster.

When using node-config-all-in-one the Ansible playbook for some reason decides to try and configure node services before master services.

calston on 29 Oct 2018

@DanyC97

I've got a CI job that I use to run daily tests

I was about to ask what Openshift/OKD's CI is in this regard because over the last 3 releases the regressions I've seen in various bits of the install procedure seems like there's a big big void in functional test coverage.

calston on 29 Oct 2018

This breaks okd deployment on CentOS + EPEL, since they upgraded Ansible to 2.7 (not for the first time, either). @extra is still at 2.4, so EPEL is necessary. Maybe the CentOS SCL responsible for OpenShift should maintain their own Ansible packages which are compatible with openshift-ansible, like RHEL does?

Happy to contribute, if someone points me in the right direction.

leoluk on 1 Nov 2018

so now with 3.11 being out you you can see that we have a new ansible rpm being released in CentOS extras repo especially to address the above issue.

If people happy with it i think we can close this issue

DanyC97 on 9 Nov 2018

Ansible 2.7.4 is in CentOS repos - see https://cbs.centos.org/repos/configmanagement7-ansible-27-testing/x86_64/os/ - this seems to work for me, could someone verify that it works?

vrutkovs on 14 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Missing config.yml from /openshift-ansible/playbooks/byo/

rharveyva · 6Comments

openshift_service_catalog install fails (OKD 3.11) - Wait for API Server rollout success

DizzyThermal · 3Comments

OpenShift 3.10 openshift_master_audit_config is not getting encoded correctly into /etc/origin/master/master-config.yaml

outcoldman · 6Comments

Metrics installation fails on Atomic host as passlib module is required

IronicBadger · 7Comments

No package matching 'origin-docker-excluder-3.11**' found available, installed or updated

wongkafai · 7Comments