Openshift upgrade 3.7.x -> 3.9 croaks miserably if there is an issue with repodata on a master
A step that is trying to ensure that an excluder is available first does an ansible repoquery and registers the result as repoquery_out. If it fails, it simply adds the host to the ansible unreachable list and stops all execution of tasks on that host. Unfortunately, it never gets to the next step wherein it uses the ansible fail module to abort the whole playbook.
In this case, we had an issue with repomd on our 2nd master and the result was that the 1st and 3rd master were both upgraded to 3.9.0 and the 2nd master was stuck at 3.7.2. Normal execution of the upgrade playbook was impossible and we had to manually bump the version of origin-master-api, origin-master-controllers to 3.8.0 in /etc/sysconfig/${componentname} and then re-run the upgrade play. Other issues fell out such as the compute/master node relabel for 3.9's updated role nomenclature and had to be run by hand.
Repodata output in failed play is as follows. It shows the failure which is not trapped by a failed_when or ignore_errors clause in the repoquery task and then it shows how master2 is just pruned from the list and the play goes on never actually tripping over the fail safety measure.
TASK [openshift_excluder : Get available excluder version] ************************************************************************************************************************************************************************
Monday 21 May 2018 14:56:41 +0000 (0:00:00.523) 0:01:08.273 ************
ok: [master1]
ok: [master3]
fatal: [master2]: FAILED! => {"changed": false, "msg": {"cmd": "/usr/bin/repoquery --plugins --quiet --pkgnarrow=repos --queryformat=%{version}|%{release}|%{arch}|%{repo}|%{version}-%{release} --config=/tmp/tmpIyrcR1 origin-docker-excluder-3.9*", "package_found": false, "package_name": "origin-docker-excluder-3.9*", "results": {}, "returncode": 1, "stderr": "Not using downloaded centos-all-x86_64-redacted-repo-name/repomd.xml because it is older than what we have:\n Current : Tue Dec 19 19:07:12 2017\n Downloaded: Tue Dec 19 19:07:07 2017\nfailure: repodata/a47046e8ca319d5b41f80b84e384e21fec560391-primary.sqlite.bz2 from centos-all-x86_64-redacted-repo-name: [Errno 256] No more mirrors to try.\nhttp://YUMREPO/REPONAME/REPOENV/rhel/x86_64/all/repodata/a47046e8ca319d5b41f80b84e384e21fec560391-primary.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found\n", "stdout": ""}}
TASK [openshift_excluder : Fail when excluder package is not found] ***************************************************************************************************************************************************************
Monday 21 May 2018 14:57:16 +0000 (0:00:35.110) 0:01:43.384 ************
skipping: [master1]
skipping: [master3]
TASK [openshift_excluder : Set fact excluder_version] *****************************************************************************************************************************************************************************
Monday 21 May 2018 14:57:16 +0000 (0:00:00.271) 0:01:43.655 ************
ok: [master1]
ok: [master3]
Openshift Ansible 3.9.27-1 (and master), built from the Dockerfile and run as a container.
A clean abend aborting the entire attempted upgrade due to one of the 3 masters being unprepared for the upgrade
The plays march on to their doom when ~10 minutes later the upgrade of the master api/controllers is completed on master1 and master3, leaving master2 untouched.
Recommended change... failed_when or a simple ignore_errors since the next step is actually going to cause the entire play to fail when the output of the repoquery does not indicate that the relevant excluder is available.
I would submit a PR, but it's a single line of code and I don't know if you'd rather I submit a PR or simply just submit this issue. The pseudocode fix is inline in the issue, the the approach (ignore_errors or failed_when) is totally a matter of style/preference.
-m
@mshutt i think this is related to https://github.com/openshift/openshift-ansible/pull/8446 but that PR is not yet merged in 3.9
@DanyC97 I do not believe it is, no. I just checked that PR and it would seem that a failure in "repoquery" would still result in the same outcome...
Again, the underlying issue was a broken yum cache... with broken cached repo metadata from a broken repository. While these are not specifically in the purview of openshift-ansible to handle, if openshift-ansible has a dependency on something and that something is broken and if the next step is to fail if that thing is broken, it ought to do that rather than just proceeding on but removing one of the master nodes from the control plane upgrade.
@mshutt thx for the info. I do suggest to open a PR starting with master branch and then once you can
a) wait till is merged and ask a core contributor to kick bot for cherry pick in other release branches
b) you create PRs for the other branches like 3.9/3.7
@DanyC97 after looking deeper, there are myriad places where a single host might fail during the upgrade play... Pursuant to https://github.com/openshift/openshift-ansible/pull/7225/files, would it make better sense to simply add "all_errors_fatal: true" to the root of the upgrade playbooks? Is there any possible reason that a failure would be desirable on a single master or node and for the plays to continue on hosts other than the one with the failure?
To be more fine grained, perhaps the any_errors_fatal option should either be added only to the control plane plays rather than the node plays and perhaps even then only to the "verification" steps?
I've also seen a lot of chatter about bugs related to any_errors_fatal being in includes or blocks, so this may be a lot of "fail" (sorry for the pun)...
Thoughts? I think this speaks to a larger project-wide decision about how to handle upgrades and how to deal with the fact that a single master might have an issue during the upgrade? Moreover, this may also only be "such a big deal" with the 3.9 upgrade because the upgrade plays actually detect if you're 3.7 to take you to 3.8... and then detect if you are 3.8 to take you to 3.9... and it does so by checking the /etc/sysconfig/ based image name for the origin-master-api systemd service (container)... so once you've gotten the first master to 3.9, if you still have a master at 3.7, the playbooks themselves cannot handle moving forward?
Is there any possible reason that a failure would be desirable on a single master or node and for the plays to continue on hosts other than the one with the failure?
Its certainly possible to have only some nodes (or even masters) setup or updated and have a working install. The remaining parts could be fixed later on.
In this particular step its complicated for repoquery to detect if its a temporary issue or missing/incorrect repos
@vrutkovs Sure, but given that this is the 3.7 -> 3.9 upgrade, once the first master is upgraded to 3.9 and the play abends, the playbook is no longer able to do the 3.7 -> 3.9 upgrade. Perhaps this is a better thing to fix than to ask the upgrade to fail entirely before trying if one of the masters cannot be upgraded? In our case, I manually upgraded the origin-master-{api,controllers} to v3.8.0 by editing the /etc/sysconfig/${thing} k/v pair to define the 3.8.0 container version on the failed master and manually relabeled the un-upgraded master as node-role.kubernetes.io/master: "true" and re-ran the full upgrade playbook which took us through the rest of the 3.8 -> 3.9 upgrade steps such as generating the configmap for the openshift-web-console from the assetConfig in master-config.yaml and so on and so forth.
I do think that this situation is unique since there are essentially two upgrades being done by one instantiation of the upgrade playbook... and this changes the proposition entirely whereas this may be a non-issue in an environment that was only going from 3.6 to 3.7 or from 3.9 to 3.10 (in theory)
This is why I am seeking guidance before I propose a fix one way or the other :) . If this is ENOTABUG and Working as intended, I will simply document the behavior and transmogrify the containerized ansible.cfg by way of binding a custom one over the top to define any_errors_fatal: true for our own environment. I would not want to have an operator be forced to intervene at this level during a scheduled upgrade of a production environment.
Ah, yes, 3.7 -> 3.9 is special, probably any_errors_fatal would be required to make it pass.
@sdodson @michaelgugino @mtnbikenc WDYT?
@vrutkovs Yeah, my first look thought was to add any_errors_fatal: true to the next task in the verify_excluder.yml and to failed_when: false the repoquery step, but then I looked through all of the upgrade stuff and I saw that this is only one of many potential pitfalls wherein some random environmental issue unrelated to openshift might cause this same problem.
Maybe it would be better to handle the case wherein some masters made it to 3.9.0 and 1 is still stuck at 3.7.x. Or maybe it would be better to have the 3.8 interim upgrade broken into it's own playbook, but then you'd have the issue(s) wherein users would be running 3.8 and that's obviously undesirable as well.
is a tricky situation however i wouldn't "rewind" the initial decision and go to _but then you'd have the issue(s) wherein users would be running 3.8_
@vrutkovs any_errors_fatal true on the play level should do it for any master plays.
I know we tolerate some level of failure for nodes.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen.
Mark the issue as fresh by commenting/remove-lifecycle rotten.
Exclude this issue from closing again by commenting/lifecycle frozen./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
Ah, yes, 3.7 -> 3.9 is special, probably
any_errors_fatalwould be required to make it pass.@sdodson @michaelgugino @mtnbikenc WDYT?