Ref: #695 #1909
Details:
Go through and rerun the playbooks on all Linux Machines, using AWX. I'll do them in batches to make sure it's obvious what caused an issue, if any occur.
Running on build*rhel* : https://awx.adoptopenjdk.net/#/jobs/playbook/648
EDIT: No Issues, moving on
Running on build*centos* : https://awx.adoptopenjdk.net/#/jobs/playbook/650
Funny thing I noticed about AWX. In the centos run - it appears to only be running a single centos machine from each provider. As in, build-osusol-centos74-ppc64le-2 is in the run, but not build-osusol-centos74-ppc64le-1 - despite build-osusol-centos74-ppc64le-1 being in the inventory. Very odd - I'll note down the machines that aren't run in subsequent runs too.
After build*centos* run:
build-packet-centos74-ppc64le-2 succeeded
on Apt Upgrade task: fatal: [build-digitalocean-centos69-x64-2]: FAILED! => {"changed": false, "msg": "Error: Cannot find a valid baseurl for repo: base\n", "rc": 1, "results": []}
on Enable EPEL Release task: fatal: [build-osuosl-centos74-ppc64le-2]: FAILED! => {"changed": false, "module_stderr": "Shared connection to 140.211.168.117 closed.\r\n", "module_stdout": "error: rpmdb: BDB0113 Thread/process 7996/70367273601024 failed: BDB1507 Thread died in Berkeley DB library\r\nerror: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
I've seen the EPEL release task failure in https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1868
and the Base URL issue here: https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1745
I'll get to fixing those and rerun the playbook
EDIT: Fixes done (I also accidentally rebuilt the databases on build-osuosl-centos74-ppc64le-1 too :facepalm: ), new build*centos* run here: https://awx.adoptopenjdk.net/#/jobs/playbook/662
build*centos* succeeded.
Onto build*ubuntu* : https://awx.adoptopenjdk.net/#/jobs/playbook/664
Of the build*ubuntu*:
build-scaleway-ubuntu1604-x64-1, build-scaleway-ubuntu1604-armv7-2 and build-packet-ubuntu1804-armv8-1 were all UNREACHABLE with something to the effect of:
UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added '<IP>' (ECDSA) to the list of known hosts.\r\nno such identity: /var/lib/awx/.ssh/id_rsa: No such file or directory\r\nroot@<IP>: Permission denied (publickey).", "unreachable": true}
Looks like there's something wrong with AWX's Private key? Ping @sxa
build-scaleway-ubuntu1604-armv7-1 FAILED on the Swap_File: Create swap file - via DD task, with:
dd: failed to open '//swapfile': Text file busy"
build-alibaba-ubuntu1804-armv8-1 succeeded
No hosts were skipped like they were in build*centos*.
build-alibaba-ubuntu1804-armv8-1succeeded
That's reassuring since I ran the playbook on it yesterday successfully :-)
I'm a little surprised we haven't seen that issue with the swapfile previously. I suspect we need to adjust the conditions under which that is executed ...
That should be an easy enough fix - just check if /swapfile exists.
For the ssh key problem, I had a quick re-run of the failed boxes, and it's recurred - I've ssh'd to the build-scaleway-ubuntu1604-armv7-2 and added the AWX ssh key into the authorized_keys file and it works. The AWX secret file says that the AWX authorized_key is added to a machine via bastillion - so I presume those 3 machines aren't in bastillion ?
While I wait for the PR to be merged, I've started test*rhel* : https://awx.adoptopenjdk.net/#/jobs/playbook/670?job_search=page_size:20;order_by:-finished;not__launch_type:sync
EDIT:
test-ibmcloud-rhel6-x64-1 : FAILED : No package matching locales
test-aws-rhel76-armv8-1 : FAILED : 404 Error from link in the _Install missing Rhel7 aarch64 deps from Centos Mirror_ task
test-aws-rhel8-x64-1: FAILED: Can't get certain deps for nagios-plugins-all
test-ibmcloud-rhel7-x64-1 : SUCCESS
EDIT2: These should have all been addressed in : #1999
Given the above PR, I'm going to just add any small fixes to that, and carry on with the search.
First test*centos* run: https://awx.adoptopenjdk.net/#/jobs/playbook/672?job_search=page_size:20;order_by:-finished;not__launch_type:sync
Looks like the missing machines issue is back; test-osuosl-centos74-ppc64le-4 isn't running, despite being in the inv (FYI @sxa )
test*centos* was successful! Running test*ubuntu* : https://awx.adoptopenjdk.net/#/jobs/playbook/674?job_search=page_size:20;order_by:-finished;not__launch_type:sync
Looks like all 19 Ubuntu Hosts are running :)
EDIT: All of them passed! (slightly shocked).
I think that's it for all the linux so I'll look to get those PRs in, and once they've been merged, I can rerun all the hosts that failed, or were skipped
Okay, the PRs that were regarding the issues I found, have been merged. SO we have the following machines to do:
Hosts that AWX never ran on:
Hosts that AWX failed on:
Hosts that were 'unreachable' by AWX (I'm just going to ssh to it and put AWX's key into the authorized_keys file):
first run, containing the 'Hosts that AWX never ran on', and 'Hosts that AWX failed on',:
test-aws-rhel8-x64-1 failed on task "Create Symlink to (Nagios) Plugins":
refusing to convert from directory to symlink for /usr/local/nagios/libexec",
test-ibmcloud-rhel6-x64-1 failed adoptopenjdk_install task with:
Failed to validate the SSL certificate for github-releases.githubusercontent.com:443. Make sure your managed systems have a valid CA certificate installed.
(Interesting - Python 2.7.18 should be on the machine and used for that task - I'll confirm)
test-osuosl-centos74-ppc64le-4 was unreachable ( maybe no AWX key on it )
Ignore the osuosl one for now.
Got it!
With the RHEL machine, only CentOS machines install Python 2.7.18 - I'll fix that up so it's RedHat too.
The machines that were initially unreachable by AWX: https://awx.adoptopenjdk.net/#/jobs/playbook/689?job_search=page_size:20;order_by:-finished;not__launch_type:sync
EDIT: Those two worked! woo
test-ibmcloud-rhel6-x64-1 rerun (now that #2005 has been merged): https://awx.adoptopenjdk.net/#/jobs/playbook/692?job_search=page_size:20;order_by:-finished;not__launch_type:sync
EDIT: Failed- I forgot to make RHEL6 use the alternative install (like CentOS6), in #2005
Rerunning: https://awx.adoptopenjdk.net/#/jobs/playbook/694?job_search=page_size:20;order_by:-finished;not__launch_type:sync
EDIT: It worked! :tada:
With #2006 merged, Rerunning test-aws-rhel8-x64-1 : https://awx.adoptopenjdk.net/#/jobs/playbook/700?job_search=page_size:20;order_by:-finished;not__launch_type:sync
EDIT: Passed!
So, All linux hosts have had the playbooks run on them successfully, except for 3 of them:
test-osuosl-centos74-ppc64le-4
build-packet-ubuntu1804-armv8-1
build-packet-ubuntu1804-armv8-1 is one that doesn't need the full playbook to run on it at the moment, although I'll need to look into why the keys aren't being propogated to ittest-osuosl-centos74-ppc64le-4 doesn't exist and has been replaced with another system with a later Ubuntu level that isn't yet live so it can continue to be ignored until all the inventory updates are in placebuild-osuosl-centos74-ppc64le-1 ... No idea why this isn't in the inventory but I've manually added it in AWX and the playbook has run ok on it.Awesome! Thanks for the update. In that case, every other machine has been done, so, closing issue :+1: