Openjdk-infrastructure: Regular & visible ansible refreshes of machines

Created on 5 Feb 2019  路  17Comments  路  Source: AdoptOpenJDK/openjdk-infrastructure

Now that the playbooks are more stable, it would be good to have regular and visible machine refreshes, to ensure that new updates to the playbooks will be picked up and deployed on a regular basis. (related: https://github.com/AdoptOpenJDK/openjdk-infrastructure/pull/624 submitted 2 weeks ago, and needs deployment to test machines). In addition to 'full set of machines' refreshes, for one-off updates to a particular machine, it should also be made known/visible, and part of some easy to find communication.

Benefits include faster test triage and easier on-boarding of new helpers to the infra team. Visibility to all interested parties.

At present, what are the tools used for deployment, Ansible Tower? (not visible to non-infra folks). I ask because this request for scheduled/visible machine refreshes could possibly be addressed using the ansible plugins for Jenkins and scheduling a set of infra jobs to run regularly. These jobs would then be visible to more than the infra team, and the infra tasks would be dealt with similarly to the build and test tasks. But maybe Ansible tower gives other benefits, which would be good to understand (as its at the cost of visibility/transparency).

I know its already been discussed by infra and was possibly already in plan, so if this is already being done, please point me to it, I will like to help.

bug

All 17 comments

This is something we are looking into at the OpenJ9 CI as well so there's possibility for collaboration or reusing solutions here. eclipse/openj9#4221

Thanks Adam, I was going to post that as a related effort. And was going to post to that issue the question about difference between using Jenkins ansible plugins & schedule versus Tower approach.

AWX should be rolling out updates regularly - I suspect there is a bug.

Now that the playbooks are more stable

While refreshing regularly is a goal, we're not at the stage where they're stable enough to do it on a regular basis, and that has to be a prereq. We're working on it, but bear in mind we're currently getting very regular requests for new types of systems therefore ensuring we have an infrastructure capable of testing changes before they're deployed in production is critical to ensuring that visible ansible refreshes of production machines doesn't break anything

We're a lot closer than we were a couple of months ago but I would not advocate putting this in place right now as I believe the risk would be too great.

Current issue list: https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues?q=is%3Aissue+is%3Aopen+label%3Aansible

For reference, my plans on this are to start running them manually on subsets of machines, ensure that they are running "green" (many haven't been and I've done a pile of work under the infra repo to get them closer - we're almost there on xlinux I think, but other things like builds keep getting in the way!) This is the best way to understand any stability problems. Then start running the schedules for them automatically in AWX. Bear in mind that at present it's still just me (Windows excepted) really working on playbook stabilisation.

@sxa555 - apologies, I understood you and Husain were getting very close to playbook goodness, so thought it was timely to propose. I know you are holding the fort on this work (which I greatly appreciate).

Please let me know if there are any small tasks we can help with... (tricky I know due to permissions, etc).

No need to apologise - I want to get there as much as you do :-)

Shelley has confirmed that my re-run on the first machine has gone cleanly and resolved the issue, so I will be continuing to redeploy on other systems - I'll update this comment as and when each one is done.

test-softlayer-rhel69-x64-1 not yet done as it's subject to #698

And I will mentioned I verified that I could still run openjdk regression tests on test-packet-ubuntu1604-x64-3, after its refresh

Aha - this is the issue I was missing. Right - I'm happy to pair with @sxa555 and get through this as well. Got stuck on s390 and ppcle with docker.

s390x and ppc64le now resolved as per #714 ... Now running on all UNIX build-*1 machines to validate whether there are outstanding issues

Several machines failed due to the issue addressed by https://github.com/AdoptOpenJDK/openjdk-infrastructure/pull/729

build-linaro-centos74-armv8-1
build-packet-centos74-armv8-1
build-packet-ubuntu1604-armv8-2

There were a few issues with machines being unreachable (armv7 offline, others likely temporary)

build-marist-rhel74-s390x-1
build-marist-rhel74-s390x-2
build-scaleway-ubuntu1604-armv7-2

And a few of special case failures:
build-joyent-centos69-x64-1 - out of space - removed /swapfile - wasn't in use
build-marist-sles12-s390x-1 - zypper upgrade failed - managed to hand hold manually
build-osuosl-ubuntu1604-ppc64le-1 - AO10 download failed with 404 from this link - Seems correct based on this link - Fix is in https://github.com/AdoptOpenJDK/openjdk-infrastructure/pull/730

Now running on a subset of the test machines (test-softlayer*) to validate those.

Several machines failed due to the issue addressed by #729

build-linaro-centos74-armv8-1
build-packet-centos74-armv8-1
build-packet-ubuntu1604-armv8-2

There were a few issues with machines being unreachable (armv7 offline, others likely temporary)

build-marist-rhel74-s390x-1
build-marist-rhel74-s390x-2
build-scaleway-ubuntu1604-armv7-2

And a few of special case failures:
build-joyent-centos69-x64-1 - out of space - removed /swapfile - wasn't in use
build-marist-sles12-s390x-1 - zypper upgrade failed - managed to hand hold manually
build-osuosl-ubuntu1604-ppc64le-1 - AO10 download failed with 404 from this link - Seems correct based on this link - Fix is in #730

Now running on a subset of the test machines (test-softlayer*) to validate those.

Thanks for the efforts in getting us to green @sxa555 !

/var/run/systemd/sessions on build-linaro-centos74-armv8-2 is chewing a lot of disk space and is blocking yum commands - rebooting to clear.

numactl needs to be excluded on arm32:

failed: [build-scaleway-ubuntu1604-armv7-2] (item=numactl) => {"changed": false, "failed": true, "item": "numactl", "msg": "No package matching 'numactl' is available"}

With the new AWX server set up as part of #756 and the fix in https://github.com/AdoptOpenJDK/openjdk-infrastructure/pull/1870 this is mostly resolved. We don't currently have a good AWX process for the Windows machines (Need something that doesn't involve exposing the passwords in cleartext to AWX users) but otherwise we're more or less there. I'm running across all Ubuntu/CentOS/RHEL test machines at the moment, and will do the same to the Linux build ones in the next few days

Windows password issue mitigated (though automating it might still be problematic. Further issues to cover setting up other platforms:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Mesbah-Alam picture Mesbah-Alam  路  4Comments

Willsparker picture Willsparker  路  9Comments

aahlenst picture aahlenst  路  6Comments

piyush286 picture piyush286  路  5Comments

sxa picture sxa  路  3Comments