Openjdk-infrastructure: Verify all new test machines have all test prereqs

Created on 22 Jul 2019  路  11Comments  路  Source: AdoptOpenJDK/openjdk-infrastructure

We received a bunch of new macs and windows machines on the day of the release.
It appears some of the new machines do not have all of the test prereqs installed:

Example:
from https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk13_j9_sanity.openjdk_x86-64_mac/1/
test-macincloud-macos1013-x64-1 appears to be missing Text/CSV.pm

11:34:20 Can't locate Text/CSV.pm in @INC (you may need to install the Text::CSV module) (@INC contains: ./makeGenTool /Library/Perl/5.18/darwin-thread-multi-2level /Library/Perl/5.18 /Network/Library/Perl/5.18/darwin-thread-multi-2level /Network/Library/Perl/5.18 /Library/Perl/Updates/5.18.2 /System/Library/Perl/5.18/darwin-thread-multi-2level /System/Library/Perl/5.18 /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level /System/Library/Perl/Extras/5.18 .) at makeGenTool/parseFiles.pl line 27.
11:34:20 BEGIN failed--compilation aborted at makeGenTool/parseFiles.pl line 27.
11:34:20 Compilation failed in require at makeGenTool/mkgen.pl line 93.
11:34:20 Using projectRootDir: /Users/jenkins/workspace/Test_openjdk13_j9_sanity.openjdk_x86-64_mac/openjdk-tests/TestConfig/scripts/testKitGen/../../..
11:34:20 Getting modes data from modes.xml and ottawa.csv...
11:34:20 settings.mk:54: /Users/jenkins/workspace/Test_openjdk13_j9_sanity.openjdk_x86-64_mac/openjdk-tests/TestConfig/../TestConfig/utils.mk: No such file or directory
11:34:20 makefile:39: count.mk: No such file or directory
11:34:20 make: * No rule to make target `count.mk'. Stop.

Now that release is done, we should revisit all the machines that were added in that 24 hr window of time do have all the prereqs required.

bug

All 11 comments

@smlambert All the Windows machines were set up with the Playbooks so if there are things missing from them, then we haven't got them documented/covered in the playbooks. macos machines were set up by gdams and he's been creating a suitable playbook as he goes, so based on the above example I'll assign this to him :-)

@smlambert are there any machines still playing up? the playbooks should have all the deps in them now.

I created this issue hoping someone had the time to verify machine config of new machines (and to not forget to revisit)... I have not had time to look, but some machines behave differently than other machines, and it would be good to determine why:

Example 1:
test-macincloud-macos1013-x64-2 - takes 3.5 hrs to run sanity.openjdk
test-macstadium-macos1010-1-XJ - takes 55min to run sanity.openjdk

Also, different failures on XJ, 3 tests on XJ fail with Synchronization failed for node '/' suggests possible problem in config, https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_hs_sanity.openjdk_x86-64_mac/11/

test-macstadium-macos1010-1-XJ is a much meatier machine tbf so that might explain so time differences

Example 2:
Consistent failure across new Windows machines, java.lang.RuntimeException: cannot bind first socket to s160-153-234-8/160.153.234.8:5050 unexpected java.net.BindException: Address already in use: Cannot bind

https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_hs_sanity.openjdk_x86-64_windows/11/testReport/java_net_DatagramSocket_ReuseAddressTest/java/ReuseAddressTest/

Last passing on test-azure-win2012r2-x64-1, July 15th when last run on that machine.

@smlambert I'm not sure any particular prereq on the machine would account for that. Has this test been run on a Windows Server 2016 machine before now as the four new godaddy machines are 2016?

I probably titled this issue poorly... not just check prereqs, but note differences of machines.

I'd like to have help to note any differences between all the different machines. Included differences in config, network settings and such.

Ideally this would be a co-ordinated effort by infra across all the servers we run tests, as I do not know in detail the machine lists at openj9/internal at the moment to answer your question on what different Windows variants we run, but can try and find out.

@sxa555 - Related to all of this, I saw you have an sxaCheck job, what does that do? Does it check that the machine config matches the playbook? That would be a useful test. In our test suite, I added a test called MachineInfo, which can be augmented to report versions of installed prereqs and fail if they were older than recommended (which does seem to catch us at times with errors, old versions of ant etc). For now that job reports on stuff but will never fail, it exists to help triage failures when they pass on 1 machine but not on another...

I suspect it would be a lot of work to pull every detail of every package/config on the machine and do a compare at the moment unfortunately. The playbooks are intended to ensure that the machines are the same config and re-running them on a regular basis would help ensure that, but the setup they will give is the one on the new machines, which is currently not what is required based on your results. We're in a situation (particularly with AIX, Windows and macos where the playbooks are relatively new) that we have old machines that were configured manually, but the new ones are done using the correct way forward for providing machines. Ideally we want to see the fallout from doing them this way, identify any things are missing, and add them to the playbooks where feasible.

Also it's worth noting that as well as being the later version of Windows, the four new test windows servers are on a new (to us) infrastructure provider (godaddy) as opposed to the SoftLayer/Azure systems that most of the existing ones are, so it's always possible that there's something specific to those servers (potentially more likely in the networking area) that may result in different behaviour. Related: Azure machine setup #627 Godaddy machine setup https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/827

The processCheck job is looking for any java processes other than the jenkins slave agent, and will fail if it finds any. It's a WIP job hence the reason it's still prefixed with SXA- but will hopefully make it easier to trap situations where processes have been left around from any jobs, which has caused problems before (And it's possible that your address in use could be due to such a process, however that test appears to be testing specifically for re-use of addresses so there may be more subtleties in that one :-) ) Related: https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/770

@smlambert Looking at https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_hs_sanity.openjdk_x86-64_windows/27/tapResults/ which was on one of the godaddy win2016 systems I don't see an AddressReuseTest failure - has it been disabled?

@smlambert Looks like https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_hs_sanity.openjdk_x86-64_windows/71/console ran on one of the godaddy windows 2016 machines and passed ok - are we good to close this now?

Yes, let's close it (its too general an issue anyway). We can raise more specific issues as they arise.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sxa picture sxa  路  3Comments

smlambert picture smlambert  路  4Comments

aahlenst picture aahlenst  路  6Comments

lumpfish picture lumpfish  路  4Comments

andrew-m-leonard picture andrew-m-leonard  路  8Comments