Openjdk-infrastructure: AIX test machines at OSUOSL not available

Created on 22 Oct 2020  Â·  23Comments  Â·  Source: AdoptOpenJDK/openjdk-infrastructure

test-osuosl-aix72-ppc64-1 is marked as CPAN allegedly not working on it (Need to verify current issue via Grinder)
test-osuosl-aix72-ppc64-2 is currently offline - raising with OSUOSL.

bug aix osuosl systemdown testFail

All 23 comments

-2 now back up and running. The ssh keys on it weren't up to date on either of them but that has now been resolved by refreshing. It was trying to connect to -1 using a DNS entry which was no longer in place so that has now been fixed too. Just need to see what the issues are with CPAN and whether either of them can now run test jobs properly.

Ill run a sanity system and openjdk test on both to begin with, I think these would trigger an error if either machine is exhibiting CPAN issues.

I ran both system and openjdk sanity tests on both machines. The tests were able to run without error. The following test cases, from openjdk sanity, failed on both machines

jdk_lang_j9_0
jdk_math_j9_0
jdk_util_j9_0

None of the system tests failed. Where were you notified that CPAN was not working on either machine?

Im also running an openjdk sanity test on test-osuosl-aix72-ppc64-1 via grinder, incase the CPAN issues occur only via grinder.
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/4288/console

Those three suites failing is a concern = JDK11/J9 sanity.openjdk appears to pass on the other machines so we have something that needs to be fixed: https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_j9_sanity.openjdk_ppc64_aix/211

@smlambert ref the discussion we had in the team meeting.
On https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/4288/, a sanity openjdk test ran on test-osuosl-aix72-ppc64-1.

In terms of machine dependencies and configuration, would you know why java/lang/ProcessBuilder/Basic.java#id0.Basic_id0 might fail? The machine meets the prereqs

May be helpful to look at what the test itself does (and if its doing anything special on AIX), if it is behaving well on one machine but not another, can you compare what LIBPATH is on the machines you are trying to compare. (if you search for AIX in the test source, you will see several places where there is AIX specific handling of args and such, starting with:

https://github.com/AdoptOpenJDK/openjdk-jdk11u/blob/master/test/jdk/java/lang/ProcessBuilder/Basic.java#L75

@Haroon-Khel Have you looked more into this? Would be good to get these two machines live again if possible. We are restricted on AIX testing capacity.

The test failure is caused by https://github.com/ibmruntimes/openj9-openjdk-jdk11/blob/29d8a1d89c10cfd0cf86075b292bb4be6b196e29/test/jdk/java/lang/ProcessBuilder/Basic.java#L1794, and the 3 lines that follow it.
From what I have gathered, the test tries to trigger an expected java.lang.OutOfMemoryError, and then tries to look for certain expected results in stderr. In this case the stderr is empty, causing this test to fail.
Continuing to look into this.

Weirdly, the test has just passed on test-osuosl-aix72-ppc64-1. It appears intermittent, as it has just failed again after a subsequent run.
When it passed, I captured what the stderr for the test is supposed to look like:

JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2020/11/24 07:39:36 - please wait.
JVMDUMP032I JVM requested Java dump using '/home/jenkins/jdk-11.0.8+10/bin/JTwork/scratch/javacore.20201124.073936.24445318.0001.txt' in response to an event
JVMDUMP010I Java dump written to /home/jenkins/jdk-11.0.8+10/bin/JTwork/scratch/javacore.20201124.073936.24445318.0001.txt
JVMDUMP032I JVM requested Snap dump using '/home/jenkins/jdk-11.0.8+10/bin/JTwork/scratch/Snap.20201124.073936.24445318.0002.trc' in response to an event
JVMDUMP010I Snap dump written to /home/jenkins/jdk-11.0.8+10/bin/JTwork/scratch/Snap.20201124.073936.24445318.0002.trc
JVMDUMP007I JVM Requesting Tool dump using '/home/jenkins/jdk-11.0.8+10/bin/java -version'
JVMDUMP011I Tool dump created process 27132240
openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.8+10)
Eclipse OpenJ9 VM AdoptOpenJDK (build openj9-0.21.0, JRE 11 AIX ppc64-64-Bit Compressed References 20200715_695 (JIT enabled, AOT enabled)
OpenJ9   - 34cf4c075
OMR      - 113e54219
JCL      - 95bb504fbb based on jdk-11.0.8+10)
JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at Basic$JavaChild.main(Basic.java:368)

Look at ojdk01, ojdk02, (the AIX 7.1 systems) and ojdk03 and ojdk04 (the AIX 7.2 pair):

The default perl used on the AIX 7.1 ones is the AIX perl - ancient (5.10) (as this is about CPAN).

ojdk03 - does not have all the ssh keys it is suppossed to have - to allow automated login from OSUNIM; ojdk04 - for the 3rd time at least, no longer has either the OSUNIM or my admin authorized keys.

IMHO: there are systems outside these systems making unauthorized changes - because my PKI keys keep getting restored, and keep getting removed. What else is being modified?

OSUNIM key added to the set of machines that you have access to and your key has also been reinstated on 03/04 - hopefully it won't disappear this timeas it was deployed properly through our automation

I did some digging.

This same test failure affected aarch64, https://github.com/eclipse/openj9/issues/9032. The solution there was to exclude the test case for that platform, https://github.com/AdoptOpenJDK/openjdk-tests/pull/1716/files.

This test used to be excluded on aix due to https://github.com/AdoptOpenJDK/openjdk-tests/issues/1397 but has since been reincluded, https://github.com/AdoptOpenJDK/openjdk-tests/pull/1788, due to an upstream fix.

For the sake of re adding the ci.role.test label back to test-osuosl-aix72-ppc64-1 and test-osuosl-aix72-ppc64-2, could this test be excluded for aix? Thoughts @smlambert @sxa

Yes to re-excluding, but will want someone to chase down the reason we thought the upstream fix would/did fix the issue.

If ive understood it correctly, I think the upstream fix was for a different issue related to the same test

@adamfarley, given it was your upstream fix, can you check if the test failure is happening is different than what was fixed via: https://bugs.openjdk.java.net/browse/JDK-8239365 ?

No, I don't think so. My issue wasn't an OOM, and the bug I fixed wasn't checking against the error class. It was checking against an error message supplied by the OS, derived from an error message "set" that could change depending on what sets you'd installed.

If you weren't referring to the OOM, please include a job link, trss link, or a copy of the error output.

OSUNIM key added to the set of machines that you have access to and your key has also been reinstated on 03/04 - hopefully it won't disappear this timeas it was deployed properly through our automation

The key for 03/04 has been removed - again. 01/02 is working fine.

root@p8-aix2-osunim:[/home/root]ssh [email protected] date
[email protected]'s password:
root@p8-aix2-osunim:[/home/root]ssh [email protected] date
[email protected]'s password:
root@p8-aix2-osunim:[/home/root]ssh [email protected] date
Fri Dec  4 06:03:49 PST 2020
root@p8-aix2-osunim:[/home/root]ssh [email protected] date
Fri Dec  4 06:03:56 PST 2020
root@p8-aix2-osunim:[/home/root]

Using my desktop I can access 01/02, but not 03/04 - when using the hostname (but can when using IP address??)

 ssh [email protected] date
Warning: Permanently added 'p8-aix2-ojdk02.osuosl.org' (RSA) to the list of known hosts.
X11 forwarding request failed on channel 0
Fri Dec  4 06:12:09 PST 2020

++++++
  04/12/2020 î‚°  15:06.58 î‚°  /home/mobaxterm î‚° ssh [email protected] date
Warning: Permanently added '140.211.9.28' (RSA) to the list of known hosts.
X11 forwarding request failed on channel 0
Fri Dec  4 08:07:09 CST 2020
                                                                                                                                  ✔

  04/12/2020 î‚°  15:07.10 î‚°  /home/mobaxterm î‚° ssh [email protected] date
Warning: Permanently added 'p8-aix2-ojdk03.osuosl.org' (RSA) to the list of known hosts.
[email protected]'s password:

                                                                                                                                  ✘

  04/12/2020   15:08.19   /home/mobaxterm  nslookup 140.211.9.28

Name:      140.211.9.28
Address 1: 140.211.9.28 p8-aix2-ojdk03.osuosl.org
                                                                                                                                  ✔

  04/12/2020 î‚°  15:08.36 î‚°  /home/mobaxterm î‚° ssh [email protected] date
X11 forwarding request failed on channel 0
Fri Dec  4 08:08:56 CST 2020

++++++++++

Strangely enough - a few moments - works for both IP and hostname addressing:


  04/12/2020 î‚°  15:10.13 î‚°  /home/mobaxterm î‚° ssh [email protected] date
Warning: Permanently added '140.211.9.36' (RSA) to the list of known hosts.
X11 forwarding request failed on channel 0
Fri Dec  4 08:10:26 CST 2020
                                                                                                                                  ✔

  04/12/2020 î‚°  15:10.26 î‚°  /home/mobaxterm î‚° ssh [email protected] date
Warning: Permanently added 'p8-aix2-ojdk02.osuosl.org' (RSA) to the list of known hosts.
X11 forwarding request failed on channel 0
Fri Dec  4 06:12:09 PST 2020
                                                                                                                                  ✔

  04/12/2020 î‚°  15:12.09 î‚°  /home/mobaxterm î‚° ssh [email protected] date
X11 forwarding request failed on channel 0
Fri Dec  4 08:12:45 CST 2020
                                                                                                                                  ✔

  04/12/2020 î‚°  15:12.45 î‚°  /home/mobaxterm î‚° ssh [email protected] date
X11 forwarding request failed on channel 0
Fri Dec  4 08:13:02 CST 2020

No idea what is causing this - but not a warm and cozy feeling.

My idea now - is that there is - perhaps - an unknown second agent or program that is updating the authorized file.

Again - I cannot access ojdk04 - either as myself, nor as the nim admin account - both internal and external IP addresses attempted.

root@p8-aix2-osunim:[/home/root]ssh [email protected]
[email protected]'s password:

[email protected]'s password:

[email protected]'s password:

This is getting tiresome. Somewhere there is a bug - and it should not be this host - but I have no clue.

When I get access again, I'll try to remember to create an audit record to at least see when the authorized file is being updated. Maybe from that we can locate the source.

My idea now - is that there is - perhaps - an unknown second agent or program that is updating the authorized file.

Nothing unkjnown about it - we use Bastillion to manage access. That machine (and 9.28) had duplicate entries in the sytsem so it was updating the keys file twice - once for the full admin user set and another for the AIX set. I've removed the dupicate so it won't happen again.

On the basis that the problematic tests have been excluded I'm going to re-enable those two test machiens as we have a significant backlog on AIX testing just now.

Added ci.role.test back onto:

FYI @andrew-m-leonard both are now running test jobs starting with these two:

Seeing as the failing test was excluded, can this issue be closed?

Yep the machines are running the tests on a regular basis now so this can be closed :-)

Was this page helpful?
0 / 5 - 0 ratings