Openj9: Functional Sanity JDK10 Linux s390x tests suddenly take 7 hours

Created on 5 Jul 2018  路  27Comments  路  Source: eclipse/openj9

First observed on June 28 in OMR build 574
Test before: https://ci.eclipse.org/openj9/job/Test-Sanity-JDK10-linux_390-64_cmprssptrs/200/
Test After: https://ci.eclipse.org/openj9/job/Test-Sanity-JDK10-linux_390-64_cmprssptrs/201/
Typical build time
Compile Test material: 10min
Sanity functional tests: 1.5hrs
After regression
Compile test material: 1hr
Sanity functional tests: 6hrs

Diff between build 573/574
OpenJ9:
https://github.com/eclipse/openj9/compare/693fe845...be52aeb9
No OMR diff between builds
PRs merged

  • #2295 CMake: Update exclude list in the compiler component

    • No PR build

  • ~#2296 Also fetch branches for checks~

    • Jenkins file changes

  • ~#2245 Process zip files up to 4 GB~

    • PR build on xlinux 8 only

    • Revert PR no change in perf

  • ~#2271 Prevent recognizing JIT Helpers not for their ISA~

    • PR build was typical time

  • ~#2188 [JDK11] Add Dockerfile(s) for z/p Linux and update xLinux Dockerfile~

    • Dockerfile changes

  • #2167 Contribute the source code of CDS adapter to OpenJ9

    • PR compile plinux 8 only

  • ~#2283 Add stub methods to bringup jdk-11+19~

    • PR build was typical time

  • ~#2154 Change Git to SCM step for Copyright and Line Endings checks~

    • Jenkins file chnages

(Crossing off the PRs that have been ruled out)
Also affects PR builds
https://ci.eclipse.org/openj9/job/PullRequest-Sanity-JDK10-linux_390-64_cmprssptrs-OpenJ9/

bug build high

Most helpful comment

The whole build took about 1.5 hours.

All 27 comments

Running a revert of #2245 here

It looks like all cmdlinetest are affected. For example:

cmdLineTester_XcheckJNI_0: changed from 8mins to 50mins
cmdLineTester_SCURLClassLoaderNPTests_SE100_1: changed from 9mins to 48mins
cmdLineTester_SCURLClassLoaderTests_1: changed from 9mins to 49mins

Where any changes made to the machine configuration? Either by (re)running the ansible scripts or even at the machine provider level?

Does rerunning the Jenkins 200 build have the same good perf it had before?

I think SDK and test are fine. It is the machine configuration issue. Reran the test cmdLineTester_XcheckJNI_0 with lab machine and one of the "bad" SDK https://ci.eclipse.org/openj9/job/Build-JDK10-linux_390-64_cmprssptrs/246/artifact/OpenJ9-JDK10-linux_390-64_cmprssptrs-201805070321.tar.gz. And the test execution time is normal (~8mins).

I disabled the PR build until we can fix this.

Further tested a full sanity.functional build, which used latest SDK: https://ci.eclipse.org/openj9/job/Build-JDK10-linux_390-64_cmprssptrs/247/artifact/OpenJ9-JDK10-linux_390-64_cmprssptrs-201805071103.tar.gz

The build took only 1hr39mins to complete.

Related to Dan's question, is there a log of configuration activity (given the smaller set of people with access to machines this should be easier to accomplish) or an ansible schedule that can shine a light on this?

If not, it would be good to institute, putting as much transparency on machine layer changes as possible.

fyi @jdekonin

Rebuilt the last "good" levels here
Tested here
Rolled back default gcc version (what we upgraded last week) on ub16-390-1, running a test with the same sdk here

Definitely not a code change issue.
Also doesn't seem to be related to gcc as the rollback of gcc followed by build&test didn't change the perf.
@jdekonin is going to look at the logs to see what else was updated with the gcc7 install and the apt upgrade.

I haven't been able to successful reboot with the old kernel. zLinux doesn't use grub, it uses zipl as a bootloader. I've followed the basic instructions, the machine just will not reboot with another kernel specified. At least not through the machine reboot cmdln that sudo has access too which reboots the instance in under 10sec. I think this need to be rebooted from the openstack host.

@mstoodle @AdamBrousseau do either of you recall how this can be done on our zLinux machines?

@joransiu helped get these machines, maybe he has the requisite abilities?

I expect this problem an aspect of the problem being discussed in #1888. Slow startup related to Java 9 and later setting -Xmx to 25% of the physical memory on the machine by default, vs Java 8 that uses a default of 512MB.

@pshipton that change for Java 9 has existed for months so I doubt it is actually the cause here. It may be related if something in the kernel changed which causes the port library to exhibit the same behaviour as the other issues. The new Linux kernel is likely causing a few different problems here so lets make sure we figure all of them.

@jdekonin mentioned creating an internal machine with the same kernel level which doesn't exhibit the same slowness, so its not necessarily the kernel change which caused the slowdown.

Bottom line seems to be that the machines changed and caused the JVM memory allocation to get really slow. While perhaps we could figure out what changed and revert the machines (which is problematic at this time), we should improve the memory allocation to avoid others finding the same issue.

FWIW, the internal machine we created to test this (where the sdk runs fine) is Ubuntu 16.04.4 kernel version 4.4.0-130-generic
The OpenJ9 Jenkins zlinux machines are 16.04.4 with kernel 4.4.0-128-generic

One of the problems is fixed by https://github.com/eclipse/omr/pull/2743, however there is still a problem outstanding. The QUICK memory allocation algorithm can fail to find a suitable candidate but then it falls back to a brute force search which also won't find any suitable memory and can be very slow.

That looks promising as compiling test material only took 6 mins instead of the recent 1hr plus. Testing appears to be going quickly as well.

The whole build took about 1.5 hours.

This is a great result! I'll admit I was skeptical this would address the regression so I'm very pleased to see it resolved.

Thanks to everyone for all the work tracking this down!

I disabled the PR build until we can fix this.
@AdamBrousseau can you please re-enable the PR build?

Done. I assume this can be closed now.

Thanks, Adam.

For the record, eclipse/openj9-omr#12 merged eclipse/omr#2796 to the v0.9.0-release branch.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mikezhang1234567890 picture mikezhang1234567890  路  5Comments

JasonFengJ9 picture JasonFengJ9  路  5Comments

ciplogic picture ciplogic  路  3Comments

AdamBrousseau picture AdamBrousseau  路  6Comments

dsouzai picture dsouzai  路  5Comments