win2012r2-x86-4 got a test timeout from running slow, pls reboot it. Or maybe just all the Windows machines except win2012r2-x86-3, which was rebooted yesterday.
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/143/
@jdekonin @AdamBrousseau
The sanitize job reboots Windows.
Ok, I'll just keep an eye out for more problems.
Problem happened again on win2012r2-x86-2, please take a look at the machine to see if there is anything left running.
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/144
There doesn't appear to be anything java related running on it other than the jenkins agent.
Latest timeout is on win2012r2-x86-3
@Mesbah-Alam did something change with the SharedClasses.SCM23.MultiCL_0 test? It's been consistently running for over two hours and timing out on Windows for the past 3-4 days.
Same for this machine. Appears to be okay. I rebooted this one too since it was idle in jenkins.
did something change with the SharedClasses.SCM23.MultiCL_0 test? It's been consistently running for over two hours and timing out on Windows for the past 3-4 days.
Hi @pshipton,
The change done recently in https://github.com/eclipse/openj9-systemtest/pull/98/files reduces load in the MultiThreadMultiCL variation of this test. No change was done for the MultiCL variation of the test - which is what this issue is about.
If this test is consistently timing out on Windows, we need to reduce the load on it.
Or, rethinking, did the reboot solve the issue? Because the test has not changed in the recent past - so it should not start to time out all on a sudden due to any test issue.
@Mesbah-Alam I don't think the reboot helped. I don't know why the test was timing out, but I think it always runs slow even when it doesn't timeout. Pls check other platforms as well. If the test is running for over an hour across multiple platforms I think it needs to be re-tuned.
If the test is running for over an hour across multiple platforms I think it needs to be re-tuned.
The timeout for this test is set to 2h , so looks like workload of this test needs to be reduced.
Will check other platforms. Thanks.
Another timeout, on win2012r2-x86-6
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/147
IGNORE THIS, IT IS INCORRECT
This test used to take 10 min, it got slow between
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/142/
OpenJ9: 535e903
OMR: 640f1df
OpenJDK: 8bf2a81
================================================
timestamp: 20190920-011107
repo dir: /cygdrive/c/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests
git repo:
Fetch URL: https://github.com/AdoptOpenJDK/openjdk-tests.git
sha:
c91c6009de728a74cdfca10218929c2b1706c144
================================================
timestamp: 20190920-011335
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/stf
git repo:
Fetch URL: https://github.com/AdoptOpenJDK/stf.git
sha:
5b2abc832159739899df0d2153b86ea5eb823564
================================================
timestamp: 20190920-011337
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/openjdk-systemtest
git repo:
Fetch URL: https://github.com/AdoptOpenJDK/openjdk-systemtest.git
sha:
7de3fbe8a7ef85954645f35bef04906990f519b2
================================================
timestamp: 20190920-011338
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/openj9-systemtest
git repo:
Fetch URL: https://github.com/eclipse/openj9-systemtest.git
sha:
4facf81fff1eb0897bc764839f0e6d62cb54571b
and
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/143/
OpenJ9: 82609ef
OMR: 20db4fb
OpenJDK: 502ab02
================================================
timestamp: 20190920-220322
repo dir: /cygdrive/c/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests
git repo:
Fetch URL: https://github.com/AdoptOpenJDK/openjdk-tests.git
sha:
7ced48a96d1d4f63f4f5d8a6a0f3dee832f1de0c
================================================
timestamp: 20190920-220550
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/stf
git repo:
Fetch URL: https://github.com/AdoptOpenJDK/stf.git
sha:
5b2abc832159739899df0d2153b86ea5eb823564
================================================
timestamp: 20190920-220553
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/openjdk-systemtest
git repo:
Fetch URL: https://github.com/AdoptOpenJDK/openjdk-systemtest.git
sha:
6fad60418aa33671d59de6d25bdce854d7c320ba
================================================
timestamp: 20190920-220554
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/openj9-systemtest
git repo:
Fetch URL: https://github.com/eclipse/openj9-systemtest.git
sha:
79a666d92f7b3e7a11491c7436e07a96e23f70fa
IGNORE THIS, IT IS INCORRECT
https://github.com/eclipse/openj9/compare/535e903...82609ef
https://github.com/eclipse/openj9-omr/compare/640f1df...20db4fb
https://github.com/ibmruntimes/openj9-openjdk-jdk11/compare/8bf2a81...502ab02
https://github.com/AdoptOpenJDK/openjdk-tests/compare/c91c600...7ced48a
https://github.com/AdoptOpenJDK/openjdk-systemtest/compare/7de3fbe...6fad604
https://github.com/eclipse/openj9-systemtest/compare/4facf81...79a666d
IGNORE THIS, IT IS INCORRECT
I suspect #7149. Trying a grinder with the feature disabled. https://ci.eclipse.org/openj9/view/Test/job/Grinder/519/
I see the -XX:-ShareUnsafeClasses option being used in the grinder, but it's still running slow.
@doomerxe pls create a JVM with the change reverted and run a grinder to see if the problem still occurs.
08:24:50 To rebuild the failed test in a jenkins job, copy the following link and fill out the <Jenkins URL> and <FAILED test target>:
08:24:50 <Jenkins URL>/parambuild/?JDK_VERSION=11&JDK_IMPL=openj9&BUILD_LIST=system&JenkinsFile=openjdk_x86-64_windows&TARGET=<FAILED test target>
08:24:50
08:24:50 For example, to rebuild the failed tests in <Jenkins URL>=https://ci.adoptopenjdk.net/job/Grinder, use the following links:
08:24:50 https://ci.adoptopenjdk.net/job/Grinder/parambuild/?JDK_VERSION=11&JDK_IMPL=openj9&BUILD_LIST=system&JenkinsFile=openjdk_x86-64_windows&TARGET=SharedClasses.SCM23.MultiCL_0
@hangshao0 @DanHeidinga
@mpirvu fyi in case it's the Port j9methodserver change
IGNORE THIS, IT IS INCORRECT
@doomerxe Shoot, ignore the above, I got the fast / slow runs incorrect. Trying again
fast
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/140
OpenJ9: c9d84df
OMR: 54fb162
OpenJDK: 8bf2a81
slow
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/142/
OpenJ9: 535e903
OMR: 640f1df
OpenJDK: 8bf2a81
https://github.com/eclipse/openj9/compare/c9d84df...535e903
https://github.com/eclipse/openj9-omr/compare/54fb162...640f1df
There was a build in between, but it failed. Maybe another platform will have similar slowdown.
Between the following builds, SharedClasses.SCM23.MultiCL_0 on pccle went from 13min to 55min.
The systemtest material didn't change.
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_ppc64le_linux_Nightly/147
OpenJ9: c9d84df
OMR: 54fb162
OpenJDK: 8bf2a81
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_ppc64le_linux_Nightly/148
OpenJ9: b0e9913
OMR: ffb0d98
OpenJDK: 8bf2a81
https://github.com/eclipse/openj9/compare/c9d84df...b0e9913
https://github.com/eclipse/openj9-omr/compare/54fb162...ffb0d98
I think it has to be one of the JIT changes.
@andrewcraik @fjeremic
JDK8 runs don't seem affected, only JDK11+
Confirmed with Windows JDK13 runs
fast (11m)
https://ci.eclipse.org/openj9/job/Test_openjdk13_j9_extended.system_x86-64_windows_Nightly/42
OpenJ9: e60693e
OMR: 54fb162
OpenJDK: 05ba4bd
slow (58m)
https://ci.eclipse.org/openj9/job/Test_openjdk13_j9_extended.system_x86-64_windows_Nightly/43
OpenJ9: 0449f02
OMR: ffb0d98
OpenJDK: 05ba4bd
I suspect it is the changes to the method handle inlining - JDK11 has that and JDK8 doesn't have that. The improvements will not be reverted as they are providing orders of magnitude performance improvement in general and it is the strategic direction for the optimizer. @cathyzhyi is working on further reducing the overhead of the inlining to improve performance in the next release. A similar timeout exists on a few handle intensive tests on z.
FYI @liqunl. @hzongaro can we look at where the compile-time increase is happening to see if we can do anything simple to help cut the pathlength.
@pshipton what mode are the tests running in - is there a count=0 or disableAsyncCompilation in play?
is there a count=0 or disableAsyncCompilation in play
SharedClasses.SCM23.MultiCL uses -Xaot:forceAoT,count=1
hmm - so I'm not sure if the problem is the compile time or the fact that with the aggressive handle inlining we won't be able to do AOT. @mpirvu we might need your help to figure out if the inliner changes are affecting AOT or not. There is a plan to support method handles with AOT but that is still being designed.
We can get a sense of compilation time from verbose logs: -Xjit:verbose={compilePerformance},vlog=vlog.txt
@Mesbah-Alam is there a setup that can be accessed to do investigation? Would you be able to get the log mentioned by @mpirvu above?
Would you be able to get the log
@andrewcraik - I am running a grinder with -Xjit:verbose={compilePerformance},vlog=vlog.txt. Will share the log here once it finishes.
Following attachments contain the vlogs:
Generated by running the test on Windows
Generated by running the test on Linux
@andrewcraik @mpirvu
I looked at the verbose logs on Windows and there are some pretty large compilation times.
Eaxmple:
vlog.txt.20190926.111756.2632
Samples MAX(ms) TOTAL(ms) AVG(usec) MIN(usec)
Total 12103 23799.8 1921149.8 158733 1
aotl 4079 5033.9 26443.8 6483 1
cold 389 924.6 6539.0 16810 502
aotw 5256 18279.7 314258.4 59790 582
warm 2298 23799.8 1570534.7 683435 717
pvhot 1 2077.3 2077.3 2077320 2077320
scorc 1 601.8 601.8 601772 601772
jni 79 619.1 694.8 8795 584
MAXLINE=+ (warm) net/openj9/sc/classes/Test_Init_0200.makeString(I)Ljava/lang/String; @ 0x0000000040FB2980-0x0000000040FB68BF OrdinaryMethod - Q_SZ=431 Q_SZI=431 QW=3282 j9m=000000000020D0C8 bcsz=82 OSR time=23799795us mem=[region=32128 system=32768]KB compThread=0 CpuLoad=398%(99%avg) JvmCpu=0%
Samples MAX(ms) TOTAL(ms) AVG(usec) MIN(usec)
Total 12112 22830.5 1999275.8 165066 89
aotl 4589 1648.9 15966.2 3479 89
cold 410 531.9 6975.5 17014 496
aotw 4777 12036.7 219629.1 45976 265
warm 2256 22830.5 1754560.8 777731 699
pvhot 1 2086.9 2086.9 2086927 2086927
jni 79 3.0 57.3 726 567
MAXLINE=+ (warm) net/openj9/sc/classes/Test_Init_0318.makeString(I)Ljava/lang/String; @ 0x0000000041105C80-0x00000000411099D1 OrdinaryMethod - Q_SZ=300 Q_SZI=252 QW=2376 j9m=00000000002F78C8 bcsz=82 OSR time=22830462us mem=[region=31680 system=32768]KB compThread=1 CpuLoad=400%(100%avg) JvmCpu=0%
Total compilation time on Linux is about half (AOT warm compilations take less time on average):
Samples MAX(ms) TOTAL(ms) AVG(usec) MIN(usec)
Total 10945 1996.4 909153.1 83066 23
aotl 1440 214.4 2537.1 1762 23
cold 1404 323.8 10573.0 7531 181
aotw 6277 554.4 93173.2 14844 205
warm 1739 1996.4 802518.4 461483 1463
jni 85 249.7 351.4 4134 159
MAXLINE=+ (warm) net/openj9/sc/classes/TestClass_5088.stringOperations(III)V @ 00007F1A280898C8-00007F1A2808AB04 OrdinaryMethod - Q_SZ=3207 Q_SZI=3206 QW=20312 j9m=0000000001316148 bcsz=241 time=1996394us mem=[region=33600 system=49152]KB compThread=1 CpuLoad=391%(97%avg) JvmCpu=73%
Samples MAX(ms) TOTAL(ms) AVG(usec) MIN(usec)
Total 11478 6544.4 968261.5 84358 22
aotl 1515 210.2 1760.4 1162 22
cold 1559 484.8 11537.3 7400 182
aotw 6509 6544.4 99417.8 15274 206
warm 1808 1517.9 851852.7 471157 1583
hot 1 3658.5 3658.5 3658467 3658467
jni 86 5.8 34.7 404 161
MAXLINE=+ (AOT warm) jdk/internal/module/SystemModules$default.moduleDescriptors()[Ljava/lang/module/ModuleDescriptor; @ 00007FFA8D961240-00007FFA8D9E3B65 OrdinaryMethod - Q_SZ=1330 Q_SZI=1330 QW=9353 j9m=000000000081F860 bcsz=19723 GCR time=6544374us mem=[region=74304 system=81920]KB compThread=2 CpuLoad=401%(100%avg) JvmCpu=70%
Hmm - now that is odd - why is windows so much slower? worse JIT compiler optimization at build time or something? Unlikely that the difference would be the inliner change if it is platform specific, but I've been proven wrong before.
Note plinux seemed to have been affected as well https://github.com/eclipse/openj9/issues/7178#issuecomment-535484726
The other platforms seem fine according to https://github.com/eclipse/openj9-systemtest/pull/99#issuecomment-535156261
Andrew @andrewcraik, Marius @mpirvu, do you think it would be worth the time to get -Xjit:verbose={compilePerformance} logs for Windows before the degradation and for ppc64le_linux both before and after?
Yes, it makes sense to see whether compilation time increased or it was always that bad (24 sec for one warm compilation !)
FWIW I see that 54ff80e51f1bc6d11c574548e3f6e4fbf03e3790 [1] is in the list of changes in the diff from Peter's https://github.com/eclipse/openj9/issues/7178#issuecomment-535492114. Internally during Z testing we've found the same change is causing over 2x increase in footprint and massive compilation time increases for Octane benchmarks.
I would focus my attention to that commit in particular. I bet it is the same offender here. See internal Problem Report 142384 for those of you who can access it.
[1] https://github.ibm.com/runtimes/openj9/commit/54ff80e51f1bc6d11c574548e3f6e4fbf03e3790
@hzongaro would be worth getting compilePerformance and inlining vlog for both platforms before and after. That way we can see if the long compiles are inlining a lot of method handles.
@Mesbah-Alam, may I ask you to generate the verbose logs on Windows and ppc64le_linux for builds before and after the relevant changes, with the following options?
-Xjit:verbose={compilePerformance},verbose={inlining},vlog=vlog.txt
I'm unable to submit grinder runs on those builds myself. I'll do the same with the test Filip mentioned above.
@hzongaro - I have sent the vlogs in email as some of the attachments were more than 10mb and could not be attached here. Please let me know if you need anything else.
Actually, the email bounced back saying attachments are too large as well.
@hzongaro - I've sent you the attachments via slack. Please check.
Thanks, Mesbah @Mesbah-Alam, for collecting that data! The differences are quite pronounced.
Before the relevant changes on Linux for Power, this was a typical result from the verbose log:
Processing vlog Linux-PPC64-LE-before/SharedClasses.SCM23.MultiCL_0/20190927-162117-SharedClasses/results/vlog.txt.20190927.162252.18751
Hotness Count Total(ms) Average(us) Max(us)
AOT_cold 7423 65926921ms 8881us 3960168us
AOT_load 1999 3237327ms 1619us 119632us
AOT_warm 285 3541503ms 12426us 273968us
cold 34643 756692542ms 21842us 297925us
hot 1 616872ms 616872us 616872us
warm 11535 823005056ms 71348us 630607us
Overall 55886 1653020221ms 29578us 3960168us
Maximum: AOT cold jdk/internal/module/SystemModules$default.moduleDescriptors()[Ljava/lang/module/ModuleDescriptor; 3960168
and this was a typical one from after the change
Processing vlog Linux-PPC64-LE-after/SharedClasses.SCM23.MultiCL_0/20190927-162007-SharedClasses/results/vlog.txt.20190927.162137.22615
Hotness Count Total(ms) Average(us) Max(us)
AOT_cold 6686 40796889ms 6101us 4204892us
AOT_load 1541 1486183ms 964us 75401us
AOT_warm 272 4712307ms 17324us 656389us
cold 361 2342435ms 6488us 82409us
warm 1857 3172147990ms 1708211us 4647531us
Overall 10717 3221485804ms 300595us 4647531us
Maximum: warm net/openj9/sc/classes/Test_Init_4566.stringOperations(III)V 4647531
The maximum compile-time was usually one of the many net/openj9/sc/classes/Test_Init_*.stringOperations(III)V methods. Those methods typically went from having about 15 methods inlined to about 150 or so inlined, and seemed to require about ten times more memory per compile.
On Windows, the maximum compile-time method was sometimes one of the Test_Init_*.stringOperations(III)V methods, sometimes Test_Init_*.makeString(I) and sometimes Test_Init.selectUppercaseCharacter()C. Following was a typical result from the verbose log on Windows before the change:
Processing vlog Windows-before/SharedClasses.SCM23.MultiCL_0/20190927-162521-SharedClasses/results/vlog.txt.20190927.165449.9988
Hotness Count Total(ms) Average(us) Max(us)
AOT_load 2761 5949575ms 2154us 634748us
cold 27439 1352566894ms 49293us 902195us
profiled_very-hot 2 10210419ms 5105209us 8026508us
scorching 1 10674813ms 10674813us 10674813us
warm 16107 538950423ms 33460us 1288224us
Overall 46310 1918352124ms 41424us 10674813us
Maximum: scorching jdk/internal/loader/BuiltinClassLoader.loadClassOrNull(Ljava/lang/String;Z)Ljava/lang/Class; 10674813
and this was a typical one from after the change
Processing vlog Windows-after/SharedClasses.SCM23.MultiCL_0/20190927-162543-SharedClasses/results/vlog.txt.20190927.164228.14160
Hotness Count Total(ms) Average(us) Max(us)
AOT_load 5309 49508347ms 9325us 8696524us
AOT_warm 3963 238331835ms 60139us 13331566us
cold 426 3929445ms 9224us 271078us
profiled_very-hot 1 2497294ms 2497294us 2497294us
warm 2222 2159991480ms 972093us 37301270us
Overall 11921 2454258401ms 205876us 37301270us
Maximum: warm net/openj9/sc/classes/Test_Init_0168.makeString(I)Ljava/lang/String; 37301270
The total number of inlined methods for all the compiles in any particular run went from being in the range 100000 to 150000 to about 270000 to 400000, while the total number of compiles fell by about 70% to 80%.
@andrewcraik it's not clear to me if this should be a blocker for the 0.17 release or not. The test that exposed the problem is using -Xaot:forceAoT,count=1, but the increase in inlining and compile time could be affecting default mode as well. Do you know if this has shown up in any other performance testing?
Also weird that only certain platforms seem to be affected.
@pshipton I have not had any reports of such. @piyush286 (he helps monitor OpenJ9 performance at IBM) have you seen anything on any of your monitoring that might be showing JSR292 inlining being a problem? I don't know if it should block 0.17 but I think @hzongaro (with some help from me when needed) should continue treating it as a priority to see if there is anything simple and safe we can do to reign in any true excesses to reduce the changes of any field impact.
@andrewcraik We haven't seen this issue in our JSR292 regression runs but I'll report here if we do.
Related to JSR292 tests,
There are no regressions seen with JRuby related benchmarks.
With basic tests, out of two string based tests, testFilterReturnI2String and testFilterReturnString2I
testFilterReturnString2I - didn't show any difference in scores at all.
testFilterReturnI2String - has a high drop in score with 20190913 build , but most of the perf is restored with the later build.
Build Date InvokeDynamic_Basic-default_testFilterReturnI2String
20190418 2864.83 -- Baseline
20190821 2716.35
20190903 3228.39
20190913 8219.72 --- high drop
20190927 3398.04
Taking another look at logs produced with -Xjit:verbose={inlining} on Linux x86_64, I noticed that there is far less cold call inlining happening in the compilations of the various Test_Init*.makeString* methods on that platform than on Linux Power 64 LE or Windows.
I'm getting inlining logs for the two runs on Linux to see if I can spot where and why they diverge.
@hzongaro @andrewcraik Do we have any new information on this issue?
So @hzongaro and @cathyzhyi have been looking at this. From my offline discussions with them, they have not been able to reproduce a 2 hour timeout using the development machines we have here at IBM. @hzongaro has reported that in these tests we are spawning 5 instances of the tests all running forceAOT,count=1 on a 4 core machine. When he looks the runtime on bare metal machines he has access to we see the runtime double (4 mins to 7 mins kind of thing), but we have not been able to reproduce the pathological 2 hour timeout. A compile-time increase with the changes to the 292 inlining are expected per above discussion. The machines which generated this defect seem to be seeing the issue more regularly. At least one of the machines had 4 cores and so it is possible the machine is overcommitted and thrashing with the increased compilation load, but we have yet to prove this.
Investigation continues. The most surprising thing on the 2 hour runs is that the compilation time does not account for the increase in runtime so something odd seems to be happening either precipitated by the inlining changes or possibly related to how the workload is being run in the cloud.
@andrewcraik Andrew, perhaps I had been too focused on Linux on Power in trying to reproduce the problem, and had managed to juxtapose in my mind the increase in the time to run the test that was seen on Windows onto Linux on Power.
The increase in execution time on Power that Peter had reported was about four- to five-fold - going from 13 minutes to about 55 minutes. As Andrew said, the test executes five processes concurrently, all running forceAOT,count=1. It also does that twice, sequentially. So the individual runs on Power that Peter reported on went from about 6.5 minutes to roughly 27 minutes. The increases in time required on Windows with JDK 11 were much more dramatic.
Andrew mentioned offline that the priorities of the compilation threads on Windows are raised. Marius @mpirvu provided me with the -Xjit:compThreadPriority option, and recommended setting a priority of 2. I was able to produce numbers on Windows with and without that option, and this time saw a dramatic difference:
Grinder run 2226 used -Xjit:compThreadPriority=2, required just over 10 minutes for one run, just over 20 minutes for two sequential runs.
Grinder run 2234 did not specify -Xjit:compThreadPriority, and each sequential run was killed after ten minutes, having completed roughly 10% to 15% of the run, so about a six to ten-fold degradation.
I will rerun the latter with a build prior to this change.
That all makes a bit more sense. I still wonder why the compilation threads can starve the process to that extent - @mpirvu?
@hzongaro Why was 2234 kill after 10 mins when 2226 was allowed to run "just over 10 minutes"?
Why was 2234 kill after 10 mins?
This was from a recent change from Mesbah @Mesbah-Alam to kill the run after ten minutes if it hasn't completed. The test loads classes from a jar using a URLClassLoader, and after every thousand classes it checks whether ten minutes have passed - so the last batch of 1000 classes in the jar might finish being loaded after ten minutes have passed. So 2234 was also killed after just over ten minutes - not after ten minutes precisely. Sorry for the confusion.
We didn't deliver that 10min restriction on the test run, so I assume you are using this for testing purposes.
https://github.com/eclipse/openj9-systemtest/pull/99
I still wonder why the compilation threads can starve the process to that extent
When thread priorities are obeyed, if a compilation thread can run it will preempt any application thread.
Andrew @andrewcraik asked me off-line to check whether the difference in compile-time accounts for the entire difference in execution time for these tests. Following are the results from grinder runs on Windows. I added the elapsed time for each of the ten runs (two sequential executions of five concurrent runs)
Before
After
So there鈥檚 roughly 10000s difference unaccounted for there in total.
We're about out of time for making more changes for the 0.17 release, unless the release is delayed. The preferred approach is to complete the release on schedule to pick up the OpenJDK security fixes. Adding another release to update 0.17 later is open for discussion.
Testing of the current content will occur on the weekend (Oct 12). We expect to get OpenJDK changes Tues afternoon (Oct 15), and as long as these don't cause any problem the 0.17 release will be finalized.
@andrewcraik @DanHeidinga
@andrewcraik @mpirvu Is this a stop ship issue? Can you pull together a case on why (or not) this should block the release?
@andrewcraik We were talking yesterday about a possible solution to this problem. Are there unforseen roadblocks?
So I don't think my theory holds water - we are still trying to figure out what is going on. I still think the windows case is an extreme stress case - you have 5 JVMs all with 3 compile threads loading and compiling methods in 10000 classes with count=1. With the elevated compile thread priority the system starves the application on windows. Lowering the priority lets the test pass. We're trying to understand why the system is so stressed.
All in all we don't think this is a stop ship issue.
Changes to inlining method handle chains have increased the compilation overhead substantially and the test on Windows times-out because compilation threads, having higher priority , don't give application thread many chances to run (5 JVMs * 3 compTheads = 15 comp threads working on 4 cores). From the vlogs I looked at, the total time on CPU by compilation threads jumped from 2822 sec to 6509 sec. The biggest increase comes from warm compilations which, even though they are substantially fewer now, they are way more expensive (20x). However, this level of stress is not common in practice (count=1 creates many compilations).
We looked at the perf results from TrafficLights and neither JSR292 specific benchmarks nor Daytrader3 on Java11 see any negative side effects.
Are there any objections to merging the system test change that limits the test time to 10 minutes, regardless of how many classes get loaded in that time?
https://github.com/eclipse/openj9-systemtest/pull/99
and the follow up question. Does the system test change resolve this Issue, or should it stay open for more investigation or further changes?
@andrewcraik @mpirvu fyi the preceding two comments.
Without further input I'll go ahead with the test change today so we are using it for the weekend testing.
If I understand correctly the system test change will put an upper bound on the testing time, but the number of classes loaded will not be a criteria for failure. I am ok with this change.
@andrewcraik mentioned about future work that will attempt to reduce compilation time. Whether or not this issue should stay open, I'll let Andrew decide.
https://github.com/eclipse/openj9-systemtest/pull/99 is merged. Removing the blocker label and moving this issue to the next milestone as a reminder for any future work.
Additional perf improvement is open as https://github.com/eclipse/openj9/pull/7785