openj9 🚀 - SharedClasses.SCM23.MultiCL_0 times out on Windows

@jdekonin @AdamBrousseau

pshipton on 21 Sep 2019

The sanitize job reboots Windows.

AdamBrousseau on 21 Sep 2019

Ok, I'll just keep an eye out for more problems.

pshipton on 21 Sep 2019

Problem happened again on win2012r2-x86-2, please take a look at the machine to see if there is anything left running.
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/144

pshipton on 22 Sep 2019

There doesn't appear to be anything java related running on it other than the jenkins agent.

jdekonin on 23 Sep 2019

Latest timeout is on win2012r2-x86-3

@Mesbah-Alam did something change with the SharedClasses.SCM23.MultiCL_0 test? It's been consistently running for over two hours and timing out on Windows for the past 3-4 days.

pshipton on 23 Sep 2019

Same for this machine. Appears to be okay. I rebooted this one too since it was idle in jenkins.

jdekonin on 23 Sep 2019

did something change with the SharedClasses.SCM23.MultiCL_0 test? It's been consistently running for over two hours and timing out on Windows for the past 3-4 days.

Hi @pshipton,

The change done recently in https://github.com/eclipse/openj9-systemtest/pull/98/files reduces load in the MultiThreadMultiCL variation of this test. No change was done for the MultiCL variation of the test - which is what this issue is about.

If this test is consistently timing out on Windows, we need to reduce the load on it.

Or, rethinking, did the reboot solve the issue? Because the test has not changed in the recent past - so it should not start to time out all on a sudden due to any test issue.

Mesbah-Alam on 24 Sep 2019

@Mesbah-Alam I don't think the reboot helped. I don't know why the test was timing out, but I think it always runs slow even when it doesn't timeout. Pls check other platforms as well. If the test is running for over an hour across multiple platforms I think it needs to be re-tuned.

pshipton on 24 Sep 2019

If the test is running for over an hour across multiple platforms I think it needs to be re-tuned.

The timeout for this test is set to 2h , so looks like workload of this test needs to be reduced.

Will check other platforms. Thanks.

Mesbah-Alam on 24 Sep 2019

Another timeout, on win2012r2-x86-6
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/147

pshipton on 25 Sep 2019

IGNORE THIS, IT IS INCORRECT
This test used to take 10 min, it got slow between

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/142/
OpenJ9: 535e903
OMR: 640f1df
OpenJDK: 8bf2a81

================================================
timestamp: 20190920-011107
repo dir: /cygdrive/c/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests
git repo: 
  Fetch URL: https://github.com/AdoptOpenJDK/openjdk-tests.git
sha:
c91c6009de728a74cdfca10218929c2b1706c144
================================================
timestamp: 20190920-011335
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/stf
git repo: 
  Fetch URL: https://github.com/AdoptOpenJDK/stf.git
sha:
5b2abc832159739899df0d2153b86ea5eb823564
================================================
timestamp: 20190920-011337
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/openjdk-systemtest
git repo: 
  Fetch URL: https://github.com/AdoptOpenJDK/openjdk-systemtest.git
sha:
7de3fbe8a7ef85954645f35bef04906990f519b2
================================================
timestamp: 20190920-011338
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/openj9-systemtest
git repo: 
  Fetch URL: https://github.com/eclipse/openj9-systemtest.git
sha:
4facf81fff1eb0897bc764839f0e6d62cb54571b

and

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/143/
OpenJ9: 82609ef
OMR: 20db4fb
OpenJDK: 502ab02

================================================
timestamp: 20190920-220322
repo dir: /cygdrive/c/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests
git repo: 
  Fetch URL: https://github.com/AdoptOpenJDK/openjdk-tests.git
sha:
7ced48a96d1d4f63f4f5d8a6a0f3dee832f1de0c
================================================
timestamp: 20190920-220550
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/stf
git repo: 
  Fetch URL: https://github.com/AdoptOpenJDK/stf.git
sha:
5b2abc832159739899df0d2153b86ea5eb823564
================================================
timestamp: 20190920-220553
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/openjdk-systemtest
git repo: 
  Fetch URL: https://github.com/AdoptOpenJDK/openjdk-systemtest.git
sha:
6fad60418aa33671d59de6d25bdce854d7c320ba
================================================
timestamp: 20190920-220554
repo dir: C:/Users/jenkins/workspace/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/openjdk-tests//system/openj9-systemtest
git repo: 
  Fetch URL: https://github.com/eclipse/openj9-systemtest.git
sha:
79a666d92f7b3e7a11491c7436e07a96e23f70fa

IGNORE THIS, IT IS INCORRECT
https://github.com/eclipse/openj9/compare/535e903...82609ef
https://github.com/eclipse/openj9-omr/compare/640f1df...20db4fb
https://github.com/ibmruntimes/openj9-openjdk-jdk11/compare/8bf2a81...502ab02
https://github.com/AdoptOpenJDK/openjdk-tests/compare/c91c600...7ced48a
https://github.com/AdoptOpenJDK/openjdk-systemtest/compare/7de3fbe...6fad604
https://github.com/eclipse/openj9-systemtest/compare/4facf81...79a666d

IGNORE THIS, IT IS INCORRECT
I suspect #7149. Trying a grinder with the feature disabled. https://ci.eclipse.org/openj9/view/Test/job/Grinder/519/

pshipton on 26 Sep 2019

I see the -XX:-ShareUnsafeClasses option being used in the grinder, but it's still running slow.
@doomerxe pls create a JVM with the change reverted and run a grinder to see if the problem still occurs.

08:24:50  To rebuild the failed test in a jenkins job, copy the following link and fill out the <Jenkins URL> and <FAILED test target>:
08:24:50  <Jenkins URL>/parambuild/?JDK_VERSION=11&JDK_IMPL=openj9&BUILD_LIST=system&JenkinsFile=openjdk_x86-64_windows&TARGET=<FAILED test target>
08:24:50  
08:24:50  For example, to rebuild the failed tests in <Jenkins URL>=https://ci.adoptopenjdk.net/job/Grinder, use the following links:
08:24:50  https://ci.adoptopenjdk.net/job/Grinder/parambuild/?JDK_VERSION=11&JDK_IMPL=openj9&BUILD_LIST=system&JenkinsFile=openjdk_x86-64_windows&TARGET=SharedClasses.SCM23.MultiCL_0

@hangshao0 @DanHeidinga
@mpirvu fyi in case it's the Port j9methodserver change
IGNORE THIS, IT IS INCORRECT

pshipton on 26 Sep 2019

@doomerxe Shoot, ignore the above, I got the fast / slow runs incorrect. Trying again

fast
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/140
OpenJ9: c9d84df
OMR: 54fb162
OpenJDK: 8bf2a81

slow
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_x86-64_windows_Nightly/142/
OpenJ9: 535e903
OMR: 640f1df
OpenJDK: 8bf2a81

https://github.com/eclipse/openj9/compare/c9d84df...535e903
https://github.com/eclipse/openj9-omr/compare/54fb162...640f1df

There was a build in between, but it failed. Maybe another platform will have similar slowdown.

pshipton on 26 Sep 2019

👍1

Between the following builds, SharedClasses.SCM23.MultiCL_0 on pccle went from 13min to 55min.
The systemtest material didn't change.
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_ppc64le_linux_Nightly/147
OpenJ9: c9d84df
OMR: 54fb162
OpenJDK: 8bf2a81

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_ppc64le_linux_Nightly/148
OpenJ9: b0e9913
OMR: ffb0d98
OpenJDK: 8bf2a81

https://github.com/eclipse/openj9/compare/c9d84df...b0e9913
https://github.com/eclipse/openj9-omr/compare/54fb162...ffb0d98

pshipton on 26 Sep 2019

I think it has to be one of the JIT changes.
@andrewcraik @fjeremic

pshipton on 26 Sep 2019

JDK8 runs don't seem affected, only JDK11+

pshipton on 26 Sep 2019

Confirmed with Windows JDK13 runs
fast (11m)
https://ci.eclipse.org/openj9/job/Test_openjdk13_j9_extended.system_x86-64_windows_Nightly/42
OpenJ9: e60693e
OMR: 54fb162
OpenJDK: 05ba4bd

slow (58m)
https://ci.eclipse.org/openj9/job/Test_openjdk13_j9_extended.system_x86-64_windows_Nightly/43
OpenJ9: 0449f02
OMR: ffb0d98
OpenJDK: 05ba4bd

https://github.com/eclipse/openj9/compare/e60693e...0449f02

pshipton on 26 Sep 2019

I suspect it is the changes to the method handle inlining - JDK11 has that and JDK8 doesn't have that. The improvements will not be reverted as they are providing orders of magnitude performance improvement in general and it is the strategic direction for the optimizer. @cathyzhyi is working on further reducing the overhead of the inlining to improve performance in the next release. A similar timeout exists on a few handle intensive tests on z.

FYI @liqunl. @hzongaro can we look at where the compile-time increase is happening to see if we can do anything simple to help cut the pathlength.

@pshipton what mode are the tests running in - is there a count=0 or disableAsyncCompilation in play?

andrewcraik on 26 Sep 2019

is there a count=0 or disableAsyncCompilation in play

SharedClasses.SCM23.MultiCL uses -Xaot:forceAoT,count=1

Mesbah-Alam on 26 Sep 2019

hmm - so I'm not sure if the problem is the compile time or the fact that with the aggressive handle inlining we won't be able to do AOT. @mpirvu we might need your help to figure out if the inliner changes are affecting AOT or not. There is a plan to support method handles with AOT but that is still being designed.

andrewcraik on 26 Sep 2019

We can get a sense of compilation time from verbose logs: -Xjit:verbose={compilePerformance},vlog=vlog.txt

mpirvu on 26 Sep 2019

@Mesbah-Alam is there a setup that can be accessed to do investigation? Would you be able to get the log mentioned by @mpirvu above?

andrewcraik on 26 Sep 2019

Would you be able to get the log

@andrewcraik - I am running a grinder with -Xjit:verbose={compilePerformance},vlog=vlog.txt. Will share the log here once it finishes.

Mesbah-Alam on 26 Sep 2019

Following attachments contain the vlogs:

Generated by running the test on Windows

Generated by running the test on Linux

@andrewcraik @mpirvu

Mesbah-Alam on 26 Sep 2019

I looked at the verbose logs on Windows and there are some pretty large compilation times.
Eaxmple:

vlog.txt.20190926.111756.2632
        Samples MAX(ms) TOTAL(ms)       AVG(usec)       MIN(usec)
Total   12103   23799.8 1921149.8       158733          1
aotl    4079    5033.9  26443.8         6483            1
cold    389     924.6   6539.0          16810           502
aotw    5256    18279.7 314258.4         59790           582
warm    2298    23799.8 1570534.7      683435          717
pvhot   1       2077.3  2077.3          2077320         2077320
scorc   1       601.8   601.8           601772          601772
jni     79      619.1   694.8           8795            584
MAXLINE=+ (warm) net/openj9/sc/classes/Test_Init_0200.makeString(I)Ljava/lang/String; @ 0x0000000040FB2980-0x0000000040FB68BF OrdinaryMethod - Q_SZ=431 Q_SZI=431 QW=3282 j9m=000000000020D0C8 bcsz=82 OSR time=23799795us mem=[region=32128 system=32768]KB compThread=0 CpuLoad=398%(99%avg) JvmCpu=0%

        Samples MAX(ms) TOTAL(ms)       AVG(usec)       MIN(usec)
Total   12112   22830.5 1999275.8       165066          89
aotl    4589    1648.9  15966.2         3479            89
cold    410     531.9   6975.5          17014           496
aotw    4777    12036.7 219629.1        45976           265
warm    2256    22830.5 1754560.8       777731          699
pvhot   1       2086.9  2086.9          2086927         2086927
jni     79      3.0     57.3            726             567
MAXLINE=+ (warm) net/openj9/sc/classes/Test_Init_0318.makeString(I)Ljava/lang/String; @ 0x0000000041105C80-0x00000000411099D1 OrdinaryMethod - Q_SZ=300 Q_SZI=252 QW=2376 j9m=00000000002F78C8 bcsz=82 OSR time=22830462us mem=[region=31680 system=32768]KB compThread=1 CpuLoad=400%(100%avg) JvmCpu=0%

mpirvu on 26 Sep 2019

Total compilation time on Linux is about half (AOT warm compilations take less time on average):

        Samples MAX(ms) TOTAL(ms)       AVG(usec)       MIN(usec)
Total   10945   1996.4  909153.1        83066           23
aotl    1440    214.4   2537.1          1762            23
cold    1404    323.8   10573.0         7531            181
aotw    6277    554.4   93173.2         14844           205
warm    1739    1996.4  802518.4        461483          1463
jni     85      249.7   351.4           4134            159
MAXLINE=+ (warm) net/openj9/sc/classes/TestClass_5088.stringOperations(III)V @ 00007F1A280898C8-00007F1A2808AB04 OrdinaryMethod - Q_SZ=3207 Q_SZI=3206 QW=20312 j9m=0000000001316148 bcsz=241 time=1996394us mem=[region=33600 system=49152]KB compThread=1 CpuLoad=391%(97%avg) JvmCpu=73%

        Samples MAX(ms) TOTAL(ms)       AVG(usec)       MIN(usec)
Total   11478   6544.4  968261.5        84358           22
aotl    1515    210.2   1760.4          1162            22
cold    1559    484.8   11537.3         7400            182
aotw    6509    6544.4  99417.8         15274           206
warm    1808    1517.9  851852.7        471157          1583
hot     1       3658.5  3658.5          3658467         3658467
jni     86      5.8     34.7            404             161
MAXLINE=+ (AOT warm) jdk/internal/module/SystemModules$default.moduleDescriptors()[Ljava/lang/module/ModuleDescriptor; @ 00007FFA8D961240-00007FFA8D9E3B65 OrdinaryMethod - Q_SZ=1330 Q_SZI=1330 QW=9353 j9m=000000000081F860 bcsz=19723 GCR time=6544374us mem=[region=74304 system=81920]KB compThread=2 CpuLoad=401%(100%avg) JvmCpu=70%

mpirvu on 26 Sep 2019

Hmm - now that is odd - why is windows so much slower? worse JIT compiler optimization at build time or something? Unlikely that the difference would be the inliner change if it is platform specific, but I've been proven wrong before.

andrewcraik on 26 Sep 2019

Note plinux seemed to have been affected as well https://github.com/eclipse/openj9/issues/7178#issuecomment-535484726
The other platforms seem fine according to https://github.com/eclipse/openj9-systemtest/pull/99#issuecomment-535156261

pshipton on 26 Sep 2019

Andrew @andrewcraik, Marius @mpirvu, do you think it would be worth the time to get -Xjit:verbose={compilePerformance} logs for Windows before the degradation and for ppc64le_linux both before and after?

hzongaro on 26 Sep 2019

Yes, it makes sense to see whether compilation time increased or it was always that bad (24 sec for one warm compilation !)

mpirvu on 26 Sep 2019

FWIW I see that 54ff80e51f1bc6d11c574548e3f6e4fbf03e3790 [1] is in the list of changes in the diff from Peter's https://github.com/eclipse/openj9/issues/7178#issuecomment-535492114. Internally during Z testing we've found the same change is causing over 2x increase in footprint and massive compilation time increases for Octane benchmarks.

I would focus my attention to that commit in particular. I bet it is the same offender here. See internal Problem Report 142384 for those of you who can access it.

[1] https://github.ibm.com/runtimes/openj9/commit/54ff80e51f1bc6d11c574548e3f6e4fbf03e3790

fjeremic on 26 Sep 2019

👍1

@hzongaro would be worth getting compilePerformance and inlining vlog for both platforms before and after. That way we can see if the long compiles are inlining a lot of method handles.

andrewcraik on 26 Sep 2019

@Mesbah-Alam, may I ask you to generate the verbose logs on Windows and ppc64le_linux for builds before and after the relevant changes, with the following options?

-Xjit:verbose={compilePerformance},verbose={inlining},vlog=vlog.txt

I'm unable to submit grinder runs on those builds myself. I'll do the same with the test Filip mentioned above.

hzongaro on 26 Sep 2019

@hzongaro - I have sent the vlogs in email as some of the attachments were more than 10mb and could not be attached here. Please let me know if you need anything else.

Mesbah-Alam on 27 Sep 2019

Actually, the email bounced back saying attachments are too large as well.

@hzongaro - I've sent you the attachments via slack. Please check.

Mesbah-Alam on 27 Sep 2019

Thanks, Mesbah @Mesbah-Alam, for collecting that data! The differences are quite pronounced.

Before the relevant changes on Linux for Power, this was a typical result from the verbose log:

Processing vlog Linux-PPC64-LE-before/SharedClasses.SCM23.MultiCL_0/20190927-162117-SharedClasses/results/vlog.txt.20190927.162252.18751
Hotness   Count  Total(ms)     Average(us)                                                                                        Max(us)
AOT_cold  7423   65926921ms    8881us                                                                                             3960168us
AOT_load  1999   3237327ms     1619us                                                                                             119632us
AOT_warm  285    3541503ms     12426us                                                                                            273968us
cold      34643  756692542ms   21842us                                                                                            297925us
hot       1      616872ms      616872us                                                                                           616872us
warm      11535  823005056ms   71348us                                                                                            630607us
Overall   55886  1653020221ms  29578us                                                                                            3960168us
Maximum:  AOT    cold          jdk/internal/module/SystemModules$default.moduleDescriptors()[Ljava/lang/module/ModuleDescriptor;  3960168

and this was a typical one from after the change

Processing vlog Linux-PPC64-LE-after/SharedClasses.SCM23.MultiCL_0/20190927-162007-SharedClasses/results/vlog.txt.20190927.162137.22615
Hotness   Count  Total(ms)                                                    Average(us)  Max(us)
AOT_cold  6686   40796889ms                                                   6101us       4204892us
AOT_load  1541   1486183ms                                                    964us        75401us
AOT_warm  272    4712307ms                                                    17324us      656389us
cold      361    2342435ms                                                    6488us       82409us
warm      1857   3172147990ms                                                 1708211us    4647531us
Overall   10717  3221485804ms                                                 300595us     4647531us
Maximum:  warm   net/openj9/sc/classes/Test_Init_4566.stringOperations(III)V  4647531

The maximum compile-time was usually one of the many net/openj9/sc/classes/Test_Init_*.stringOperations(III)V methods. Those methods typically went from having about 15 methods inlined to about 150 or so inlined, and seemed to require about ten times more memory per compile.

On Windows, the maximum compile-time method was sometimes one of the Test_Init_*.stringOperations(III)V methods, sometimes Test_Init_*.makeString(I) and sometimes Test_Init.selectUppercaseCharacter()C. Following was a typical result from the verbose log on Windows before the change:

Processing vlog Windows-before/SharedClasses.SCM23.MultiCL_0/20190927-162521-SharedClasses/results/vlog.txt.20190927.165449.9988
Hotness            Count      Total(ms)                                                                                     Average(us)  Max(us)
AOT_load           2761       5949575ms                                                                                     2154us       634748us
cold               27439      1352566894ms                                                                                  49293us      902195us
profiled_very-hot  2          10210419ms                                                                                    5105209us    8026508us
scorching          1          10674813ms                                                                                    10674813us   10674813us
warm               16107      538950423ms                                                                                   33460us      1288224us
Overall            46310      1918352124ms                                                                                  41424us      10674813us
Maximum:           scorching  jdk/internal/loader/BuiltinClassLoader.loadClassOrNull(Ljava/lang/String;Z)Ljava/lang/Class;  10674813

and this was a typical one from after the change

Processing vlog Windows-after/SharedClasses.SCM23.MultiCL_0/20190927-162543-SharedClasses/results/vlog.txt.20190927.164228.14160
Hotness            Count  Total(ms)                                                             Average(us)  Max(us)
AOT_load           5309   49508347ms                                                            9325us       8696524us
AOT_warm           3963   238331835ms                                                           60139us      13331566us
cold               426    3929445ms                                                             9224us       271078us
profiled_very-hot  1      2497294ms                                                             2497294us    2497294us
warm               2222   2159991480ms                                                          972093us     37301270us
Overall            11921  2454258401ms                                                          205876us     37301270us
Maximum:           warm   net/openj9/sc/classes/Test_Init_0168.makeString(I)Ljava/lang/String;  37301270

The total number of inlined methods for all the compiles in any particular run went from being in the range 100000 to 150000 to about 270000 to 400000, while the total number of compiles fell by about 70% to 80%.

hzongaro on 1 Oct 2019

@andrewcraik it's not clear to me if this should be a blocker for the 0.17 release or not. The test that exposed the problem is using -Xaot:forceAoT,count=1, but the increase in inlining and compile time could be affecting default mode as well. Do you know if this has shown up in any other performance testing?

pshipton on 2 Oct 2019

Also weird that only certain platforms seem to be affected.

pshipton on 2 Oct 2019

👍1

@pshipton I have not had any reports of such. @piyush286 (he helps monitor OpenJ9 performance at IBM) have you seen anything on any of your monitoring that might be showing JSR292 inlining being a problem? I don't know if it should block 0.17 but I think @hzongaro (with some help from me when needed) should continue treating it as a priority to see if there is anything simple and safe we can do to reign in any true excesses to reduce the changes of any field impact.

andrewcraik on 2 Oct 2019

👍1

@andrewcraik We haven't seen this issue in our JSR292 regression runs but I'll report here if we do.

piyush286 on 2 Oct 2019

Related to JSR292 tests,
There are no regressions seen with JRuby related benchmarks.
With basic tests, out of two string based tests, testFilterReturnI2String and testFilterReturnString2I
testFilterReturnString2I - didn't show any difference in scores at all.
testFilterReturnI2String - has a high drop in score with 20190913 build , but most of the perf is restored with the later build.

Build Date        InvokeDynamic_Basic-default_testFilterReturnI2String

20190418               2864.83  -- Baseline
20190821               2716.35
20190903           3228.39
20190913           8219.72  --- high drop
20190927           3398.04

kusumachalasani on 3 Oct 2019

👍1

Taking another look at logs produced with -Xjit:verbose={inlining} on Linux x86_64, I noticed that there is far less cold call inlining happening in the compilations of the various Test_Init*.makeString* methods on that platform than on Linux Power 64 LE or Windows.

I'm getting inlining logs for the two runs on Linux to see if I can spot where and why they diverge.

hzongaro on 3 Oct 2019

@hzongaro @andrewcraik Do we have any new information on this issue?

vij-singh on 8 Oct 2019

So @hzongaro and @cathyzhyi have been looking at this. From my offline discussions with them, they have not been able to reproduce a 2 hour timeout using the development machines we have here at IBM. @hzongaro has reported that in these tests we are spawning 5 instances of the tests all running forceAOT,count=1 on a 4 core machine. When he looks the runtime on bare metal machines he has access to we see the runtime double (4 mins to 7 mins kind of thing), but we have not been able to reproduce the pathological 2 hour timeout. A compile-time increase with the changes to the 292 inlining are expected per above discussion. The machines which generated this defect seem to be seeing the issue more regularly. At least one of the machines had 4 cores and so it is possible the machine is overcommitted and thrashing with the increased compilation load, but we have yet to prove this.

Investigation continues. The most surprising thing on the 2 hour runs is that the compilation time does not account for the increase in runtime so something odd seems to be happening either precipitated by the inlining changes or possibly related to how the workload is being run in the cloud.

andrewcraik on 8 Oct 2019

@andrewcraik Andrew, perhaps I had been too focused on Linux on Power in trying to reproduce the problem, and had managed to juxtapose in my mind the increase in the time to run the test that was seen on Windows onto Linux on Power.

The increase in execution time on Power that Peter had reported was about four- to five-fold - going from 13 minutes to about 55 minutes. As Andrew said, the test executes five processes concurrently, all running forceAOT,count=1. It also does that twice, sequentially. So the individual runs on Power that Peter reported on went from about 6.5 minutes to roughly 27 minutes. The increases in time required on Windows with JDK 11 were much more dramatic.

Andrew mentioned offline that the priorities of the compilation threads on Windows are raised. Marius @mpirvu provided me with the -Xjit:compThreadPriority option, and recommended setting a priority of 2. I was able to produce numbers on Windows with and without that option, and this time saw a dramatic difference:

Grinder run 2226 used -Xjit:compThreadPriority=2, required just over 10 minutes for one run, just over 20 minutes for two sequential runs.

Grinder run 2234 did not specify -Xjit:compThreadPriority, and each sequential run was killed after ten minutes, having completed roughly 10% to 15% of the run, so about a six to ten-fold degradation.

I will rerun the latter with a build prior to this change.

hzongaro on 8 Oct 2019

That all makes a bit more sense. I still wonder why the compilation threads can starve the process to that extent - @mpirvu?

@hzongaro Why was 2234 kill after 10 mins when 2226 was allowed to run "just over 10 minutes"?

andrewcraik on 8 Oct 2019

Why was 2234 kill after 10 mins?

This was from a recent change from Mesbah @Mesbah-Alam to kill the run after ten minutes if it hasn't completed. The test loads classes from a jar using a URLClassLoader, and after every thousand classes it checks whether ten minutes have passed - so the last batch of 1000 classes in the jar might finish being loaded after ten minutes have passed. So 2234 was also killed after just over ten minutes - not after ten minutes precisely. Sorry for the confusion.

hzongaro on 8 Oct 2019

We didn't deliver that 10min restriction on the test run, so I assume you are using this for testing purposes.
https://github.com/eclipse/openj9-systemtest/pull/99

pshipton on 8 Oct 2019

👍1

I still wonder why the compilation threads can starve the process to that extent

When thread priorities are obeyed, if a compilation thread can run it will preempt any application thread.

mpirvu on 8 Oct 2019

Andrew @andrewcraik asked me off-line to check whether the difference in compile-time accounts for the entire difference in execution time for these tests. Following are the results from grinder runs on Windows. I added the elapsed time for each of the ten runs (two sequential executions of five concurrent runs)

Before
- Accumulated run-time: 7701s
- Total time spent in compile threads: 2403s
After
- Accumulated run-time: 68564s
- Total time spent in compile threads: 53519s

So there’s roughly 10000s difference unaccounted for there in total.

hzongaro on 9 Oct 2019

We're about out of time for making more changes for the 0.17 release, unless the release is delayed. The preferred approach is to complete the release on schedule to pick up the OpenJDK security fixes. Adding another release to update 0.17 later is open for discussion.

Testing of the current content will occur on the weekend (Oct 12). We expect to get OpenJDK changes Tues afternoon (Oct 15), and as long as these don't cause any problem the 0.17 release will be finalized.

@andrewcraik @DanHeidinga

pshipton on 10 Oct 2019

@andrewcraik @mpirvu Is this a stop ship issue? Can you pull together a case on why (or not) this should block the release?

DanHeidinga on 10 Oct 2019

@andrewcraik We were talking yesterday about a possible solution to this problem. Are there unforseen roadblocks?

mpirvu on 10 Oct 2019

So I don't think my theory holds water - we are still trying to figure out what is going on. I still think the windows case is an extreme stress case - you have 5 JVMs all with 3 compile threads loading and compiling methods in 10000 classes with count=1. With the elevated compile thread priority the system starves the application on windows. Lowering the priority lets the test pass. We're trying to understand why the system is so stressed.

andrewcraik on 10 Oct 2019

All in all we don't think this is a stop ship issue.
Changes to inlining method handle chains have increased the compilation overhead substantially and the test on Windows times-out because compilation threads, having higher priority , don't give application thread many chances to run (5 JVMs * 3 compTheads = 15 comp threads working on 4 cores). From the vlogs I looked at, the total time on CPU by compilation threads jumped from 2822 sec to 6509 sec. The biggest increase comes from warm compilations which, even though they are substantially fewer now, they are way more expensive (20x). However, this level of stress is not common in practice (count=1 creates many compilations).
We looked at the perf results from TrafficLights and neither JSR292 specific benchmarks nor Daytrader3 on Java11 see any negative side effects.

mpirvu on 11 Oct 2019

Are there any objections to merging the system test change that limits the test time to 10 minutes, regardless of how many classes get loaded in that time?
https://github.com/eclipse/openj9-systemtest/pull/99

pshipton on 11 Oct 2019

and the follow up question. Does the system test change resolve this Issue, or should it stay open for more investigation or further changes?

pshipton on 11 Oct 2019

@andrewcraik @mpirvu fyi the preceding two comments.
Without further input I'll go ahead with the test change today so we are using it for the weekend testing.

pshipton on 11 Oct 2019

If I understand correctly the system test change will put an upper bound on the testing time, but the number of classes loaded will not be a criteria for failure. I am ok with this change.

@andrewcraik mentioned about future work that will attempt to reduce compilation time. Whether or not this issue should stay open, I'll let Andrew decide.

mpirvu on 11 Oct 2019

https://github.com/eclipse/openj9-systemtest/pull/99 is merged. Removing the blocker label and moving this issue to the next milestone as a reminder for any future work.

pshipton on 11 Oct 2019

Additional perf improvement is open as https://github.com/eclipse/openj9/pull/7785

andrewcraik on 26 Nov 2019

Openj9: SharedClasses.SCM23.MultiCL_0 times out on Windows

All 62 comments

Related issues