Openj9: Strange performance issues with OpenJ9 in this jmh bench...

Created on 17 Sep 2017 · 23Comments · Source: eclipse/openj9

I'm creating a jmh version of the benchmarks of The computer language benchmarks game and I decided to try it out comparing OpenJ9 to Zulu9 from azul. Zulu9's performance is around 2600ms/op on my system, but OpenJ9 shows some strange performance issues where it slows down with every successive warmup iteration:

Zulu9 - sbt "jmh:run -i 10 -wi 20 -f1 -t1 -r 15 -jvmusr/lib/jvm/zulu-9/bin/java"
[info] # Run progress: 0.00% complete, ETA 00:05:40
[info] # Fork: 1 of 1
[info] # Warmup Iteration   1: 2630.687 ms/op
[info] # Warmup Iteration   2: 2780.907 ms/op
[info] # Warmup Iteration   3: 2630.115 ms/op
[info] # Warmup Iteration   4: 2587.016 ms/op
[info] # Warmup Iteration   5: 2504.948 ms/op
[info] # Warmup Iteration   6: 2566.648 ms/op
[info] # Warmup Iteration   7: 2537.053 ms/op
[info] # Warmup Iteration   8: 2500.211 ms/op
[info] # Warmup Iteration   9: 2576.959 ms/op
[info] # Warmup Iteration  10: 2517.836 ms/op
[info] # Warmup Iteration  11: 2514.588 ms/op
[info] # Warmup Iteration  12: 2508.471 ms/op
[info] # Warmup Iteration  13: 2548.678 ms/op
[info] # Warmup Iteration  14: 2529.385 ms/op
[info] # Warmup Iteration  15: 2508.744 ms/op
[info] # Warmup Iteration  16: 2475.521 ms/op
[info] # Warmup Iteration  17: 2518.160 ms/op
[info] # Warmup Iteration  18: 2572.912 ms/op
[info] # Warmup Iteration  19: 2605.698 ms/op
[info] # Warmup Iteration  20: 2556.147 ms/op


OpenJ9 - sbt "jmh:run -i 10 -wi 20 -f1 -t1 -r 15 -jvmhome/mhammons/jdk-9+181/bin/java"
[info] # Run progress: 0.00% complete, ETA 00:05:40
[info] # Fork: 1 of 1
[info] # Warmup Iteration   1: 36902.857 ms/op
[info] # Warmup Iteration   2: 41640.554 ms/op
[info] # Warmup Iteration   3: 49802.187 ms/op
[info] # Warmup Iteration   4: 59686.587 ms/op
[info] # Warmup Iteration   5: 69178.086 ms/op
[info] # Warmup Iteration   6: 78664.955 ms/op
...

As you can see, OpenJ9 starts off markedly slower than Zulu9. But what's more concerning is that instead of improving in performance during the warmup phase as is to be expected, the subsequent warmup iterations grow slower and slower on OpenJ9.

The benches were compiled with openjdk 8, and run with a forked jvm by the jmh harness. You can invoke the benches and indicate the jvm to be used to run them with the following command:
sbt "jmh:run -i 10 -wi 20 -f1 -t1 -r 15 -jvm path/to/jvm"

JVM arguments can be supplied with an added -jvmArgs argument. I did this with -Xjit:verbose to test that jitting was occurring, and it was, so this slowdown is puzzling. Another thing I noted while running this bench is that although it is designed to use as much threading as the system's cpus will allow for, half or less of my cpu was unused with OpenJ9.

perf

Source

markehammons

Most helpful comment

That's fine. gencon seems like an interesting GC. It's outperforming G1GC for java 8 at 512/512/384 so it seems to run better with less memory usage than a lot of the openjdk garbage collectors. The JIT for openJ9 seems to perform better than openjdk in some of my microbenchmarks too. I'm looking forward to it working better with more software so I can test it more with real world software...

markehammons on 18 Sep 2017

👍9

All 23 comments

The performance issue originates from the GC. Massive slow downs (12x to 20x) are seen with gencon [default GC policy]. Performance improves with other GC policies: balanced, metronome, optavgpause and optthruput. With metronome, the performance issue is resolved.

Warmup iterations with different GC policies + CPU profiles:

-Xgcpolicy:gencon

  [info] # Warmup Iteration   1: 32667.888 ms/op
  [info] # Warmup Iteration   2: 35529.382 ms/op
  [info] # Warmup Iteration   3: 40413.470 ms/op

  CPU profile:
  28.65%  libj9gc29.so       [.] MM_Scavenger::incrementalScanCacheBySlot
   8.68%  libj9gc29.so       [.] MM_MarkingScheme::completeScan
   6.78%  libjclse9_29.so    [.] gcEndEvent
   5.97%  libj9gc29.so       [.] MM_CollectorLanguageInterfaceImpl::scavenger_getObjectScanner
   4.89%  libj9gc29.so       [.] GC_ObjectHeapIteratorAddressOrderedList::nextObjectNoAdvance
   3.57%  libj9gc29.so       [.] GC_ObjectModelDelegate::calculateObjectDetailsForCopy

-Xgcpolicy:balanced

  [info] # Warmup Iteration   1: 7386.035 ms/op
  [info] # Warmup Iteration   2: 8001.562 ms/op
  [info] # Warmup Iteration   3: 8600.087 ms/op
  [info] # Warmup Iteration   4: 9109.380 ms/op
  [info] # Warmup Iteration   5: 8655.517 ms/op

  CPU profile:
  19.76%  libj9gc29.so     [.] MM_CopyForwardScheme::copy
  19.74%  libj9gc29.so     [.] MM_CopyForwardScheme::scanMixedObjectSlots
   7.58%  libj9gc29.so     [.] MM_GlobalMarkingScheme::scanMixedObject
   3.70%  libj9gc29.so     [.] MM_ClassLoaderRememberedSet::rememberInstance
   3.57%  libj9gc29.so     [.] MM_CopyForwardScheme::completeScanCache



md5-c50656dcdcb56d6cdb18e2195ac868a8



-Xgcpolicy:metronome

  [info] # Warmup Iteration   1: 4199.195 ms/op
  [info] # Warmup Iteration   2: 2717.050 ms/op
  [info] # Warmup Iteration   3: 2669.678 ms/op
  [info] # Warmup Iteration   4: 2718.966 ms/op
  [info] # Warmup Iteration   5: 2697.947 ms/op
  [info] # Warmup Iteration   6: 2643.536 ms/op
  [info] # Warmup Iteration   7: 2682.550 ms/op

  CPU profile:
  11.19%  libj9gc29.so   [.] MM_RealtimeMarkingScheme::incrementalConsumeQueue
  10.46%  libj9gc29.so   [.] MM_SweepSchemeSegregated::sweepSmallRegion



md5-c50656dcdcb56d6cdb18e2195ac868a8



-Xgcpolicy:optavgpause

  [info] # Warmup Iteration   1: 9636.824 ms/op
  [info] # Warmup Iteration   2: 8594.115 ms/op
  [info] # Warmup Iteration   3: 8564.137 ms/op
  [info] # Warmup Iteration   4: 7936.804 ms/op
  [info] # Warmup Iteration   5: 7872.895 ms/op

  CPU profile:
  36.49%  libj9gc29.so   [.] MM_MarkingScheme::completeScan
   7.46%  libj9gc29.so   [.] MM_CompactScheme::getForwardingPtr
   3.39%  libj9gc29.so   [.] MM_ConcurrentSweepScheme::postConnectChunk



md5-c50656dcdcb56d6cdb18e2195ac868a8



-Xgcpolicy:optthruput

  [info] # Warmup Iteration   1: 6449.687 ms/op
  [info] # Warmup Iteration   2: 5368.190 ms/op
  [info] # Warmup Iteration   3: 5982.922 ms/op
  [info] # Warmup Iteration   4: 6439.730 ms/op
  [info] # Warmup Iteration   5: 5780.017 ms/op
  [info] # Warmup Iteration   6: 4925.901 ms/op
  [info] # Warmup Iteration   7: 5586.993 ms/op

  CPU profile:
  54.00%  libj9gc29.so   [.] MM_MarkingScheme::completeScan
   1.97%  libj9gc29.so   [.] MM_CompactScheme::getForwardingPtr

References:

Zulu 9 (Java 9) - http://zulu.org/zulu-9-pre-release-downloads/
OpenJ9 (Java 9) - https://adoptopenjdk.net/releases.html?variant=openjdk9-openj9#x64_linux
OpenJ9 GC Policies - https://www.ibm.com/support/knowledgecenter/en/SSYKE2_8.0.0/com.ibm.java.lnx.80.doc/diag/appendixes/cmdline/xgcpolicy.html
sbt - http://www.scala-sbt.org/0.13/docs/Installing-sbt-on-Linux.html

babsingh on 18 Sep 2017

👍2

@markehammons Changing GC policy to metronome should resolve the performance issue. With metronome, OpenJ9 is comparable to Zulu9. The GC policy can be changed using the following JVM cmdline option: -Xgcpolicy:metronome.

-Xverbosegclog option should generate a log [verbosegc.20170917.232806.28228.txt]. Within this log, OpenJ9's GC policy can be verified: <attribute name="gcPolicy" value="-Xgcpolicy:metronome" />.

OpenJ9 vs. Zulu9:

Zulu9 Results:

[info] Benchmark                      Mode  Cnt     Score    Error  Units
[info] o.d.a.b.b.main                 avgt   10  2831.880 ± 80.085  ms/op
[info] o.d.a.b.b.main               sample   60  2823.116 ± 42.887  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       2726.298           ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       2805.989           ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       2898.264           ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       2948.386           ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       3342.860           ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       3342.860           ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       3342.860           ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       3342.860           ms/op
[info] o.d.a.b.b.main                   ss   10  2794.083 ± 84.560  ms/op

OpenJ9 Results (with -Xgcpolicy:metronome):

[info] Benchmark                      Mode  Cnt     Score    Error  Units
[info] o.d.a.b.b.main                 avgt   10  2740.335 ± 23.958  ms/op
[info] o.d.a.b.b.main               sample   60  2799.418 ± 16.621  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       2726.298           ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       2797.601           ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       2847.932           ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       2852.127           ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       2936.013           ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       2936.013           ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       2936.013           ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       2936.013           ms/op
[info] o.d.a.b.b.main                   ss   10  2808.452 ± 77.603  ms/op

babsingh on 18 Sep 2017

👍2

Adding GC members to investigate the perf issue with gencon GC policy: @dmitripivkine @amicic

fyi: @charliegracie @DanHeidinga

babsingh on 18 Sep 2017

Hi @babsingh can you provide the complete command line options you used for each test? Also would you be able to provide verbose:gc logs for each of the runs?

The metronome GC policy is our incremental soft realtime collector so I am not that surprised that it showed better performance on this benchmark. This benchmark seems to be measuring how long it takes for complete an operation and that GC pauses seem to have a real impact on the responsiveness. The metronome GC attempts keep all pauses increments to ~3ms (default) so in systems where responsiveness is key metronome can regularly perform better out of the box. For applications where total throughput is king gencon (default) is likely the best choice.

charliegracie on 18 Sep 2017

@charliegracie soft realtime shouldn't matter much for this bench. Java 8 using the default settings and not using g1gc performs well on this bench too. Further, I'd assume that throughput would be more helpful, as this bench is supposed to stress the GC with garbage and see how well the GC can clean up massive amounts of garbage without getting in the way of the executing code. So the GC stutter from high throughput GC should be averaged out by the numerous iterations and should only show up in the sampletime results as rare hiccups in an iteration.

Here's the bench results for java8 with -XX:+UseParallelGC

[info] Benchmark                      Mode  Cnt     Score    Error  Units
[info] o.d.a.b.b.main               sample   67  2439.802 ± 93.047  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       2136.998           ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       2369.782           ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       2754.819           ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       2863.032           ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       2956.984           ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       2956.984           ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       2956.984           ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       2956.984           ms/op

As you can see, the worst case caused by gc stutter is in the 99th percentile of bad performance

Here's with the deprecated -XX:+UseParNewGC

[info] Benchmark                      Mode  Cnt     Score     Error  Units
[info] o.d.a.b.b.main               sample   31  5299.706 ± 264.359  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       4722.786            ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       5251.269            ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       5731.097            ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       6059.930            ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       6467.617            ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       6467.617            ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       6467.617            ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       6467.617            ms/op

As you can see, it performs much worse because it uses serial gc for the old generation (which this bench creates an unusual amount of old-gen garbage).

Here's with -XX:+UseSerialGC, which should have terrible latency on my system, being a singlethreaded stop the world collector:

[info] Benchmark                      Mode  Cnt     Score     Error  Units
[info] o.d.a.b.b.main               sample   30  6828.886 ± 243.483  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       6048.186            ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       6819.938            ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       7255.307            ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       7472.991            ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       7625.245            ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       7625.245            ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       7625.245            ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       7625.245            ms/op

As you can see, while none are as performant as the java 8 default (which gets results similar to azul java 9), they all outperform the default GC for openJ9 on this bench.

markehammons on 18 Sep 2017

@charliegracie: Here's a log of two warmup iterations of the gencon gc. I used default settings with the command:

sbt "jmh:run -i 10 -wi 20 -f1 -t1 -r 15 -jvm /home/mhammons/jdk-9+181/bin/java -jvmArgs -Xloggc:gc_gencon.log benchmarksgame.*"

I've compressed the file with xz (though github doesn't like that format so I had to wrap it in a gz as well) because the original log was around 140MB.

gc_gencon.log.xz.gz

It seems this bench hits an edgecase in gencon and causes it to freak out, cause I doubt 140MB of logs in less than 2 minutes is normal.

For good measure, here's the gc logs from doing a full run of the same bench with metronome:
gc_metronome.log.gz

markehammons on 18 Sep 2017

👍1

Looking at the gencon log there is definitely something mis-behaving. The nursery is not expanding which is causing the JVM to spend ~90% of the total runtime in the GC. We are looking into why it is not expanding.

@markehammons while we are looking into things locally would you be able to do another run with -Xmx512m -Xms512m to see if that improves anything?

charliegracie on 18 Sep 2017

Sure!

With -Xmx512m -Xms512m I'm finally able to complete the full sampling bench with the gencon gc, as opposed to only being able to do 2-5 warmup iterations

[info] Benchmark                      Mode  Cnt     Score     Error  Units
[info] o.d.a.b.b.main               sample   20  7777.078 ± 114.003  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       7541.359            ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       7797.211            ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       7962.467            ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       7985.116            ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       7985.955            ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       7985.955            ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       7985.955            ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       7985.955            ms/op
[success] Total time: 310 s, completed Sep 18, 2017 5:48:31 PM

gc_genconXmx512Xms512.log.gz

The performance is around openjdk's serialGC, which is unfortunate, but still sane compared to my previous results. Here's a bench of SerialGC with the same memory constraints for reference:

[info] Benchmark                      Mode  Cnt     Score     Error  Units
[info] o.d.a.b.b.main               sample   30  6224.068 ± 304.897  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       5125.439            ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       6274.679            ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       6767.090            ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       7075.371            ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       7172.260            ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       7172.260            ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       7172.260            ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       7172.260            ms/op

markehammons on 18 Sep 2017

Hi Mark, thanks for quick response and observing this issue. Indeed as Charlie suspected, it's a recent regression in Nursery expansion logic. We understand the problem now and working on a fix.

Meanwhile you can continue experimenting with manual heap sizing. The heap sizing Charlie suggested is still too small. Try even larger values, specifically with Nursery size (-Xmn option): For example -Xmx8G -Xms8G -Xmn6G (total heap of 8G and Nursery 3/4 of it, all fully expanded from the very beginning). You can also provide OpenJDK/Zulu verbose GC logs so we can see how big Heap/Nursery was for them, so we can suggest more appropriate values.

amicic on 18 Sep 2017

@amicic How would I tell openj9 to direct verbose logs to a specific file? What exactly is the option for that? I'll run the tests with the parameters you've suggested

markehammons on 18 Sep 2017

It's -Xverbosegclog option. More about it and some others here: https://www.ibm.com/support/knowledgecenter/en/SSYKE2_8.0.0/com.ibm.java.lnx.80.doc/diag/appendixes/cmdline/commands_gc.html

amicic on 18 Sep 2017

Ok.

OpenJ9 Benches with manually set Xmx Xms and Xmn

Xmx8g Xms8g Xmn6g

[info] Benchmark                      Mode  Cnt     Score    Error  Units
[info] o.d.a.b.b.main               sample   70  2334.550 ± 24.099  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       2227.175           ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       2336.227           ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       2402.917           ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       2417.807           ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       2487.222           ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       2487.222           ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       2487.222           ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       2487.222           ms/op

gc_genconXmx8gXms8gXmn6g.log.gz

Xmx4g Xms4g Xmn3g - OpenJDK standard Xmx for my system

[info] Benchmark                      Mode  Cnt     Score    Error  Units
[info] o.d.a.b.b.main               sample   60  2553.283 ± 27.023  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       2470.445           ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       2543.845           ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       2662.544           ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       2675.756           ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       2730.492           ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       2730.492           ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       2730.492           ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       2730.492           ms/op

gc_genconXmx4gXms4gXmn3g.log.gz

Xmx2g Xms2g Xmn1536m

[info] Benchmark                      Mode  Cnt     Score    Error  Units
[info] o.d.a.b.b.main               sample   60  2628.500 ± 28.380  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       2508.194           ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       2623.537           ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       2701.132           ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       2717.699           ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       2885.681           ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       2885.681           ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       2885.681           ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       2885.681           ms/op

gc_genconXmx2gXms2gXmn1536m.log.gz

Xmx1g Xms1g Xmn768m

[info] Benchmark                      Mode  Cnt     Score    Error  Units
[info] o.d.a.b.b.main               sample   60  2882.326 ± 47.436  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       2680.160           ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       2860.515           ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       3042.129           ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       3061.213           ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       3263.169           ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       3263.169           ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       3263.169           ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       3263.169           ms/op

gc_genconXmx1gXms1gXmn768m.log.gz

Xmx512m Xms512m Xmn384m

[info] Benchmark                      Mode  Cnt     Score     Error  Units
[info] o.d.a.b.b.main               sample   40  4159.282 ± 161.941  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       3581.936            ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       4135.584            ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       4537.398            ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       4579.761            ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       4680.843            ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       4680.843            ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       4680.843            ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       4680.843            ms/op

gc_genconXmx512mXms512mXmn384m.log.gz

Sorry this took a while, I ran benches till I got closer to the Xmx512 results I posted before. It looks like the nursery gen regression is hurting performance really badly. With manually setting the nursery to 384m, the performance of gencon is approximately 2x what I was getting with just Xmx and Xms set to 512m.

markehammons on 18 Sep 2017

❤3 👍1

OpenJDK8 Benches - Default settings with manually set Xmx Xms and Xmn

Xmx8g Xms8g Xmn6g

[info] Benchmark                      Mode  Cnt     Score    Error  Units
[info] o.d.a.b.b.main               sample   74  2179.083 ± 42.629  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       2046.820           ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       2147.484           ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       2338.324           ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       2441.085           ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       2462.056           ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       2462.056           ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       2462.056           ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       2462.056           ms/op

gc_openJDKDefaultXmx8192mXms8192mXmn6144m.log.gz

Xmx4g Xms4g Xmn3g

[info] Benchmark                      Mode  Cnt     Score    Error  Units
[info] o.d.a.b.b.main               sample   72  2170.203 ± 49.907  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       2057.306           ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       2134.901           ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       2318.611           ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       2468.767           ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       2717.909           ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       2717.909           ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       2717.909           ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       2717.909           ms/op

gc_openJDKDefaultXmx4096mXms4096mXmn3072m.log.gz

Xmx2g Xms2g Xmn1536m

[info] Benchmark                      Mode  Cnt     Score    Error  Units
[info] o.d.a.b.b.main               sample   70  2369.872 ± 81.561  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       2107.638           ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       2315.256           ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       2688.129           ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       2827.171           ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       2982.150           ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       2982.150           ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       2982.150           ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       2982.150           ms/op

gc_openJDKDefaultXmx2048mXms2048mXmn1536m.log.gz

Xmx1g Xms1g Xmn768m

[info] Benchmark                      Mode  Cnt     Score    Error  Units
[info] o.d.a.b.b.main               sample   60  2770.478 ± 78.780  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       2424.308           ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       2778.726           ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       3027.029           ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       3122.450           ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       3133.145           ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       3133.145           ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       3133.145           ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       3133.145           ms/op

gc_openJDKDefaultXmx1024mXms1024mXmn768m.log.gz

Xmx512m Xms512m Xmn384m

[info] Benchmark                      Mode  Cnt      Score      Error  Units
[info] o.d.a.b.b.main               sample   10  16822.514 ± 1119.825  ms/op
[info] o.d.a.b.b.main:main·p0.00    sample       15904.801             ms/op
[info] o.d.a.b.b.main:main·p0.50    sample       16701.719             ms/op
[info] o.d.a.b.b.main:main·p0.90    sample       17877.801             ms/op
[info] o.d.a.b.b.main:main·p0.95    sample       17884.512             ms/op
[info] o.d.a.b.b.main:main·p0.99    sample       17884.512             ms/op
[info] o.d.a.b.b.main:main·p0.999   sample       17884.512             ms/op
[info] o.d.a.b.b.main:main·p0.9999  sample       17884.512             ms/op
[info] o.d.a.b.b.main:main·p1.00    sample       17884.512             ms/op

gc_openJDKDefaultXmx512mXms512mXmn384m.log.gz

markehammons on 18 Sep 2017

This workload is rather memory hungry. And it's very generational (most of the objects die young and very few get promoted to Tenure area). Even with 6GB Nursery, there is 12% GC over head. On your machine with (16GB physical memory), you can probably squeeze another 6-7% in throughput by running with 14-15G total heap and giving ~95% of it to Nursery.

However, out of box throughput (no heap size tuning) ,even when Nursery resizing logic regression is resolved, will be noticeably lower. I believe we take only 1/4 of physical memory for total heap size and then 1/4 of total heap size for Nursery (so, about 1GB Nursery). There is also a delay to fully expand Nursery from JVM startup - so expect results similar to what your experiment with -Xmn768M showed.

OpenJDK is known to be more aggressive in Nursery expansion than OpenJ9. It's very subjective and depends on user what is better (CPU or memory savings). For the most fair comparison you want to use identical total heap/Nursery sizing and fully expand them from the start.

amicic on 18 Sep 2017

markehammons on 18 Sep 2017

👍9

@markehammons I don't suppose you'd consider updating your comments in your reddit post :) (assuming I've correctly connected this issue with the jmh comments in the reddit posting!) ?

Just to close the loop :) .

mstoodle on 18 Sep 2017

In what way exactly? Do you want me to clarify it was a regression in the nursery gen expansion of the gencon gc for openj9 and not a problem endemic to jmh benchmarking on openj9?

I was thinking about doing a post on either r/java or r/scala once I had some more concrete results from OpenJ9. I was specifically wanting to look into the gpu and aot features and see how they effect JITC and remove warmup times (if I understand how they work correctly) and demonstrate that.

markehammons on 19 Sep 2017

I should have read your most recent comment in light of the data you posted before that; I interpreted it too broadly. My apologies.

I look forward to hearing about your experiments with gpu and AOT!

mstoodle on 19 Sep 2017

It's no problem. I need to work on my own company's projects for now but I'll probably get to that on the weekend.

markehammons on 19 Sep 2017

@markehammons

Mark, latest builds (Sept 22) have the Nursery expansion regression resolved.

You can compare performance
1) with no heap sizing options altogether (I believe on your machine with 16GB, we will effectively run as if -Xmx4G -Xmnx1G were specified)
2) or you can override max total Heap and Nursery (remember, this workload likes large Nursery) but without fully expanding it (no -Xms or -Xmns option).

With dynamic expansion it's not only interesting/important to compare throughput, but footprint, too.

On Linux, you can get native footprint (resident memory) with ps -p ${JVMPID) -o rss.

amicic on 22 Sep 2017

Hi @amicic, I haven't had a chance to rebench things yet, but I'll do that soon. Sorry, I've been busy with work obligations...

markehammons on 24 Sep 2017

Sorry, I rebenched this last night and nearly forgot to post the results...

bench-shootout-all-openj9.log

I need to rebench zulu9 to get a good comparison for you guys, but it appears the gencon gc is p much totally fixed now.

markehammons on 1 Oct 2017

Closing, please re-open if there is still a problem, or create a new Issue.