I'm creating a jmh version of the benchmarks of The computer language benchmarks game and I decided to try it out comparing OpenJ9 to Zulu9 from azul. Zulu9's performance is around 2600ms/op on my system, but OpenJ9 shows some strange performance issues where it slows down with every successive warmup iteration:
Zulu9 - sbt "jmh:run -i 10 -wi 20 -f1 -t1 -r 15 -jvmusr/lib/jvm/zulu-9/bin/java"
[info] # Run progress: 0.00% complete, ETA 00:05:40
[info] # Fork: 1 of 1
[info] # Warmup Iteration 1: 2630.687 ms/op
[info] # Warmup Iteration 2: 2780.907 ms/op
[info] # Warmup Iteration 3: 2630.115 ms/op
[info] # Warmup Iteration 4: 2587.016 ms/op
[info] # Warmup Iteration 5: 2504.948 ms/op
[info] # Warmup Iteration 6: 2566.648 ms/op
[info] # Warmup Iteration 7: 2537.053 ms/op
[info] # Warmup Iteration 8: 2500.211 ms/op
[info] # Warmup Iteration 9: 2576.959 ms/op
[info] # Warmup Iteration 10: 2517.836 ms/op
[info] # Warmup Iteration 11: 2514.588 ms/op
[info] # Warmup Iteration 12: 2508.471 ms/op
[info] # Warmup Iteration 13: 2548.678 ms/op
[info] # Warmup Iteration 14: 2529.385 ms/op
[info] # Warmup Iteration 15: 2508.744 ms/op
[info] # Warmup Iteration 16: 2475.521 ms/op
[info] # Warmup Iteration 17: 2518.160 ms/op
[info] # Warmup Iteration 18: 2572.912 ms/op
[info] # Warmup Iteration 19: 2605.698 ms/op
[info] # Warmup Iteration 20: 2556.147 ms/op
OpenJ9 - sbt "jmh:run -i 10 -wi 20 -f1 -t1 -r 15 -jvmhome/mhammons/jdk-9+181/bin/java"
[info] # Run progress: 0.00% complete, ETA 00:05:40
[info] # Fork: 1 of 1
[info] # Warmup Iteration 1: 36902.857 ms/op
[info] # Warmup Iteration 2: 41640.554 ms/op
[info] # Warmup Iteration 3: 49802.187 ms/op
[info] # Warmup Iteration 4: 59686.587 ms/op
[info] # Warmup Iteration 5: 69178.086 ms/op
[info] # Warmup Iteration 6: 78664.955 ms/op
...
As you can see, OpenJ9 starts off markedly slower than Zulu9. But what's more concerning is that instead of improving in performance during the warmup phase as is to be expected, the subsequent warmup iterations grow slower and slower on OpenJ9.
The benches were compiled with openjdk 8, and run with a forked jvm by the jmh harness. You can invoke the benches and indicate the jvm to be used to run them with the following command:
sbt "jmh:run -i 10 -wi 20 -f1 -t1 -r 15 -jvm path/to/jvm"
JVM arguments can be supplied with an added -jvmArgs argument. I did this with -Xjit:verbose to test that jitting was occurring, and it was, so this slowdown is puzzling. Another thing I noted while running this bench is that although it is designed to use as much threading as the system's cpus will allow for, half or less of my cpu was unused with OpenJ9.
The performance issue originates from the GC. Massive slow downs (12x to 20x) are seen with gencon [default GC policy]. Performance improves with other GC policies: balanced, metronome, optavgpause and optthruput. With metronome, the performance issue is resolved.
Warmup iterations with different GC policies + CPU profiles:
-Xgcpolicy:gencon
[info] # Warmup Iteration 1: 32667.888 ms/op
[info] # Warmup Iteration 2: 35529.382 ms/op
[info] # Warmup Iteration 3: 40413.470 ms/op
CPU profile:
28.65% libj9gc29.so [.] MM_Scavenger::incrementalScanCacheBySlot
8.68% libj9gc29.so [.] MM_MarkingScheme::completeScan
6.78% libjclse9_29.so [.] gcEndEvent
5.97% libj9gc29.so [.] MM_CollectorLanguageInterfaceImpl::scavenger_getObjectScanner
4.89% libj9gc29.so [.] GC_ObjectHeapIteratorAddressOrderedList::nextObjectNoAdvance
3.57% libj9gc29.so [.] GC_ObjectModelDelegate::calculateObjectDetailsForCopy
-Xgcpolicy:balanced
[info] # Warmup Iteration 1: 7386.035 ms/op
[info] # Warmup Iteration 2: 8001.562 ms/op
[info] # Warmup Iteration 3: 8600.087 ms/op
[info] # Warmup Iteration 4: 9109.380 ms/op
[info] # Warmup Iteration 5: 8655.517 ms/op
CPU profile:
19.76% libj9gc29.so [.] MM_CopyForwardScheme::copy
19.74% libj9gc29.so [.] MM_CopyForwardScheme::scanMixedObjectSlots
7.58% libj9gc29.so [.] MM_GlobalMarkingScheme::scanMixedObject
3.70% libj9gc29.so [.] MM_ClassLoaderRememberedSet::rememberInstance
3.57% libj9gc29.so [.] MM_CopyForwardScheme::completeScanCache
md5-c50656dcdcb56d6cdb18e2195ac868a8
-Xgcpolicy:metronome
[info] # Warmup Iteration 1: 4199.195 ms/op
[info] # Warmup Iteration 2: 2717.050 ms/op
[info] # Warmup Iteration 3: 2669.678 ms/op
[info] # Warmup Iteration 4: 2718.966 ms/op
[info] # Warmup Iteration 5: 2697.947 ms/op
[info] # Warmup Iteration 6: 2643.536 ms/op
[info] # Warmup Iteration 7: 2682.550 ms/op
CPU profile:
11.19% libj9gc29.so [.] MM_RealtimeMarkingScheme::incrementalConsumeQueue
10.46% libj9gc29.so [.] MM_SweepSchemeSegregated::sweepSmallRegion
md5-c50656dcdcb56d6cdb18e2195ac868a8
-Xgcpolicy:optavgpause
[info] # Warmup Iteration 1: 9636.824 ms/op
[info] # Warmup Iteration 2: 8594.115 ms/op
[info] # Warmup Iteration 3: 8564.137 ms/op
[info] # Warmup Iteration 4: 7936.804 ms/op
[info] # Warmup Iteration 5: 7872.895 ms/op
CPU profile:
36.49% libj9gc29.so [.] MM_MarkingScheme::completeScan
7.46% libj9gc29.so [.] MM_CompactScheme::getForwardingPtr
3.39% libj9gc29.so [.] MM_ConcurrentSweepScheme::postConnectChunk
md5-c50656dcdcb56d6cdb18e2195ac868a8
-Xgcpolicy:optthruput
[info] # Warmup Iteration 1: 6449.687 ms/op
[info] # Warmup Iteration 2: 5368.190 ms/op
[info] # Warmup Iteration 3: 5982.922 ms/op
[info] # Warmup Iteration 4: 6439.730 ms/op
[info] # Warmup Iteration 5: 5780.017 ms/op
[info] # Warmup Iteration 6: 4925.901 ms/op
[info] # Warmup Iteration 7: 5586.993 ms/op
CPU profile:
54.00% libj9gc29.so [.] MM_MarkingScheme::completeScan
1.97% libj9gc29.so [.] MM_CompactScheme::getForwardingPtr
References:
@markehammons Changing GC policy to metronome should resolve the performance issue. With metronome, OpenJ9 is comparable to Zulu9. The GC policy can be changed using the following JVM cmdline option: -Xgcpolicy:metronome.
-Xverbosegclog option should generate a log [verbosegc.20170917.232806.28228.txt]. Within this log, OpenJ9's GC policy can be verified: <attribute name="gcPolicy" value="-Xgcpolicy:metronome" />.
OpenJ9 vs. Zulu9:
Zulu9 Results:
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main avgt 10 2831.880 卤 80.085 ms/op
[info] o.d.a.b.b.main sample 60 2823.116 卤 42.887 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 2726.298 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 2805.989 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 2898.264 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 2948.386 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 3342.860 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 3342.860 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 3342.860 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 3342.860 ms/op
[info] o.d.a.b.b.main ss 10 2794.083 卤 84.560 ms/op
OpenJ9 Results (with -Xgcpolicy:metronome):
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main avgt 10 2740.335 卤 23.958 ms/op
[info] o.d.a.b.b.main sample 60 2799.418 卤 16.621 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 2726.298 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 2797.601 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 2847.932 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 2852.127 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 2936.013 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 2936.013 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 2936.013 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 2936.013 ms/op
[info] o.d.a.b.b.main ss 10 2808.452 卤 77.603 ms/op
Adding GC members to investigate the perf issue with gencon GC policy: @dmitripivkine @amicic
fyi: @charliegracie @DanHeidinga
Hi @babsingh can you provide the complete command line options you used for each test? Also would you be able to provide verbose:gc logs for each of the runs?
The metronome GC policy is our incremental soft realtime collector so I am not that surprised that it showed better performance on this benchmark. This benchmark seems to be measuring how long it takes for complete an operation and that GC pauses seem to have a real impact on the responsiveness. The metronome GC attempts keep all pauses increments to ~3ms (default) so in systems where responsiveness is key metronome can regularly perform better out of the box. For applications where total throughput is king gencon (default) is likely the best choice.
@charliegracie soft realtime shouldn't matter much for this bench. Java 8 using the default settings and not using g1gc performs well on this bench too. Further, I'd assume that throughput would be more helpful, as this bench is supposed to stress the GC with garbage and see how well the GC can clean up massive amounts of garbage without getting in the way of the executing code. So the GC stutter from high throughput GC should be averaged out by the numerous iterations and should only show up in the sampletime results as rare hiccups in an iteration.
Here's the bench results for java8 with -XX:+UseParallelGC
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 67 2439.802 卤 93.047 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 2136.998 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 2369.782 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 2754.819 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 2863.032 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 2956.984 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 2956.984 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 2956.984 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 2956.984 ms/op
As you can see, the worst case caused by gc stutter is in the 99th percentile of bad performance
Here's with the deprecated -XX:+UseParNewGC
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 31 5299.706 卤 264.359 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 4722.786 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 5251.269 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 5731.097 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 6059.930 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 6467.617 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 6467.617 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 6467.617 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 6467.617 ms/op
As you can see, it performs much worse because it uses serial gc for the old generation (which this bench creates an unusual amount of old-gen garbage).
Here's with -XX:+UseSerialGC, which should have terrible latency on my system, being a singlethreaded stop the world collector:
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 30 6828.886 卤 243.483 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 6048.186 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 6819.938 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 7255.307 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 7472.991 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 7625.245 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 7625.245 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 7625.245 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 7625.245 ms/op
As you can see, while none are as performant as the java 8 default (which gets results similar to azul java 9), they all outperform the default GC for openJ9 on this bench.
@charliegracie: Here's a log of two warmup iterations of the gencon gc. I used default settings with the command:
sbt "jmh:run -i 10 -wi 20 -f1 -t1 -r 15 -jvm /home/mhammons/jdk-9+181/bin/java -jvmArgs -Xloggc:gc_gencon.log benchmarksgame.*"
I've compressed the file with xz (though github doesn't like that format so I had to wrap it in a gz as well) because the original log was around 140MB.
It seems this bench hits an edgecase in gencon and causes it to freak out, cause I doubt 140MB of logs in less than 2 minutes is normal.
For good measure, here's the gc logs from doing a full run of the same bench with metronome:
gc_metronome.log.gz
Looking at the gencon log there is definitely something mis-behaving. The nursery is not expanding which is causing the JVM to spend ~90% of the total runtime in the GC. We are looking into why it is not expanding.
@markehammons while we are looking into things locally would you be able to do another run with -Xmx512m -Xms512m to see if that improves anything?
Sure!
With -Xmx512m -Xms512m I'm finally able to complete the full sampling bench with the gencon gc, as opposed to only being able to do 2-5 warmup iterations
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 20 7777.078 卤 114.003 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 7541.359 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 7797.211 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 7962.467 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 7985.116 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 7985.955 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 7985.955 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 7985.955 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 7985.955 ms/op
[success] Total time: 310 s, completed Sep 18, 2017 5:48:31 PM
The performance is around openjdk's serialGC, which is unfortunate, but still sane compared to my previous results. Here's a bench of SerialGC with the same memory constraints for reference:
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 30 6224.068 卤 304.897 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 5125.439 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 6274.679 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 6767.090 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 7075.371 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 7172.260 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 7172.260 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 7172.260 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 7172.260 ms/op
Hi Mark, thanks for quick response and observing this issue. Indeed as Charlie suspected, it's a recent regression in Nursery expansion logic. We understand the problem now and working on a fix.
Meanwhile you can continue experimenting with manual heap sizing. The heap sizing Charlie suggested is still too small. Try even larger values, specifically with Nursery size (-Xmn option): For example -Xmx8G -Xms8G -Xmn6G (total heap of 8G and Nursery 3/4 of it, all fully expanded from the very beginning). You can also provide OpenJDK/Zulu verbose GC logs so we can see how big Heap/Nursery was for them, so we can suggest more appropriate values.
@amicic How would I tell openj9 to direct verbose logs to a specific file? What exactly is the option for that? I'll run the tests with the parameters you've suggested
It's -Xverbosegclog option. More about it and some others here: https://www.ibm.com/support/knowledgecenter/en/SSYKE2_8.0.0/com.ibm.java.lnx.80.doc/diag/appendixes/cmdline/commands_gc.html
Ok.
Xmx8g Xms8g Xmn6g
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 70 2334.550 卤 24.099 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 2227.175 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 2336.227 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 2402.917 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 2417.807 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 2487.222 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 2487.222 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 2487.222 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 2487.222 ms/op
gc_genconXmx8gXms8gXmn6g.log.gz
Xmx4g Xms4g Xmn3g - OpenJDK standard Xmx for my system
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 60 2553.283 卤 27.023 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 2470.445 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 2543.845 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 2662.544 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 2675.756 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 2730.492 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 2730.492 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 2730.492 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 2730.492 ms/op
gc_genconXmx4gXms4gXmn3g.log.gz
Xmx2g Xms2g Xmn1536m
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 60 2628.500 卤 28.380 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 2508.194 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 2623.537 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 2701.132 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 2717.699 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 2885.681 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 2885.681 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 2885.681 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 2885.681 ms/op
gc_genconXmx2gXms2gXmn1536m.log.gz
Xmx1g Xms1g Xmn768m
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 60 2882.326 卤 47.436 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 2680.160 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 2860.515 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 3042.129 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 3061.213 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 3263.169 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 3263.169 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 3263.169 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 3263.169 ms/op
gc_genconXmx1gXms1gXmn768m.log.gz
Xmx512m Xms512m Xmn384m
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 40 4159.282 卤 161.941 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 3581.936 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 4135.584 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 4537.398 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 4579.761 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 4680.843 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 4680.843 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 4680.843 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 4680.843 ms/op
gc_genconXmx512mXms512mXmn384m.log.gz
Sorry this took a while, I ran benches till I got closer to the Xmx512 results I posted before. It looks like the nursery gen regression is hurting performance really badly. With manually setting the nursery to 384m, the performance of gencon is approximately 2x what I was getting with just Xmx and Xms set to 512m.
Xmx8g Xms8g Xmn6g
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 74 2179.083 卤 42.629 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 2046.820 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 2147.484 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 2338.324 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 2441.085 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 2462.056 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 2462.056 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 2462.056 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 2462.056 ms/op
gc_openJDKDefaultXmx8192mXms8192mXmn6144m.log.gz
Xmx4g Xms4g Xmn3g
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 72 2170.203 卤 49.907 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 2057.306 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 2134.901 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 2318.611 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 2468.767 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 2717.909 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 2717.909 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 2717.909 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 2717.909 ms/op
gc_openJDKDefaultXmx4096mXms4096mXmn3072m.log.gz
Xmx2g Xms2g Xmn1536m
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 70 2369.872 卤 81.561 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 2107.638 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 2315.256 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 2688.129 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 2827.171 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 2982.150 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 2982.150 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 2982.150 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 2982.150 ms/op
gc_openJDKDefaultXmx2048mXms2048mXmn1536m.log.gz
Xmx1g Xms1g Xmn768m
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 60 2770.478 卤 78.780 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 2424.308 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 2778.726 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 3027.029 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 3122.450 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 3133.145 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 3133.145 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 3133.145 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 3133.145 ms/op
gc_openJDKDefaultXmx1024mXms1024mXmn768m.log.gz
Xmx512m Xms512m Xmn384m
[info] Benchmark Mode Cnt Score Error Units
[info] o.d.a.b.b.main sample 10 16822.514 卤 1119.825 ms/op
[info] o.d.a.b.b.main:main路p0.00 sample 15904.801 ms/op
[info] o.d.a.b.b.main:main路p0.50 sample 16701.719 ms/op
[info] o.d.a.b.b.main:main路p0.90 sample 17877.801 ms/op
[info] o.d.a.b.b.main:main路p0.95 sample 17884.512 ms/op
[info] o.d.a.b.b.main:main路p0.99 sample 17884.512 ms/op
[info] o.d.a.b.b.main:main路p0.999 sample 17884.512 ms/op
[info] o.d.a.b.b.main:main路p0.9999 sample 17884.512 ms/op
[info] o.d.a.b.b.main:main路p1.00 sample 17884.512 ms/op
This workload is rather memory hungry. And it's very generational (most of the objects die young and very few get promoted to Tenure area). Even with 6GB Nursery, there is 12% GC over head. On your machine with (16GB physical memory), you can probably squeeze another 6-7% in throughput by running with 14-15G total heap and giving ~95% of it to Nursery.
However, out of box throughput (no heap size tuning) ,even when Nursery resizing logic regression is resolved, will be noticeably lower. I believe we take only 1/4 of physical memory for total heap size and then 1/4 of total heap size for Nursery (so, about 1GB Nursery). There is also a delay to fully expand Nursery from JVM startup - so expect results similar to what your experiment with -Xmn768M showed.
OpenJDK is known to be more aggressive in Nursery expansion than OpenJ9. It's very subjective and depends on user what is better (CPU or memory savings). For the most fair comparison you want to use identical total heap/Nursery sizing and fully expand them from the start.
That's fine. gencon seems like an interesting GC. It's outperforming G1GC for java 8 at 512/512/384 so it seems to run better with less memory usage than a lot of the openjdk garbage collectors. The JIT for openJ9 seems to perform better than openjdk in some of my microbenchmarks too. I'm looking forward to it working better with more software so I can test it more with real world software...
@markehammons I don't suppose you'd consider updating your comments in your reddit post :) (assuming I've correctly connected this issue with the jmh comments in the reddit posting!) ?
Just to close the loop :) .
In what way exactly? Do you want me to clarify it was a regression in the nursery gen expansion of the gencon gc for openj9 and not a problem endemic to jmh benchmarking on openj9?
I was thinking about doing a post on either r/java or r/scala once I had some more concrete results from OpenJ9. I was specifically wanting to look into the gpu and aot features and see how they effect JITC and remove warmup times (if I understand how they work correctly) and demonstrate that.
I should have read your most recent comment in light of the data you posted before that; I interpreted it too broadly. My apologies.
I look forward to hearing about your experiments with gpu and AOT!
It's no problem. I need to work on my own company's projects for now but I'll probably get to that on the weekend.
@markehammons
Mark, latest builds (Sept 22) have the Nursery expansion regression resolved.
You can compare performance
1) with no heap sizing options altogether (I believe on your machine with 16GB, we will effectively run as if -Xmx4G -Xmnx1G were specified)
2) or you can override max total Heap and Nursery (remember, this workload likes large Nursery) but without fully expanding it (no -Xms or -Xmns option).
With dynamic expansion it's not only interesting/important to compare throughput, but footprint, too.
On Linux, you can get native footprint (resident memory) with ps -p ${JVMPID) -o rss.
Hi @amicic, I haven't had a chance to rebench things yet, but I'll do that soon. Sorry, I've been busy with work obligations...
Sorry, I rebenched this last night and nearly forgot to post the results...
I need to rebench zulu9 to get a good comparison for you guys, but it appears the gencon gc is p much totally fixed now.
Closing, please re-open if there is still a problem, or create a new Issue.
Most helpful comment
That's fine.
genconseems like an interesting GC. It's outperforming G1GC for java 8 at 512/512/384 so it seems to run better with less memory usage than a lot of the openjdk garbage collectors. The JIT for openJ9 seems to perform better than openjdk in some of my microbenchmarks too. I'm looking forward to it working better with more software so I can test it more with real world software...