In order to retain optimal performance in the interpreter in runtime compressed refs mode, we'll need to multiply-compile it (we already do this for debug mode).
The idea is to defined something like J9_COMPRESSED_REFS_OVERRIDE to either TRUE or FALSE, and use the override value if present in the J9VMTHREAD_COMPRESS_OBJECT_REFERENCES and J9JAVAVM_COMPRESS_OBJECT_REFERENCES macros.
In keeping with the current naming conventions, I'd suggest replacing BytecodeInterprer.cpp with BytecodeInterpreterCompressed.cpp and BytecodeInterpreterFull.cpp, and rename the C entry points appropriately. (Almost) the entirety of these files should be ifdeffed on OMR_GC_COMPRESSED_POINTERS or OMR_GC_FULL_POINTERS (you'll need to include j9cfg.h to get those ifdefs).
The new entry points will need to be used in:
For now, let's not bother splitting the debug interpreter or MH interpreter.
@DanHeidinga @sharon-wang
Makefiles will also need to be updated - search for BytecodeInterpreter and duplicate accordingly.
Makefiles will also need to be updated - search for
BytecodeInterpreterand duplicate accordingly.
@dnakamura FYI for CMake.
I'm working on this
To start with, there's no need to actually be in mixed mode - if the individual interpreters are ifdeffed on the full/compressed flags, then they'll work in the normal builds, and won't bloat them.
For an example, see:
For the override check, I was thinking something like:
```
before including much of anything else (particularly j9nonbuilder.h or anything that includes it)
and in j9nonbuilder.h:
/* Mixed mode - necessarily 64-bit */
/* Compressed only - necessarily 64-bit */
/* Full only - could be 32 or 64-bit */
Some initial benchmark results...
split: "mixed" build
baseline: regular build
LibertyStartupDT7 - slight improvement in startup time
| | startup time (ms) |
|---|---|
| split | 12417 |
| baseline | 12931 |
| diff | -514 |
Throughput Benchmark - significant reduction in throughput
Options: -Xnocompressedrefs -Xms6g -Xmx6g -Xmn4g
| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff | -46.8% | -62.3% | -42.0% | -39.3% |
I'll be running more iterations of these benchmarks to see if results are consistent.
@sharon-wang those numbers seem really off. The overhead from the compressed checks should mostly be in the GC at this point which shouldn't account for a 2-3x overhead.
@amicic any comment on that?
Can you collect some profiles from a combo build run? It would really help to see where time is going.
The probable reason for the awful numbers is that the bench was being run in GC stress mode (which is something we will eventually need to address, but for now, the new numbers will be in normal mode to give us a general idea).
hard to speculate without seeing heap sizing, machine config, GC logs, CPU profiles...
Ran a few more iterations of LibertyStartupDT7, which yielded similar results as above.
For GC, I've collected a few profiles and logs. What sort of data do we want from the profiles and GC logs? Are there specific measurements that we are interested in?
For the profiles, we want to compare the profiles from the combo build to a non-combo build to see which symbols have become hotter. This helps to identify where we're spending more time.
For GC, I've collected a few profiles and logs
Please send collected GC verbose log files (both combo and non-combo) to me or @amicic
GC times are about 10% longer. More or less expected at this stage.
CPU profiles are too short to be meaningful. I asked @sharon-wang to generate them again
Updated Throughput Benchmark Results
The GC regression appears to be much less than the initial results. It's not super clear why those results were so off, other than the small heap.
COMPOSITE + PRESET, IR = 5000, Heap settings: -Xms12g -Xmx12g -Xmn10g

COMPOSITE + PRESET, IR = 20000, Heap settings: -Xms12g -Xmx12g -Xmn10g

COMPOSITE + COMPLIANT + Heap settings: -Xms12g -Xmx12g -Xmn10g
| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff | -2.22% | -9.88% | 0 | -1.97% |
COMPOSITE + COMPLIANT + Heap settings: -Xms8g -Xmx8g -Xmn6g
| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff | -1.18% | -2.54% | 0 | 0|
Original runs were with -Xnocompressedrefs. Maybe there is something specific to it that led to so big difference.
The machine is also different, and heap sizing is also a bit different, but I would not think they would be a cause for such big difference.
Anyhow, nonCR should be tried, again.
I find it odd that the lower heap sizes are showing better results.
Run-to-run variation is easily 3%.
We did not pick the right combo of heap sizes to emphasize the difference between low and high GC overhead scenarios, either. When I picked large heap (10G Nursery) I assumed small will be 4G (like in the original runs), so about 2.5x smaller GC overhead. But I should've also picked even larger than 10G.
I can do some more runs: one set with compressed refs and one without. Are we interested in both COMPLIANT and PRESET runs?
What is suggested for the heap settings? I can use -Xms6g -Xmx6g -Xmn4g for small - what is recommended for a larger heap size?
Let's do compliant runs only for nonCR. If we indeed reproduce the big gap on GC perf machine, then will do PRESET runs and with CPU profiles.
large: -Xmx24G -Xms24G -Xmn20G
small: -Xmx6G -Xms6G -Xmn4G
New Throughput Benchmark Results - COMPOSITE + COMPLIANT, -Xnocompressedrefs
There is not a big gap in performance like initially measured. Might have been a machine issue, since those first runs were done on a different machine than the subsequent runs.
Heap settings: -Xms24g -Xmx24g -Xmn20g
| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff | -1.21% | -5.81% | 0 | -2.65% |
Heap settings: -Xms6g -Xmx6g -Xmn4g
| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff | 0 | -2.45% | 0 | -1.10% |
GC (Scavenge specifically) time slowdown for nonCR
聽 | Reg | Mix | Diff (%)
-- | -- | -- | --
Large | 154.851 | 174.017 | 12.38
Small | 115.13 | 123.128 | 6.95
Similar for CR:
聽 | Reg | Mix | Diff (%)
-- | -- | -- | --
Large | 147.904 | 172.263 | 16.47
Small | 145.972 | 156.681 | 7.34
Those are average GC times as obtained by:
grep '<gc-op.*scavenge' verbosegc.log | awk -F\" '{x+=$6} END {print x/NR}'
@amicic Should more perf results be collected or is further profiling needed?
I don't need more perf results. These slow downs are more or less expected, knowing the changes that came in... Next step is to try to reduce the gap.
Generally, we need to reduce the number of runtime 'if CR' checks. If possible, extracting it from tight loops. Very often it won't be possible, so we'll have to resort to versioning the code. Something similar to what is done to the Interpreter (although to avoid maintaining 2 version of source code, I'd consider C++ templates). Either way we need to do it for minimum amount of code that is really impacted by the runtimes checks most.
Specifically, for Scavenger that is most frequently executed GC in Gencon, we can version:
So where do we go from here? @rwy0717 suggested that an interpreter-style solution (the compressed override define) might work better than templates. Any thoughts on that?
Another obvious option is to simply multiply-compile the majority of the GC and fill in the MM function table appropriately. This would obviously perform, but may be prohibitively large, and it would also need to address the ODR constraint (i.e. the classes would need to be multiply-named).
I'm not familiar at all with templates, so I have nothing to offer in that regard.
Can anyone provide an example of what a piece of code would look like with templates?
One of the major places we probably want to optimize are the scanners/iterators, but I had heard they were being reworked (for VT?), so I've been loathe to put a lot of effort into changing those. One thing I had looked at was keeping an internal SlotObject instead of a scan pointer, and using the new instance-based APIs to increment the pointer.
Here's a template example from a discussion with @rwy0717.
In this example:
PointerModePrinter parameterized on the PointerMode enumPointerModePrinter for COMPRESSED mode and another for FULL modeprint_pointer_mode() is also parameterized on the PointerMode enum, so that it can be used to declare the appropriate PointerModePrinter based on the PointerMode specifieddo_print_pointer_mode() calls the templated function print_pointer_mode() based on the PointerMode provided at runtime.To keep the example simple, there are no ifdefs on OMR_GC_COMPRESSED_POINTERS
enum class PointerMode { COMPRESSED, FULL };
// class template declaration
template <PointerMode M>
class PointerModePrinter;
// class template specialization for COMPRESSED mode
template <>
class PointerModePrinter<PointerMode::COMPRESSED> {
public:
void print() { printf("compressed\n"); }
};
// class template specialization for FULL mode
template <>
class PointerModePrinter<PointerMode::FULL> {
public:
void print() { printf("full\n"); }
};
// templated function
template <PointerMode M>
void print_pointer_mode() {
PointerModePrinter<M> printer;
printer.print();
}
// call templated function from wrapper that does runtime dispatch
void do_print_pointer_mode(PointerMode m) {
switch (m) {
case PointerMode::COMPRESSED:
return print_pointer_mode<PointerMode::COMPRESSED>();
case PointerMode::FULL:
return print_pointer_mode<PointerMode::FULL>();
}
}
int main() {
do_print_pointer_mode(PointerMode::COMPRESSED);
do_print_pointer_mode(PointerMode::FULL);
}
This morning, I've made an extremely rough pass at what templating scanObject and completeScan in the MarkingScheme might look like. I did not spend a lot of time on it, there are probably errors in the code, and some things are less elegant than they could be.
@sharon-wang, @rwy0717 thanks for samples and initial work. The switch statement would hopefully go only in very few places, very high in call hierarchy (completeScan() for various Collector classes).
Clearly there would be significant portion of GC code that would be affected. It would take some time to get it right, be harder to read and maintain, and from that perspective this does not seem as an ideal approach.
Another perspectives is how this affects resident set (loaded executable code that is; we don't have static/global variables that would be specialized). If this is linked into a single libj9gc library, does only one specialization get loaded or both? We are probably talking about a couple of MBs of executable code being specialized what might be a problem with tiny workloads.
Let's consider now how multiple-compile approach would look like.... Would we need per file/class changes? We would get two separate libj9gccr and libj9gcfull libraries and only one would get loaded? Well, we would probably have a third one libj9gccommon (where as a minimum initial startup logic that decides about heap geometry is, plus potentially any other code that is unaware of CR)? Then consequently some inter-library APIs being defined between CR/Full and Common. Then makefile/config shuffling. What else?
The templated code should remain fairly readable - almost every class has it's own compressObjetReferences function, which is the only thing that would need to change for the templating.
Hitting a bit of a roadblock with templates as IBM builds may use the old ddrgen which doesn't support C++ templates and the new ddrgen needs to be extended to properly handle templates.
@keithc-ca noted that we鈥檇 have to look into our own (portable) name mangling for template instantiations and teach ddrgen to map all the compiler-specific manglings to ours. Also discussed, using typedefs to hide the templated types would seem like a way around dealing with templates in DDR; however, it's possible for compilers to "see through" the typedefs and encounter the templated types anyways.
A typedef won't be much use, since we need both definitions available in the same build. It would also be horrendous to have to use !SomeGCStructCompressed directly in DDR - whatever we do, the unmangled pretty printers have to remain working as-is - nothing would be worse than using the wrong structure during a debug session and going down the garden path.
To elaborate on my typedef comments, imagine we have:
typedef GCAlgorithm<Compressed> GCAlgorithmCompressed;
typedef GCAlgorithm<Full> GCAlgorithmFull;
If compilers don't inline references to things like GCAlgorithmFull, then we have some hope of generating GCAlgorithmFull.java, but we know that VisualStudio is not such a compiler.
If, as I believe should be the case, the compressed vs. full distinction need only be made for algorithms operating on objects in the heap; that is, we need neither GCAlgorithmCompressed nor GCAlgorithmFull (or whatever their mangled names might be) for DDR.
With respect to data types like SomeGCStruct: these need to continue with the same names as they currently have. Their fields, offsets and types will continue to be derived from the information extracted from object files. !SomeGCStruct should be the command to be used for both old and new core files.
As a stop-gap measure, I'm going to try compiling two different GC libraries and select the appropriate one at runtime. To keep DDR happy, I will leave all of the compressed/full optional fields in structures present in both libraries and only fix the implementations of compressObjectReferences() to return a compile-time constant.
@dmitripivkine @amicic Would you object to me removing the compressed/full ifdefs from the variables (this is where we would eventually end up anyway with a true mixed mode build)? This solves any DDR issues, and will probably make my life a lot easier even with the two-library solution.
You are talking about fields like
bool _compressObjectReferences;
in various GC meta structures (Environment, Extention, ObjectModel etc.)?
I don't see a problem.
I'm referring to things like this:
#if defined (OMR_GC_COMPRESSED_POINTERS)
uintptr_t _compressedPointersShift; /**< the number of bits to shift by when converting between the compressed pointers heap and real heap */
There will be no need to un-ifdef any fields - the proposed multi-compile solution will leave both of the pointer flags defined in both compiles, so the structures will be identical. The optimization will be achieved by defining the override flag (like some of the VM files) instead.
The optimization will be controlled by defining OMR_OVERRIDE_COMPRESS_OBJECT_REFERENCES to 0 (full) or 1 (compressed) for all of the J9 and OMR GC code.
The J9 override has been changed to OMR for consistency.
So the question now is how to get this into the various build systems. If we end up only making this work in cmake, that's probably OK (and certainly a good start).
Do we want to put the override on the compile command line, or use some other more nefarious mechanism (to test the mechanism, I made objectdescription.h declare the flag for OMR, but that's gross and impractical)?
Things we need to consider:
how to get the flag into the builds at all
how to compile the OMR and J9 portions twice with the different flag definitions
do we compile once in the current locations and once more somewhere else or do we abandon the in-place compiles and do two compiles to new locations?
@rwy0717 @youngar @dnakamura
Under cmake, I would build the GC as two different libraries, from the same set of sources. To get the glue code to build twice, we just need to make both libraries link against the gc-glue interface-library. I would just pass the override flag along on the command line, I haven't been able to come up with a cleaner way to do it.
I haven't been able to come up with a cleaner way to do it.
@rwy0717 That seems reasonably 'clean' to me: what about that is less than ideal?
Also bear in mind the glue is going to be compiled twice as well.
I'm hoping @sharon-wang will actually be doing this work, so she'll no doubt have more specific questions about adding options to the command line (in particular, how does OpenJ9 tell OMR that it wants extra options?).
One of the issues is that we will not need the default compiled OMR GC (which will be runtime mixed mode). If we compile it and discard it, it's not the end of the world, but it would make more sense not to do so.
I'm leaning towards leaving the existing compiles in place (with the flag defined one way) in order to keep the impact of the dual compile minimal (DDR looks in those directories) and compiling the second way somewhere else.
If we produce two GC shared libraries, we'll need to produce two DDR blobs as well (one corresponding to each GC shared library) and the appropriate blob will need to be loaded.
No, the whole point of doing it this way is that we will not need separate blobs - the structures are identical in both libraries (both still think they're in mixed mode, but have their compressed query overridden).
This only works if we are producing shared libraries, which is why it's only an interim solution (which may last for many years).
What will prevent (or detect) some data structures differing between the two GC shared libraries?
Which of the two GC shared libraries should be analyzed by ddrgen?
Reiterating the comment from a few lines ago, there is no possibility of different structures - only inlined query methods are affected by the multiple compiles.
For reference, the PRs adding the multi-compile override are:
https://github.com/eclipse/omr/pull/5416
https://github.com/eclipse/openj9/pull/10206
I saw your comment and I can accept that it is true today, but you haven't answered my questions.
1) How can we be sure that your comment continues to be true?
2) Which GC shared library will be examined by ddrgen (assuming it doesn't matter)?
Code reviewers who understand the process will prevent problems occurring in the future (i.e. no one can use the override to define structure fields).
Again, either library can be used since they will contain identical structures. My suggestion is that we build one of the libraries in the current location so no changes are required to DDR (I plan to go the rest of my life without having to interact with the DDR build process again).
@sharon-wang Now that you're looking into this, I recommend that the current directory structures be used for the compressed version of the GC (still named j9gc29) and the second compile be used for the full version (j9gcfull29). Leaving the existing structure in place allows DDR to continue to work unmodified. This will mean the in-place compile of OMR will need to have the override flag passed to it.
@gacholio By "current directory structures...of the GC", are you referring to the directories for gc, gc_api, gc_base, etc. in the runtime directory? So we would be building 2 copies of each of these GC libraries?
Correct - build the current GC with the compressed override set, and build it again elsewhere with the full override set.
@tajila @sharon-wang Hi Sharon, any update on the build work?
@gacholio Currently working through building two copies of each of the GC libraries (updating cmake files to build 2 versions, updating existing module.xml files and creating new module.xml files for the second copy). The existing GC library structure will stay as is for the compressed override and there will be a new directory gc_full with subdirectories for each duplicated library. The spots where dynamic GC library loading need to occur have been identified, and the appropriate library (full or compressed) will be determined based on J9VMTHREAD_COMPRESS_OBJECT_REFERENCES/J9JAVAVM_COMPRESS_OBJECT_REFERENCES.
There's been some discussion on how/where to set the override and pass it to the GC/OMR, but the current effort is still centered around building two copies of each GC library.
Thanks, sounds like good progress.
An update:
I've shifted to getting things working with CMake since it's more straightforward than UMA. Here's a summary of what's working right now (tested on mac and linux so far, jdk11):
--with-mixedrefsPLATFORM_mxdptrs.cmake file is used to build the jdk (basically this just sets the pointer mode to mixed)j9gc/omrgc are built with OMR_OVERRIDE_COMPRESS_OBJECT_REFERENCES=1, j9gc_full/omrgc_full are built with OMR_OVERRIDE_COMPRESS_OBJECT_REFERENCES=0 (the OVERRIDE is set in omr and is read by both omr and openj9)jvminit.c , DLLMain.cpp, dllinit.c, mminit.cpp and xcheck.c , there are runtime checks to select either the compressed GC or the full version-Xcompressedrefs or -Xnocompressedrefs to choose between compressed refs or large heap -- the corresponding libraries/code will be usedI'll be building on the rest of the platforms to confirm that everything's working. Then, I'll move on to splitting all the GC libraries with UMA.
@amicic @dmitripivkine Would you like a build for perf analysis or other GC testing purposes?
I think only a trivial perf measurement need be done - there are only two scenarios here:
Is a DayTrader startup perf test sufficient or would we like to see another set of SPECjbb runs? SPECjbb tends to take several hours per run, but would allow us to compare the current perf with perf previously posted in this issue (https://github.com/eclipse/openj9/issues/8878#issuecomment-622436065)
I think we would want a comparison between a current VM and a mixed build based on the same sources, rather than relying on historical runs.
I think a throughput benchmark would be more appropriate for the GC work. I am expecting identical performance (within the tolerance of course).
Sounds good. I'll run a throughput benchmark on a current VM and a mixed build to get us some throughput results. Should have some results by tomorrow.
Throughput Benchmark COMPOSITE Results (run on a different machine than previous perf runs in this issue)
baseline: configure --with-cmake
mixedrefs: configure --with-cmake --with-mixedrefs + -Xcompressedrefs at runtime
Heap settings: -Xms12g -Xmx12g -Xmn10g
Average of 2 runs for each:
| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff % | +0.06% | -0.80% | +6.64% | -1.37% |
The results for baseline were not strictly better than mixedrefs or vice versa in the two runs, so the perf is pretty close.
@mpirvu Do you agree that the results above are within the expected tolerance?
Results look ok to me for max_jOPS and critical_jOPS. I don't know what hblR_max is.
FYI: @vijaysun-omr
Can we do Daytrader7 startup and throughput experiments as well?
I'm not sure that will be valuable - we've already confirmed the interpreter and JIT changes, and the GC is being built twice, so it will perform exactly as well as the normal builds. The above run was just to verify that fact.
ok
That jbb run would exercise Local GCs (Scavenge) almost exclusively (due to nature of the bench). If there is a reason to believe Marking/Sweep could be affected while Scavenge is not, we can do a similar comparison with Optthruput GC policy (and even Balanced). @gacholio ?
It is possible that I missed some of the override locations, so runs with the other supported GC policies would make sense. If you'd like other workloads tested at the same time, please speak up.
@sharon-wang Can we please get some numbers with -Xgcpolicy:optthruput, -Xgcpolicy:metronome and -Xgcpolicy:balanced ?
I think I'd feel more comfortable if at least DT7 throughput and startup was tried with gencon given its importance as a workload. I checked with @mpirvu and he would be willing to do those runs if he can be given the builds to try.
Sure thing, I'll do 2 runs of a Throughput Benchmark for mixedrefs and baseline for each of those policies. Expecting it'll a take a week or so to run everything and put the results together.
I'll coordinate with @mpirvu to get him the builds - thanks for doing the DT7 runs!
I compared the baseline (compressedrefs) to the mixedrefs build, both being run with -Xcompressedrefs.
I used DT7 throughput runs, both with SCC on and off, and DT7 start-up experiments with SCC on. I did not see any change in throughput, footprint after load, start-up time or footprint after start-up.
================= Detailed results =============
Results for JDK=/home/mpirvu/sdks/Sharon/Baseline jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m
Throughput avg=3561.54 min=3529.70 max=3603.30 stdDev=27.7 maxVar=2.09% confInt=0.45% samples=10
Intermediate results:
Run 0 241.9 3372.0 3556.4 3533.0 Avg=3533 CPU=184399 ms Footprint=972096 KB
Run 1 270.2 3175.6 3562.2 3546.6 Avg=3547 CPU=138939 ms Footprint=945864 KB
Run 2 278.6 3135.5 3610.9 3588.3 Avg=3588 CPU=140642 ms Footprint=936156 KB
Run 3 278.3 3157.1 3542.8 3551.6 Avg=3552 CPU=135591 ms Footprint=942300 KB
Run 4 273.4 3216.3 3615.3 3603.3 Avg=3603 CPU=137415 ms Footprint=951392 KB
Run 5 301.2 3102.5 3592.5 3584.7 Avg=3585 CPU=138050 ms Footprint=946904 KB
Run 6 271.4 3177.7 3613.8 3529.7 Avg=3530 CPU=140451 ms Footprint=958532 KB
Run 7 281.1 3103.0 3619.0 3592.2 Avg=3592 CPU=141665 ms Footprint=959196 KB
Run 8 294.0 3170.7 3577.3 3551.2 Avg=3551 CPU=141405 ms Footprint=944408 KB
Run 9 276.8 3203.7 3605.0 3534.8 Avg=3535 CPU=137581 ms Footprint=947024 KB
CompTime avg=143613.80 min=135591.00 max=184399.00 stdDev=14464.3 maxVar=36.00% confInt=5.84% samples=10
Footprint avg=950387.20 min=936156.00 max=972096.00 stdDev=10348.4 maxVar=3.84% confInt=0.63% samples=10
Results for JDK=/home/mpirvu/sdks/Sharon/MixedRefs jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m
Throughput avg=3569.66 min=3486.60 max=3616.00 stdDev=35.2 maxVar=3.71% confInt=0.57% samples=10
Intermediate results:
Run 0 256.2 3306.2 3531.7 3486.6 Avg=3487 CPU=189048 ms Footprint=971648 KB
Run 1 280.0 3183.4 3589.9 3557.0 Avg=3557 CPU=138386 ms Footprint=946104 KB
Run 2 279.1 3100.3 3521.4 3584.2 Avg=3584 CPU=140922 ms Footprint=941188 KB
Run 3 281.8 3205.2 3617.9 3616.0 Avg=3616 CPU=135744 ms Footprint=946652 KB
Run 4 269.1 3118.3 3629.6 3557.1 Avg=3557 CPU=135273 ms Footprint=940276 KB
Run 5 290.2 3230.1 3648.3 3598.8 Avg=3599 CPU=131868 ms Footprint=953932 KB
Run 6 265.5 3200.7 3595.9 3580.5 Avg=3580 CPU=137383 ms Footprint=942832 KB
Run 7 291.2 3182.2 3592.9 3552.6 Avg=3553 CPU=135526 ms Footprint=951772 KB
Run 8 278.3 3215.6 3588.7 3583.6 Avg=3584 CPU=137627 ms Footprint=946320 KB
Run 9 283.0 3021.0 3553.9 3580.2 Avg=3580 CPU=140904 ms Footprint=950332 KB
CompTime avg=142268.10 min=131868.00 max=189048.00 stdDev=16658.7 maxVar=43.36% confInt=6.79% samples=10
Footprint avg=949105.60 min=940276.00 max=971648.00 stdDev=9085.2 maxVar=3.34% confInt=0.55% samples=10
Results for JDK=/home/mpirvu/sdks/Sharon/Baseline jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m -Xshareclasses:none
Throughput avg=3647.57 min=3586.80 max=3700.40 stdDev=32.4 maxVar=3.17% confInt=0.52% samples=10
Intermediate results:
Run 0 288.9 3297.1 3607.4 3586.8 Avg=3587 CPU=143930 ms Footprint=957548 KB
Run 1 294.8 3463.2 3652.6 3700.4 Avg=3700 CPU=131022 ms Footprint=918148 KB
Run 2 271.7 3297.0 3684.4 3681.1 Avg=3681 CPU=141237 ms Footprint=927300 KB
Run 3 301.5 3281.0 3611.4 3629.5 Avg=3630 CPU=137024 ms Footprint=916332 KB
Run 4 296.4 3311.9 3651.2 3624.1 Avg=3624 CPU=133670 ms Footprint=933276 KB
Run 5 294.3 3352.8 3667.4 3626.5 Avg=3626 CPU=138411 ms Footprint=928344 KB
Run 6 306.2 3360.2 3632.1 3645.6 Avg=3646 CPU=133881 ms Footprint=929844 KB
Run 7 306.9 3320.7 3647.3 3657.3 Avg=3657 CPU=136465 ms Footprint=923268 KB
Run 8 290.3 3354.3 3722.6 3665.2 Avg=3665 CPU=134035 ms Footprint=923764 KB
Run 9 293.7 3310.0 3655.2 3659.2 Avg=3659 CPU=140797 ms Footprint=923900 KB
CompTime avg=137047.20 min=131022.00 max=143930.00 stdDev=4055.1 maxVar=9.85% confInt=1.72% samples=10
Footprint avg=928172.40 min=916332.00 max=957548.00 stdDev=11522.8 maxVar=4.50% confInt=0.72% samples=10
Results for JDK=/home/mpirvu/sdks/Sharon/MixedRefs jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m -Xshareclasses:none
Throughput avg=3643.11 min=3607.00 max=3681.20 stdDev=23.8 maxVar=2.06% confInt=0.38% samples=10
Intermediate results:
Run 0 269.9 3336.6 3652.3 3607.0 Avg=3607 CPU=144425 ms Footprint=954116 KB
Run 1 283.2 3361.4 3700.8 3650.1 Avg=3650 CPU=138062 ms Footprint=922316 KB
Run 2 287.4 3344.3 3728.8 3681.2 Avg=3681 CPU=133078 ms Footprint=934540 KB
Run 3 282.3 3350.6 3671.7 3635.5 Avg=3636 CPU=134608 ms Footprint=924100 KB
Run 4 268.5 3352.2 3639.9 3626.4 Avg=3626 CPU=136802 ms Footprint=926748 KB
Run 5 291.6 3311.9 3633.3 3643.4 Avg=3643 CPU=140688 ms Footprint=928880 KB
Run 6 290.7 3365.5 3678.4 3665.8 Avg=3666 CPU=133991 ms Footprint=932460 KB
Run 7 272.6 3367.4 3654.5 3610.3 Avg=3610 CPU=128890 ms Footprint=930800 KB
Run 8 289.8 3385.1 3633.9 3659.7 Avg=3660 CPU=136654 ms Footprint=908452 KB
Run 9 294.8 3379.3 3640.8 3651.7 Avg=3652 CPU=133142 ms Footprint=925196 KB
CompTime avg=136034.00 min=128890.00 max=144425.00 stdDev=4365.3 maxVar=12.05% confInt=1.86% samples=10
Footprint avg=928760.80 min=908452.00 max=954116.00 stdDev=11482.4 maxVar=5.03% confInt=0.72% samples=10
Results for JDK=/home/mpirvu/sdks/Sharon/Baseline jvmOpts=-Xcompressedrefs -Xmx256m
StartupTime avg=3990 min=3838 max=4278 stdDev=78.2 maxVar=11.5% confInt=0.36% samples= 80
Footprint avg=217608 min=212580 max=223988 stdDev=2425.8 maxVar=5.4% confInt=0.23% samples= 64
Outlier values: 279768 273776 270648 268024 277728 276680 271772 271932 271836 278256 273840 276412 280328 268892 267852 271984
CThreadTime avg=2287 min=1897 max=2825 stdDev=205.1 maxVar=48.9% confInt=1.67% samples= 80
ProcessTime avg=7583 min=7000 max=9040 stdDev=533.7 maxVar=29.1% confInt=1.32% samples= 79
Results for JDK=/home/mpirvu/sdks/Sharon/MixedRefs jvmOpts=-Xcompressedrefs -Xmx256m
StartupTime avg=3995 min=3826 max=4243 stdDev=86.1 maxVar=10.9% confInt=0.40% samples= 80
Footprint avg=217352 min=210596 max=235584 stdDev=3546.9 maxVar=11.9% confInt=0.33% samples= 67
Outlier values: 275104 275252 273148 276632 274804 283276 281312 275076 277008 265988 273980 276528 272844
CThreadTime avg=2292 min=1904 max=3211 stdDev=252.0 maxVar=68.6% confInt=2.06% samples= 79
ProcessTime avg=7571 min=6900 max=9060 stdDev=526.9 maxVar=31.3% confInt=1.30% samples= 79
Some more Throughput Benchmark COMPOSITE Results
baseline: configure --with-cmake
mixedrefs: configure --with-cmake --with-mixedrefs + -Xcompressedrefs at runtime
Heap settings: -Xms12g -Xmx12g -Xmn10g
Measurements below are an average of 3 runs for each. Run-to-run measurements varied slightly for both configurations and there is no major performance difference between baseline and mixedrefs.
-Xgcpolicy:optthruput| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff % | -2.13% | +0.41% | -- | -1.06% |
-Xgcpolicy:balanced| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff % | -2.25% | +4.54% | -2.09% | +2.88% |
-Xgcpolicy:metronomeThis GC policy appears to be incompatible with the Throughput Benchmark? The benchmark errors out or crashes even when running with the latest openj9 Adopt JDK15 build.
-Xgcpolicy:metronomeThis GC policy appears to be incompatible with SPECjbb? The benchmark errors out or crashes even when running with the latest openj9 Adopt JDK15 build.
Crashes? It is interesting... Do you have any example around to get any idea where/how it crashed?
@dmitripivkine Yes, I'll send you the segfault and stack trace
@sharon-wang were you planning to investigate why the max-jOps was regressed by ~2% for both optthruput and balanced ?
I'm assuming that's random variance - these GCs should be identical to the normal builds. Perhaps another run is in order. I've manually verified that none of the getters are missing the override check.
I will do another set of runs to check if the same regression is seen.
New set of Throughput Benchmark COMPOSITE runs:
baseline: configure --with-cmake
mixedrefs: configure --with-cmake --with-mixedrefs + -Xcompressedrefs at runtime
Heap settings: -Xms12g -Xmx12g -Xmn10g
Measurements below are an average of 5 runs of each.
-Xgcpolicy:optthruput| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff % | +0.16% | -2.68% | -0.19% | +2.53% |
-Xgcpolicy:balanced| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff % | +3.15% | +3.00% | +4.98% | -1.52% |
Seeing similar run-to-run variance as previous results. Seems like the two builds show the same performance.
For this initial set of changes to enable mixed builds with CMake, are we focused on JDK11 specifically, or do we want to enable this feature for all versions? I assume this also depends on which versions/platforms CMake is currently available on.
The changes to openj9 to support mixed references should not be specific to any version of java, so I would expect it should work for all (with changes similar to ibmruntimes/openj9-openjdk-jdk11#359 made for the other extensions repositories).
Just an update that all CMake mixed refs changes are now merged. The test story is in progress and can be followed here: https://github.com/eclipse/openj9/issues/9231.