Openj9: Runtime compressed refs interpreter performance

Created on 16 Mar 2020  路  80Comments  路  Source: eclipse/openj9

In order to retain optimal performance in the interpreter in runtime compressed refs mode, we'll need to multiply-compile it (we already do this for debug mode).

The idea is to defined something like J9_COMPRESSED_REFS_OVERRIDE to either TRUE or FALSE, and use the override value if present in the J9VMTHREAD_COMPRESS_OBJECT_REFERENCES and J9JAVAVM_COMPRESS_OBJECT_REFERENCES macros.

In keeping with the current naming conventions, I'd suggest replacing BytecodeInterprer.cpp with BytecodeInterpreterCompressed.cpp and BytecodeInterpreterFull.cpp, and rename the C entry points appropriately. (Almost) the entirety of these files should be ifdeffed on OMR_GC_COMPRESSED_POINTERS or OMR_GC_FULL_POINTERS (you'll need to include j9cfg.h to get those ifdefs).

The new entry points will need to be used in:

https://github.com/eclipse/openj9/blob/b542db86f655d22d1697c64d071bd483c3da3695/runtime/vm/jvminit.c#L2455

For now, let's not bother splitting the debug interpreter or MH interpreter.

build vm perf

All 80 comments

@DanHeidinga @sharon-wang

Makefiles will also need to be updated - search for BytecodeInterpreter and duplicate accordingly.

Makefiles will also need to be updated - search for BytecodeInterpreter and duplicate accordingly.

@dnakamura FYI for CMake.

I'm working on this

To start with, there's no need to actually be in mixed mode - if the individual interpreters are ifdeffed on the full/compressed flags, then they'll work in the normal builds, and won't bloat them.

For an example, see:

https://github.com/eclipse/openj9/blob/6a8a0aef2e12405cc08654f15c62952d7c5f4fc6/runtime/compiler/runtime/Runtime.cpp#L274-L281

https://github.com/eclipse/openj9/blob/6a8a0aef2e12405cc08654f15c62952d7c5f4fc6/runtime/compiler/runtime/Runtime.cpp#L1094-L1107

For the override check, I was thinking something like:
```

define J9_COMPRESSED_OVERRIDE TRUE or FALSE

before including much of anything else (particularly j9nonbuilder.h or anything that includes it)

and in j9nonbuilder.h:

if defined(OMR_GC_COMPRESSED_POINTERS)

if defined(OMR_GC_FULL_POINTERS)

/* Mixed mode - necessarily 64-bit */

if defined(J9_COMPRESSED_OVERRIDE)

define J9VMTHREAD_COMPRESS_OBJECT_REFERENCES(vmThread) J9_COMPRESSED_OVERRIDE

else /* J9_COMPRESSED_OVERRIDE */

define J9VMTHREAD_COMPRESS_OBJECT_REFERENCES(vmThread) (0 != (vmThread)->compressObjectReferences)

endif /* J9_COMPRESSED_OVERRIDE */

else /* OMR_GC_FULL_POINTERS */

/* Compressed only - necessarily 64-bit */

define J9VMTHREAD_COMPRESS_OBJECT_REFERENCES(vmThread) TRUE

endif /* OMR_GC_FULL_POINTERS */

else /* OMR_GC_COMPRESSED_POINTERS */

/* Full only - could be 32 or 64-bit */

define J9VMTHREAD_COMPRESS_OBJECT_REFERENCES(vmThread) FALSE

endif /* OMR_GC_COMPRESSED_POINTERS */

Some initial benchmark results...

split: "mixed" build
baseline: regular build

LibertyStartupDT7 - slight improvement in startup time
| | startup time (ms) |
|---|---|
| split | 12417 |
| baseline | 12931 |
| diff | -514 |

Throughput Benchmark - significant reduction in throughput
Options: -Xnocompressedrefs -Xms6g -Xmx6g -Xmn4g
| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff | -46.8% | -62.3% | -42.0% | -39.3% |

I'll be running more iterations of these benchmarks to see if results are consistent.

@sharon-wang those numbers seem really off. The overhead from the compressed checks should mostly be in the GC at this point which shouldn't account for a 2-3x overhead.
@amicic any comment on that?

Can you collect some profiles from a combo build run? It would really help to see where time is going.

The probable reason for the awful numbers is that the bench was being run in GC stress mode (which is something we will eventually need to address, but for now, the new numbers will be in normal mode to give us a general idea).

hard to speculate without seeing heap sizing, machine config, GC logs, CPU profiles...

Ran a few more iterations of LibertyStartupDT7, which yielded similar results as above.

For GC, I've collected a few profiles and logs. What sort of data do we want from the profiles and GC logs? Are there specific measurements that we are interested in?

For the profiles, we want to compare the profiles from the combo build to a non-combo build to see which symbols have become hotter. This helps to identify where we're spending more time.

For GC, I've collected a few profiles and logs

Please send collected GC verbose log files (both combo and non-combo) to me or @amicic

GC times are about 10% longer. More or less expected at this stage.

CPU profiles are too short to be meaningful. I asked @sharon-wang to generate them again

Updated Throughput Benchmark Results
The GC regression appears to be much less than the initial results. It's not super clear why those results were so off, other than the small heap.

COMPOSITE + PRESET, IR = 5000, Heap settings: -Xms12g -Xmx12g -Xmn10g
image

COMPOSITE + PRESET, IR = 20000, Heap settings: -Xms12g -Xmx12g -Xmn10g
image

COMPOSITE + COMPLIANT + Heap settings: -Xms12g -Xmx12g -Xmn10g
| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff | -2.22% | -9.88% | 0 | -1.97% |

COMPOSITE + COMPLIANT + Heap settings: -Xms8g -Xmx8g -Xmn6g
| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff | -1.18% | -2.54% | 0 | 0|

Original runs were with -Xnocompressedrefs. Maybe there is something specific to it that led to so big difference.
The machine is also different, and heap sizing is also a bit different, but I would not think they would be a cause for such big difference.
Anyhow, nonCR should be tried, again.

I find it odd that the lower heap sizes are showing better results.

Run-to-run variation is easily 3%.

We did not pick the right combo of heap sizes to emphasize the difference between low and high GC overhead scenarios, either. When I picked large heap (10G Nursery) I assumed small will be 4G (like in the original runs), so about 2.5x smaller GC overhead. But I should've also picked even larger than 10G.

I can do some more runs: one set with compressed refs and one without. Are we interested in both COMPLIANT and PRESET runs?

What is suggested for the heap settings? I can use -Xms6g -Xmx6g -Xmn4g for small - what is recommended for a larger heap size?

Let's do compliant runs only for nonCR. If we indeed reproduce the big gap on GC perf machine, then will do PRESET runs and with CPU profiles.

large: -Xmx24G -Xms24G -Xmn20G
small: -Xmx6G -Xms6G -Xmn4G

New Throughput Benchmark Results - COMPOSITE + COMPLIANT, -Xnocompressedrefs

There is not a big gap in performance like initially measured. Might have been a machine issue, since those first runs were done on a different machine than the subsequent runs.

Heap settings: -Xms24g -Xmx24g -Xmn20g
| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff | -1.21% | -5.81% | 0 | -2.65% |

Heap settings: -Xms6g -Xmx6g -Xmn4g
| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff | 0 | -2.45% | 0 | -1.10% |

GC (Scavenge specifically) time slowdown for nonCR

聽 | Reg | Mix | Diff (%)
-- | -- | -- | --
Large | 154.851 | 174.017 | 12.38
Small | 115.13 | 123.128 | 6.95

Similar for CR:

聽 | Reg | Mix | Diff (%)
-- | -- | -- | --
Large | 147.904 | 172.263 | 16.47
Small | 145.972 | 156.681 | 7.34

Those are average GC times as obtained by:

grep '<gc-op.*scavenge' verbosegc.log | awk -F\" '{x+=$6} END {print x/NR}'

@amicic Should more perf results be collected or is further profiling needed?

I don't need more perf results. These slow downs are more or less expected, knowing the changes that came in... Next step is to try to reduce the gap.

Generally, we need to reduce the number of runtime 'if CR' checks. If possible, extracting it from tight loops. Very often it won't be possible, so we'll have to resort to versioning the code. Something similar to what is done to the Interpreter (although to avoid maintaining 2 version of source code, I'd consider C++ templates). Either way we need to do it for minimum amount of code that is really impacted by the runtimes checks most.

Specifically, for Scavenger that is most frequently executed GC in Gencon, we can version:

  • top level completeScan() loop and several methods that are dragged in like various scanObject() and copyAndForward() methods
  • Heap Object Iterator, specifically next() method
  • ForwardedHeader, probably the whole class

So where do we go from here? @rwy0717 suggested that an interpreter-style solution (the compressed override define) might work better than templates. Any thoughts on that?

Another obvious option is to simply multiply-compile the majority of the GC and fill in the MM function table appropriately. This would obviously perform, but may be prohibitively large, and it would also need to address the ODR constraint (i.e. the classes would need to be multiply-named).

I'm not familiar at all with templates, so I have nothing to offer in that regard.

Can anyone provide an example of what a piece of code would look like with templates?

One of the major places we probably want to optimize are the scanners/iterators, but I had heard they were being reworked (for VT?), so I've been loathe to put a lot of effort into changing those. One thing I had looked at was keeping an internal SlotObject instead of a scan pointer, and using the new instance-based APIs to increment the pointer.

Here's a template example from a discussion with @rwy0717.

In this example:

  1. We declare a class template PointerModePrinter parameterized on the PointerMode enum
  2. Create a class template specialization of PointerModePrinter for COMPRESSED mode and another for FULL mode
  3. The templated function print_pointer_mode() is also parameterized on the PointerMode enum, so that it can be used to declare the appropriate PointerModePrinter based on the PointerMode specified
  4. The wrapper function do_print_pointer_mode() calls the templated function print_pointer_mode() based on the PointerMode provided at runtime.

To keep the example simple, there are no ifdefs on OMR_GC_COMPRESSED_POINTERS

enum class PointerMode { COMPRESSED, FULL };

// class template declaration
template <PointerMode M>
class PointerModePrinter;

// class template specialization for COMPRESSED mode
template <>
class PointerModePrinter<PointerMode::COMPRESSED> {
    public:
    void print() { printf("compressed\n"); }
};

// class template specialization for FULL mode
template <>
class PointerModePrinter<PointerMode::FULL> {
    public:
    void print() { printf("full\n"); }
};

// templated function
template <PointerMode M>
void print_pointer_mode() {
    PointerModePrinter<M> printer;
    printer.print();
}

// call templated function from wrapper that does runtime dispatch
void do_print_pointer_mode(PointerMode m) {
    switch (m) {
        case PointerMode::COMPRESSED:
            return print_pointer_mode<PointerMode::COMPRESSED>();
        case PointerMode::FULL:
            return print_pointer_mode<PointerMode::FULL>();
    }
}

int main() {
    do_print_pointer_mode(PointerMode::COMPRESSED);
    do_print_pointer_mode(PointerMode::FULL);
}

This morning, I've made an extremely rough pass at what templating scanObject and completeScan in the MarkingScheme might look like. I did not spend a lot of time on it, there are probably errors in the code, and some things are less elegant than they could be.

@sharon-wang, @rwy0717 thanks for samples and initial work. The switch statement would hopefully go only in very few places, very high in call hierarchy (completeScan() for various Collector classes).

Clearly there would be significant portion of GC code that would be affected. It would take some time to get it right, be harder to read and maintain, and from that perspective this does not seem as an ideal approach.

Another perspectives is how this affects resident set (loaded executable code that is; we don't have static/global variables that would be specialized). If this is linked into a single libj9gc library, does only one specialization get loaded or both? We are probably talking about a couple of MBs of executable code being specialized what might be a problem with tiny workloads.

Let's consider now how multiple-compile approach would look like.... Would we need per file/class changes? We would get two separate libj9gccr and libj9gcfull libraries and only one would get loaded? Well, we would probably have a third one libj9gccommon (where as a minimum initial startup logic that decides about heap geometry is, plus potentially any other code that is unaware of CR)? Then consequently some inter-library APIs being defined between CR/Full and Common. Then makefile/config shuffling. What else?

The templated code should remain fairly readable - almost every class has it's own compressObjetReferences function, which is the only thing that would need to change for the templating.

Hitting a bit of a roadblock with templates as IBM builds may use the old ddrgen which doesn't support C++ templates and the new ddrgen needs to be extended to properly handle templates.

@keithc-ca noted that we鈥檇 have to look into our own (portable) name mangling for template instantiations and teach ddrgen to map all the compiler-specific manglings to ours. Also discussed, using typedefs to hide the templated types would seem like a way around dealing with templates in DDR; however, it's possible for compilers to "see through" the typedefs and encounter the templated types anyways.

A typedef won't be much use, since we need both definitions available in the same build. It would also be horrendous to have to use !SomeGCStructCompressed directly in DDR - whatever we do, the unmangled pretty printers have to remain working as-is - nothing would be worse than using the wrong structure during a debug session and going down the garden path.

To elaborate on my typedef comments, imagine we have:

typedef GCAlgorithm<Compressed> GCAlgorithmCompressed;
typedef GCAlgorithm<Full> GCAlgorithmFull;

If compilers don't inline references to things like GCAlgorithmFull, then we have some hope of generating GCAlgorithmFull.java, but we know that VisualStudio is not such a compiler.

If, as I believe should be the case, the compressed vs. full distinction need only be made for algorithms operating on objects in the heap; that is, we need neither GCAlgorithmCompressed nor GCAlgorithmFull (or whatever their mangled names might be) for DDR.

With respect to data types like SomeGCStruct: these need to continue with the same names as they currently have. Their fields, offsets and types will continue to be derived from the information extracted from object files. !SomeGCStruct should be the command to be used for both old and new core files.

As a stop-gap measure, I'm going to try compiling two different GC libraries and select the appropriate one at runtime. To keep DDR happy, I will leave all of the compressed/full optional fields in structures present in both libraries and only fix the implementations of compressObjectReferences() to return a compile-time constant.

@dmitripivkine @amicic Would you object to me removing the compressed/full ifdefs from the variables (this is where we would eventually end up anyway with a true mixed mode build)? This solves any DDR issues, and will probably make my life a lot easier even with the two-library solution.

You are talking about fields like

if defined(OMR_GC_COMPRESSED_POINTERS) && defined(OMR_GC_FULL_POINTERS)

bool _compressObjectReferences;

endif /* defined(OMR_GC_COMPRESSED_POINTERS) && defined(OMR_GC_FULL_POINTERS) */

in various GC meta structures (Environment, Extention, ObjectModel etc.)?

I don't see a problem.

I'm referring to things like this:

#if defined (OMR_GC_COMPRESSED_POINTERS)
    uintptr_t _compressedPointersShift; /**< the number of bits to shift by when converting between the compressed pointers heap and real heap */

There will be no need to un-ifdef any fields - the proposed multi-compile solution will leave both of the pointer flags defined in both compiles, so the structures will be identical. The optimization will be achieved by defining the override flag (like some of the VM files) instead.

The optimization will be controlled by defining OMR_OVERRIDE_COMPRESS_OBJECT_REFERENCES to 0 (full) or 1 (compressed) for all of the J9 and OMR GC code.

The J9 override has been changed to OMR for consistency.

So the question now is how to get this into the various build systems. If we end up only making this work in cmake, that's probably OK (and certainly a good start).

Do we want to put the override on the compile command line, or use some other more nefarious mechanism (to test the mechanism, I made objectdescription.h declare the flag for OMR, but that's gross and impractical)?

Things we need to consider:

  • how to get the flag into the builds at all

  • how to compile the OMR and J9 portions twice with the different flag definitions

  • do we compile once in the current locations and once more somewhere else or do we abandon the in-place compiles and do two compiles to new locations?

@rwy0717 @youngar @dnakamura

Under cmake, I would build the GC as two different libraries, from the same set of sources. To get the glue code to build twice, we just need to make both libraries link against the gc-glue interface-library. I would just pass the override flag along on the command line, I haven't been able to come up with a cleaner way to do it.

I haven't been able to come up with a cleaner way to do it.

@rwy0717 That seems reasonably 'clean' to me: what about that is less than ideal?

Also bear in mind the glue is going to be compiled twice as well.

I'm hoping @sharon-wang will actually be doing this work, so she'll no doubt have more specific questions about adding options to the command line (in particular, how does OpenJ9 tell OMR that it wants extra options?).

One of the issues is that we will not need the default compiled OMR GC (which will be runtime mixed mode). If we compile it and discard it, it's not the end of the world, but it would make more sense not to do so.

I'm leaning towards leaving the existing compiles in place (with the flag defined one way) in order to keep the impact of the dual compile minimal (DDR looks in those directories) and compiling the second way somewhere else.

If we produce two GC shared libraries, we'll need to produce two DDR blobs as well (one corresponding to each GC shared library) and the appropriate blob will need to be loaded.

No, the whole point of doing it this way is that we will not need separate blobs - the structures are identical in both libraries (both still think they're in mixed mode, but have their compressed query overridden).

This only works if we are producing shared libraries, which is why it's only an interim solution (which may last for many years).

What will prevent (or detect) some data structures differing between the two GC shared libraries?
Which of the two GC shared libraries should be analyzed by ddrgen?

Reiterating the comment from a few lines ago, there is no possibility of different structures - only inlined query methods are affected by the multiple compiles.

For reference, the PRs adding the multi-compile override are:

https://github.com/eclipse/omr/pull/5416
https://github.com/eclipse/openj9/pull/10206

I saw your comment and I can accept that it is true today, but you haven't answered my questions.
1) How can we be sure that your comment continues to be true?
2) Which GC shared library will be examined by ddrgen (assuming it doesn't matter)?

Code reviewers who understand the process will prevent problems occurring in the future (i.e. no one can use the override to define structure fields).

Again, either library can be used since they will contain identical structures. My suggestion is that we build one of the libraries in the current location so no changes are required to DDR (I plan to go the rest of my life without having to interact with the DDR build process again).

@sharon-wang Now that you're looking into this, I recommend that the current directory structures be used for the compressed version of the GC (still named j9gc29) and the second compile be used for the full version (j9gcfull29). Leaving the existing structure in place allows DDR to continue to work unmodified. This will mean the in-place compile of OMR will need to have the override flag passed to it.

@gacholio By "current directory structures...of the GC", are you referring to the directories for gc, gc_api, gc_base, etc. in the runtime directory? So we would be building 2 copies of each of these GC libraries?

Correct - build the current GC with the compressed override set, and build it again elsewhere with the full override set.

@tajila @sharon-wang Hi Sharon, any update on the build work?

@gacholio Currently working through building two copies of each of the GC libraries (updating cmake files to build 2 versions, updating existing module.xml files and creating new module.xml files for the second copy). The existing GC library structure will stay as is for the compressed override and there will be a new directory gc_full with subdirectories for each duplicated library. The spots where dynamic GC library loading need to occur have been identified, and the appropriate library (full or compressed) will be determined based on J9VMTHREAD_COMPRESS_OBJECT_REFERENCES/J9JAVAVM_COMPRESS_OBJECT_REFERENCES.

There's been some discussion on how/where to set the override and pass it to the GC/OMR, but the current effort is still centered around building two copies of each GC library.

Thanks, sounds like good progress.

An update:

I've shifted to getting things working with CMake since it's more straightforward than UMA. Here's a summary of what's working right now (tested on mac and linux so far, jdk11):

  • New configure flag --with-mixedrefs
  • When that flag is specified, the corresponding PLATFORM_mxdptrs.cmake file is used to build the jdk (basically this just sets the pointer mode to mixed)
  • openj9 gc libs are all split into two libs (full, compressed) and use the corresponding omr gc as appropriate
  • j9gc/omrgc are built with OMR_OVERRIDE_COMPRESS_OBJECT_REFERENCES=1, j9gc_full/omrgc_full are built with OMR_OVERRIDE_COMPRESS_OBJECT_REFERENCES=0 (the OVERRIDE is set in omr and is read by both omr and openj9)
  • In jvminit.c , DLLMain.cpp, dllinit.c, mminit.cpp and xcheck.c , there are runtime checks to select either the compressed GC or the full version
  • When running Java, use -Xcompressedrefs or -Xnocompressedrefs to choose between compressed refs or large heap -- the corresponding libraries/code will be used

I'll be building on the rest of the platforms to confirm that everything's working. Then, I'll move on to splitting all the GC libraries with UMA.

@amicic @dmitripivkine Would you like a build for perf analysis or other GC testing purposes?

I think only a trivial perf measurement need be done - there are only two scenarios here:

  • work didn't split appropriately, perf will be miserable
  • work was done correctly, perf is exactly what is was before

Is a DayTrader startup perf test sufficient or would we like to see another set of SPECjbb runs? SPECjbb tends to take several hours per run, but would allow us to compare the current perf with perf previously posted in this issue (https://github.com/eclipse/openj9/issues/8878#issuecomment-622436065)

I think we would want a comparison between a current VM and a mixed build based on the same sources, rather than relying on historical runs.

I think a throughput benchmark would be more appropriate for the GC work. I am expecting identical performance (within the tolerance of course).

Sounds good. I'll run a throughput benchmark on a current VM and a mixed build to get us some throughput results. Should have some results by tomorrow.

Throughput Benchmark COMPOSITE Results (run on a different machine than previous perf runs in this issue)

baseline: configure --with-cmake
mixedrefs: configure --with-cmake --with-mixedrefs + -Xcompressedrefs at runtime

Heap settings: -Xms12g -Xmx12g -Xmn10g

Average of 2 runs for each:

| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff % | +0.06% | -0.80% | +6.64% | -1.37% |

The results for baseline were not strictly better than mixedrefs or vice versa in the two runs, so the perf is pretty close.

@mpirvu Do you agree that the results above are within the expected tolerance?

Results look ok to me for max_jOPS and critical_jOPS. I don't know what hblR_max is.
FYI: @vijaysun-omr

Can we do Daytrader7 startup and throughput experiments as well?

I'm not sure that will be valuable - we've already confirmed the interpreter and JIT changes, and the GC is being built twice, so it will perform exactly as well as the normal builds. The above run was just to verify that fact.

ok

That jbb run would exercise Local GCs (Scavenge) almost exclusively (due to nature of the bench). If there is a reason to believe Marking/Sweep could be affected while Scavenge is not, we can do a similar comparison with Optthruput GC policy (and even Balanced). @gacholio ?

It is possible that I missed some of the override locations, so runs with the other supported GC policies would make sense. If you'd like other workloads tested at the same time, please speak up.

@sharon-wang Can we please get some numbers with -Xgcpolicy:optthruput, -Xgcpolicy:metronome and -Xgcpolicy:balanced ?

I think I'd feel more comfortable if at least DT7 throughput and startup was tried with gencon given its importance as a workload. I checked with @mpirvu and he would be willing to do those runs if he can be given the builds to try.

Sure thing, I'll do 2 runs of a Throughput Benchmark for mixedrefs and baseline for each of those policies. Expecting it'll a take a week or so to run everything and put the results together.

I'll coordinate with @mpirvu to get him the builds - thanks for doing the DT7 runs!

I compared the baseline (compressedrefs) to the mixedrefs build, both being run with -Xcompressedrefs.
I used DT7 throughput runs, both with SCC on and off, and DT7 start-up experiments with SCC on. I did not see any change in throughput, footprint after load, start-up time or footprint after start-up.

================= Detailed results =============

Results for JDK=/home/mpirvu/sdks/Sharon/Baseline jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m
Throughput      avg=3561.54     min=3529.70     max=3603.30     stdDev=27.7     maxVar=2.09%    confInt=0.45%   samples=10
Intermediate results:
Run 0   241.9   3372.0  3556.4  3533.0  Avg=3533        CPU=184399 ms  Footprint=972096 KB
Run 1   270.2   3175.6  3562.2  3546.6  Avg=3547        CPU=138939 ms  Footprint=945864 KB
Run 2   278.6   3135.5  3610.9  3588.3  Avg=3588        CPU=140642 ms  Footprint=936156 KB
Run 3   278.3   3157.1  3542.8  3551.6  Avg=3552        CPU=135591 ms  Footprint=942300 KB
Run 4   273.4   3216.3  3615.3  3603.3  Avg=3603        CPU=137415 ms  Footprint=951392 KB
Run 5   301.2   3102.5  3592.5  3584.7  Avg=3585        CPU=138050 ms  Footprint=946904 KB
Run 6   271.4   3177.7  3613.8  3529.7  Avg=3530        CPU=140451 ms  Footprint=958532 KB
Run 7   281.1   3103.0  3619.0  3592.2  Avg=3592        CPU=141665 ms  Footprint=959196 KB
Run 8   294.0   3170.7  3577.3  3551.2  Avg=3551        CPU=141405 ms  Footprint=944408 KB
Run 9   276.8   3203.7  3605.0  3534.8  Avg=3535        CPU=137581 ms  Footprint=947024 KB
CompTime        avg=143613.80   min=135591.00   max=184399.00   stdDev=14464.3  maxVar=36.00%   confInt=5.84%   samples=10
Footprint       avg=950387.20   min=936156.00   max=972096.00   stdDev=10348.4  maxVar=3.84%    confInt=0.63%   samples=10

Results for JDK=/home/mpirvu/sdks/Sharon/MixedRefs jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m
Throughput      avg=3569.66     min=3486.60     max=3616.00     stdDev=35.2     maxVar=3.71%    confInt=0.57%   samples=10
Intermediate results:
Run 0   256.2   3306.2  3531.7  3486.6  Avg=3487        CPU=189048 ms  Footprint=971648 KB
Run 1   280.0   3183.4  3589.9  3557.0  Avg=3557        CPU=138386 ms  Footprint=946104 KB
Run 2   279.1   3100.3  3521.4  3584.2  Avg=3584        CPU=140922 ms  Footprint=941188 KB
Run 3   281.8   3205.2  3617.9  3616.0  Avg=3616        CPU=135744 ms  Footprint=946652 KB
Run 4   269.1   3118.3  3629.6  3557.1  Avg=3557        CPU=135273 ms  Footprint=940276 KB
Run 5   290.2   3230.1  3648.3  3598.8  Avg=3599        CPU=131868 ms  Footprint=953932 KB
Run 6   265.5   3200.7  3595.9  3580.5  Avg=3580        CPU=137383 ms  Footprint=942832 KB
Run 7   291.2   3182.2  3592.9  3552.6  Avg=3553        CPU=135526 ms  Footprint=951772 KB
Run 8   278.3   3215.6  3588.7  3583.6  Avg=3584        CPU=137627 ms  Footprint=946320 KB
Run 9   283.0   3021.0  3553.9  3580.2  Avg=3580        CPU=140904 ms  Footprint=950332 KB
CompTime        avg=142268.10   min=131868.00   max=189048.00   stdDev=16658.7  maxVar=43.36%   confInt=6.79%   samples=10
Footprint       avg=949105.60   min=940276.00   max=971648.00   stdDev=9085.2   maxVar=3.34%    confInt=0.55%   samples=10

Results for JDK=/home/mpirvu/sdks/Sharon/Baseline jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m -Xshareclasses:none
Throughput      avg=3647.57     min=3586.80     max=3700.40     stdDev=32.4     maxVar=3.17%    confInt=0.52%   samples=10
Intermediate results:
Run 0   288.9   3297.1  3607.4  3586.8  Avg=3587        CPU=143930 ms  Footprint=957548 KB
Run 1   294.8   3463.2  3652.6  3700.4  Avg=3700        CPU=131022 ms  Footprint=918148 KB
Run 2   271.7   3297.0  3684.4  3681.1  Avg=3681        CPU=141237 ms  Footprint=927300 KB
Run 3   301.5   3281.0  3611.4  3629.5  Avg=3630        CPU=137024 ms  Footprint=916332 KB
Run 4   296.4   3311.9  3651.2  3624.1  Avg=3624        CPU=133670 ms  Footprint=933276 KB
Run 5   294.3   3352.8  3667.4  3626.5  Avg=3626        CPU=138411 ms  Footprint=928344 KB
Run 6   306.2   3360.2  3632.1  3645.6  Avg=3646        CPU=133881 ms  Footprint=929844 KB
Run 7   306.9   3320.7  3647.3  3657.3  Avg=3657        CPU=136465 ms  Footprint=923268 KB
Run 8   290.3   3354.3  3722.6  3665.2  Avg=3665        CPU=134035 ms  Footprint=923764 KB
Run 9   293.7   3310.0  3655.2  3659.2  Avg=3659        CPU=140797 ms  Footprint=923900 KB
CompTime        avg=137047.20   min=131022.00   max=143930.00   stdDev=4055.1   maxVar=9.85%    confInt=1.72%   samples=10
Footprint       avg=928172.40   min=916332.00   max=957548.00   stdDev=11522.8  maxVar=4.50%    confInt=0.72%   samples=10

Results for JDK=/home/mpirvu/sdks/Sharon/MixedRefs jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m -Xshareclasses:none
Throughput      avg=3643.11     min=3607.00     max=3681.20     stdDev=23.8     maxVar=2.06%    confInt=0.38%   samples=10
Intermediate results:
Run 0   269.9   3336.6  3652.3  3607.0  Avg=3607        CPU=144425 ms  Footprint=954116 KB
Run 1   283.2   3361.4  3700.8  3650.1  Avg=3650        CPU=138062 ms  Footprint=922316 KB
Run 2   287.4   3344.3  3728.8  3681.2  Avg=3681        CPU=133078 ms  Footprint=934540 KB
Run 3   282.3   3350.6  3671.7  3635.5  Avg=3636        CPU=134608 ms  Footprint=924100 KB
Run 4   268.5   3352.2  3639.9  3626.4  Avg=3626        CPU=136802 ms  Footprint=926748 KB
Run 5   291.6   3311.9  3633.3  3643.4  Avg=3643        CPU=140688 ms  Footprint=928880 KB
Run 6   290.7   3365.5  3678.4  3665.8  Avg=3666        CPU=133991 ms  Footprint=932460 KB
Run 7   272.6   3367.4  3654.5  3610.3  Avg=3610        CPU=128890 ms  Footprint=930800 KB
Run 8   289.8   3385.1  3633.9  3659.7  Avg=3660        CPU=136654 ms  Footprint=908452 KB
Run 9   294.8   3379.3  3640.8  3651.7  Avg=3652        CPU=133142 ms  Footprint=925196 KB
CompTime        avg=136034.00   min=128890.00   max=144425.00   stdDev=4365.3   maxVar=12.05%   confInt=1.86%   samples=10
Footprint       avg=928760.80   min=908452.00   max=954116.00   stdDev=11482.4  maxVar=5.03%    confInt=0.72%   samples=10
Results for JDK=/home/mpirvu/sdks/Sharon/Baseline jvmOpts=-Xcompressedrefs -Xmx256m
StartupTime     avg=3990        min=3838        max=4278        stdDev=78.2     maxVar=11.5%    confInt=0.36%   samples= 80
Footprint       avg=217608      min=212580      max=223988      stdDev=2425.8   maxVar=5.4%     confInt=0.23%   samples= 64
        Outlier values:  279768 273776 270648 268024 277728 276680 271772 271932 271836 278256 273840 276412 280328 268892 267852 271984
CThreadTime     avg=2287        min=1897        max=2825        stdDev=205.1    maxVar=48.9%    confInt=1.67%   samples= 80
ProcessTime     avg=7583        min=7000        max=9040        stdDev=533.7    maxVar=29.1%    confInt=1.32%   samples= 79

Results for JDK=/home/mpirvu/sdks/Sharon/MixedRefs jvmOpts=-Xcompressedrefs -Xmx256m
StartupTime     avg=3995        min=3826        max=4243        stdDev=86.1     maxVar=10.9%    confInt=0.40%   samples= 80
Footprint       avg=217352      min=210596      max=235584      stdDev=3546.9   maxVar=11.9%    confInt=0.33%   samples= 67
        Outlier values:  275104 275252 273148 276632 274804 283276 281312 275076 277008 265988 273980 276528 272844
CThreadTime     avg=2292        min=1904        max=3211        stdDev=252.0    maxVar=68.6%    confInt=2.06%   samples= 79
ProcessTime     avg=7571        min=6900        max=9060        stdDev=526.9    maxVar=31.3%    confInt=1.30%   samples= 79

Some more Throughput Benchmark COMPOSITE Results

baseline: configure --with-cmake
mixedrefs: configure --with-cmake --with-mixedrefs + -Xcompressedrefs at runtime

Heap settings: -Xms12g -Xmx12g -Xmn10g

Measurements below are an average of 3 runs for each. Run-to-run measurements varied slightly for both configurations and there is no major performance difference between baseline and mixedrefs.

-Xgcpolicy:optthruput

| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff % | -2.13% | +0.41% | -- | -1.06% |

-Xgcpolicy:balanced

| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff % | -2.25% | +4.54% | -2.09% | +2.88% |

-Xgcpolicy:metronome

This GC policy appears to be incompatible with the Throughput Benchmark? The benchmark errors out or crashes even when running with the latest openj9 Adopt JDK15 build.

-Xgcpolicy:metronome

This GC policy appears to be incompatible with SPECjbb? The benchmark errors out or crashes even when running with the latest openj9 Adopt JDK15 build.

Crashes? It is interesting... Do you have any example around to get any idea where/how it crashed?

@dmitripivkine Yes, I'll send you the segfault and stack trace

@sharon-wang were you planning to investigate why the max-jOps was regressed by ~2% for both optthruput and balanced ?

I'm assuming that's random variance - these GCs should be identical to the normal builds. Perhaps another run is in order. I've manually verified that none of the getters are missing the override check.

I will do another set of runs to check if the same regression is seen.

New set of Throughput Benchmark COMPOSITE runs:

baseline: configure --with-cmake
mixedrefs: configure --with-cmake --with-mixedrefs + -Xcompressedrefs at runtime

Heap settings: -Xms12g -Xmx12g -Xmn10g

Measurements below are an average of 5 runs of each.

-Xgcpolicy:optthruput

| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff % | +0.16% | -2.68% | -0.19% | +2.53% |

-Xgcpolicy:balanced

| | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
|---|---|---|---|---|
| diff % | +3.15% | +3.00% | +4.98% | -1.52% |

Seeing similar run-to-run variance as previous results. Seems like the two builds show the same performance.

For this initial set of changes to enable mixed builds with CMake, are we focused on JDK11 specifically, or do we want to enable this feature for all versions? I assume this also depends on which versions/platforms CMake is currently available on.

The changes to openj9 to support mixed references should not be specific to any version of java, so I would expect it should work for all (with changes similar to ibmruntimes/openj9-openjdk-jdk11#359 made for the other extensions repositories).

Just an update that all CMake mixed refs changes are now merged. The test story is in progress and can be followed here: https://github.com/eclipse/openj9/issues/9231.

Was this page helpful?
0 / 5 - 0 ratings