Openj9: Portable SCC: Compressed Refs

Created on 4 Dec 2019  路  59Comments  路  Source: eclipse/openj9

There are two approaches that were brought up in the Portable SCC discussion regarding how to deal with the potential for the compressed refs shift changing with the heap size.

  1. Have the JIT assume that the compressed shift might be 4. The generated code then loads the shift value into a register. This load can then be relocated.
  2. Fix the shift value to 3 if the JVM is going to use AOT code.
gc jit compaot vm discussion

Most helpful comment

@vijaysun-omr i will give DT7/OpenLiberty a spin next as I talked to @harryyu1994

All 59 comments

fyi @dmitripivkine @amicic

Can one of you comment on what would be affected in the GC (and rest of the runtime) by forcing the CR shift to always be 3? How would this affect the heap size? Occupancy? More wasted space? etc

also fyi @fjeremic @andrewcraik @gita-omr @knn-k to give a codegen perspective.

For a large part GC would not be affected (roots and metastructures like rembembered set do not use CR). The biggest performance impact, I geuess, would come from jitted code, which is more in JIT folks to comment, like how an unnecessary shift for <4GB heap would affect performance.

I don't see obvious footprint implications.

Don't remember if this was discussed already, but what I struggle more is, jitted code (non)portability in unified CR/nonCR VM.

jitted code (non)portability in unified CR/nonCR VM.

This was brought up in the discussion; running in CR / non-CR results in a different SCC. I suppose theoretically the same SCC could be used to store both CR and non-CR versions of a compiled method, but that's better discussed in another issue (which I can open if we feel it's a discussion worth having).

Don't remember if this was discussed already, but what I struggle more is, jitted code (non)portability in unified CR/nonCR VM.

The CR-ness is encoded in the cache name today so that they are forced to be separate caches. I expect the initial unified CR/nonCR VM will still use separate caches for the initial approach.

fyi @dmitripivkine @amicic

Can one of you comment on what would be affected in the GC (and rest of the runtime) by forcing the CR shift to always be 3? How would this affect the heap size? Occupancy? More wasted space? etc

  • As Aleks mentioned before lost of performance for cases heap can be allocated below 4g bar and can run 0-bit shift
  • also lost support for heaps up to 64g using 4-bit shift

From the codegen perspective proposed solution 1. is going to be less performant and harder to functionally implement correctly. 2. will give most performance but least flexibility.

@fjeremic @andrewcraik Do you guys have any issues/concerns with moving forward with solution 2 here (fixing the shift value to 3 for portable AOT).

Summary of the solution:

  • When -XX:+PortableSharedCache is specified, the compressedref shift will be fixed to 3 (3 for <= 32g and 4 for > 32g and nocompressedref > 64g. I could be wrong on these numbers.).
  • If user always operates under 32gb then there will be no problem
  • The only limitation is If it was set to 3 and the heap size was later increased to more than 32gb (4-bit shift) then we will need to deem the AOT code in the existing cache/layers unusable. (Not sure how often this scenario occurs)

FYI @mpirvu @vijaysun-omr @ymanton @dsouzai

the compressedref shift will be fixed to 3 (3 for <= 32gb and 4 for > 32gb).

Worth noting that there is a point when even shift 4 won't work, eg if the heap is so big that we have to run without compressedrefs; I don't know if the JVM will still explicitly require the user to pass in -Xnocompressedrefs once we have one build that can do both.

If we can't use compressedrefs, what should the portable AOT story be? Should we generate a SCC with the default CPU features & nocompressedrefs (thereby mandating that all future layers must have compressedrefs disabled)?

If we can't use compressedrefs, what should the portable AOT story be? Should we generate a SCC with the default CPU features & nocompressedrefs (thereby mandating that all future layers must have compressedrefs disabled)?

It sounds reasonable to me as we are giving the same treatment to shift 3/shift 4/nocompressedrefs. This shouldn't be that bad if most of the use cases fall under shift 3.

Note compressedrefs and non-compressedrefs don't share any cache, there are different cache files for these atm.

Is it possible to estimate the impact of applying a constant shift by 3 in compiled code ? In general I am supportive of this proposed direction that you are taking but I wanted to get a sense for how much slower we would be making portable AOT code because of compressed refs shift. My expectation/hope would be that the overhead is'nt much more than 5% but this is what I'm asking to be measured.

Is it possible to estimate the impact of applying a constant shift by 3 in compiled code ? In general I am supportive of this proposed direction that you are taking but I wanted to get a sense for how much slower we would be making portable AOT code because of compressed refs shift. My expectation/hope would be that the overhead is'nt much more than 5% but this is what I'm asking to be measured.

Daytrader7

  • liberty version: liberty-20.0.0.2_wlp_webProfile7
  • seeing a 1% drop here (and that could be due to the fluctuations), will do another run and update the result again tomorrow

CompressedShift = 0

Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=2850.45     min=2759.30     max=2900.90     stdDev=45.5     maxVar=5.13%    confInt=1.07%   samples= 8
Intermediate results:
Run 0   187.3   2575.9  2890.0  2861.1  Avg=2861        CPU=113492 ms  Footprint=712364 KB
Run 1   200.6   2573.6  2887.4  2867.5  Avg=2868        CPU=108388 ms  Footprint=700740 KB
Run 2   221.0   2579.6  2901.8  2759.3  Avg=2759        CPU=112528 ms  Footprint=699352 KB
Run 3   222.1   2628.1  2830.4  2892.9  Avg=2893        CPU=107786 ms  Footprint=706204 KB
Run 4   180.4   2628.9  2903.8  2830.5  Avg=2830        CPU=108510 ms  Footprint=706704 KB
Run 5   226.5   2598.6  2705.5  2867.6  Avg=2868        CPU=112368 ms  Footprint=713240 KB
Run 6   221.2   2647.2  2837.1  2900.9  Avg=2901        CPU=110313 ms  Footprint=698736 KB
Run 7   231.8   2619.8  2928.2  2823.8  Avg=2824        CPU=110404 ms  Footprint=707608 KB
CompTime        avg=110473.62   min=107786.00   max=113492.00   stdDev=2150.7   maxVar=5.29%    confInt=1.30%   samples= 8
Footprint       avg=705618.50   min=698736.00   max=713240.00   stdDev=5599.8   maxVar=2.08%    confInt=0.53%   samples= 8

CompressedShift = 3

Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=2819.82     min=2776.80     max=2861.30     stdDev=29.0     maxVar=3.04%    confInt=0.69%   samples= 8
Intermediate results:
Run 0   163.2   2442.4  2795.4  2776.8  Avg=2777        CPU=137886 ms  Footprint=651608 KB
Run 1   144.9   2350.9  2826.5  2847.1  Avg=2847        CPU=137913 ms  Footprint=636708 KB
Run 2   152.9   2429.0  2857.6  2826.9  Avg=2827        CPU=131592 ms  Footprint=637768 KB
Run 3   174.9   2363.9  2790.5  2832.9  Avg=2833        CPU=140504 ms  Footprint=642980 KB
Run 4   161.6   2433.0  2810.8  2803.1  Avg=2803        CPU=132384 ms  Footprint=632412 KB
Run 5   139.9   2409.7  2819.2  2861.3  Avg=2861        CPU=132907 ms  Footprint=649168 KB
Run 6   177.9   2467.1  2801.9  2787.1  Avg=2787        CPU=137302 ms  Footprint=636512 KB
Run 7   178.0   2431.0  2764.5  2823.4  Avg=2823        CPU=133845 ms  Footprint=638280 KB
CompTime        avg=135541.62   min=131592.00   max=140504.00   stdDev=3256.5   maxVar=6.77%    confInt=1.61%   samples= 8
Footprint       avg=640679.50   min=632412.00   max=651608.00   stdDev=6681.6   maxVar=3.04%    confInt=0.70%   samples= 8

AcmeAir in Docker

Shift0:

run0: summary = 2938081 in  600s = 4896.7/s Avg:   1 Min:   0 Max:  891 Err:   0 (0.00%)
run1: summary = 3136727 in  600s = 5227.7/s Avg:   1 Min:   0 Max:  129 Err:   0 (0.00%)
run2: summary = 3147370 in  600s = 5245.4/s Avg:   1 Min:   0 Max:  109 Err:   0 (0.00%)
run3: summary = 3139280 in  600s = 5232.0/s Avg:   1 Min:   0 Max:  117 Err:   0 (0.00%)
run4: summary = 3133830 in  600s = 5222.8/s Avg:   1 Min:   0 Max:  79 Err:   0 (0.00%)
run5: summary = 3136712 in  600s = 5227.7/s Avg:   1 Min:   0 Max:  156 Err:   0 (0.00%)

5231.12 
Shift3:

run0: summary = 2964754 in  600s = 4941.1/s Avg:   1 Min:   0 Max:  260 Err:   0 (0.00%)
run1: summary = 3137234 in  600s = 5228.3/s Avg:   1 Min:   0 Max:  124 Err:   0 (0.00%)
run2: summary = 3126874 in  600s = 5211.3/s Avg:   1 Min:   0 Max:  110 Err:   0 (0.00%)
run3: summary = 3139452 in  600s = 5232.2/s Avg:   1 Min:   0 Max:  64 Err:   0 (0.00%)
run4: summary = 3134675 in  600s = 5224.3/s Avg:   1 Min:   0 Max:  100 Err:   0 (0.00%)
run5: summary = 3139328 in  600s = 5232.1/s Avg:   1 Min:   0 Max:  113 Err:   0 (0.00%)

5225.64

  • No throughput drop observed
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:none -Xnoaot -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=2997.16     min=2976.00     max=3022.10     stdDev=13.9     maxVar=1.55%    confInt=0.31%   samples= 8
Intermediate results:
Run 0   161.6   2661.4  3026.7  3022.1  Avg=3022        CPU=131266 ms  Footprint=597944 KB
Run 1   175.7   2600.5  3029.5  2990.0  Avg=2990        CPU=127742 ms  Footprint=594136 KB
Run 2   178.5   2622.8  3002.0  2984.6  Avg=2985        CPU=130536 ms  Footprint=602824 KB
Run 3   161.8   2686.5  3003.7  3000.8  Avg=3001        CPU=129596 ms  Footprint=596588 KB
Run 4   131.0   2617.7  2820.5  2976.0  Avg=2976        CPU=143361 ms  Footprint=603160 KB
Run 5   157.4   2657.5  3016.9  2999.4  Avg=2999        CPU=129978 ms  Footprint=604736 KB
Run 6   175.0   2656.1  2978.4  3001.4  Avg=3001        CPU=130472 ms  Footprint=592384 KB
Run 7   192.8   2599.7  3042.6  3003.0  Avg=3003        CPU=130239 ms  Footprint=613256 KB
CompTime        avg=131648.75   min=127742.00   max=143361.00   stdDev=4843.3   maxVar=12.23%   confInt=2.46%   samples= 8
Footprint       avg=600628.50   min=592384.00   max=613256.00   stdDev=6774.0   maxVar=3.52%    confInt=0.76%   samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:none -Xnoaot -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=2933.98     min=2877.50     max=2974.10     stdDev=29.9     maxVar=3.36%    confInt=0.68%   samples= 8
Intermediate results:
Run 0   173.7   2609.1  2960.8  2918.4  Avg=2918        CPU=133268 ms  Footprint=599836 KB
Run 1   179.3   2641.4  2952.0  2920.2  Avg=2920        CPU=128775 ms  Footprint=597212 KB
Run 2   154.8   2589.6  2955.9  2956.3  Avg=2956        CPU=131750 ms  Footprint=606756 KB
Run 3   191.9   2553.4  2955.2  2945.6  Avg=2946        CPU=129687 ms  Footprint=597068 KB
Run 4   149.1   2640.5  2965.7  2974.1  Avg=2974        CPU=129883 ms  Footprint=600332 KB
Run 5   193.3   2638.6  2941.3  2927.3  Avg=2927        CPU=127431 ms  Footprint=603852 KB
Run 6   168.7   2530.4  2892.4  2877.5  Avg=2878        CPU=145817 ms  Footprint=596084 KB
Run 7   176.7   2569.6  2945.2  2952.4  Avg=2952        CPU=134123 ms  Footprint=596880 KB
CompTime        avg=132591.75   min=127431.00   max=145817.00   stdDev=5798.9   maxVar=14.43%   confInt=2.93%   samples= 8
Footprint       avg=599752.50   min=596084.00   max=606756.00   stdDev=3809.2   maxVar=1.79%    confInt=0.43%   samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=2964.54     min=2933.90     max=3002.30     stdDev=25.2     maxVar=2.33%    confInt=0.57%   samples= 8
Intermediate results:
Run 0   220.7   2656.6  2940.8  2933.9  Avg=2934        CPU=103509 ms  Footprint=712676 KB
Run 1   228.4   2674.1  2920.4  2950.9  Avg=2951        CPU=105973 ms  Footprint=719892 KB
Run 2   223.8   2727.0  2960.8  2947.2  Avg=2947        CPU=103124 ms  Footprint=704384 KB
Run 3   215.6   2704.4  2978.5  2978.7  Avg=2979        CPU=103663 ms  Footprint=709576 KB
Run 4   235.9   2666.1  2967.8  3002.3  Avg=3002        CPU=103964 ms  Footprint=710316 KB
Run 5   218.4   2676.8  2964.8  2997.3  Avg=2997        CPU=101415 ms  Footprint=704660 KB
Run 6   176.1   2719.4  2953.1  2958.0  Avg=2958        CPU=103691 ms  Footprint=726336 KB
Run 7   214.4   2654.4  2957.3  2948.0  Avg=2948        CPU=106512 ms  Footprint=714952 KB
CompTime        avg=103981.38   min=101415.00   max=106512.00   stdDev=1608.1   maxVar=5.03%    confInt=1.04%   samples= 8
Footprint       avg=712849.00   min=704384.00   max=726336.00   stdDev=7481.4   maxVar=3.12%    confInt=0.70%   samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=2889.20     min=2842.80     max=2943.50     stdDev=37.0     maxVar=3.54%    confInt=0.86%   samples= 8
Intermediate results:
Run 0   181.7   2536.2  2979.7  2933.4  Avg=2933        CPU=123566 ms  Footprint=693680 KB
Run 1   185.0   2523.2  2887.1  2910.4  Avg=2910        CPU=128682 ms  Footprint=640464 KB
Run 2   178.7   2494.4  2877.1  2943.5  Avg=2944        CPU=128197 ms  Footprint=637768 KB
Run 3   175.7   2553.8  2879.9  2889.7  Avg=2890        CPU=129298 ms  Footprint=639216 KB
Run 4   157.6   2498.3  2862.5  2875.5  Avg=2876        CPU=123893 ms  Footprint=640976 KB
Run 5   187.6   2497.7  2868.2  2842.8  Avg=2843        CPU=125097 ms  Footprint=632568 KB
Run 6   178.9   2420.8  2828.6  2865.9  Avg=2866        CPU=127235 ms  Footprint=637264 KB
Run 7   173.4   2305.5  2862.9  2852.4  Avg=2852        CPU=119157 ms  Footprint=636492 KB
CompTime        avg=125640.62   min=119157.00   max=129298.00   stdDev=3410.0   maxVar=8.51%    confInt=1.82%   samples= 8
Footprint       avg=644803.50   min=632568.00   max=693680.00   stdDev=19923.9  maxVar=9.66%    confInt=2.07%   samples= 8

I have spent some time coming up with an implementation and here's the whole story:

First, recap on the original compressed shift design:

  • Users specify -XX:+PortableSharedCache during cold run, if the compressed shift value is <=3 then 3 will be used and persisted to the shared class cache, if the compressed shift value is 4 then 4 will be used
  • During warm runs, users can pick up the compressed shift value from the shared class cache and use that for all the AOT compilations (We didn't decide whether we want to do it by default for all AOT compilations or just for users who specified -XX:+PotableSharedCache during the warm run)

I proceeded to implement this, and have found some limitations with our existing infrastructure in the codebase:

  • It looks like the compressed shift value is calculated and set very early in initializeRunTimeObjectAlignmentAndCRShift(), it is in fact earlier than the earliest point the vm is able to load the SCC. As a result, with the current infrastructure it may not be possible to pick up the CR shift value from the SCC and then set it to the current JVM.
  • Another minor issue is that the -XX:+PortableSharedCache is yet to be parsed and processed when initializeRunTimeObjectAlignmentAndCRShift() is called, but this looks possible to be worked around without too much effort.

Due to the limitations, I'm proposing an alternative solution

  • When users specify -XX:+PortableSharedCache, if the compressed shift value <= 3 we will fix the compressed shift to 3 and if the compressed shift value is 4 then we will generate a warning message to the user that the heap may be too large for the portableSharedCache option to work.
  • We will never check the SCC for the compressed shift value and will be only relying on the portableSharedCache option
  • The only downside is we no longer support the compressed shift 4 case. I don't think this is too big of a problem and also the throughput drop for compressed shift 4 may not even be acceptable for us (just guessing here).
  • This solution may be much easier to implement than the original one and makes the original design look over-engineered.

@vijaysun-omr @mpirvu @dsouzai Let me know if there are any concerns with the alternative solution.

@dmitripivkine Does the CR shift change the object alignment requirements? Does forcing a 3bit shift mean small heaps will waste more space and therefore incur higher gc overhead?

if the compressed shift value is 4 then we will generate a warning message to the user that the heap may be too large for the portableSharedCache option to work.

I assume we will continue to generate AOT which is portable from the processor point of view.

@dmitripivkine Does the CR shift change the object alignment requirements? Does forcing a 3bit shift mean small heaps will waste more space and therefore incur higher gc overhead?

only for 4-bit shift (required 16 bytes alignment for objects). All other cases are covered by minimum heap object alignment to 8 bytes

I assume we will continue to generate AOT which is portable from the processor point of view.

Yes, we will always use the portable processor feature set when -XX:+PortableSharedCache is specified.

@DanHeidinga @vijaysun-omr Moving question here: https://github.com/eclipse/omr/pull/5436 force any run in container to use shift 3 (except it needs to use shift 4). It obviously prevents to use most performant shift 0 for small container applications. Would you please confirm this is desired behaviour?

I think in containers the plan is to sacrifice a bit of performance in exchange for maximum portability. I have run some experiment comparing shift0 and shift3 and didn't see a significant throughput drop. I'll leave the decision to Vijay @vijaysun-omr though.

Moving question here: eclipse/omr#5436 force any run in container to use shift 3 (except it needs to use shift 4). It obviously prevents to use most performant shift 0 for small container applications. Would you please confirm this is desired behaviour?

My understanding is that this is the expected behaviour only when -XX:+PortableSharedCache is specified and in that case we accept the tradeoff for better portability.

My understanding is that this is the expected behaviour only when -XX:+PortableSharedCache is specified and in that case we accept the tradeoff for better portability.

Yes that is correct. But in containers the PortableSharedCache feature is enabled by default. In containers, the portable processor feature set will be used by default for AOT compilations unless disabled by -XX:-PortableSharedCache. The question here is whether we want to also have the shift set to 3 by default in containers.

The question here is whether we want to also have the shift set to 3 by default in containers.

I would say yes, but only if AOT is enabled.

The shift by 3 code is only generated for an AOT compilation in containers.

So, in the unaffected category are : 1) JIT compilations inside or outside containers and 2) AOT compilations outside containers.

This is a conscious choice being made in the same philosophy as the change to use "portable processor feature set", i.e. we want to make the code portable at a small performance cost in containers (in the case of compressed refs shift, the cost was negligible as @harryyu1994 measured). Since AOT compilations can (and are) usually recompiled as JIT compilations if they are deemed important for peak throughput (since this is not the first way and also nor the most significant way that AOT compilations are worse than JIT compilations), the steady state impact won't even be as much as measured in the AOT experiments with and without portability changes. During startup and rampup phases when AOT code is used heavily, these minor performance differences are unlikely to be a big enough deal to compromise on portability in containers.

The shift by 3 code is only generated for an AOT compilation in containers.

So, in the unaffected category are : 1) JIT compilations inside or outside containers and 2) AOT compilations outside containers.

This is a conscious choice being made in the same philosophy as the change to use "portable processor feature set", i.e. we want to make the code portable at a small performance cost in containers (in the case of compressed refs shift, the cost was negligible as @harryyu1994 measured). Since AOT compilations can (and are) usually recompiled as JIT compilations if they are deemed important for peak throughput (since this is not the first way and also nor the most significant way that AOT compilations are worse than JIT compilations), the steady state impact won't even be as much as measured in the AOT experiments with and without portability changes. During startup and rampup phases when AOT code is used heavily, these minor performance differences are unlikely to be a big enough deal to compromise on portability in containers.

Correct me if I'm wrong, but I thought the JIT compilations inside containers would also have to use shift3 if we made the AOT compilations shift by 3. So JIT compilations inside containers are affected (though we didn't see a throughput drop in my experiment when comparing AOT+JIT shift0 vs. AOT+JIT shift3).

The question here is whether we want to also have the shift set to 3 by default in containers.

I would say yes, but only if AOT is enabled.

May not be possible to check for whether AOT is enabled this early.

enum INIT_STAGE {
    PORT_LIBRARY_GUARANTEED,           0
    ALL_DEFAULT_LIBRARIES_LOADED,   1
    ALL_LIBRARIES_LOADED,       2
    DLL_LOAD_TABLE_FINALIZED,   3  Consume JIT specific X options
    VM_THREADING_INITIALIZED,       4
    HEAP_STRUCTURES_INITIALIZED,    5
    ALL_VM_ARGS_CONSUMED,       6

The shift is set at ALL_LIBRARIES_LOADED, very early into the initialization.

Looking at the code https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L34-L66 I see that vm->sharedCacheAPI->sharedCacheEnabled is set very early and SCC options are also parsed very early. But then there is this piece of code in the same function: https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L304-L328
which seems to deal with -Xshareclasses:none option. I wonder why can't this piece of code be grouped with the previous one. That way we would know very early whether SCC is 'likely' going to be used. @hangshao0

@harryyu1994 Discussing with @mpirvu some more, I feel that we need more data points if we are going to slow down JITed code (in addition to AOTed code) inside containers. Could you please run SPECjbb2015 (please ask Piyush if you need help with accessing a setup for it) and maybe SPECjbb2005 (that is much easier to run) and check what the throughput overhead is ?

Additionally, the overhead of the shift would be platform dependent and so if one wanted to take a design decision for all platforms, the effect of the shift ought to be measured on the other platforms first.

I would also add quarkus throughput experiments since quarkus is more likely to be run in containers.

But then there is this piece of code in the same function: https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L304-L328
which seems to deal with -Xshareclasses:none option. I wonder why can't this piece of code be grouped with the previous one. That way we would know very early whether SCC is 'likely' going to be used.

Looking at the code, what it does is to unload the SCC dll if -Xshareclasses:none presents. I guess that is the reason why it is done in stage DLL_LOAD_TABLE_FINALIZED. Once the SCC dll in unloaded, all SCC related functionalities will be inactive.

unload the SCC dll if -Xshareclasses:none presents

This means that we load the SCC dll before checking the command line options. Be that as it may, we could add another check for -Xshareclasses:none when SCC options are parsed.

we could add another check for -Xshareclasses:none when SCC options are parsed.

Yes. It looks fine to me if another check for -Xshareclasses:none is added in the block L34 to L66.

Quarkus+CRUD on x86 loses 0.9% in throughput when we force a shift3 instead on shift0 for compresssedrefs

Stats for rest-crud-quarkus-openj9:j11 with JAVA_OPTS=-Xms128m -Xmx128m -Xshareclasses:none -XXgc:forcedShiftingCompressionAmount=3
Throughput:     avg=12040.20    min=11931.00    max=12111.10    stdDev=57.7     maxVar=1.51%    confInt=0.28%   samples=10
Footprint:      avg=123.01      min=105.90      max=129.90      stdDev=6.7      maxVar=22.66%   confInt=3.17%   samples=10

Stats for rest-crud-quarkus-openj9:j11 with JAVA_OPTS=-Xms128m -Xmx128m -Xshareclasses:none -XXgc:forcedShiftingCompressionAmount=0
Throughput:     avg=12140.48    min=12065.70    max=12209.50    stdDev=48.3     maxVar=1.19%    confInt=0.23%   samples=10
Footprint:      avg=125.31      min=120.40      max=129.30      stdDev=2.8      maxVar=7.39%    confInt=1.28%   samples=10

SPECjbb2015GMR multi_2grp_gencon

  • -Xms2g -Xmx2g -Xmn1g -Xgcpolicy:gencon -Xlp -Xcompressedrefs

Shift3

Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9962189 | 12177 | 4435 | 13838 | 13099
9962190 | 13688 | 5824 | 19279 | 16099
9962191 | 14006 | 5749 | 16099 | 13449
9962192 | 11277 | 4417 | 13587 | 13415
means | 12787 | 5106.25 | 15700.75 | 14015.5
medians | 12932.5 | 5092 | 14968.5 | 13432
confidence_interval | 0.15982393515568 | 0.24493717290884 | 0.26746374600271 | 0.1586869314041
min | 11277 | 4417 | 13587 | 13099
max | 14006 | 5824 | 19279 | 16099
stddev | 1284.5183273637 | 786.11592656554 | 2639.4603457273 | 1397.9111798203

Shift0

Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9962180 | 12718 | 4750 | 16099 | 14334
9962181 | 14972 | 5971 | 16099 | 13449
9962182 | 13040 | 4531 | 16099 | 13795
9962183 | 14167 | 5686 | 16099 | 13449
means | 13724.25 | 5234.5 | 16099 | 13756.75
medians | 13603.5 | 5218 | 16099 | 13622
confidence_interval | 0.12035581689665 | 0.21319070129398 | 0 | 0.048339382455907
min | 12718 | 4531 | 16099 | 13449
max | 14972 | 5971 | 16099 | 14334
stddev | 1038.2107605555 | 701.41214702912 | 聽 | 417.97158994362

Added more runs

Shift 3

Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994492 | 13253 | 5439 | 16566 | 15678
9994493 | 13445 | 5534 | 14939 | 14322
9994494 | 聽 | 聽 | 聽 | 聽
9994495 | 15133 | 6707 | 16099 | 13449
means | 13943.666666667 | 5893.3333333333 | 15868 | 14483
medians | 13445 | 5534 | 16099 | 14322
confidence_interval | 0.15737812859215 | 0.2542198958441 | 0.11199389127022 | 0.16451397337762
min | 13253 | 5439 | 14939 | 13449
max | 15133 | 6707 | 16566 | 15678
stddev | 1034.4570234347 | 706.25514747387 | 837.73683218538 | 1123.1878738662

Shift 0

Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994483 | 12508 | 5022 | 13449 | 13024
9994484 | 13146 | 5101 | 14939 | 14322
9994485 | 16099 | 6128 | 16099 | 13449
9994486 | 13362 | 5131 | 16099 | 13795
means | 13778.75 | 5345.5 | 15146.5 | 13647.5
medians | 13254 | 5116 | 15519 | 13622
confidence_interval | 0.18344971565841 | 0.1558672596321 | 0.13202131493524 | 0.064024675274026
min | 12508 | 5022 | 13449 | 13024
max | 16099 | 6128 | 16099 | 14322
stddev | 1588.754097818 | 523.68852065581 | 1256.8578545988 | 549.19972080595

Don't think this is a good benchmark for this as the fluctuations are too large.

That's a 6.8% drop in max_jOPS and 2.5% drop in critical_jOPS.
Fluctuations are huge though min-max ~20%, so I am not sure we can draw any conclusions with such a small dataset.

My DT7 experiments with AOT enabled show a 2.1% regression when moving from shift 0 to shift 3

Results for JDK=/home/mpirvu/sdks/OpenJDK11U-jre_x64_linux_openj9_2020-05-21-10-15 jvmOpts=-Xms1024m -Xmx1024m -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=3530.25     min=3434.60     max=3587.30     stdDev=44.9     maxVar=4.45%    confInt=0.74%   samples=10
CompTime        avg=137425.30   min=128296.00   max=179999.00   stdDev=15107.7  maxVar=40.30%   confInt=6.37%   samples=10
Footprint       avg=932900.80   min=912844.00   max=948924.00   stdDev=9184.8   maxVar=3.95%    confInt=0.57%   samples=10

Results for JDK=/home/mpirvu/sdks/OpenJDK11U-jre_x64_linux_openj9_2020-05-21-10-15 jvmOpts=-Xms1024m -Xmx1024m -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=3455.40     min=3392.70     max=3521.00     stdDev=36.2     maxVar=3.78%    confInt=0.61%   samples=10
CompTime        avg=139633.70   min=132116.00   max=182410.00   stdDev=15162.1  maxVar=38.07%   confInt=6.29%   samples=10
Footprint       avg=930844.00   min=922164.00   max=945488.00   stdDev=7221.0   maxVar=2.53%    confInt=0.45%   samples=10

That's a 6.8% drop in max_jOPS and 2.5% drop in critical_jOPS.
Fluctuations are huge though min-max ~20%, so I am not sure we can draw any conclusions with such a small dataset.

Not sure why fluctuations are so large. Originally the heap size is set to 24GB, I had to change it to 2GB to be able to use shift0. Maybe the test does not work well for a smaller heap..

ILOG_WODM 851-4way-Seg5FastpathRVEJB (on Power)

Shift 0

Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
-- | -- | -- | -- | -- | -- | --
9994528 | 8638.9371504854 | 1.8573037069837 | 504.7412381469 | 582.0975 | 13.859526383526 | 14.445926640927
9994529 | 8762.1322573549 | 1.8264175610697 | 534.71616320959 | 560.30609923475 | 13.895542351454 | 14.478946902655
9994530 | 8740.2146440973 | 1.835861262807 | 506.15 | 586.72603318492 | 13.904566037736 | 14.485850314465
9994531 | 8664.0184392624 | 1.8513182105413 | 510.4425 | 579.65 | 14.166381979695 | 14.671502538071
9994532 | 8577.5554538932 | 1.8694810666913 | 509.735 | 571.67 | 13.984703208556 | 14.73531684492
9994533 | 8691.1423921766 | 1.8423403917498 | 525.7875 | 567.96858007855 | 14.010517902813 | 14.684122762148
9994534 | 8502.5874741253 | 1.8831568605721 | 494.7275 | 555.5325 | 14.109282694848 | 14.744678996037
9994535 | 8658.9934676143 | 1.8495857046636 | 515.9275 | 564.43 | 13.919508322663 | 14.490800256082
means | 8654.4476598762 | 1.8519330956348 | 512.77842516956 | 571.04758906228 | 13.981253610162 | 14.592143156913
medians | 8661.5059534384 | 1.8504519576024 | 510.08875 | 569.81929003927 | 13.95210576561 | 14.581151397076
confidence_interval | 0.0081348395932811 | 0.0082178422245874 | 0.02052919348065 | 0.016151072515423 | 0.0065275035989177 | 0.0073217207304309
min | 8502.5874741253 | 1.8264175610697 | 494.7275 | 555.5325 | 13.859526383526 | 14.445926640927
max | 8762.1322573549 | 1.8831568605721 | 534.71616320959 | 586.72603318492 | 14.166381979695 | 14.744678996037
stddev | 84.198081874972 | 0.018201070854603 | 12.589702870929 | 11.030304909653 | 0.10914581344745 | 0.1277750589018

Shift 3

Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
-- | -- | -- | -- | -- | -- | --
9994541 | 8119.8086062203 | 1.9734693630701 | 482.52 | 533.94 | 13.927745554036 | 14.523974008208
9994542 | 8286.7210135417 | 1.9335406181712 | 484.4725 | 541.3 | 14.028153225806 | 14.621639784946
9994543 | 8238.9619674721 | 1.9421575015578 | 502.385 | 524.27 | 14.281521505376 | 14.873037634409
9994544 | 8388.4166737499 | 1.9096835513286 | 500.78124804688 | 547.5725 | 13.915002663116 | 14.43093608522
9994545 | 8408.0668386632 | 1.9034667560754 | 515.315 | 539.43 | 13.90702393617 | 14.452569148936
9994546 | 8298.7935361939 | 1.9281740577079 | 509.36622658443 | 526.865 | 13.899852393617 | 14.452348404255
9994547 | 8441.6219797253 | 1.8954390538826 | 523.06 | 533.69 | 14.077266311585 | 14.597009320905
9994548 | 8412.5731827431 | 1.9026361479098 | 513.48871627821 | 540.7625 | 13.878481333333 | 14.53224
means | 8324.3704747887 | 1.9235708812129 | 503.92358636369 | 535.97875 | 13.98938086538 | 14.56046929836
medians | 8343.6051049719 | 1.9189288045183 | 505.87561329222 | 536.685 | 13.921374108576 | 14.528107004104
confidence_interval | 0.010993300632059 | 0.011362164570798 | 0.024010815613053 | 0.012185469952163 | 0.0081766615307008 | 0.0082655166059187
min | 8119.8086062203 | 1.8954390538826 | 482.52 | 524.27 | 13.878481333333 | 14.43093608522
max | 8441.6219797253 | 1.9734693630701 | 523.06 | 547.5725 | 14.281521505376 | 14.873037634409
stddev | 109.44435177091 | 0.026138647857235 | 14.470563630049 | 7.8109472171159 | 0.13680071373816 | 0.14393261774689

Seeing a 4% drop in throughput on Power.

@andrewcraik @zl-wang see above overhead(s)

I have also updated the original post for SPECjbb2015GMR. I don't think we can draw any conclusions from that particular benchmark as we always seem to have large fluctuations (multiple attempts and not a small dataset considering each run takes over 3 hours).

most of variability typically comes from JIT, but this heap size is also fairly small for jbb2015 and may contribute (by having large variations in number of global GCs that are relatively expensive) as well.

if these tests are to be repeated try heap as big as possible while still being able to run shift0, with 2GB given to Tenure and rest to Nursery (for example -Xmx3200M -Xms3200M -Xmn1200M)

maybe due to the shift-3 case's bigger measurement variability, the overhead looked like twice more than expected. we had prior experience with this overhead ... about 2-2.5%. it might be worth of another more stable measurement on shift-3 case.

most of variability typically comes from JIT, but this heap size is also fairly small for jbb2015 and may contribute (by having large variations in number of global GCs that are relatively expensive) as well.

if these tests are to be repeated try heap as big as possible while still being able to run shift0, with 2GB given to Tenure and rest to Nursery (for example -Xmx3200M -Xms3200M -Xmn1200M)

Tried with -Xmx3200M -Xms3200M -Xmn1200M on x86
The shift0 runs were pretty stable, the shift3 runs were not.
3.2% drop in max_jOPS and 2% drop in cirtical_jOPS
Going to give this another try..

Shift 3

Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994645 | 18938 | 11502 | 23095 | 22360
9994646 | 18938 | 11227 | 23095 | 21847
9994647 | 19169 | 10885 | 23095 | 21419
9994648 | 21247 | 11684 | 23095 | 19279
means | 19573 | 11324.5 | 23095 | 21226.25
medians | 19053.5 | 11364.5 | 23095 | 21633
confidence_interval | 0.09114537985632 | 0.048897960523052 | 0 | 0.1014854988189
min | 18938 | 10885 | 23095 | 19279
max | 21247 | 11684 | 23095 | 22360
stddev | 1121.3001382324 | 348.04836828617 | 聽 | 1353.9639027685

Shift 0

Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994636 | 20555 | 12086 | 23095 | 22548
9994637 | 20093 | 11388 | 23095 | 22360
9994638 | 20202 | 11982 | 27674 | 23095
9994639 | 20093 | 11496 | 23095 | 22706
means | 20235.75 | 11738 | 24239.75 | 22677.25
medians | 20147.5 | 11739 | 23095 | 22627
confidence_interval | 0.017214402858354 | 0.047064354097083 | 0.15027360018152 | 0.021914250955476
min | 20093 | 11388 | 23095 | 22360
max | 20555 | 12086 | 27674 | 23095
stddev | 218.94805319984 | 347.22903104435 | 2289.5 | 312.35383248276

Do you use -XXgc:forcedShiftingCompressionAmount=3 to force shift 3 ?

Do you use -XXgc:forcedShiftingCompressionAmount=3 to force shift 3 ?

Yes

With a larger dataset, I'm measuring a 3.5% throughput drop on Power.

First 8 runs: (8288.3387340932 vs. 8581.7531089419)
Next 8 runs: (8316.9258529547 vs. 8627.474541907)
Both have 3.5% throughput drop.

Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
-- | -- | -- | -- | -- | -- | --
9994855 | 8227.0338779673 | 1.9450356503132 | 503.18874202814 | 526.035 | 13.904885598923 | 14.4253243607
9994856 | 8087.2274532643 | 1.9783969421294 | 494.9975 | 517.92093638361 | 13.935554945055 | 14.535127747253
9994857 | 8317.3519463409 | 1.9245225413624 | 500.05874985313 | 537.415 | 14.015772543742 | 14.649139973082
9994858 | 8286.6920988556 | 1.9310648048965 | 504.66 | 527.8886802783 | 14.190730201342 | 14.867684563758
9994859 | 8207.8741850326 | 1.9494444556917 | 506.135 | 519.855 | 13.985191117093 | 14.581004037685
9994860 | 8392.4408795395 | 1.907375533393 | 503.745 | 545.16 | 13.938728 | 14.444909333333
9994861 | 8334.6407679616 | 1.920294696344 | 499.66 | 533.34116664708 | 14.107698795181 | 14.611708165997
9994862 | 8453.4486637834 | 1.8928913679089 | 509.36872657818 | 532.835 | 14.051724842767 | 14.721377358491
means | 8288.3387340932 | 1.9311282490049 | 502.72671480743 | 530.05634791362 | 14.016285755513 | 14.604534442537
medians | 8302.0220225982 | 1.9277936731294 | 503.46687101407 | 530.36184013915 | 14.000481830417 | 14.596356101841
confidence_interval | 0.011550726630632 | 0.011514626729118 | 0.0073576725719721 | 0.014272083518414 | 0.0057905694289637 | 0.0083223576011986
min | 8087.2274532643 | 1.8928913679089 | 494.9975 | 517.92093638361 | 13.904885598923 | 14.4253243607
max | 8453.4486637834 | 1.9783969421294 | 509.36872657818 | 545.16 | 14.190730201342 | 14.867684563758
stddev | 114.49608734331 | 0.026593458983634 | 4.4237061399032 | 9.0473890683649 | 0.097066208198194 | 0.14536101225863
聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
9994868 | 8429.0328467609 | 1.9004089367754 | 506.765 | 552.32 | 14.128936925099 | 14.630515111695
9994869 | 8687.8546821798 | 1.8426826705212 | 522.0325 | 564.32608918478 | 13.933602791878 | 14.440649746193
9994870 | 8468.1746712228 | 1.8895803645447 | 514.3225 | 542.93614265964 | 14.165972440945 | 14.769738845144
9994871 | 8627.8107263089 | 1.8540877613996 | 525.47618630953 | 544.83613790966 | 13.899932484076 | 14.414615286624
9994872 | 8670.780576214 | 1.8459311333678 | 526.85723142769 | 555.88111029722 | 14.051856780735 | 14.553642585551
9994873 | 8555.9774620284 | 1.8699833702903 | 527.8925 | 541.9025 | 13.964604139715 | 14.498144890039
9994874 | 8579.5265630195 | 1.8652614005592 | 516.575 | 546.735 | 13.892619607843 | 14.611988235294
9994875 | 8634.8673438012 | 1.854260194939 | 509.6125 | 568.4310789223 | 13.914638569604 | 14.44150063857
means | 8581.7531089419 | 1.8652744790497 | 518.69167721715 | 552.1710073717 | 13.994020467487 | 14.545099417389
medians | 8603.6686446642 | 1.8597607977491 | 519.30375 | 549.5275 | 13.949103465797 | 14.525893737795
confidence_interval | 0.0090968857000075 | 0.0092466689375262 | 0.013021956977964 | 0.015143005290046 | 0.0064300747604867 | 0.0069793709871253
min | 8429.0328467609 | 1.8426826705212 | 506.765 | 541.9025 | 13.892619607843 | 14.414615286624
max | 8687.8546821798 | 1.9004089367754 | 527.8925 | 568.4310789223 | 14.165972440945 | 14.769738845144
stddev | 93.364677712505 | 0.02062727721853 | 8.0779169549425 | 9.9999889949771 | 0.107614892342 | 0.12140786619901
聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
9994930 | 8177.5107248928 | 1.9590636485304 | 487.47 | 541.2825 | 14.013406035665 | 14.80304526749
9994931 | 8432.8923393013 | 1.8986411307841 | 507.995 | 551.18 | 13.983329333333 | 14.492921333333
9994932 | 8228.0696491294 | 1.9454009387451 | 489.7725 | 529.6475 | 13.910288227334 | 14.588560216509
9994933 | 8403.3447909506 | 1.9049858210628 | 508.8425 | 543.6725 | 13.894525827815 | 14.501691390728
9994934 | 8385.7767133493 | 1.9085944995934 | 508.14622963443 | 541.58364604088 | 13.929327516778 | 14.648150335571
9994935 | 8330.2474844166 | 1.9258762781886 | 486.82378294054 | 558.22860442849 | 14.191576974565 | 14.858611780455
9994936 | 8267.1334929976 | 1.9375925939066 | 488.9325 | 546.98 | 14.161099319728 | 14.669331972789
9994937 | 8310.4316285999 | 1.9272640311479 | 494.9175 | 544.84113789716 | 14.038435549525 | 14.781244233378
means | 8316.9258529547 | 1.9259273677449 | 496.61250157187 | 544.67698604582 | 14.015248598093 | 14.667944566282
medians | 8320.3395565083 | 1.9265701546682 | 492.345 | 544.25681894858 | 13.998367684499 | 14.65874115418
confidence_interval | 0.0089671497091839 | 0.0091344216653754 | 0.016839629548937 | 0.012702234285658 | 0.0066472032951229 | 0.0078402754117856
min | 8177.5107248928 | 1.8986411307841 | 486.82378294054 | 529.6475 | 13.894525827815 | 14.492921333333
max | 8432.8923393013 | 1.9590636485304 | 508.8425 | 558.22860442849 | 14.191576974565 | 14.858611780455
stddev | 89.19306714941 | 0.021039470646772 | 10.001474451658 | 8.2743329580123 | 0.11141755278116 | 0.13753537856553
聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
9994917 | 8635.7070897714 | 1.853296592018 | 524.31 | 556.5375 | 14.183319371728 | 14.806257853403
9994918 | 8621.7088487345 | 1.8558855922125 | 531.7725 | 545.95 | 13.98879791395 | 14.501852672751
9994919 | 8533.2685029174 | 1.8753910779895 | 522.7925 | 546.08863477841 | 13.958782664942 | 14.470852522639
9994920 | 8574.0599079032 | 1.8675608874445 | 513.61871595321 | 564.5675 | 13.962389175258 | 14.472323453608
9994921 | 8510.2283664363 | 1.8815516956704 | 505.67 | 550.7275 | 14.622058124174 | 15.321787318362
9994922 | 8703.9959200816 | 1.8385274078447 | 526.2775 | 557.65 | 13.938370656371 | 14.615827541828
9994923 | 8781.1981281123 | 1.8221623975359 | 535.02 | 559.17860205349 | 14.360284634761 | 14.88064231738
9994924 | 8659.6295712997 | 1.8498789700963 | 506.085 | 569.3425 | 14.483114068441 | 14.988921419518
means | 8627.474541907 | 1.8555318276015 | 520.69327699415 | 556.25527960399 | 14.187139576203 | 14.757308137436
medians | 8628.707969253 | 1.8545910921153 | 523.55125 | 557.09375 | 14.086058642839 | 14.711042697615
confidence_interval | 0.008675996428163 | 0.0087775211883997 | 0.017859539791627 | 0.012590065519452 | 0.015918315024142 | 0.017111525799816
min | 8510.2283664363 | 1.8221623975359 | 505.67 | 545.95 | 13.938370656371 | 14.470852522639
max | 8781.1981281123 | 1.8815516956704 | 535.02 | 569.3425 | 14.622058124174 | 15.321787318362
stddev | 89.519345731438 | 0.019478438704893 | 11.121569557209 | 8.37560108854 | 0.27008830852039 | 0.30200193835895

SPECjbb2015 on x86. No throughput drop observed this time. (I grabbed the build from a different location this time, non-source code version of that build)

-Xmx3200M -Xms3200M -Xmn1200M

Shift0

Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994899 | 20202 | 11928 | 27674 | 23095
9994900 | 19649 | 11962 | 23392 | 22775
9994901 | 20324 | 11498 | 23095 | 21847
9994902 | 20479 | 11678 | 27674 | 23095
means | 20163.5 | 11766.5 | 25458.75 | 22703
medians | 20263 | 11803 | 25533 | 22935
confidence_interval | 0.028503988445181 | 0.029647326634529 | 0.16003411426044 | 0.041365280701093
min | 19649 | 11498 | 23095 | 21847
max | 20479 | 11962 | 27674 | 23095
stddev | 361.24460780289 | 219.26163975184 | 2560.8224427581 | 590.26773586229

Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9995099 | 20913 | 11952 | 24604 | 23716
9995100 | 19631 | 11039 | 23095 | 21062
9995101 | 聽 | 聽 | 聽 | 聽
9995102 | 聽 | 聽 | 聽 | 聽
means | 20272 | 11495.5 | 23849.5 | 22389
medians | 20272 | 11495.5 | 23849.5 | 22389
confidence_interval | 0.14229072923525 | 0.17870145527142 | 0.14236234682501 | 0.26671743115415
min | 19631 | 11039 | 23095 | 21062
max | 20913 | 11952 | 24604 | 23716
stddev | 906.51089348115 | 645.58849122332 | 1067.0241328105 | 1876.6613972691

Shift3

Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994908 | 20785 | 12163 | 23095 | 20517
9994909 | 20093 | 11586 | 23095 | 22360
9994910 | 20785 | 11861 | 23095 | 21847
9994911 | 20324 | 12112 | 23095 | 20517
means | 20496.75 | 11930.5 | 23095 | 21310.25
medians | 20554.5 | 11986.5 | 23095 | 21182
confidence_interval | 0.026852923897151 | 0.035325331114737 | 0 | 0.070149805133889
min | 20093 | 11586 | 23095 | 20517
max | 20785 | 12163 | 23095 | 22360
stddev | 345.94448013133 | 264.89557691035 | 聽 | 939.60395025422

Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9995108 | 20755 | 12015 | 27674 | 23095
9995109 | 21309 | 12721 | 27674 | 23095
9995110 | 21016 | 12134 | 23095 | 21847
9995111 | 20755 | 12249 | 27674 | 23095
means | 20958.75 | 12279.75 | 26529.25 | 22783
medians | 20885.5 | 12191.5 | 27674 | 23095
confidence_interval | 0.020035367673363 | 0.040072637108334 | 0.13730484276789 | 0.043575648509854
min | 20755 | 12015 | 23095 | 21847
max | 21309 | 12721 | 27674 | 23095
stddev | 263.93228298183 | 309.29099027723 | 2289.5 | 624

Judging from the various experiments we tried, the overhead of shift3 on x86 isn't very significant.

SPECjbb2015 on zLinux

  • -Xmx3200m -Xms3200m -Xmn1200m
RUN RESULT: hbIR (max attempted) = 13837, hbIR (settled) = 11548, max-jOPS = 10516, critical-jOPS = 2144
RUN RESULT: hbIR (max attempted) = 12302, hbIR (settled) = 11859, max-jOPS = 10088, critical-jOPS = 2224
RUN RESULT: hbIR (max attempted) = 13837, hbIR (settled) = 11548, max-jOPS = 10793, critical-jOPS = 2135
RUN RESULT: hbIR (max attempted) = 13837, hbIR (settled) = 11548, max-jOPS = 10931, critical-jOPS = 2154

max-jOPS = 10582
critical-jOPS = 2164.25

Shift3
RUN RESULT: hbIR (max attempted) = 12302, hbIR (settled) = 11859, max-jOPS = 10703, critical-jOPS = 2161
RUN RESULT: hbIR (max attempted) = 11859, hbIR (settled) = 11489, max-jOPS = 10673, critical-jOPS = 2080
RUN RESULT: hbIR (max attempted) = 11548, hbIR (settled) = 11430, max-jOPS = 10278, critical-jOPS = 2129
RUN RESULT: hbIR (max attempted) = 11548, hbIR (settled) = 10959, max-jOPS = 10278, critical-jOPS = 2071

max-jOPS = 10483
critical-jOPS = 2110.25
  • 1% drop in max-jOPS
  • 2.5% drop in critical-jOPS

SPECjbb2005 on Power

-Xmx3200m -Xms3200m -Xmn2600m -Xjit:scratchSpaceLimit=2048000,acceptHugeMethods -Xgcpolicy:gencon -Xcompressedrefs -XXgc:forcedShiftingCompressionAmount=0

Shift 0

SPECjbb2005 bops = 72214
SPECjbb2005 bops = 72204
SPECjbb2005 bops = 72304
SPECjbb2005 bops = 72690
SPECjbb2005 bops = 71765
Average bops = 72235.4

Shift 3

SPECjbb2005 bops = 69579
SPECjbb2005 bops = 69860
SPECjbb2005 bops = 70412
SPECjbb2005 bops = 70350
SPECjbb2005 bops = 69376
Average bops = 69915.4

3.3% throughput drop

Summary

  • 1% throughput drop on x86
  • 1-2% throughput drop on Z
  • 3% throughput drop on Power

FYI @vijaysun-omr @zl-wang @mpirvu

@harryyu1994 could you please make a summary of all the experiments that were tried? It seems that only Power sees more than 2% regression from the move to shift3

it is within expectation. i remembered the overhead was about 2.5% previously we did the experiments.

Shift 0 vs. Shift 3 Summary

X86

Daytrader7

  • 1% throughput drop
  • Shift 0: Throughput = 2850.45
  • Shift 3: Throughput = 2819.82

Marius' Daytrader7 Experiment

  • 2% throughput drop
  • Shift 0: Throughput = 3455.40
  • Shift 3: Throughput = 3530.25

AcmeAir in Docker

  • no throughput drop observed
  • Shift 0: Throughput = 5231.12
  • Shift 3: Throughput = 5225.64

Quarkus+CRUD

  • 0.9% throughput drop
  • Shift 0: Throughput = 12140.48
  • Shift 3: Throughput = 12040.20

Specjbb2015

  • no throughput drop observed
  • Shift 0: max_jOPS = 20163.5, critical_jOPS = 11766.5
  • Shift 3: max_jOPS = 20496.75, critical_jOPS = 11930.5
  • Made it look like we have a throughput gain for shift 3 but it's just fluctuation

Z

SPECjbb2015

  • 1% max-jOPS drop
  • 2.5% critical-jOPS drop
  • Shift 0: max-jOPS = 10582, critical-jOPS = 2164.25
  • Shift 3: max-jOPS = 10483, critical-jOPS = 2110.25

Power

ILOG

  • 3.5% throughput drop
  • Shift 0: Global Throughput = 8604.6
  • Shift 3: Global Throughput = 8302.6

SPECjbb2005

  • 3.3% throughput drop
  • Shift 0: Global Throughput = 72235.4
  • Shift 3: Global Throughput = 69915.4

@vijaysun-omr I have all the results listed here, will be waiting for your final call on this.

@zl-wang I am worried by the high throughput loss on Power still (in excess of 3%). I don't know if you can afford to slow down everything 3+% inside OpenShift on Power. While I agree we used to have an overhead of approximately 2-3% on all platforms previously due to the shift, we now find that the overhead on other platforms (X86 has more data shown than Z) is lower. Can you please try the same on your Open Liberty setup ?

@vijaysun-omr i will give DT7/OpenLiberty a spin next as I talked to @harryyu1994

shift0 average throughput: 2798/s
shift3 average throughput: 2757/s
The gap is about 1.5%.

However, the up-down in the same run could be as big as 3-4%. haven't investigated why it is not as stable as my older driver: this one is July 29 build on Adopt site, as harry suggested a recent build.

Can we try to get a Daytrader7 run done on Z as well so that we have more than just that one data point ?

@zl-wang fluctuation of 3-4% is high enough that we don't know if the overhead is in the 3% range on Power in this case as well. Ideally we should try to understand what is different on Power before going ahead but I am okay with delivering the change to make things portable wrt compressed refs with the general approach taken in this design first and then work out how to make the situation better on Power as a continuing effort past that initial delivery.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xliang6 picture xliang6  路  6Comments

xliang6 picture xliang6  路  3Comments

AdamBrousseau picture AdamBrousseau  路  6Comments

0xdaryl picture 0xdaryl  路  3Comments

hrstoyanov picture hrstoyanov  路  4Comments