There are two approaches that were brought up in the Portable SCC discussion regarding how to deal with the potential for the compressed refs shift changing with the heap size.
fyi @dmitripivkine @amicic
Can one of you comment on what would be affected in the GC (and rest of the runtime) by forcing the CR shift to always be 3? How would this affect the heap size? Occupancy? More wasted space? etc
also fyi @fjeremic @andrewcraik @gita-omr @knn-k to give a codegen perspective.
For a large part GC would not be affected (roots and metastructures like rembembered set do not use CR). The biggest performance impact, I geuess, would come from jitted code, which is more in JIT folks to comment, like how an unnecessary shift for <4GB heap would affect performance.
I don't see obvious footprint implications.
Don't remember if this was discussed already, but what I struggle more is, jitted code (non)portability in unified CR/nonCR VM.
jitted code (non)portability in unified CR/nonCR VM.
This was brought up in the discussion; running in CR / non-CR results in a different SCC. I suppose theoretically the same SCC could be used to store both CR and non-CR versions of a compiled method, but that's better discussed in another issue (which I can open if we feel it's a discussion worth having).
Don't remember if this was discussed already, but what I struggle more is, jitted code (non)portability in unified CR/nonCR VM.
The CR-ness is encoded in the cache name today so that they are forced to be separate caches. I expect the initial unified CR/nonCR VM will still use separate caches for the initial approach.
fyi @dmitripivkine @amicic
Can one of you comment on what would be affected in the GC (and rest of the runtime) by forcing the CR shift to always be 3? How would this affect the heap size? Occupancy? More wasted space? etc
From the codegen perspective proposed solution 1. is going to be less performant and harder to functionally implement correctly. 2. will give most performance but least flexibility.
@fjeremic @andrewcraik Do you guys have any issues/concerns with moving forward with solution 2 here (fixing the shift value to 3 for portable AOT).
Summary of the solution:
-XX:+PortableSharedCache is specified, the compressedref shift will be fixed to 3 (3 for <= 32g and 4 for > 32g and nocompressedref > 64g. I could be wrong on these numbers.).FYI @mpirvu @vijaysun-omr @ymanton @dsouzai
the compressedref shift will be fixed to 3 (3 for <= 32gb and 4 for > 32gb).
Worth noting that there is a point when even shift 4 won't work, eg if the heap is so big that we have to run without compressedrefs; I don't know if the JVM will still explicitly require the user to pass in -Xnocompressedrefs once we have one build that can do both.
If we can't use compressedrefs, what should the portable AOT story be? Should we generate a SCC with the default CPU features & nocompressedrefs (thereby mandating that all future layers must have compressedrefs disabled)?
If we can't use compressedrefs, what should the portable AOT story be? Should we generate a SCC with the default CPU features & nocompressedrefs (thereby mandating that all future layers must have compressedrefs disabled)?
It sounds reasonable to me as we are giving the same treatment to shift 3/shift 4/nocompressedrefs. This shouldn't be that bad if most of the use cases fall under shift 3.
Note compressedrefs and non-compressedrefs don't share any cache, there are different cache files for these atm.
Is it possible to estimate the impact of applying a constant shift by 3 in compiled code ? In general I am supportive of this proposed direction that you are taking but I wanted to get a sense for how much slower we would be making portable AOT code because of compressed refs shift. My expectation/hope would be that the overhead is'nt much more than 5% but this is what I'm asking to be measured.
Is it possible to estimate the impact of applying a constant shift by 3 in compiled code ? In general I am supportive of this proposed direction that you are taking but I wanted to get a sense for how much slower we would be making portable AOT code because of compressed refs shift. My expectation/hope would be that the overhead is'nt much more than 5% but this is what I'm asking to be measured.
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput avg=2850.45 min=2759.30 max=2900.90 stdDev=45.5 maxVar=5.13% confInt=1.07% samples= 8
Intermediate results:
Run 0 187.3 2575.9 2890.0 2861.1 Avg=2861 CPU=113492 ms Footprint=712364 KB
Run 1 200.6 2573.6 2887.4 2867.5 Avg=2868 CPU=108388 ms Footprint=700740 KB
Run 2 221.0 2579.6 2901.8 2759.3 Avg=2759 CPU=112528 ms Footprint=699352 KB
Run 3 222.1 2628.1 2830.4 2892.9 Avg=2893 CPU=107786 ms Footprint=706204 KB
Run 4 180.4 2628.9 2903.8 2830.5 Avg=2830 CPU=108510 ms Footprint=706704 KB
Run 5 226.5 2598.6 2705.5 2867.6 Avg=2868 CPU=112368 ms Footprint=713240 KB
Run 6 221.2 2647.2 2837.1 2900.9 Avg=2901 CPU=110313 ms Footprint=698736 KB
Run 7 231.8 2619.8 2928.2 2823.8 Avg=2824 CPU=110404 ms Footprint=707608 KB
CompTime avg=110473.62 min=107786.00 max=113492.00 stdDev=2150.7 maxVar=5.29% confInt=1.30% samples= 8
Footprint avg=705618.50 min=698736.00 max=713240.00 stdDev=5599.8 maxVar=2.08% confInt=0.53% samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput avg=2819.82 min=2776.80 max=2861.30 stdDev=29.0 maxVar=3.04% confInt=0.69% samples= 8
Intermediate results:
Run 0 163.2 2442.4 2795.4 2776.8 Avg=2777 CPU=137886 ms Footprint=651608 KB
Run 1 144.9 2350.9 2826.5 2847.1 Avg=2847 CPU=137913 ms Footprint=636708 KB
Run 2 152.9 2429.0 2857.6 2826.9 Avg=2827 CPU=131592 ms Footprint=637768 KB
Run 3 174.9 2363.9 2790.5 2832.9 Avg=2833 CPU=140504 ms Footprint=642980 KB
Run 4 161.6 2433.0 2810.8 2803.1 Avg=2803 CPU=132384 ms Footprint=632412 KB
Run 5 139.9 2409.7 2819.2 2861.3 Avg=2861 CPU=132907 ms Footprint=649168 KB
Run 6 177.9 2467.1 2801.9 2787.1 Avg=2787 CPU=137302 ms Footprint=636512 KB
Run 7 178.0 2431.0 2764.5 2823.4 Avg=2823 CPU=133845 ms Footprint=638280 KB
CompTime avg=135541.62 min=131592.00 max=140504.00 stdDev=3256.5 maxVar=6.77% confInt=1.61% samples= 8
Footprint avg=640679.50 min=632412.00 max=651608.00 stdDev=6681.6 maxVar=3.04% confInt=0.70% samples= 8
Shift0:
run0: summary = 2938081 in 600s = 4896.7/s Avg: 1 Min: 0 Max: 891 Err: 0 (0.00%)
run1: summary = 3136727 in 600s = 5227.7/s Avg: 1 Min: 0 Max: 129 Err: 0 (0.00%)
run2: summary = 3147370 in 600s = 5245.4/s Avg: 1 Min: 0 Max: 109 Err: 0 (0.00%)
run3: summary = 3139280 in 600s = 5232.0/s Avg: 1 Min: 0 Max: 117 Err: 0 (0.00%)
run4: summary = 3133830 in 600s = 5222.8/s Avg: 1 Min: 0 Max: 79 Err: 0 (0.00%)
run5: summary = 3136712 in 600s = 5227.7/s Avg: 1 Min: 0 Max: 156 Err: 0 (0.00%)
5231.12
Shift3:
run0: summary = 2964754 in 600s = 4941.1/s Avg: 1 Min: 0 Max: 260 Err: 0 (0.00%)
run1: summary = 3137234 in 600s = 5228.3/s Avg: 1 Min: 0 Max: 124 Err: 0 (0.00%)
run2: summary = 3126874 in 600s = 5211.3/s Avg: 1 Min: 0 Max: 110 Err: 0 (0.00%)
run3: summary = 3139452 in 600s = 5232.2/s Avg: 1 Min: 0 Max: 64 Err: 0 (0.00%)
run4: summary = 3134675 in 600s = 5224.3/s Avg: 1 Min: 0 Max: 100 Err: 0 (0.00%)
run5: summary = 3139328 in 600s = 5232.1/s Avg: 1 Min: 0 Max: 113 Err: 0 (0.00%)
5225.64
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:none -Xnoaot -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput avg=2997.16 min=2976.00 max=3022.10 stdDev=13.9 maxVar=1.55% confInt=0.31% samples= 8
Intermediate results:
Run 0 161.6 2661.4 3026.7 3022.1 Avg=3022 CPU=131266 ms Footprint=597944 KB
Run 1 175.7 2600.5 3029.5 2990.0 Avg=2990 CPU=127742 ms Footprint=594136 KB
Run 2 178.5 2622.8 3002.0 2984.6 Avg=2985 CPU=130536 ms Footprint=602824 KB
Run 3 161.8 2686.5 3003.7 3000.8 Avg=3001 CPU=129596 ms Footprint=596588 KB
Run 4 131.0 2617.7 2820.5 2976.0 Avg=2976 CPU=143361 ms Footprint=603160 KB
Run 5 157.4 2657.5 3016.9 2999.4 Avg=2999 CPU=129978 ms Footprint=604736 KB
Run 6 175.0 2656.1 2978.4 3001.4 Avg=3001 CPU=130472 ms Footprint=592384 KB
Run 7 192.8 2599.7 3042.6 3003.0 Avg=3003 CPU=130239 ms Footprint=613256 KB
CompTime avg=131648.75 min=127742.00 max=143361.00 stdDev=4843.3 maxVar=12.23% confInt=2.46% samples= 8
Footprint avg=600628.50 min=592384.00 max=613256.00 stdDev=6774.0 maxVar=3.52% confInt=0.76% samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:none -Xnoaot -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput avg=2933.98 min=2877.50 max=2974.10 stdDev=29.9 maxVar=3.36% confInt=0.68% samples= 8
Intermediate results:
Run 0 173.7 2609.1 2960.8 2918.4 Avg=2918 CPU=133268 ms Footprint=599836 KB
Run 1 179.3 2641.4 2952.0 2920.2 Avg=2920 CPU=128775 ms Footprint=597212 KB
Run 2 154.8 2589.6 2955.9 2956.3 Avg=2956 CPU=131750 ms Footprint=606756 KB
Run 3 191.9 2553.4 2955.2 2945.6 Avg=2946 CPU=129687 ms Footprint=597068 KB
Run 4 149.1 2640.5 2965.7 2974.1 Avg=2974 CPU=129883 ms Footprint=600332 KB
Run 5 193.3 2638.6 2941.3 2927.3 Avg=2927 CPU=127431 ms Footprint=603852 KB
Run 6 168.7 2530.4 2892.4 2877.5 Avg=2878 CPU=145817 ms Footprint=596084 KB
Run 7 176.7 2569.6 2945.2 2952.4 Avg=2952 CPU=134123 ms Footprint=596880 KB
CompTime avg=132591.75 min=127431.00 max=145817.00 stdDev=5798.9 maxVar=14.43% confInt=2.93% samples= 8
Footprint avg=599752.50 min=596084.00 max=606756.00 stdDev=3809.2 maxVar=1.79% confInt=0.43% samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput avg=2964.54 min=2933.90 max=3002.30 stdDev=25.2 maxVar=2.33% confInt=0.57% samples= 8
Intermediate results:
Run 0 220.7 2656.6 2940.8 2933.9 Avg=2934 CPU=103509 ms Footprint=712676 KB
Run 1 228.4 2674.1 2920.4 2950.9 Avg=2951 CPU=105973 ms Footprint=719892 KB
Run 2 223.8 2727.0 2960.8 2947.2 Avg=2947 CPU=103124 ms Footprint=704384 KB
Run 3 215.6 2704.4 2978.5 2978.7 Avg=2979 CPU=103663 ms Footprint=709576 KB
Run 4 235.9 2666.1 2967.8 3002.3 Avg=3002 CPU=103964 ms Footprint=710316 KB
Run 5 218.4 2676.8 2964.8 2997.3 Avg=2997 CPU=101415 ms Footprint=704660 KB
Run 6 176.1 2719.4 2953.1 2958.0 Avg=2958 CPU=103691 ms Footprint=726336 KB
Run 7 214.4 2654.4 2957.3 2948.0 Avg=2948 CPU=106512 ms Footprint=714952 KB
CompTime avg=103981.38 min=101415.00 max=106512.00 stdDev=1608.1 maxVar=5.03% confInt=1.04% samples= 8
Footprint avg=712849.00 min=704384.00 max=726336.00 stdDev=7481.4 maxVar=3.12% confInt=0.70% samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput avg=2889.20 min=2842.80 max=2943.50 stdDev=37.0 maxVar=3.54% confInt=0.86% samples= 8
Intermediate results:
Run 0 181.7 2536.2 2979.7 2933.4 Avg=2933 CPU=123566 ms Footprint=693680 KB
Run 1 185.0 2523.2 2887.1 2910.4 Avg=2910 CPU=128682 ms Footprint=640464 KB
Run 2 178.7 2494.4 2877.1 2943.5 Avg=2944 CPU=128197 ms Footprint=637768 KB
Run 3 175.7 2553.8 2879.9 2889.7 Avg=2890 CPU=129298 ms Footprint=639216 KB
Run 4 157.6 2498.3 2862.5 2875.5 Avg=2876 CPU=123893 ms Footprint=640976 KB
Run 5 187.6 2497.7 2868.2 2842.8 Avg=2843 CPU=125097 ms Footprint=632568 KB
Run 6 178.9 2420.8 2828.6 2865.9 Avg=2866 CPU=127235 ms Footprint=637264 KB
Run 7 173.4 2305.5 2862.9 2852.4 Avg=2852 CPU=119157 ms Footprint=636492 KB
CompTime avg=125640.62 min=119157.00 max=129298.00 stdDev=3410.0 maxVar=8.51% confInt=1.82% samples= 8
Footprint avg=644803.50 min=632568.00 max=693680.00 stdDev=19923.9 maxVar=9.66% confInt=2.07% samples= 8
I have spent some time coming up with an implementation and here's the whole story:
First, recap on the original compressed shift design:
-XX:+PortableSharedCache during cold run, if the compressed shift value is <=3 then 3 will be used and persisted to the shared class cache, if the compressed shift value is 4 then 4 will be used-XX:+PotableSharedCache during the warm run) I proceeded to implement this, and have found some limitations with our existing infrastructure in the codebase:
initializeRunTimeObjectAlignmentAndCRShift(), it is in fact earlier than the earliest point the vm is able to load the SCC. As a result, with the current infrastructure it may not be possible to pick up the CR shift value from the SCC and then set it to the current JVM.-XX:+PortableSharedCache is yet to be parsed and processed when initializeRunTimeObjectAlignmentAndCRShift() is called, but this looks possible to be worked around without too much effort.Due to the limitations, I'm proposing an alternative solution
-XX:+PortableSharedCache, if the compressed shift value <= 3 we will fix the compressed shift to 3 and if the compressed shift value is 4 then we will generate a warning message to the user that the heap may be too large for the portableSharedCache option to work.@vijaysun-omr @mpirvu @dsouzai Let me know if there are any concerns with the alternative solution.
@dmitripivkine Does the CR shift change the object alignment requirements? Does forcing a 3bit shift mean small heaps will waste more space and therefore incur higher gc overhead?
if the compressed shift value is 4 then we will generate a warning message to the user that the heap may be too large for the portableSharedCache option to work.
I assume we will continue to generate AOT which is portable from the processor point of view.
@dmitripivkine Does the CR shift change the object alignment requirements? Does forcing a 3bit shift mean small heaps will waste more space and therefore incur higher gc overhead?
only for 4-bit shift (required 16 bytes alignment for objects). All other cases are covered by minimum heap object alignment to 8 bytes
I assume we will continue to generate AOT which is portable from the processor point of view.
Yes, we will always use the portable processor feature set when -XX:+PortableSharedCache is specified.
@DanHeidinga @vijaysun-omr Moving question here: https://github.com/eclipse/omr/pull/5436 force any run in container to use shift 3 (except it needs to use shift 4). It obviously prevents to use most performant shift 0 for small container applications. Would you please confirm this is desired behaviour?
I think in containers the plan is to sacrifice a bit of performance in exchange for maximum portability. I have run some experiment comparing shift0 and shift3 and didn't see a significant throughput drop. I'll leave the decision to Vijay @vijaysun-omr though.
Moving question here: eclipse/omr#5436 force any run in container to use shift 3 (except it needs to use shift 4). It obviously prevents to use most performant shift 0 for small container applications. Would you please confirm this is desired behaviour?
My understanding is that this is the expected behaviour only when -XX:+PortableSharedCache is specified and in that case we accept the tradeoff for better portability.
My understanding is that this is the expected behaviour only when
-XX:+PortableSharedCacheis specified and in that case we accept the tradeoff for better portability.
Yes that is correct. But in containers the PortableSharedCache feature is enabled by default. In containers, the portable processor feature set will be used by default for AOT compilations unless disabled by -XX:-PortableSharedCache. The question here is whether we want to also have the shift set to 3 by default in containers.
The question here is whether we want to also have the shift set to 3 by default in containers.
I would say yes, but only if AOT is enabled.
The shift by 3 code is only generated for an AOT compilation in containers.
So, in the unaffected category are : 1) JIT compilations inside or outside containers and 2) AOT compilations outside containers.
This is a conscious choice being made in the same philosophy as the change to use "portable processor feature set", i.e. we want to make the code portable at a small performance cost in containers (in the case of compressed refs shift, the cost was negligible as @harryyu1994 measured). Since AOT compilations can (and are) usually recompiled as JIT compilations if they are deemed important for peak throughput (since this is not the first way and also nor the most significant way that AOT compilations are worse than JIT compilations), the steady state impact won't even be as much as measured in the AOT experiments with and without portability changes. During startup and rampup phases when AOT code is used heavily, these minor performance differences are unlikely to be a big enough deal to compromise on portability in containers.
The shift by 3 code is only generated for an AOT compilation in containers.
So, in the unaffected category are : 1) JIT compilations inside or outside containers and 2) AOT compilations outside containers.
This is a conscious choice being made in the same philosophy as the change to use "portable processor feature set", i.e. we want to make the code portable at a small performance cost in containers (in the case of compressed refs shift, the cost was negligible as @harryyu1994 measured). Since AOT compilations can (and are) usually recompiled as JIT compilations if they are deemed important for peak throughput (since this is not the first way and also nor the most significant way that AOT compilations are worse than JIT compilations), the steady state impact won't even be as much as measured in the AOT experiments with and without portability changes. During startup and rampup phases when AOT code is used heavily, these minor performance differences are unlikely to be a big enough deal to compromise on portability in containers.
Correct me if I'm wrong, but I thought the JIT compilations inside containers would also have to use shift3 if we made the AOT compilations shift by 3. So JIT compilations inside containers are affected (though we didn't see a throughput drop in my experiment when comparing AOT+JIT shift0 vs. AOT+JIT shift3).
The question here is whether we want to also have the shift set to 3 by default in containers.
I would say yes, but only if AOT is enabled.
May not be possible to check for whether AOT is enabled this early.
enum INIT_STAGE {
PORT_LIBRARY_GUARANTEED, 0
ALL_DEFAULT_LIBRARIES_LOADED, 1
ALL_LIBRARIES_LOADED, 2
DLL_LOAD_TABLE_FINALIZED, 3 Consume JIT specific X options
VM_THREADING_INITIALIZED, 4
HEAP_STRUCTURES_INITIALIZED, 5
ALL_VM_ARGS_CONSUMED, 6
The shift is set at ALL_LIBRARIES_LOADED, very early into the initialization.
Looking at the code https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L34-L66 I see that vm->sharedCacheAPI->sharedCacheEnabled is set very early and SCC options are also parsed very early. But then there is this piece of code in the same function: https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L304-L328
which seems to deal with -Xshareclasses:none option. I wonder why can't this piece of code be grouped with the previous one. That way we would know very early whether SCC is 'likely' going to be used. @hangshao0
@harryyu1994 Discussing with @mpirvu some more, I feel that we need more data points if we are going to slow down JITed code (in addition to AOTed code) inside containers. Could you please run SPECjbb2015 (please ask Piyush if you need help with accessing a setup for it) and maybe SPECjbb2005 (that is much easier to run) and check what the throughput overhead is ?
Additionally, the overhead of the shift would be platform dependent and so if one wanted to take a design decision for all platforms, the effect of the shift ought to be measured on the other platforms first.
I would also add quarkus throughput experiments since quarkus is more likely to be run in containers.
But then there is this piece of code in the same function: https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L304-L328
which seems to deal with -Xshareclasses:none option. I wonder why can't this piece of code be grouped with the previous one. That way we would know very early whether SCC is 'likely' going to be used.
Looking at the code, what it does is to unload the SCC dll if -Xshareclasses:none presents. I guess that is the reason why it is done in stage DLL_LOAD_TABLE_FINALIZED. Once the SCC dll in unloaded, all SCC related functionalities will be inactive.
unload the SCC dll if -Xshareclasses:none presents
This means that we load the SCC dll before checking the command line options. Be that as it may, we could add another check for -Xshareclasses:none when SCC options are parsed.
we could add another check for -Xshareclasses:none when SCC options are parsed.
Yes. It looks fine to me if another check for -Xshareclasses:none is added in the block L34 to L66.
Quarkus+CRUD on x86 loses 0.9% in throughput when we force a shift3 instead on shift0 for compresssedrefs
Stats for rest-crud-quarkus-openj9:j11 with JAVA_OPTS=-Xms128m -Xmx128m -Xshareclasses:none -XXgc:forcedShiftingCompressionAmount=3
Throughput: avg=12040.20 min=11931.00 max=12111.10 stdDev=57.7 maxVar=1.51% confInt=0.28% samples=10
Footprint: avg=123.01 min=105.90 max=129.90 stdDev=6.7 maxVar=22.66% confInt=3.17% samples=10
Stats for rest-crud-quarkus-openj9:j11 with JAVA_OPTS=-Xms128m -Xmx128m -Xshareclasses:none -XXgc:forcedShiftingCompressionAmount=0
Throughput: avg=12140.48 min=12065.70 max=12209.50 stdDev=48.3 maxVar=1.19% confInt=0.23% samples=10
Footprint: avg=125.31 min=120.40 max=129.30 stdDev=2.8 maxVar=7.39% confInt=1.28% samples=10
-Xms2g -Xmx2g -Xmn1g -Xgcpolicy:gencon -Xlp -XcompressedrefsJob ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9962189 | 12177 | 4435 | 13838 | 13099
9962190 | 13688 | 5824 | 19279 | 16099
9962191 | 14006 | 5749 | 16099 | 13449
9962192 | 11277 | 4417 | 13587 | 13415
means | 12787 | 5106.25 | 15700.75 | 14015.5
medians | 12932.5 | 5092 | 14968.5 | 13432
confidence_interval | 0.15982393515568 | 0.24493717290884 | 0.26746374600271 | 0.1586869314041
min | 11277 | 4417 | 13587 | 13099
max | 14006 | 5824 | 19279 | 16099
stddev | 1284.5183273637 | 786.11592656554 | 2639.4603457273 | 1397.9111798203
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9962180 | 12718 | 4750 | 16099 | 14334
9962181 | 14972 | 5971 | 16099 | 13449
9962182 | 13040 | 4531 | 16099 | 13795
9962183 | 14167 | 5686 | 16099 | 13449
means | 13724.25 | 5234.5 | 16099 | 13756.75
medians | 13603.5 | 5218 | 16099 | 13622
confidence_interval | 0.12035581689665 | 0.21319070129398 | 0 | 0.048339382455907
min | 12718 | 4531 | 16099 | 13449
max | 14972 | 5971 | 16099 | 14334
stddev | 1038.2107605555 | 701.41214702912 | 聽 | 417.97158994362
Added more runs
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994492 | 13253 | 5439 | 16566 | 15678
9994493 | 13445 | 5534 | 14939 | 14322
9994494 | 聽 | 聽 | 聽 | 聽
9994495 | 15133 | 6707 | 16099 | 13449
means | 13943.666666667 | 5893.3333333333 | 15868 | 14483
medians | 13445 | 5534 | 16099 | 14322
confidence_interval | 0.15737812859215 | 0.2542198958441 | 0.11199389127022 | 0.16451397337762
min | 13253 | 5439 | 14939 | 13449
max | 15133 | 6707 | 16566 | 15678
stddev | 1034.4570234347 | 706.25514747387 | 837.73683218538 | 1123.1878738662
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994483 | 12508 | 5022 | 13449 | 13024
9994484 | 13146 | 5101 | 14939 | 14322
9994485 | 16099 | 6128 | 16099 | 13449
9994486 | 13362 | 5131 | 16099 | 13795
means | 13778.75 | 5345.5 | 15146.5 | 13647.5
medians | 13254 | 5116 | 15519 | 13622
confidence_interval | 0.18344971565841 | 0.1558672596321 | 0.13202131493524 | 0.064024675274026
min | 12508 | 5022 | 13449 | 13024
max | 16099 | 6128 | 16099 | 14322
stddev | 1588.754097818 | 523.68852065581 | 1256.8578545988 | 549.19972080595
Don't think this is a good benchmark for this as the fluctuations are too large.
That's a 6.8% drop in max_jOPS and 2.5% drop in critical_jOPS.
Fluctuations are huge though min-max ~20%, so I am not sure we can draw any conclusions with such a small dataset.
My DT7 experiments with AOT enabled show a 2.1% regression when moving from shift 0 to shift 3
Results for JDK=/home/mpirvu/sdks/OpenJDK11U-jre_x64_linux_openj9_2020-05-21-10-15 jvmOpts=-Xms1024m -Xmx1024m -XXgc:forcedShiftingCompressionAmount=0
Throughput avg=3530.25 min=3434.60 max=3587.30 stdDev=44.9 maxVar=4.45% confInt=0.74% samples=10
CompTime avg=137425.30 min=128296.00 max=179999.00 stdDev=15107.7 maxVar=40.30% confInt=6.37% samples=10
Footprint avg=932900.80 min=912844.00 max=948924.00 stdDev=9184.8 maxVar=3.95% confInt=0.57% samples=10
Results for JDK=/home/mpirvu/sdks/OpenJDK11U-jre_x64_linux_openj9_2020-05-21-10-15 jvmOpts=-Xms1024m -Xmx1024m -XXgc:forcedShiftingCompressionAmount=3
Throughput avg=3455.40 min=3392.70 max=3521.00 stdDev=36.2 maxVar=3.78% confInt=0.61% samples=10
CompTime avg=139633.70 min=132116.00 max=182410.00 stdDev=15162.1 maxVar=38.07% confInt=6.29% samples=10
Footprint avg=930844.00 min=922164.00 max=945488.00 stdDev=7221.0 maxVar=2.53% confInt=0.45% samples=10
That's a 6.8% drop in max_jOPS and 2.5% drop in critical_jOPS.
Fluctuations are huge though min-max ~20%, so I am not sure we can draw any conclusions with such a small dataset.
Not sure why fluctuations are so large. Originally the heap size is set to 24GB, I had to change it to 2GB to be able to use shift0. Maybe the test does not work well for a smaller heap..
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
-- | -- | -- | -- | -- | -- | --
9994528 | 8638.9371504854 | 1.8573037069837 | 504.7412381469 | 582.0975 | 13.859526383526 | 14.445926640927
9994529 | 8762.1322573549 | 1.8264175610697 | 534.71616320959 | 560.30609923475 | 13.895542351454 | 14.478946902655
9994530 | 8740.2146440973 | 1.835861262807 | 506.15 | 586.72603318492 | 13.904566037736 | 14.485850314465
9994531 | 8664.0184392624 | 1.8513182105413 | 510.4425 | 579.65 | 14.166381979695 | 14.671502538071
9994532 | 8577.5554538932 | 1.8694810666913 | 509.735 | 571.67 | 13.984703208556 | 14.73531684492
9994533 | 8691.1423921766 | 1.8423403917498 | 525.7875 | 567.96858007855 | 14.010517902813 | 14.684122762148
9994534 | 8502.5874741253 | 1.8831568605721 | 494.7275 | 555.5325 | 14.109282694848 | 14.744678996037
9994535 | 8658.9934676143 | 1.8495857046636 | 515.9275 | 564.43 | 13.919508322663 | 14.490800256082
means | 8654.4476598762 | 1.8519330956348 | 512.77842516956 | 571.04758906228 | 13.981253610162 | 14.592143156913
medians | 8661.5059534384 | 1.8504519576024 | 510.08875 | 569.81929003927 | 13.95210576561 | 14.581151397076
confidence_interval | 0.0081348395932811 | 0.0082178422245874 | 0.02052919348065 | 0.016151072515423 | 0.0065275035989177 | 0.0073217207304309
min | 8502.5874741253 | 1.8264175610697 | 494.7275 | 555.5325 | 13.859526383526 | 14.445926640927
max | 8762.1322573549 | 1.8831568605721 | 534.71616320959 | 586.72603318492 | 14.166381979695 | 14.744678996037
stddev | 84.198081874972 | 0.018201070854603 | 12.589702870929 | 11.030304909653 | 0.10914581344745 | 0.1277750589018
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
-- | -- | -- | -- | -- | -- | --
9994541 | 8119.8086062203 | 1.9734693630701 | 482.52 | 533.94 | 13.927745554036 | 14.523974008208
9994542 | 8286.7210135417 | 1.9335406181712 | 484.4725 | 541.3 | 14.028153225806 | 14.621639784946
9994543 | 8238.9619674721 | 1.9421575015578 | 502.385 | 524.27 | 14.281521505376 | 14.873037634409
9994544 | 8388.4166737499 | 1.9096835513286 | 500.78124804688 | 547.5725 | 13.915002663116 | 14.43093608522
9994545 | 8408.0668386632 | 1.9034667560754 | 515.315 | 539.43 | 13.90702393617 | 14.452569148936
9994546 | 8298.7935361939 | 1.9281740577079 | 509.36622658443 | 526.865 | 13.899852393617 | 14.452348404255
9994547 | 8441.6219797253 | 1.8954390538826 | 523.06 | 533.69 | 14.077266311585 | 14.597009320905
9994548 | 8412.5731827431 | 1.9026361479098 | 513.48871627821 | 540.7625 | 13.878481333333 | 14.53224
means | 8324.3704747887 | 1.9235708812129 | 503.92358636369 | 535.97875 | 13.98938086538 | 14.56046929836
medians | 8343.6051049719 | 1.9189288045183 | 505.87561329222 | 536.685 | 13.921374108576 | 14.528107004104
confidence_interval | 0.010993300632059 | 0.011362164570798 | 0.024010815613053 | 0.012185469952163 | 0.0081766615307008 | 0.0082655166059187
min | 8119.8086062203 | 1.8954390538826 | 482.52 | 524.27 | 13.878481333333 | 14.43093608522
max | 8441.6219797253 | 1.9734693630701 | 523.06 | 547.5725 | 14.281521505376 | 14.873037634409
stddev | 109.44435177091 | 0.026138647857235 | 14.470563630049 | 7.8109472171159 | 0.13680071373816 | 0.14393261774689
Seeing a 4% drop in throughput on Power.
@andrewcraik @zl-wang see above overhead(s)
I have also updated the original post for SPECjbb2015GMR. I don't think we can draw any conclusions from that particular benchmark as we always seem to have large fluctuations (multiple attempts and not a small dataset considering each run takes over 3 hours).
most of variability typically comes from JIT, but this heap size is also fairly small for jbb2015 and may contribute (by having large variations in number of global GCs that are relatively expensive) as well.
if these tests are to be repeated try heap as big as possible while still being able to run shift0, with 2GB given to Tenure and rest to Nursery (for example -Xmx3200M -Xms3200M -Xmn1200M)
maybe due to the shift-3 case's bigger measurement variability, the overhead looked like twice more than expected. we had prior experience with this overhead ... about 2-2.5%. it might be worth of another more stable measurement on shift-3 case.
most of variability typically comes from JIT, but this heap size is also fairly small for jbb2015 and may contribute (by having large variations in number of global GCs that are relatively expensive) as well.
if these tests are to be repeated try heap as big as possible while still being able to run shift0, with 2GB given to Tenure and rest to Nursery (for example -Xmx3200M -Xms3200M -Xmn1200M)
Tried with -Xmx3200M -Xms3200M -Xmn1200M on x86
The shift0 runs were pretty stable, the shift3 runs were not.
3.2% drop in max_jOPS and 2% drop in cirtical_jOPS
Going to give this another try..
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994645 | 18938 | 11502 | 23095 | 22360
9994646 | 18938 | 11227 | 23095 | 21847
9994647 | 19169 | 10885 | 23095 | 21419
9994648 | 21247 | 11684 | 23095 | 19279
means | 19573 | 11324.5 | 23095 | 21226.25
medians | 19053.5 | 11364.5 | 23095 | 21633
confidence_interval | 0.09114537985632 | 0.048897960523052 | 0 | 0.1014854988189
min | 18938 | 10885 | 23095 | 19279
max | 21247 | 11684 | 23095 | 22360
stddev | 1121.3001382324 | 348.04836828617 | 聽 | 1353.9639027685
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994636 | 20555 | 12086 | 23095 | 22548
9994637 | 20093 | 11388 | 23095 | 22360
9994638 | 20202 | 11982 | 27674 | 23095
9994639 | 20093 | 11496 | 23095 | 22706
means | 20235.75 | 11738 | 24239.75 | 22677.25
medians | 20147.5 | 11739 | 23095 | 22627
confidence_interval | 0.017214402858354 | 0.047064354097083 | 0.15027360018152 | 0.021914250955476
min | 20093 | 11388 | 23095 | 22360
max | 20555 | 12086 | 27674 | 23095
stddev | 218.94805319984 | 347.22903104435 | 2289.5 | 312.35383248276
Do you use -XXgc:forcedShiftingCompressionAmount=3 to force shift 3 ?
Do you use
-XXgc:forcedShiftingCompressionAmount=3to force shift 3 ?
Yes
With a larger dataset, I'm measuring a 3.5% throughput drop on Power.
First 8 runs: (8288.3387340932 vs. 8581.7531089419)
Next 8 runs: (8316.9258529547 vs. 8627.474541907)
Both have 3.5% throughput drop.
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
-- | -- | -- | -- | -- | -- | --
9994855 | 8227.0338779673 | 1.9450356503132 | 503.18874202814 | 526.035 | 13.904885598923 | 14.4253243607
9994856 | 8087.2274532643 | 1.9783969421294 | 494.9975 | 517.92093638361 | 13.935554945055 | 14.535127747253
9994857 | 8317.3519463409 | 1.9245225413624 | 500.05874985313 | 537.415 | 14.015772543742 | 14.649139973082
9994858 | 8286.6920988556 | 1.9310648048965 | 504.66 | 527.8886802783 | 14.190730201342 | 14.867684563758
9994859 | 8207.8741850326 | 1.9494444556917 | 506.135 | 519.855 | 13.985191117093 | 14.581004037685
9994860 | 8392.4408795395 | 1.907375533393 | 503.745 | 545.16 | 13.938728 | 14.444909333333
9994861 | 8334.6407679616 | 1.920294696344 | 499.66 | 533.34116664708 | 14.107698795181 | 14.611708165997
9994862 | 8453.4486637834 | 1.8928913679089 | 509.36872657818 | 532.835 | 14.051724842767 | 14.721377358491
means | 8288.3387340932 | 1.9311282490049 | 502.72671480743 | 530.05634791362 | 14.016285755513 | 14.604534442537
medians | 8302.0220225982 | 1.9277936731294 | 503.46687101407 | 530.36184013915 | 14.000481830417 | 14.596356101841
confidence_interval | 0.011550726630632 | 0.011514626729118 | 0.0073576725719721 | 0.014272083518414 | 0.0057905694289637 | 0.0083223576011986
min | 8087.2274532643 | 1.8928913679089 | 494.9975 | 517.92093638361 | 13.904885598923 | 14.4253243607
max | 8453.4486637834 | 1.9783969421294 | 509.36872657818 | 545.16 | 14.190730201342 | 14.867684563758
stddev | 114.49608734331 | 0.026593458983634 | 4.4237061399032 | 9.0473890683649 | 0.097066208198194 | 0.14536101225863
聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
9994868 | 8429.0328467609 | 1.9004089367754 | 506.765 | 552.32 | 14.128936925099 | 14.630515111695
9994869 | 8687.8546821798 | 1.8426826705212 | 522.0325 | 564.32608918478 | 13.933602791878 | 14.440649746193
9994870 | 8468.1746712228 | 1.8895803645447 | 514.3225 | 542.93614265964 | 14.165972440945 | 14.769738845144
9994871 | 8627.8107263089 | 1.8540877613996 | 525.47618630953 | 544.83613790966 | 13.899932484076 | 14.414615286624
9994872 | 8670.780576214 | 1.8459311333678 | 526.85723142769 | 555.88111029722 | 14.051856780735 | 14.553642585551
9994873 | 8555.9774620284 | 1.8699833702903 | 527.8925 | 541.9025 | 13.964604139715 | 14.498144890039
9994874 | 8579.5265630195 | 1.8652614005592 | 516.575 | 546.735 | 13.892619607843 | 14.611988235294
9994875 | 8634.8673438012 | 1.854260194939 | 509.6125 | 568.4310789223 | 13.914638569604 | 14.44150063857
means | 8581.7531089419 | 1.8652744790497 | 518.69167721715 | 552.1710073717 | 13.994020467487 | 14.545099417389
medians | 8603.6686446642 | 1.8597607977491 | 519.30375 | 549.5275 | 13.949103465797 | 14.525893737795
confidence_interval | 0.0090968857000075 | 0.0092466689375262 | 0.013021956977964 | 0.015143005290046 | 0.0064300747604867 | 0.0069793709871253
min | 8429.0328467609 | 1.8426826705212 | 506.765 | 541.9025 | 13.892619607843 | 14.414615286624
max | 8687.8546821798 | 1.9004089367754 | 527.8925 | 568.4310789223 | 14.165972440945 | 14.769738845144
stddev | 93.364677712505 | 0.02062727721853 | 8.0779169549425 | 9.9999889949771 | 0.107614892342 | 0.12140786619901
聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
9994930 | 8177.5107248928 | 1.9590636485304 | 487.47 | 541.2825 | 14.013406035665 | 14.80304526749
9994931 | 8432.8923393013 | 1.8986411307841 | 507.995 | 551.18 | 13.983329333333 | 14.492921333333
9994932 | 8228.0696491294 | 1.9454009387451 | 489.7725 | 529.6475 | 13.910288227334 | 14.588560216509
9994933 | 8403.3447909506 | 1.9049858210628 | 508.8425 | 543.6725 | 13.894525827815 | 14.501691390728
9994934 | 8385.7767133493 | 1.9085944995934 | 508.14622963443 | 541.58364604088 | 13.929327516778 | 14.648150335571
9994935 | 8330.2474844166 | 1.9258762781886 | 486.82378294054 | 558.22860442849 | 14.191576974565 | 14.858611780455
9994936 | 8267.1334929976 | 1.9375925939066 | 488.9325 | 546.98 | 14.161099319728 | 14.669331972789
9994937 | 8310.4316285999 | 1.9272640311479 | 494.9175 | 544.84113789716 | 14.038435549525 | 14.781244233378
means | 8316.9258529547 | 1.9259273677449 | 496.61250157187 | 544.67698604582 | 14.015248598093 | 14.667944566282
medians | 8320.3395565083 | 1.9265701546682 | 492.345 | 544.25681894858 | 13.998367684499 | 14.65874115418
confidence_interval | 0.0089671497091839 | 0.0091344216653754 | 0.016839629548937 | 0.012702234285658 | 0.0066472032951229 | 0.0078402754117856
min | 8177.5107248928 | 1.8986411307841 | 486.82378294054 | 529.6475 | 13.894525827815 | 14.492921333333
max | 8432.8923393013 | 1.9590636485304 | 508.8425 | 558.22860442849 | 14.191576974565 | 14.858611780455
stddev | 89.19306714941 | 0.021039470646772 | 10.001474451658 | 8.2743329580123 | 0.11141755278116 | 0.13753537856553
聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time
9994917 | 8635.7070897714 | 1.853296592018 | 524.31 | 556.5375 | 14.183319371728 | 14.806257853403
9994918 | 8621.7088487345 | 1.8558855922125 | 531.7725 | 545.95 | 13.98879791395 | 14.501852672751
9994919 | 8533.2685029174 | 1.8753910779895 | 522.7925 | 546.08863477841 | 13.958782664942 | 14.470852522639
9994920 | 8574.0599079032 | 1.8675608874445 | 513.61871595321 | 564.5675 | 13.962389175258 | 14.472323453608
9994921 | 8510.2283664363 | 1.8815516956704 | 505.67 | 550.7275 | 14.622058124174 | 15.321787318362
9994922 | 8703.9959200816 | 1.8385274078447 | 526.2775 | 557.65 | 13.938370656371 | 14.615827541828
9994923 | 8781.1981281123 | 1.8221623975359 | 535.02 | 559.17860205349 | 14.360284634761 | 14.88064231738
9994924 | 8659.6295712997 | 1.8498789700963 | 506.085 | 569.3425 | 14.483114068441 | 14.988921419518
means | 8627.474541907 | 1.8555318276015 | 520.69327699415 | 556.25527960399 | 14.187139576203 | 14.757308137436
medians | 8628.707969253 | 1.8545910921153 | 523.55125 | 557.09375 | 14.086058642839 | 14.711042697615
confidence_interval | 0.008675996428163 | 0.0087775211883997 | 0.017859539791627 | 0.012590065519452 | 0.015918315024142 | 0.017111525799816
min | 8510.2283664363 | 1.8221623975359 | 505.67 | 545.95 | 13.938370656371 | 14.470852522639
max | 8781.1981281123 | 1.8815516956704 | 535.02 | 569.3425 | 14.622058124174 | 15.321787318362
stddev | 89.519345731438 | 0.019478438704893 | 11.121569557209 | 8.37560108854 | 0.27008830852039 | 0.30200193835895
SPECjbb2015 on x86. No throughput drop observed this time. (I grabbed the build from a different location this time, non-source code version of that build)
-Xmx3200M -Xms3200M -Xmn1200M
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994899 | 20202 | 11928 | 27674 | 23095
9994900 | 19649 | 11962 | 23392 | 22775
9994901 | 20324 | 11498 | 23095 | 21847
9994902 | 20479 | 11678 | 27674 | 23095
means | 20163.5 | 11766.5 | 25458.75 | 22703
medians | 20263 | 11803 | 25533 | 22935
confidence_interval | 0.028503988445181 | 0.029647326634529 | 0.16003411426044 | 0.041365280701093
min | 19649 | 11498 | 23095 | 21847
max | 20479 | 11962 | 27674 | 23095
stddev | 361.24460780289 | 219.26163975184 | 2560.8224427581 | 590.26773586229
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9995099 | 20913 | 11952 | 24604 | 23716
9995100 | 19631 | 11039 | 23095 | 21062
9995101 | 聽 | 聽 | 聽 | 聽
9995102 | 聽 | 聽 | 聽 | 聽
means | 20272 | 11495.5 | 23849.5 | 22389
medians | 20272 | 11495.5 | 23849.5 | 22389
confidence_interval | 0.14229072923525 | 0.17870145527142 | 0.14236234682501 | 0.26671743115415
min | 19631 | 11039 | 23095 | 21062
max | 20913 | 11952 | 24604 | 23716
stddev | 906.51089348115 | 645.58849122332 | 1067.0241328105 | 1876.6613972691
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9994908 | 20785 | 12163 | 23095 | 20517
9994909 | 20093 | 11586 | 23095 | 22360
9994910 | 20785 | 11861 | 23095 | 21847
9994911 | 20324 | 12112 | 23095 | 20517
means | 20496.75 | 11930.5 | 23095 | 21310.25
medians | 20554.5 | 11986.5 | 23095 | 21182
confidence_interval | 0.026852923897151 | 0.035325331114737 | 0 | 0.070149805133889
min | 20093 | 11586 | 23095 | 20517
max | 20785 | 12163 | 23095 | 22360
stddev | 345.94448013133 | 264.89557691035 | 聽 | 939.60395025422
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled
-- | -- | -- | -- | --
9995108 | 20755 | 12015 | 27674 | 23095
9995109 | 21309 | 12721 | 27674 | 23095
9995110 | 21016 | 12134 | 23095 | 21847
9995111 | 20755 | 12249 | 27674 | 23095
means | 20958.75 | 12279.75 | 26529.25 | 22783
medians | 20885.5 | 12191.5 | 27674 | 23095
confidence_interval | 0.020035367673363 | 0.040072637108334 | 0.13730484276789 | 0.043575648509854
min | 20755 | 12015 | 23095 | 21847
max | 21309 | 12721 | 27674 | 23095
stddev | 263.93228298183 | 309.29099027723 | 2289.5 | 624
Judging from the various experiments we tried, the overhead of shift3 on x86 isn't very significant.
-Xmx3200m -Xms3200m -Xmn1200mRUN RESULT: hbIR (max attempted) = 13837, hbIR (settled) = 11548, max-jOPS = 10516, critical-jOPS = 2144
RUN RESULT: hbIR (max attempted) = 12302, hbIR (settled) = 11859, max-jOPS = 10088, critical-jOPS = 2224
RUN RESULT: hbIR (max attempted) = 13837, hbIR (settled) = 11548, max-jOPS = 10793, critical-jOPS = 2135
RUN RESULT: hbIR (max attempted) = 13837, hbIR (settled) = 11548, max-jOPS = 10931, critical-jOPS = 2154
max-jOPS = 10582
critical-jOPS = 2164.25
Shift3
RUN RESULT: hbIR (max attempted) = 12302, hbIR (settled) = 11859, max-jOPS = 10703, critical-jOPS = 2161
RUN RESULT: hbIR (max attempted) = 11859, hbIR (settled) = 11489, max-jOPS = 10673, critical-jOPS = 2080
RUN RESULT: hbIR (max attempted) = 11548, hbIR (settled) = 11430, max-jOPS = 10278, critical-jOPS = 2129
RUN RESULT: hbIR (max attempted) = 11548, hbIR (settled) = 10959, max-jOPS = 10278, critical-jOPS = 2071
max-jOPS = 10483
critical-jOPS = 2110.25
-Xmx3200m -Xms3200m -Xmn2600m -Xjit:scratchSpaceLimit=2048000,acceptHugeMethods -Xgcpolicy:gencon -Xcompressedrefs -XXgc:forcedShiftingCompressionAmount=0
SPECjbb2005 bops = 72214
SPECjbb2005 bops = 72204
SPECjbb2005 bops = 72304
SPECjbb2005 bops = 72690
SPECjbb2005 bops = 71765
Average bops = 72235.4
SPECjbb2005 bops = 69579
SPECjbb2005 bops = 69860
SPECjbb2005 bops = 70412
SPECjbb2005 bops = 70350
SPECjbb2005 bops = 69376
Average bops = 69915.4
3.3% throughput drop
FYI @vijaysun-omr @zl-wang @mpirvu
@harryyu1994 could you please make a summary of all the experiments that were tried? It seems that only Power sees more than 2% regression from the move to shift3
it is within expectation. i remembered the overhead was about 2.5% previously we did the experiments.
@vijaysun-omr I have all the results listed here, will be waiting for your final call on this.
@zl-wang I am worried by the high throughput loss on Power still (in excess of 3%). I don't know if you can afford to slow down everything 3+% inside OpenShift on Power. While I agree we used to have an overhead of approximately 2-3% on all platforms previously due to the shift, we now find that the overhead on other platforms (X86 has more data shown than Z) is lower. Can you please try the same on your Open Liberty setup ?
@vijaysun-omr i will give DT7/OpenLiberty a spin next as I talked to @harryyu1994
shift0 average throughput: 2798/s
shift3 average throughput: 2757/s
The gap is about 1.5%.
However, the up-down in the same run could be as big as 3-4%. haven't investigated why it is not as stable as my older driver: this one is July 29 build on Adopt site, as harry suggested a recent build.
Can we try to get a Daytrader7 run done on Z as well so that we have more than just that one data point ?
@zl-wang fluctuation of 3-4% is high enough that we don't know if the overhead is in the 3% range on Power in this case as well. Ideally we should try to understand what is different on Power before going ahead but I am okay with delivering the change to make things portable wrt compressed refs with the general approach taken in this design first and then work out how to make the situation better on Power as a continuing effort past that initial delivery.
Most helpful comment
@vijaysun-omr i will give DT7/OpenLiberty a spin next as I talked to @harryyu1994