Openj9: Crash in StringBuilder.lengthInternal()I at hot due to PRE

Created on 1 Feb 2019 · 24Comments · Source: eclipse/openj9

When running AcmeAir on lara with these options:

-Xjit:limit={java/lang/StringBuilder.lengthInternal*},verbose,initialOptLevel=hot,dontDowngradeToCold,scorchingSampleThreshold=0 -Xshareclasses:none -Xms512m -Xmx512m

I get the following crash:

Unexpected character (o) at position 0.
    at org.json.simple.parser.Yylex.yylex(Yylex.java:610)
    at org.json.simple.parser.JSONParser.nextToken(JSONParser.java:269)
    at org.json.simple.parser.JSONParser.parse(JSONParser.java:118)
    at org.json.simple.parser.JSONParser.parse(JSONParser.java:81)
    at org.json.simple.parser.JSONParser.parse(JSONParser.java:75)
    at com.acmeair.jmeter.functions.UpdateCustomerFunction.processJSonString(UpdateCustomerFunction.java:81)
    at com.acmeair.jmeter.functions.UpdateCustomerFunction.execute(UpdateCustomerFunction.java:52)
    at org.apache.jmeter.engine.util.CompoundVariable.execute(CompoundVariable.java:142)
    at org.apache.jmeter.engine.util.CompoundVariable.execute(CompoundVariable.java:118)
    at org.apache.jmeter.testelement.property.FunctionProperty.getStringValue(FunctionProperty.java:101)
    at org.apache.jmeter.testelement.AbstractTestElement.getPropertyAsString(AbstractTestElement.java:274)
    at org.apache.jmeter.config.Argument.getValue(Argument.java:146)
    at org.apache.jmeter.protocol.http.util.HTTPArgument.getEncodedValue(HTTPArgument.java:236)
    at org.apache.jmeter.protocol.http.util.HTTPArgument.getEncodedValue(HTTPArgument.java:217)
    at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.sendPostData(HTTPHC4Impl.java:1113)
    at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.handleMethod(HTTPHC4Impl.java:453)
    at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.sample(HTTPHC4Impl.java:329)
    at org.apache.jmeter.protocol.http.sampler.HTTPSamplerProxy.sample(HTTPSamplerProxy.java:74)
    at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1146)
    at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1135)
    at org.apache.jmeter.threads.JMeterThread.process_sampler(JMeterThread.java:434)
    at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:261)
    at java.lang.Thread.run(Thread.java:812)

However, when I change initialOptLevel=hot to optLevel=hot, the crash goes away, so this is probably caused by a recompilation issue.
Although verbose log only shows one compilation, enabling traceFull produces two logs, to further lend support to this idea.

jit

Source

dmitry-ten

All 24 comments

Can you please specify which architecture you're running on? This is a very tiny method [1] so it will be very obvious if we screwed something up in there.

[1] https://github.com/eclipse/openj9/blob/81aab6007489b5a1d3e9858a8540ba1292146db4/jcl/src/java.base/share/classes/java/lang/String.java#L2260-L2266

fjeremic on 1 Feb 2019

It's on x86-64

dmitry-ten on 1 Feb 2019

It's on x86-64

@andrewcraik FYI. @dmitry-ten can you please provide a trace log of the method?

fjeremic on 1 Feb 2019

lengthInternal.1.txt
lengthInternal.2.txt

dmitry-ten on 1 Feb 2019

Which log is which?

andrewcraik on 1 Feb 2019

The only difference between these two is that lengthInternal.1.txt is a hot compile with GCR while lengthInternal.2.txt is a hot compile. lengthInternal.2.txt is trivial load indirect from the parameter register and return. lengthInternal.1.txt is the same, but there is the complexity of GCR trees to trip a recompile. If lengthInternal.1.txt is causing the problem try running a disableGuardedCountingRecompilation and see if it start passing - if so there may be some weird interaction between the mainline and the recomp helper.

andrewcraik on 1 Feb 2019

Running with disableGuardedCountingRecompilation made the crash disappear.

dmitry-ten on 4 Feb 2019

So, I asked @mpirvu, and he said that we shoudn't even have GCR trees in a hot compilation.
Not sure if this is related to the problem.

dmitry-ten on 4 Feb 2019

The GCR trees are inserted because of this code

                        else if (jitConfig->javaVM->phase != J9VM_PHASE_NOT_STARTUP &&
                                 details.isOrdinaryMethod() &&
                                 !options->getOption(TR_DisableNoServerDuringStartup))
                           {
                           options->setOption(TR_NoOptServer);
                           reducedWarm = true;
                           // These guys should be compiled with GCR hooks so that we get the throughput back
                           options->setInsertGCRTrees();
                           }

that attempts to do cheaper compilations during start-up. This is not a common scenario because the JIT refrains as much as possible from doing hot compilation during start-up. However, it is a possible scenario. GCR is not much tested in the context of hot compilations, but there is no reason for why it shouldn't work.

mpirvu on 4 Feb 2019

Still haven't figured out the problem, but, apparently, all GCR compilations will always be recompiled at warm, according to this code in control/CompilationController.cpp:

if (recomp->_bodyInfo->getUsesGCR())
            {
            level = warm;  // GCR recompilations should be performed at warm
            // If a GCR count was specified, used that
            if (options->getGCRCount() > 0)
               {
               countValue = options->getGCRCount();
               }
            else // find the count corresponding to the warm level (or next available)
               {
               countValue = options->getCountValue(mayHaveLoops, (TR_Hotness) level);
               if (countValue < 0)
                  {
                  // Last resort: use some sensible values
                  countValue = mayHaveLoops ? options->getInitialBCount() : options->getInitialCount();
                  }
               }
            }

So it doesn't make sense to have GCR enabled at levels above cold.

dmitry-ten on 5 Feb 2019

I found the reason for this crash.
It's in the \

This is original IL:

n47n      bstore  <gcr-patch-point>[#565  Static] [flags 0x400301 0x40 ]                      [0x7fc849947a40] bci=[-1,0,1170] rc=0 vc=0 vn=- li=- udi=- nc=1
n46n        bconst   2                                                                        [0x7fc8499479f0] bci=[-1,0,1170] rc=1 vc=0 vn=- li=- udi=- nc=0

After PRE:

n47n      istore  <gcr-patch-point>[#565  Static] [flags 0x400301 0x40 ]                      [0x7fc849947a40] bci=[-1,0,1170] rc=0 vc=156 vn=- li=6 udi=- nc=1
n46n        bconst   2 (X!=0 X>=0 )                                                           [0x7fc8499479f0] bci=[-1,0,1170] rc=1 vc=156 vn=- li=-1 udi=- nc=0 flg=0x104

So the constant 2 ends up written to the wrong address.
I verified that with disablePRE the crash goes away.
Maybe someone who knows more about this optimization can help figure out why it does this transformation.

dmitry-ten on 6 Feb 2019

So I wasn't following this, but we need a traceFull,tracePRE log. The most worrying thing here is that the O^O transformations do not mention the bstore and yet the bstore is created to an istore without checking the child constant types. Likely a nasty issue. @cathyzhyi could you take a look - shouldn't be a hard fix once you find the right spot in PRE.

andrewcraik on 8 May 2020

If 9470 is merged it should be undone once this is fixed.

andrewcraik on 8 May 2020

@dmitry-ten could you get a log with tracePRE and tracefull?

cathyzhyi on 9 May 2020

👍1

lengthInternal.log.33615.48656.20200511.215736.33615.txt

dmitry-ten on 12 May 2020

👍1

I see that the problem appears after PRE node numbering.

dmitry-ten on 12 May 2020

I think the problem is here:
https://github.com/eclipse/omr/blob/b8634d51d53c1018a8abe934a50fd35fa7df514f/compiler/optimizer/LocalAnalysis.cpp#L541-L545
The API opCodeForDirectLoad is meant for field shadow only. We should be calling the version in the OMR layer for this particular case.

cathyzhyi on 12 May 2020

The API opCodeForDirectLoad is meant for field shadow only. We should be calling the version in the OMR layer for this particular case.

It is using the version in the OMR layer. That code is in OMR. The fact that OpenJ9 overrides that API and returns something different is another story.

fjeremic on 12 May 2020

The API opCodeForDirectLoad is meant for field shadow only. We should be calling the version in the OMR layer for this particular case.

It is using the version in the OMR layer. That code is in OMR. The fact that OpenJ9 overrides that API and returns something different is another story.

Actually, because the IL is extensible class so we are calling the OpenJ9 implementation here.

cathyzhyi on 12 May 2020

Actually, because the IL is extensible class so we are calling the OpenJ9 implementation here.

Right I understand that is what is happening in OpenJ9, however looking at it from OMR perspective (which is where the code lives) the code is correct. It asks the IL class "given this data type, what is the opcode to do a direct load". OpenJ9 overloads this API, so when the same code is executed within a Java method compilation it will return something different.

How do you propose the OMR code should be altered? I don't see anywhere in the codebase us calling OMR::IL::opCodeForDirectLoad explicitly.

fjeremic on 12 May 2020

So I think this could go one of two ways - one is that the OpenJ9 override is not warranted and whatever cases it was papering over will need to be 'fixed' in OpenJ9 and the override removed so that this opt works correctly. The other would be that the override is warranted but the API is not sufficiently rich to allow for us to determine the correct conversion (eg this case we want the matching type not a widened type). This latter case would be a problem in OMR's API design that would need to be addressed.

Now both of these are going to take some time and this causing a real functional issue to a major consumer of OMR downstream so if these are the real solutions we are going to consider we are going to have to consider the right tactical fix. I don't know if that fix is in OMR or OpenJ9 but I think we need to be open to something not too tasteful in the short term pending a proper resolution which may be a fair bit more invasive.

@cathyzhyi is going to look at the override and see if it is warranted etc. Once she reports back we can decide on a plan of action.

andrewcraik on 19 May 2020

👍1

So here is the current state of the opCodeFor APIs:

array has its own APIs: opCodeFor(In)directArrayLoad/store
opCodeForCorresponding(In)directStore/Loads APIs are only used in PRE and FieldPrivatizer and created because of array. bload bstorei. However, I feel this group of APIs are redundant. we could have just used the APIs from 1 since array shadows are easy to tell.
opCodeFor*store/loads APIs currently all do “widening to int” magic.

I think the main point of the problem is that, putting array access aside:

when data type comes from cp symbol int widening is needed. This is the case for most uses in ilgen.
when data type comes from node, int widening is not needed. Most of the uses after ilgen falls into this category.

However, currently, there is no distinction between those cases, optimizations are also calling the int widening version which is functionally correct most of the time until we hit the current issue.

My proposal:

openj9 just use the omr version ofopCodeFor* and don’t do widening magic because the APIs don’t have enough information.
add a new API dataTypeToNodeType(TR::DataType).
The caller should know where the type comes from:
if it’s for array accesses use the opCodeFor(In)directArrayLoad/store APIs
if the type comes from cp or symbols, call dataTypeToNodeType before passing to opCodeFor*store/loads.
if the type comes from node, call opCodeFor*store/loads directly.

@andrewcraik Hey Andrew, could you share your thoughts on this?

cathyzhyi on 22 May 2020

@cathyzhyi do you see the new dataTypeToNodeType living in OMR? If so we need to discuss this with the OMR community before we can settle on a design. I guess the uses of dataTypeToNodeType will be primarily confined to ilgen - do yo use a need for it in any optimizations?

The use of the other APIs seems reasonable. I guess my main concern with the added complexity of having to call the right API for the right origin is if we can design the API to minimize misuse - eg the data type to node type should take a symref or something like that rather than just a type so that it is hard to pass the type from a node. etc.

Overall I think it seems reasonable once we flush out the above details.

andrewcraik on 22 May 2020

do you see the new dataTypeToNodeType living in OMR? If so we need to discuss this with the OMR community before we can settle on a design. I guess the uses of dataTypeToNodeType will be primarily confined to ilgen - do yo use a need for it in any optimizations?

Actually, there are some cases that opts cook up nodes from symbols. EA and LoopVersioner do that. Now I realize it's more common than I thought. Given this, I think we need to discuss with OMR community as well.