Openj9: Agent support for exception catch information

Created on 19 Oct 2018  路  62Comments  路  Source: eclipse/openj9

This Epic tracks the work to enable agents to request exception catch information without the JVM going into Full Speed Debug (FSD) mode. Most of the work here is on the VM side but there are JIT portions as well to ensure we have a stack map at the start of every exception catch block and to ensure we don't enable throw to goto transformations when this mode is enabled.

Other than the obvious desire to enable as many features as possible without going into FSD, the motivation behind this work is the upcoming cooperation with OverOps - their agent needs to get exception events, but can not tolerate the overhead of FSD being imposed on production systems.

Epic jit vm

Most helpful comment

Many thanks to everyone who pushed this along, particularly @0dvictor for the common stack map code and initial prototype on X, and @fjeremic for the greatly-simplified platform code and general diligence throughout this process.

All 62 comments

@gacholio I've opened up the Epic to track the work but I'm not too familiar with the VM portion or the high level. I've written down what I could but I encourage you to add some more detail here if possible. I'll open up the JIT issues for us to tackle against this Epic. I've also tagged it for 0.12.0 as I asumme the work will be done for the next release. @pshipton FYI.

The main issue I ran into when doing the VM side of the work was that bytecode PC information is stored in the JIT exception handler data only in FSD mode.

  • jitExceptionHandlerSearch updates userData1 with the bytecode PC if present
  • getJitPCOffsetFromExceptionHandler assumes the bytecode PC data is there

Both of the above are used in the JVMTI Exception event (i.e. exception throw) to identify the catching location.

The JVMTI ExceptionCatch event already relies on the stack map at the catch point and gets the bytecode PC information from getCurrentByteCodeIndexAndIsSameReceiver. This works in FSD, but not in normal mode.

https://github.com/eclipse/openj9/pull/3368 changes the VM to always use the stack map approach, obviating the need for bytecode PC information in the JIT handler data completely. getJitPCOffsetFromExceptionHandler is no longer called from anywhere in the codebase, and the userData1 value from jitExceptionHandlerSearch is ignored.

The above change is safe to make today, given that ExceptionCatch forces FSD.

I modified the JIT not to force FSD for ExceptionCatch, and ran into the two issues that need to be addressed in the JIT:

1) Platform-dependent code tests FSD mode before generating stack maps at catch points (J9TreeEvalutator -> VMgenerateCatchBlockBBStartPrologue on X, unsure of other platforms)

2) Common code tests FSD mode and merges stack maps with different bytecode PCs (MetaData.cpp -> mapsAreIdentical)

Forcing the FSD test to true for both of the above gives me the results I want, but we don't need to completely stop merging stack maps. All I require is that I can always (regardless of FSD) locate an accurate stack map at the catch point which contains the correct bytecode PC.

Once the above changes are made on all platforms, I will commit the change to J9Options.cpp to prevent ExceptionCatch forcing FSD (and another related change to initialize a required function pointer in the jitConfig).

Thank you very much for the detailed writeup @gacholio. It's great to see a design forging. @andrewcraik @gita-omr @0dvictor FYI GAC's comment above.

3389 can be used to test the JIT changes. Once #3386 has been merged, the decomp003-NON-FSD test will fail if the JIT changes are not correct.

To run it:

java -Xjit:count=0,limit={*.jit*},tryToInline={*.jit*} -agentlib:jvmtitest=test:decomp003,args:nostep -cp /path/to/jvmtitest.jar com.ibm.jvmti.tests.util.TestRunner

If we want to completely remove any impact on normal production builds, the changes need only apply if the exception catch/throw events are hooked. An easy way to determine this is to check TR_DisableThrowToGoto instead of TR_FullSpeedDebug.

@gacholio Is there an API on the VM we can query - TR_disableThrowToGoto is used to control opt and there are Xjit options to enable/disable that opt which should not change the GC Map layout in cases where throwToGoto may be turned off for debug in a non-FSD mode?

You could query the hook status (See #3389), or we could add a field in the JITConfig and set it where we disable throwToGoto. The check suggested above is a bit hacky, I agree, though I'm not sure there's any real scenario where throwToGoto would be disabled other than the hooks being active.

I'd really like to have this closed out this week. Any comments on the above strategies?

3462 and #3463 introduce the support of this Issue for X86 CodeGen and common infrastructure. This first draft queries TR_disableThrowToGoto for whether to preserve bytecode info; it will be updated once we decide the proper API.

@andrewcraik - can you please comment on the new field in the jitConfig? I'd like to fix this and get t he changs from @0dvictor updated and merged.

@0dvictor Did you try the test with your changes and mine?

Did you try the test with your changes and mine?

Yes, I tested the changes in the way you described at https://github.com/eclipse/openj9/issues/3373#issuecomment-431527572, which includes #3389, #3462 and #3463.

@andrewcraik - can you please comment on the new field in the jitConfig? I'd like to fix this and get t he changs from @0dvictor updated and merged.

@0dvictor Did you try the test with your changes and mine?

I'm heavily in favor of not going the jitConfig route. The information is already available via the J9 hooks as was done in #3389. I don't think we should duplicate the information.

What about introducing a query in J9::Options, namely ReportByteCodeInfoAtCatchBlock?

Agreed, since there is no need to externalize this outside of the JIT codebase, a TR_ option or equivalent seems sensible, and can be set in the hook disable code.

~I've created a _hookStatus field in J9::Options, which allows us to cache and query the status of up to 32 different JVM hooks. Changes are in https://github.com/eclipse/openj9/pull/3463/files~

Many thanks to everyone who pushed this along, particularly @0dvictor for the common stack map code and initial prototype on X, and @fjeremic for the greatly-simplified platform code and general diligence throughout this process.

Just a heads up - I just saw a failure of the new test (decomp003-NONFSD) on AIX, which I have been unable to reproduce manually:

[OUT] *** Testing [1/1]:       testExceptions
 [OUT] Successfully caught exception in outer frame
 [OUT] FAIL: Catcher not detected in JVMTI
 [OUT] Successfully caught exception in shallowest inline frame
 [OUT] FAIL: Catcher not detected in JVMTI
 [OUT] Successfully caught exception in deepest inline frame
 [OUT] FAIL: Catcher not detected in JVMTI
 [OUT] *** Test took 12 milliseconds
 [OUT] FAILED
 [OUT] 
 [ERR] exception thrown: 
 [ERR] type: 2type: 2type: 2    Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR]  Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR]  Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR] 
 [ERR] com.ibm.jvmti.tests.util.AgentHardException: 
 [ERR] type: 2type: 2type: 2    Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR]  Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR]  Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR] 
 [ERR]  at com.ibm.jvmti.tests.util.ErrorControl.checkErrors(ErrorControl.java:71)
 [ERR]  at com.ibm.jvmti.tests.util.TestSuite.run(TestSuite.java:87)
 [ERR]  at com.ibm.jvmti.tests.util.TestRunner.main(TestRunner.java:60)
>> Success condition was not found: [Return code: 0]

Interesting. I don't see a reason why it wouldn't fail on plinux given they share the same code. @gita-omr FYI in case we need help to take a look. The common piece of code was delivered in #3598.

I'll keep an eye out for similar failures, and maybe improve the test a bit (I've done exactly what I always complain about - hidden any useful debug information when the test fails).

At a guess, the problem would most likely be that a catch map got merged away. I'm not clear on what was done to prevent that.

I'm not clear on what was done to prevent that.

The code @0dvictor delivered as part of #3463 and more specifically the line in [1] checks the same query used to generate the stack map as delivered in #3598. So _in theory_ we should not have merged GC maps belonging to catch blocks.

[1] https://github.com/eclipse/openj9/pull/3463/files#diff-4e3b1a3c145fc0a879cf6bd655b245ffR1336

3463 allows merging the catch map if the previous one has identical bytecode info and GC info. Would that cause any issue? You can remove mapCursor->isByteCodeInfoIdenticalTo(nextMapCursor) || from following line to rule it out.

https://github.com/eclipse/openj9/blob/22e0009fda531be602d728ea128f4ad9b4d289fd/runtime/compiler/runtime/MetaData.cpp#L706

I don't think that should cause an issue - if I can locate a map with the correct BCI using the catch address (well, catch address + 1 techinically), then we're fine. I assume when merging maps, the merged map will certainly cover all of the instructions covered by both maps.

Yes, since the two maps are merged only if they have identical bytecode info and GC info.

Yes map merging just eliminates one of the two maps to allow one map to cover a larger range. Since the BCI is being considered as part of the comparison I believe it should suit the purpose fine. You will get the info you need and be none the wiser how the JIT chose to deliver that information to you (eg by map merging).

@gacholio were you able to reproduce the AIX failure above?

I can only see setReportByteCodeInfoAtCatchBlock under FSD. Was not the idea to enable it without FSD?
Also, I think we wanted to disable throw->goto optimizations. I am not sure I seen that in all the commits.

10K runs on AMD64 with no errors - will try some on P.

@gita-omr Look in #3389

Thanks. Was looking at slightly older code...

10K runs on Linux PPC LE with no issues. 15 failures in 10K runs on AIX.

The failure rate is low enough that the results can't really be trusted. I'll instrument the test to gather core at the point of failure so we can see what the stack maps look like.

One question occurs - the way @0dvictor detects the unmergeable maps - is it possible that the stack map might migrate during optimizations, and would not be detected in the unmergeable map test?

It's a good question. Even without merging, how accurate BCI info really is in GC maps, especially in non-FSD mode? I think we get BCI from JIT instruction's node, which might be absent or not accurate at all.

FYI, there is an outstanding OMR Issue about the inaccuracy of a node's bytecode info: https://github.com/eclipse/omr/issues/3153

Thanks. But even if a node has a correct BCI, it can be moved around or eliminated.

Just talked to @andrewcraik and he said we ensure BCI is accurate for catch blocks even in non-FSD mode.

Note that eclipse/omr#3153 is looking at a slightly different angle on the correctness issue.

Instrumented the test - the problem is at the throw, not the catch:

[ERR] throw: 10021b00de0 0 catch: 10021b00de0 6

The thrower has detected the wrong catch PC.

The code to determine the catching method and pc is at:

https://github.com/eclipse/openj9/blob/master/runtime/codert_vm/CodertVMHelpers.cpp#L323

The method is correct, but getCurrentByteCodeIndexAndIsSameReceiver is not returning the correct location.

The particular failure in this case is in jitTest1, which throws from an inlined frame to a catch in the outer frame.

we need to make sure we mark the throw map as requiring the BCI to match for merging then to make sure both sides show the right bytecode info if I read this right...

Not quite - the throw event reports the method/pc of both the thrower and the likely catcher - I'm only testing the catcher information (ensuring the throw and catch events report the same catch point).

At this point I think we'll need a log for further investigation. Given the above test it seems only one method is being compiled and as I understand the test runs very quickly at -Xjit:count=0, so the compilation will be synchronous and deterministic, meaning logging won't affect failure rate (in theory).

My suggestion is to attempt to reproduce the problem while logging the method via

-Xjit:{the/method/with/bad/catch.bci()}(tracefull,log=log.trace)

I'm trying with this:

-Xjit:tracefull,log=log,count=0,limit={*.jit*},tryToInline={*.jit*}

I believe the tracefull,log=log has to be part of a method regex, so this would be the correct way:

-Xjit:{*}(tracefull,log=log),count=0,limit={*.jit*},tryToInline={*.jit*}

I've updated the command line and will grind the test.

@andrewcraik @gita-omr Failures with logs provided in the interlock slack. The symptom seems the same - in the failing tests, every throw in the test gets the correct method, but 0 for the bytecode index for the catcher. The catcher itself has the correct information.

I dumped internal stack traces at throw and catch which seem reasonable. At the catch point, the index is correct, so this may be down to how the code I use to find the catcher in the throw event differs from the stack walker (which is where the catch event retrieves its data).

@andrewcraik @0dvictor Further to our phone discussion, the BCI of the stack map at any throw point also needs to be correct (even outside of FSD). This includes explicit throws via the athrow bytecode, and any implicitly-thrown exceptions from NPE traps, resolves, etc.

I'm going to reopen this item while we investigate.

Just wanted to clarify...
There are two call backs - cbException and cbExceptionCatch. Both return catcher鈥檚 method and location. We are tying to test if they are the same. They are not, but which one is incorrect? The catch location returned by cbException? Does cbException use GC maps as well?

This is (I believe) explained in the comments above. Exception (e.g. throw) is incorrect, ExceptionCatch is correct. Both use the JIT metadata to get the locations, but for some reason, the throw version of the code is not getting the correct info - see https://github.com/eclipse/openj9/issues/3373#issuecomment-438381311.

I've reviewed the code, and it seems reasonable.

I'd welcome some volunteers to see if my simplified exception code in:

https://github.com/eclipse/openj9/blob/master/runtime/codert_vm/CodertVMHelpers.cpp#L323

called from:

https://github.com/eclipse/openj9/blob/master/runtime/jvmti/jvmtiHook.c#L1314

should be equivalent to the stack walk code in:

https://github.com/eclipse/openj9/blob/master/runtime/codert_vm/jswalk.c#L200

called from:

https://github.com/eclipse/openj9/blob/master/runtime/jvmti/jvmtiHook.c#L1886

@DanHeidinga @andrewcraik @fjeremic

Note the magic +1 for the handler PC is taken care of in the stackwalk case by:

https://github.com/eclipse/openj9/blob/master/runtime/codert_vm/jswalk.c#L376

The jvmtiHook.c and jswalk.c implementations look identical to me.

@gacholio The old code uses getFirstInlinedCallSite & getJitInlineDepthFromCallSite to loop over the inlined call sites. The new code only uses the getFirstInlinedCallSite with no loop.

Intentional?

@DanHeidinga Yes, the catch code is only interested in the top method in the map, since that will be the catcher. The stack walker needs to treat all of the methods in the map.

I don't believe so - it's the location that's causing issues.

Not disagreeing about the location issue, but I think there's also a bug related to the method here. If the catcher isn't the first inline, then we're returning the wrong method to the callback.

The catcher is necessarily the first inline, assuming the map I get back based on the catcher PC is correct.

The stackwalk isn't walking inlines so this line https://github.com/eclipse/openj9/blob/82275daf0c7145f8ce63878ca43961b8b2e17039/runtime/codert_vm/CodertVMHelpers.cpp#L334 should return the data for the method that has inlined the catcher?

The new code should have to loop through the inlines to find the appropriate one, no?

Interesting - the catch hook is not walking inlines, as you say, which would seem to be a bug, thought it passes the test. I'll look into this.

Stop confusing me - inlines are walked by default, and the catch code is not disabling them, so that's why it works.

50K runs on AIX with no issues, so I believe #3687 fixes this. Thanks to @gita-omr and @andrewcraik for the quick analysis of the data once I managed to gather it. Special thanks to Andrew for being able to detect the difference in the stack walk and catch code that the rest of us managed to repeatedly miss.

Was this page helpful?
0 / 5 - 0 ratings