Openj9: Agent support for exception catch information

Created on 19 Oct 2018 · 62Comments · Source: eclipse/openj9

This Epic tracks the work to enable agents to request exception catch information without the JVM going into Full Speed Debug (FSD) mode. Most of the work here is on the VM side but there are JIT portions as well to ensure we have a stack map at the start of every exception catch block and to ensure we don't enable throw to goto transformations when this mode is enabled.

Other than the obvious desire to enable as many features as possible without going into FSD, the motivation behind this work is the upcoming cooperation with OverOps - their agent needs to get exception events, but can not tolerate the overhead of FSD being imposed on production systems.

Epic jit vm

Source

fjeremic

Most helpful comment

Many thanks to everyone who pushed this along, particularly @0dvictor for the common stack map code and initial prototype on X, and @fjeremic for the greatly-simplified platform code and general diligence throughout this process.

gacholio on 8 Nov 2018

👍4

All 62 comments

@gacholio I've opened up the Epic to track the work but I'm not too familiar with the VM portion or the high level. I've written down what I could but I encourage you to add some more detail here if possible. I'll open up the JIT issues for us to tackle against this Epic. I've also tagged it for 0.12.0 as I asumme the work will be done for the next release. @pshipton FYI.

fjeremic on 19 Oct 2018

The main issue I ran into when doing the VM side of the work was that bytecode PC information is stored in the JIT exception handler data only in FSD mode.

jitExceptionHandlerSearch updates userData1 with the bytecode PC if present
getJitPCOffsetFromExceptionHandler assumes the bytecode PC data is there

Both of the above are used in the JVMTI Exception event (i.e. exception throw) to identify the catching location.

The JVMTI ExceptionCatch event already relies on the stack map at the catch point and gets the bytecode PC information from getCurrentByteCodeIndexAndIsSameReceiver. This works in FSD, but not in normal mode.

https://github.com/eclipse/openj9/pull/3368 changes the VM to always use the stack map approach, obviating the need for bytecode PC information in the JIT handler data completely. getJitPCOffsetFromExceptionHandler is no longer called from anywhere in the codebase, and the userData1 value from jitExceptionHandlerSearch is ignored.

The above change is safe to make today, given that ExceptionCatch forces FSD.

I modified the JIT not to force FSD for ExceptionCatch, and ran into the two issues that need to be addressed in the JIT:

1) Platform-dependent code tests FSD mode before generating stack maps at catch points (J9TreeEvalutator -> VMgenerateCatchBlockBBStartPrologue on X, unsure of other platforms)

2) Common code tests FSD mode and merges stack maps with different bytecode PCs (MetaData.cpp -> mapsAreIdentical)

Forcing the FSD test to true for both of the above gives me the results I want, but we don't need to completely stop merging stack maps. All I require is that I can always (regardless of FSD) locate an accurate stack map at the catch point which contains the correct bytecode PC.

Once the above changes are made on all platforms, I will commit the change to J9Options.cpp to prevent ExceptionCatch forcing FSD (and another related change to initialize a required function pointer in the jitConfig).

gacholio on 19 Oct 2018

👍1

Thank you very much for the detailed writeup @gacholio. It's great to see a design forging. @andrewcraik @gita-omr @0dvictor FYI GAC's comment above.

fjeremic on 19 Oct 2018

3389 can be used to test the JIT changes. Once #3386 has been merged, the decomp003-NON-FSD test will fail if the JIT changes are not correct.

To run it:

java -Xjit:count=0,limit={*.jit*},tryToInline={*.jit*} -agentlib:jvmtitest=test:decomp003,args:nostep -cp /path/to/jvmtitest.jar com.ibm.jvmti.tests.util.TestRunner

gacholio on 20 Oct 2018

If we want to completely remove any impact on normal production builds, the changes need only apply if the exception catch/throw events are hooked. An easy way to determine this is to check TR_DisableThrowToGoto instead of TR_FullSpeedDebug.

gacholio on 22 Oct 2018

@gacholio Is there an API on the VM we can query - TR_disableThrowToGoto is used to control opt and there are Xjit options to enable/disable that opt which should not change the GC Map layout in cases where throwToGoto may be turned off for debug in a non-FSD mode?

andrewcraik on 22 Oct 2018

You could query the hook status (See #3389), or we could add a field in the JITConfig and set it where we disable throwToGoto. The check suggested above is a bit hacky, I agree, though I'm not sure there's any real scenario where throwToGoto would be disabled other than the hooks being active.

gacholio on 22 Oct 2018

I'd really like to have this closed out this week. Any comments on the above strategies?

gacholio on 23 Oct 2018

3462 and #3463 introduce the support of this Issue for X86 CodeGen and common infrastructure. This first draft queries `TR_disableThrowToGoto` for whether to preserve bytecode info; it will be updated once we decide the proper API.

0dvictor on 26 Oct 2018

@andrewcraik - can you please comment on the new field in the jitConfig? I'd like to fix this and get t he changs from @0dvictor updated and merged.

@0dvictor Did you try the test with your changes and mine?

gacholio on 31 Oct 2018

Did you try the test with your changes and mine?

Yes, I tested the changes in the way you described at https://github.com/eclipse/openj9/issues/3373#issuecomment-431527572, which includes #3389, #3462 and #3463.

0dvictor on 31 Oct 2018

@andrewcraik - can you please comment on the new field in the jitConfig? I'd like to fix this and get t he changs from @0dvictor updated and merged.

@0dvictor Did you try the test with your changes and mine?

I'm heavily in favor of not going the jitConfig route. The information is already available via the J9 hooks as was done in #3389. I don't think we should duplicate the information.

fjeremic on 2 Nov 2018

What about introducing a query in J9::Options, namely ReportByteCodeInfoAtCatchBlock?

0dvictor on 2 Nov 2018

Agreed, since there is no need to externalize this outside of the JIT codebase, a TR_ option or equivalent seems sensible, and can be set in the hook disable code.

gacholio on 3 Nov 2018

~I've created a _hookStatus field in J9::Options, which allows us to cache and query the status of up to 32 different JVM hooks. Changes are in https://github.com/eclipse/openj9/pull/3463/files~

0dvictor on 5 Nov 2018

gacholio on 8 Nov 2018

👍4

Just a heads up - I just saw a failure of the new test (decomp003-NONFSD) on AIX, which I have been unable to reproduce manually:

[OUT] *** Testing [1/1]:       testExceptions
 [OUT] Successfully caught exception in outer frame
 [OUT] FAIL: Catcher not detected in JVMTI
 [OUT] Successfully caught exception in shallowest inline frame
 [OUT] FAIL: Catcher not detected in JVMTI
 [OUT] Successfully caught exception in deepest inline frame
 [OUT] FAIL: Catcher not detected in JVMTI
 [OUT] *** Test took 12 milliseconds
 [OUT] FAILED
 [OUT] 
 [ERR] exception thrown: 
 [ERR] type: 2type: 2type: 2    Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR]  Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR]  Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR] 
 [ERR] com.ibm.jvmti.tests.util.AgentHardException: 
 [ERR] type: 2type: 2type: 2    Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR]  Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR]  Error: 0  JVMTI_ERROR_NONE
 [ERR]          FAIL: Catching method or location inconsistent at catch point
 [ERR]          Location: com/ibm/jvmti/tests/decompResolveFrame/decomp003.c -> [cbExceptionCatch():95]
 [ERR] 
 [ERR]  at com.ibm.jvmti.tests.util.ErrorControl.checkErrors(ErrorControl.java:71)
 [ERR]  at com.ibm.jvmti.tests.util.TestSuite.run(TestSuite.java:87)
 [ERR]  at com.ibm.jvmti.tests.util.TestRunner.main(TestRunner.java:60)
>> Success condition was not found: [Return code: 0]

gacholio on 8 Nov 2018

Interesting. I don't see a reason why it wouldn't fail on plinux given they share the same code. @gita-omr FYI in case we need help to take a look. The common piece of code was delivered in #3598.

fjeremic on 8 Nov 2018

I'll keep an eye out for similar failures, and maybe improve the test a bit (I've done exactly what I always complain about - hidden any useful debug information when the test fails).

gacholio on 8 Nov 2018

At a guess, the problem would most likely be that a catch map got merged away. I'm not clear on what was done to prevent that.

gacholio on 8 Nov 2018

I'm not clear on what was done to prevent that.

The code @0dvictor delivered as part of #3463 and more specifically the line in [1] checks the same query used to generate the stack map as delivered in #3598. So _in theory_ we should not have merged GC maps belonging to catch blocks.

[1] https://github.com/eclipse/openj9/pull/3463/files#diff-4e3b1a3c145fc0a879cf6bd655b245ffR1336

fjeremic on 8 Nov 2018

3463 allows merging the catch map if the previous one has identical bytecode info and GC info. Would that cause any issue? You can remove `mapCursor->isByteCodeInfoIdenticalTo(nextMapCursor) ||` from following line to rule it out.

https://github.com/eclipse/openj9/blob/22e0009fda531be602d728ea128f4ad9b4d289fd/runtime/compiler/runtime/MetaData.cpp#L706

0dvictor on 8 Nov 2018

I don't think that should cause an issue - if I can locate a map with the correct BCI using the catch address (well, catch address + 1 techinically), then we're fine. I assume when merging maps, the merged map will certainly cover all of the instructions covered by both maps.

gacholio on 8 Nov 2018

Yes, since the two maps are merged only if they have identical bytecode info and GC info.

0dvictor on 8 Nov 2018

👍1

Yes map merging just eliminates one of the two maps to allow one map to cover a larger range. Since the BCI is being considered as part of the comparison I believe it should suit the purpose fine. You will get the info you need and be none the wiser how the JIT chose to deliver that information to you (eg by map merging).

andrewcraik on 9 Nov 2018

@gacholio were you able to reproduce the AIX failure above?

gita-omr on 9 Nov 2018

I can only see setReportByteCodeInfoAtCatchBlock under FSD. Was not the idea to enable it without FSD?
Also, I think we wanted to disable throw->goto optimizations. I am not sure I seen that in all the commits.

gita-omr on 9 Nov 2018

10K runs on AMD64 with no errors - will try some on P.

@gita-omr Look in #3389

gacholio on 9 Nov 2018

Thanks. Was looking at slightly older code...

gita-omr on 9 Nov 2018

10K runs on Linux PPC LE with no issues. 15 failures in 10K runs on AIX.

gacholio on 10 Nov 2018

The failure rate is low enough that the results can't really be trusted. I'll instrument the test to gather core at the point of failure so we can see what the stack maps look like.

gacholio on 11 Nov 2018

One question occurs - the way @0dvictor detects the unmergeable maps - is it possible that the stack map might migrate during optimizations, and would not be detected in the unmergeable map test?

gacholio on 11 Nov 2018

It's a good question. Even without merging, how accurate BCI info really is in GC maps, especially in non-FSD mode? I think we get BCI from JIT instruction's node, which might be absent or not accurate at all.

gita-omr on 11 Nov 2018

FYI, there is an outstanding OMR Issue about the inaccuracy of a node's bytecode info: https://github.com/eclipse/omr/issues/3153

0dvictor on 12 Nov 2018

Thanks. But even if a node has a correct BCI, it can be moved around or eliminated.

gita-omr on 12 Nov 2018

Just talked to @andrewcraik and he said we ensure BCI is accurate for catch blocks even in non-FSD mode.

gita-omr on 12 Nov 2018

👍1

Note that eclipse/omr#3153 is looking at a slightly different angle on the correctness issue.

andrewcraik on 12 Nov 2018

Instrumented the test - the problem is at the throw, not the catch:

[ERR] throw: 10021b00de0 0 catch: 10021b00de0 6

The thrower has detected the wrong catch PC.

The code to determine the catching method and pc is at:

https://github.com/eclipse/openj9/blob/master/runtime/codert_vm/CodertVMHelpers.cpp#L323

The method is correct, but getCurrentByteCodeIndexAndIsSameReceiver is not returning the correct location.

gacholio on 13 Nov 2018

The particular failure in this case is in jitTest1, which throws from an inlined frame to a catch in the outer frame.

gacholio on 13 Nov 2018

we need to make sure we mark the throw map as requiring the BCI to match for merging then to make sure both sides show the right bytecode info if I read this right...

andrewcraik on 13 Nov 2018

Not quite - the throw event reports the method/pc of both the thrower and the likely catcher - I'm only testing the catcher information (ensuring the throw and catch events report the same catch point).

gacholio on 13 Nov 2018

The test code:

https://github.com/eclipse/openj9/blob/master/runtime/tests/jvmtitests/src/com/ibm/jvmti/tests/decompResolveFrame/decomp003.c
https://github.com/eclipse/openj9/blob/master/test/functional/cmdLineTests/jvmtitests/src/com/ibm/jvmti/tests/decompResolveFrame/decomp003.java

gacholio on 13 Nov 2018

At this point I think we'll need a log for further investigation. Given the above test it seems only one method is being compiled and as I understand the test runs very quickly at -Xjit:count=0, so the compilation will be synchronous and deterministic, meaning logging won't affect failure rate (in theory).

My suggestion is to attempt to reproduce the problem while logging the method via

-Xjit:{the/method/with/bad/catch.bci()}(tracefull,log=log.trace)

fjeremic on 13 Nov 2018

I'm trying with this:

-Xjit:tracefull,log=log,count=0,limit={*.jit*},tryToInline={*.jit*}

gacholio on 13 Nov 2018

I believe the tracefull,log=log has to be part of a method regex, so this would be the correct way:

-Xjit:{*}(tracefull,log=log),count=0,limit={*.jit*},tryToInline={*.jit*}

fjeremic on 13 Nov 2018

I've updated the command line and will grind the test.

gacholio on 13 Nov 2018

👍1

@andrewcraik @gita-omr Failures with logs provided in the interlock slack. The symptom seems the same - in the failing tests, every throw in the test gets the correct method, but 0 for the bytecode index for the catcher. The catcher itself has the correct information.

I dumped internal stack traces at throw and catch which seem reasonable. At the catch point, the index is correct, so this may be down to how the code I use to find the catcher in the throw event differs from the stack walker (which is where the catch event retrieves its data).

gacholio on 14 Nov 2018

@andrewcraik @0dvictor Further to our phone discussion, the BCI of the stack map at any throw point also needs to be correct (even outside of FSD). This includes explicit throws via the athrow bytecode, and any implicitly-thrown exceptions from NPE traps, resolves, etc.

I'm going to reopen this item while we investigate.

gacholio on 14 Nov 2018

Just wanted to clarify...
There are two call backs - cbException and cbExceptionCatch. Both return catcher’s method and location. We are tying to test if they are the same. They are not, but which one is incorrect? The catch location returned by cbException? Does cbException use GC maps as well?

gita-omr on 14 Nov 2018

This is (I believe) explained in the comments above. Exception (e.g. throw) is incorrect, ExceptionCatch is correct. Both use the JIT metadata to get the locations, but for some reason, the throw version of the code is not getting the correct info - see https://github.com/eclipse/openj9/issues/3373#issuecomment-438381311.

gacholio on 14 Nov 2018

I've reviewed the code, and it seems reasonable.

I'd welcome some volunteers to see if my simplified exception code in:

https://github.com/eclipse/openj9/blob/master/runtime/codert_vm/CodertVMHelpers.cpp#L323

called from:

https://github.com/eclipse/openj9/blob/master/runtime/jvmti/jvmtiHook.c#L1314

should be equivalent to the stack walk code in:

https://github.com/eclipse/openj9/blob/master/runtime/codert_vm/jswalk.c#L200

called from:

https://github.com/eclipse/openj9/blob/master/runtime/jvmti/jvmtiHook.c#L1886

@DanHeidinga @andrewcraik @fjeremic

Note the magic +1 for the handler PC is taken care of in the stackwalk case by:

https://github.com/eclipse/openj9/blob/master/runtime/codert_vm/jswalk.c#L376

gacholio on 14 Nov 2018

The jvmtiHook.c and jswalk.c implementations look identical to me.

fjeremic on 14 Nov 2018

@gacholio The old code uses getFirstInlinedCallSite & getJitInlineDepthFromCallSite to loop over the inlined call sites. The new code only uses the getFirstInlinedCallSite with no loop.

Intentional?

DanHeidinga on 14 Nov 2018

@DanHeidinga Yes, the catch code is only interested in the top method in the map, since that will be the catcher. The stack walker needs to treat all of the methods in the map.

gacholio on 14 Nov 2018

Won't the method returned by https://github.com/eclipse/openj9/blob/82275daf0c7145f8ce63878ca43961b8b2e17039/runtime/codert_vm/CodertVMHelpers.cpp#L340 potentially be wrong?

DanHeidinga on 14 Nov 2018

I don't believe so - it's the location that's causing issues.

gacholio on 14 Nov 2018

Not disagreeing about the location issue, but I think there's also a bug related to the method here. If the catcher isn't the first inline, then we're returning the wrong method to the callback.

DanHeidinga on 14 Nov 2018

The catcher is necessarily the first inline, assuming the map I get back based on the catcher PC is correct.

gacholio on 14 Nov 2018

The stackwalk isn't walking inlines so this line https://github.com/eclipse/openj9/blob/82275daf0c7145f8ce63878ca43961b8b2e17039/runtime/codert_vm/CodertVMHelpers.cpp#L334 should return the data for the method that has inlined the catcher?

The new code should have to loop through the inlines to find the appropriate one, no?

DanHeidinga on 14 Nov 2018

Interesting - the catch hook is not walking inlines, as you say, which would seem to be a bug, thought it passes the test. I'll look into this.

gacholio on 14 Nov 2018

Stop confusing me - inlines are walked by default, and the catch code is not disabling them, so that's why it works.

gacholio on 14 Nov 2018

50K runs on AIX with no issues, so I believe #3687 fixes this. Thanks to @gita-omr and @andrewcraik for the quick analysis of the data once I managed to gather it. Special thanks to Andrew for being able to detect the difference in the stack walk and catch code that the rest of us managed to repeatedly miss.

gacholio on 15 Nov 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

SDK archiving is very slow on zLinux machines

AdamBrousseau · 6Comments

OpenJDK8-OpenJ9 SDK throws java/lang/UnsatisfiedLinkError while attempting to access JLM password file

Mesbah-Alam · 5Comments

Question: can VM-Proxy support be removed?

JasonFengJ9 · 3Comments

Trying to figure out how to run JIT casting tests

VermaSh · 3Comments

OpenJ9 in Alpine/musl Docker containers?

hrstoyanov · 4Comments

Openj9: Agent support for exception catch information

Most helpful comment

All 62 comments

3389 can be used to test the JIT changes. Once #3386 has been merged, the decomp003-NON-FSD test will fail if the JIT changes are not correct.

3462 and #3463 introduce the support of this Issue for X86 CodeGen and common infrastructure. This first draft queries TR_disableThrowToGoto for whether to preserve bytecode info; it will be updated once we decide the proper API.

3463 allows merging the catch map if the previous one has identical bytecode info and GC info. Would that cause any issue? You can remove mapCursor->isByteCodeInfoIdenticalTo(nextMapCursor) || from following line to rule it out.

Related issues

3462 and #3463 introduce the support of this Issue for X86 CodeGen and common infrastructure. This first draft queries `TR_disableThrowToGoto` for whether to preserve bytecode info; it will be updated once we decide the proper API.

3463 allows merging the catch map if the previous one has identical bytecode info and GC info. Would that cause any issue? You can remove `mapCursor->isByteCodeInfoIdenticalTo(nextMapCursor) ||` from following line to rule it out.