Incorrect behaviour: OpenJ9 terminates abruptly or generates a core file for TRAP instructions when run in RPM package manager on RHEL8.2 ppc64le. This behaviour is not seen outside of RPM package manager. Similar behaviour is also seen on SLES15 ppc64le
The issue is introduced because of https://github.com/eclipse/omr/pull/4554, which impacts JIT's SIGTRAP signal handler.
Reverting https://github.com/eclipse/omr/pull/4554 corrects the behaviour. https://github.com/eclipse/omr/pull/4554 can't be reverted since it fixes https://github.com/eclipse/openj9/issues/7749.
There is no direct link between unblocking a signal and JIT's SIGTRAP signal handler. Also, SIGTRAP is categorized as a synchronous signal, and only asynchronous signals are unblocked.
We need to investigate how https://github.com/eclipse/omr/pull/4554 impacts JIT's SIGTRAP signal handler.
The below workaround, test case and instructions to reproduce the behaviour are provided by @klangman.
The JVM command line option, -Xjit:NoResumableTrapHandler, will avoid the above reported problem. This option prevents the JIT from using TRAP instructions.
WORK directory: mkdir /root/work; export WORK=/root/workcd $WORK$WORK/jdk8NPETest.javatest.jvmtrap.specrun.sh./run.sh should reproduce the incorrect behaviourrpmbuildrpmNPETest.javapublic class NPETest {
static String[] strings = {"Kevin", "Alice", "Phill", "Tilly", "Noel", "Anthony", "Bruce", "Julie", null, null};
public static void main(String[] args) {
for( int i=0 ; i<20000 ; i++){
throwMaybe(i%10);
}
System.out.println("Test finished without a crash!");
}
static void throwMaybe(int x){
try{
System.out.println( "Name: " + strings[x].toString() );
}catch(NullPointerException e){
// Do nothing
}
}
}
test.jvmtrap.specName: TEST
Version: 8.1.11
Release: 15
License: None
Summary: Testing if I can see a SEGV using JVM in a scriptlet
Group: Utilities/Archiving
Vendor: N/A
URL: n/a
%description
%files
%post
#!/bin/bash
echo "Hello World!"
#/etc/init.d/webserver start
$WORK/jdk8/bin/java -Xjit:verbose -cp $WORK NPETest |& tee $WORK/output.log
run.shcd $WORK
jdk8/bin/javac NPETest.java
rpm -evh TEST-8.1.11-15.ppc64le
rpmbuild -ba test.jvmtrap.spec
rpm -ivh /root/rpmbuild/RPMS/ppc64le/TEST-8.1.11-15.ppc64le.rpm
The problem also happens on SLES 15 for PPCle.
@babsingh / @klangman so which platforms are affected and what is the next step on this?
so which platforms are affected and what is the next step on this?
@andrewcraik The failure was only observed on ppc64le platforms, specifically RHEL8.2 and SLES15, since JIT's SIGTRAP signal handler only gets used on these platforms. There is a workaround, -Xjit:NoResumableTrapHandler, so resources are being prioritized towards other high priority tasks. The next step would be to investigate how eclipse/omr#4554 impacts JIT's SIGTRAP signal handler since reverting eclipse/omr#4554 resolves the issue.
I tested zLinux and was unable to recreate the problem. It's unknown if the problem exists on LinuxPPCbe because there is no up to date distribution on PPC BE. The problem only appears on newer distributions (older SLES and RHEL distributions of PPC LE do not show the problem).
OK since this is P specific - FYI @gita-omr and I am going to tag this with the arch:p label
The problem also happens on SLES 15 for PPCle.
@klangman SLES has YaST as the package manager. It also supports RPM. Was the problem recreated using YaST or RPM on SLES 15?
@klangman SLES has YaST as the package manager. It also supports RPM. Was the problem recreated using YaST or RPM on SLES 15?
@babsingh I used RPM on SLES 15 to recreate the problem.
I'm going to move this to the next release as it is unlikely that it will be resolved for the code split.
@babsingh @klangman Let me know if you think otherwise.
Looking at the signal masks in /proc/<PID>/status every ~0.02 seconds I see the following changes:
| Timestamp | java process (PID=7527) | main thread (PID=7530) | |||||
|---|---|---|---|---|---|---|---|
| SigBlk | SigIgn | SigCgt | SigBlk | SigIgn | SigCgt | ||
| 32.262 | 0xfa17 | 0x0006 | 0x0000 | --- | --- | --- | |
| 32.352 | // | 0x1006 | 0x04f8 | 0xfa17 | 0x1006 | 0x04f8 | |
| 32.448 | // | 0x1002 | 0x44fd | 0xba12 | 0x1002 | 0x44fd | |
| 33.083 | // | // | 0x44ed | 0xba02 | // | 0x44ed | |
| 33.084 | SIGTRAP received in main thread | ||||||
* SIGTRAP bit = 0x0010
What of interest here is the last change, the SIGTRAP handler is not set with the OS anymore and it was unblocked in the main thread.
The main difference between crashed runs and successful runs (by using -Xjit:NoResumableTrapHandler, reverting the causing commit, or by chance as its not failing %100 of the time for me) is not having the last change (33.083 row).
There is no related strace to that change:
java(7527)-+-{JIT Compilation}(7548)
...
|-{JIT Compilation}(7555)
|-{JIT IProfiler}(7557)
|-{JIT Sampler}(7556)
|-{Signal Reporter}(7538)
`-{main}(7530)
[pid 7530 {main}] 15:57:32.264961 rt_sigaction(SIGTRAP, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
[pid 7530 {main}] 15:57:32.265027 rt_sigaction(SIGTRAP, {sa_handler=0x7fff9c7835c0, sa_mask=[], sa_flags=SA_RESTART|SA_NODEFER|SA_SIGINFO}, NULL, 8) = 0
[pid 7530 {main}] 15:57:33.084218 --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=1929882744, si_uid=32767} ---
@rmnattas is looking at it.
Adding an OR condition || portLibrarySignalNo == OMRPORT_SIG_FLAG_SIGTRAP to line [1] allowing SIGTRAP to get unblocked works without showing the problem (finishes without crashing). For some reason when SIGTRAP is unblocked (only in the main thread) the handler does not get unset and the signals get received by the JVM not the OS.
@babsingh Any reason why would a change in a signal mask/handler like the last row change in https://github.com/eclipse/openj9/issues/10744#issuecomment-750893654 happen in OpenJ9/OMR? But then no related strace is showing which could mean it's something outside the java process and its threads.
@klangman mentioned that it could be a PPCle kernel bug which I agree that its a plausible theory given the strange circumstances of the signal mask/handler change without seeing the java process and its threads doing the change. I'm trying to test that somehow with a C process.
Also, if it's urgent, I think a temporary fix could be my last comment. Allowing synchronous signals (or maybe just SIGTRAP) to get unblocked under registerSignalHandlerWithOS should satisfy "a signal should not be unblocked if it is not used i.e. a signal handler is not registered for it."
Any reason why would a change ...
None. I agree that there is an external entity responsible for interfering with the SIGTRAP handler, which is environment specific (RPM + RHEL + PPCLE).
... temporary fix ... unblocking SIGTRAP fixes the behaviour of the SIGTRAP handler
As per the Notes in https://man7.org/linux/man-pages/man2/sigprocmask.2.html, there are only side-effects for blocking signals such as SIGBUS, SIGFPE, SIGILL and SIGSEGV . Unblocking signals doesn't have any side-effects. So, we can unblock all signals (synchronous + asynchronous) after registering a handler against the signal in OMR. This can be treated as a permanent fix. This will also add consistency to unblocking signals.
@rmnattas is working on the FIX: https://github.com/rmnattas/omr/commits/unix-signal-unblock. Currently, only asynchronous signals are unblocked. The FIX will unblock ALL signals. The FIX will also unblock after the signal handler is registered in order to avoid receiving signals between unblock and registration.
The following tests have been suggested to verify functionality:
SunMiscSignalTest (checks registration and invocation of handlers): https://github.com/ibmruntimes/openj9-openjdk-jdk15/blob/openj9/test/jdk/sun/misc/SunMiscSignalTest.javaSigIntTest (checks shutdown hooks): https://github.com/eclipse/openj9/issues/54#issuecomment-345854560On a RHEL 8.3 (Ootpa) ppc64le machine, I ran the failing test, https://github.com/eclipse/openj9/issues/10744#issue-711416727, 100 times using a nightly build (https://ci.eclipse.org/openj9/job/Build_JDK8_ppc64le_linux_Nightly/636/), which has @rmnattas's fix (https://github.com/eclipse/omr/pull/5742).
No failures were seen, which verifies that https://github.com/eclipse/omr/pull/5742 works as a fix. Thus, closing this issue.