Openj9: JIT SIGTRAP Signal Handler Issue

Created on 29 Sep 2020  路  14Comments  路  Source: eclipse/openj9

Issue

Incorrect behaviour: OpenJ9 terminates abruptly or generates a core file for TRAP instructions when run in RPM package manager on RHEL8.2 ppc64le. This behaviour is not seen outside of RPM package manager. Similar behaviour is also seen on SLES15 ppc64le

The issue is introduced because of https://github.com/eclipse/omr/pull/4554, which impacts JIT's SIGTRAP signal handler.

Reverting https://github.com/eclipse/omr/pull/4554 corrects the behaviour. https://github.com/eclipse/omr/pull/4554 can't be reverted since it fixes https://github.com/eclipse/openj9/issues/7749.

There is no direct link between unblocking a signal and JIT's SIGTRAP signal handler. Also, SIGTRAP is categorized as a synchronous signal, and only asynchronous signals are unblocked.

We need to investigate how https://github.com/eclipse/omr/pull/4554 impacts JIT's SIGTRAP signal handler.

The below workaround, test case and instructions to reproduce the behaviour are provided by @klangman.

Workaround

The JVM command line option, -Xjit:NoResumableTrapHandler, will avoid the above reported problem. This option prevents the JIT from using TRAP instructions.

Steps to reproduce

  • Perform the machine setup
  • Create a WORK directory: mkdir /root/work; export WORK=/root/work
  • cd $WORK
  • Unzip OpenJ9 JDK8 in $WORK/jdk8
  • Create the test case: NPETest.java
  • Create the RPM spec: test.jvmtrap.spec
  • Create the run script: run.sh
  • Execute ./run.sh should reproduce the incorrect behaviour

Machine setup

  • RHEL 8.2 ppc64le
  • rpmbuild
  • rpm
  • OpenJ9 JDK8 build for Linux ppc64le from AdoptOpenJDK

Create the test case: NPETest.java

public class NPETest {
    static String[] strings = {"Kevin", "Alice", "Phill", "Tilly", "Noel", "Anthony", "Bruce", "Julie", null, null};
    public static void main(String[] args) {
        for( int i=0 ; i<20000 ; i++){
            throwMaybe(i%10);
        }
        System.out.println("Test finished without a crash!");
    }
    static void throwMaybe(int x){
        try{
           System.out.println( "Name: " + strings[x].toString() );
        }catch(NullPointerException e){
            // Do nothing
        }
    }
}

Create the RPM spec: test.jvmtrap.spec

Name:       TEST
Version:        8.1.11
Release:        15
License:        None
Summary:        Testing if I can see a SEGV using JVM in a scriptlet
Group:      Utilities/Archiving
Vendor:     N/A
URL:        n/a
%description

%files

%post
#!/bin/bash
echo "Hello World!"
#/etc/init.d/webserver start
$WORK/jdk8/bin/java -Xjit:verbose -cp $WORK NPETest |& tee $WORK/output.log

Create the run script: run.sh

cd $WORK
jdk8/bin/javac NPETest.java
rpm -evh TEST-8.1.11-15.ppc64le
rpmbuild -ba test.jvmtrap.spec
rpm -ivh /root/rpmbuild/RPMS/ppc64le/TEST-8.1.11-15.ppc64le.rpm
power jit vm userRaised

All 14 comments

The problem also happens on SLES 15 for PPCle.

@babsingh / @klangman so which platforms are affected and what is the next step on this?

so which platforms are affected and what is the next step on this?

@andrewcraik The failure was only observed on ppc64le platforms, specifically RHEL8.2 and SLES15, since JIT's SIGTRAP signal handler only gets used on these platforms. There is a workaround, -Xjit:NoResumableTrapHandler, so resources are being prioritized towards other high priority tasks. The next step would be to investigate how eclipse/omr#4554 impacts JIT's SIGTRAP signal handler since reverting eclipse/omr#4554 resolves the issue.

I tested zLinux and was unable to recreate the problem. It's unknown if the problem exists on LinuxPPCbe because there is no up to date distribution on PPC BE. The problem only appears on newer distributions (older SLES and RHEL distributions of PPC LE do not show the problem).

OK since this is P specific - FYI @gita-omr and I am going to tag this with the arch:p label

The problem also happens on SLES 15 for PPCle.

@klangman SLES has YaST as the package manager. It also supports RPM. Was the problem recreated using YaST or RPM on SLES 15?

@klangman SLES has YaST as the package manager. It also supports RPM. Was the problem recreated using YaST or RPM on SLES 15?

@babsingh I used RPM on SLES 15 to recreate the problem.

I'm going to move this to the next release as it is unlikely that it will be resolved for the code split.

@babsingh @klangman Let me know if you think otherwise.

Looking at the signal masks in /proc/<PID>/status every ~0.02 seconds I see the following changes:

Timestampjava process (PID=7527)main thread (PID=7530)
SigBlkSigIgnSigCgtSigBlkSigIgnSigCgt
32.2620xfa170x00060x0000---------
32.352//0x10060x04f80xfa170x10060x04f8
32.448//0x10020x44fd0xba120x10020x44fd
33.083////0x44ed0xba02//0x44ed
33.084SIGTRAP received in main thread

* SIGTRAP bit = 0x0010

What of interest here is the last change, the SIGTRAP handler is not set with the OS anymore and it was unblocked in the main thread.
The main difference between crashed runs and successful runs (by using -Xjit:NoResumableTrapHandler, reverting the causing commit, or by chance as its not failing %100 of the time for me) is not having the last change (33.083 row).

There is no related strace to that change:

java(7527)-+-{JIT Compilation}(7548)
           ...
           |-{JIT Compilation}(7555)
           |-{JIT IProfiler}(7557)
           |-{JIT Sampler}(7556)
           |-{Signal Reporter}(7538)
           `-{main}(7530)



[pid  7530 {main}] 15:57:32.264961 rt_sigaction(SIGTRAP, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
[pid  7530 {main}] 15:57:32.265027 rt_sigaction(SIGTRAP, {sa_handler=0x7fff9c7835c0, sa_mask=[], sa_flags=SA_RESTART|SA_NODEFER|SA_SIGINFO}, NULL, 8) = 0

[pid  7530 {main}] 15:57:33.084218 --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=1929882744, si_uid=32767} ---

@rmnattas is looking at it.

Adding an OR condition || portLibrarySignalNo == OMRPORT_SIG_FLAG_SIGTRAP to line [1] allowing SIGTRAP to get unblocked works without showing the problem (finishes without crashing). For some reason when SIGTRAP is unblocked (only in the main thread) the handler does not get unset and the signals get received by the JVM not the OS.

[1]
https://github.com/eclipse/openj9-omr/blob/d19d2fff4fab1bc47f7288c029793468969e9ef3/port/unix/omrsignal.c#L1278

@babsingh Any reason why would a change in a signal mask/handler like the last row change in https://github.com/eclipse/openj9/issues/10744#issuecomment-750893654 happen in OpenJ9/OMR? But then no related strace is showing which could mean it's something outside the java process and its threads.

@klangman mentioned that it could be a PPCle kernel bug which I agree that its a plausible theory given the strange circumstances of the signal mask/handler change without seeing the java process and its threads doing the change. I'm trying to test that somehow with a C process.

Also, if it's urgent, I think a temporary fix could be my last comment. Allowing synchronous signals (or maybe just SIGTRAP) to get unblocked under registerSignalHandlerWithOS should satisfy "a signal should not be unblocked if it is not used i.e. a signal handler is not registered for it."

Any reason why would a change ...

None. I agree that there is an external entity responsible for interfering with the SIGTRAP handler, which is environment specific (RPM + RHEL + PPCLE).

... temporary fix ... unblocking SIGTRAP fixes the behaviour of the SIGTRAP handler

As per the Notes in https://man7.org/linux/man-pages/man2/sigprocmask.2.html, there are only side-effects for blocking signals such as SIGBUS, SIGFPE, SIGILL and SIGSEGV . Unblocking signals doesn't have any side-effects. So, we can unblock all signals (synchronous + asynchronous) after registering a handler against the signal in OMR. This can be treated as a permanent fix. This will also add consistency to unblocking signals.

@rmnattas is working on the FIX: https://github.com/rmnattas/omr/commits/unix-signal-unblock. Currently, only asynchronous signals are unblocked. The FIX will unblock ALL signals. The FIX will also unblock after the signal handler is registered in order to avoid receiving signals between unblock and registration.

The following tests have been suggested to verify functionality:

  1. SunMiscSignalTest (checks registration and invocation of handlers): https://github.com/ibmruntimes/openj9-openjdk-jdk15/blob/openj9/test/jdk/sun/misc/SunMiscSignalTest.java
  2. SigIntTest (checks shutdown hooks): https://github.com/eclipse/openj9/issues/54#issuecomment-345854560
  3. For checking generation of core files, you can just run sanity+extended functional testing in a personal build.

On a RHEL 8.3 (Ootpa) ppc64le machine, I ran the failing test, https://github.com/eclipse/openj9/issues/10744#issue-711416727, 100 times using a nightly build (https://ci.eclipse.org/openj9/job/Build_JDK8_ppc64le_linux_Nightly/636/), which has @rmnattas's fix (https://github.com/eclipse/omr/pull/5742).

No failures were seen, which verifies that https://github.com/eclipse/omr/pull/5742 works as a fix. Thus, closing this issue.

Was this page helpful?
0 / 5 - 0 ratings