Openj9: Add diagnostic check to detect wrong DF flag and abort

Created on 26 Jun 2020  路  20Comments  路  Source: eclipse/openj9

in #9782, we found a memory corruption caused by rep stosb executing at the wrong direction in zero initializing a newly allocated object, it turns out the DF flag is incorrectly set.

There are other memory corruption defects that fit into the pattern, i.e. fields are good but class header is zero.

We'd like to find out who set's the flag or who forgets to clear the flag. Waiting another memory corruption defect shows up and proving it to be related to wrong DF flag may be hard, but It will be easier to add a diagnostic check which will abort if DF is wrong. If the check runs frequently enough, we should be able to catch it.

When abort, this check won't tell us who leaves DF uncleared directly, but we hope using this check to at least find situations where the flag is wrong.

All 20 comments

@DanHeidinga @andrewcraik Do we want to enable the check on all tests for x86 builds? Or have a separate test suite for it? Who can help enabling the check or give me some guide on how to enable it if it's trivial.

FYI @gacholio @JamesKingdon

@liqunl I'd suggest enabling it in a WIP PR and running the test suite as a first pass. If that doesn't catch it, then we should put together some grinders of all the test suites with the flag enabled.

For the test mode, the check should be added in as many places as possible. I would suggest:

  • method enter
  • method exit
  • after any call to another method
  • return from direct JNI

Basically, we want to detect the problem as soon as possible.

Will the crash due to bad DF be in a snippet? Let's make it as easy as possible to identify the spot in the main code where the problem is detected.

@gacholio Currently I put it in the mainline code, so we have to open the core to find where it crash. If we want to identify the crash spot without having to look at the core, we can put the check in helpers, and call to the helper from the mainline. For example, we can have checkDfAtMethodEnter, checkDfAtMethodExit, checkDfAfterCall, checkDfAfterJNI. This way, we can identify the crash spot by looking at stack back trace printed to the output.

I'm not concerned about having to look at a core. I just want the problem detected as soon as possible.

I'm not concerned about having to look at a core. I just want the problem detected as soon as possible.

Ok, I'll stick to current approach and add more detections in the JITCODE

I have added the check to more places, but I'm getting failures with invalid JIT return address, Object neither in heap nor stack-allocated errors in my personal builds. The added checks seem to have messed up the stack. Looking at the failures.

The invalid JIT return address and bad o-slot problem is caused by adding the diagnostic check in out-of-line and call snippet, where instruction shape is important. Now I only add the check to mainline code, and the problems disappear.

@DanHeidinga I tried a personal build on all x86 platforms with the check enable, but it didn't detect bad DF. What should we do next?

I believe the suggestion was to add this as a command-line option and run all (or most) of our automated testing with the flag enabled.

I believe the suggestion was to add this as a command-line option and run all (or most) of our automated testing with the flag enabled.

We don't have a command-line option. What we have is a environment variable, but it should be easily enabled with a change in our test script or configuartion.

@DanHeidinga Do you agree to enable this test on all of our automated testing? I have two PRs opened for the JIT change, and will enable the test in a separate PR if we decide to do it.

@smlambert Someone from test should probably comment on the easiest way to get this into automated testing. The option can't be -Xjit due to the fact that the JIT only parses one of the strings, so it can't easily be tacked on without changing every option set for every mode.

The original suggestion was to enable it for all tests in a grinder / personal build, not enabled by default in the mainline code.

We can do that by enabling it in the special binary or setting the envvar for all the tests. Pick the easier to reproduce path as the end goal was to try to narrow the window of when the DF flag missed being reset

The original suggestion was to enable it for all tests in a grinder / personal build, not enabled by default in the mainline code.

We can do that by enabling it in the special binary or setting the envvar for all the tests. Pick the easier to reproduce path as the end goal was to try to narrow the window of when the DF flag missed being reset

Yes, the easier way is to set the envvar for all the tests. @smlambert Could you point me to places where I can set an environment variable for all tests?

Yes, if you want to blanket set it for all types of tests (system/functional/openjdk etc), then method 3

which describes adding the env vars temporarily to https://github.com/AdoptOpenJDK/TKG/blob/master/testEnv.mk in your branch of TKG, then when running tests, use that fork/branch of TKG. Note the instructions subtly tell you to use Grinder_TKG job (not Grinder job which doesn't present those TKG repo/branch parameters to you).

Update: I'm getting test failures of test_setSecurityManager jobs. Don't know how my change can cause such error, will have to look into it.

https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder_TKG/696/testReport/junit/org.openj9.test.java.lang/Test_System/test_setSecurityManager3/

Error Message

JIT: env var TR_enableBreakOnDFSet is set to 1 Exception in thread "main" java/lang/Error: JVM can't set custom SecurityManager due to java.lang.NoSuchMethodException: org.openj9.test.java.lang.Test_System$TestSecurityManagerNonPublicConstructor.<init>()  at java/lang/ClassLoader.initializeClassLoaders (java.base@9/ClassLoader.java:229)  at java/lang/Thread.initialize (java.base@9/Thread.java:430)  at java/lang/Thread.<init> (java.base@9/Thread.java:155) 

Stacktrace

      java.lang.AssertionError: JIT: env var TR_enableBreakOnDFSet is set to 1
Exception in thread "main" java/lang/Error: JVM can't set custom SecurityManager due to java.lang.NoSuchMethodException: org.openj9.test.java.lang.Test_System$TestSecurityManagerNonPublicConstructor.<init>()
at java/lang/ClassLoader.initializeClassLoaders (java.base@9/ClassLoader.java:229)
at java/lang/Thread.initialize (java.base@9/Thread.java:430)
at java/lang/Thread.<init> (java.base@9/Thread.java:155)

at org.testng.Assert.fail(Assert.java:96) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.openj9.test.support.Support_Exec.checkStderr(Support_Exec.java:124) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/jvmtest/functional/Java8andUp/GeneralTest.jar)
at org.openj9.test.support.Support_Exec.execJava(Support_Exec.java:83) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/jvmtest/functional/Java8andUp/GeneralTest.jar)
at org.openj9.test.java.lang.Test_System.test_setSecurityManager3(Test_System.java:385) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/jvmtest/functional/Java8andUp/GeneralTest.jar)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) from jrt:/java.base
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) from jrt:/java.base
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) from jrt:/java.base
at java.base/java.lang.reflect.Method.invoke(Method.java:566) from jrt:/java.base
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.internal.Invoker.invokeMethod(Invoker.java:580) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:716) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:988) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.TestRunner.privateRun(TestRunner.java:648) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.TestRunner.run(TestRunner.java:505) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:455) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:450) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:415) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.SuiteRunner.run(SuiteRunner.java:364) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:84) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1208) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1137) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.TestNG.runSuites(TestNG.java:1049) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.TestNG.run(TestNG.java:1017) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.TestNG.privateMain(TestNG.java:1354) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)
at org.testng.TestNG.main(TestNG.java:1323) from jdk.internal.loader.ClassLoaders$AppClassLoader@5950d4fe(file:/home/jenkins/workspace/Grinder_TKG/openjdk-tests/TKG/lib/testng.jar)

We have a job we know has failed with this - how about grinding that job with the option set?

We have a job we know has failed with this - how about grinding that job with the option set?

I have tried grinding that job and nothing happened. I can't run many repetitions in one grinder due to the limit, so I have to do many grinders. I can definitely try it again.

Due to other work, I haven't been able to get back this the security manager failures. I'll try to look at it this week.

Hint: If you switch to the old web UI on the farm, you can grind a much larger set of jobs.

Update, the error in security manager test is due to mismatch output. Exporting JIT ENV will cause a message to be printed out to stdout, which will cause failures in test looking at stdout. This is fixed by enabling silent env.

I also tried using the old web UI, but still can't reproduce a crash. The detect is found on Mac OS, and the internal farm doesn't run Mac build. I tried grinder on x86-64 linux with 3000 jobs.

@gacholio @andrewcraik I have all the changes needed in #10027, https://github.com/eclipse/omr/pull/5352 and https://github.com/AdoptOpenJDK/TKG/pull/104. Notice that the changes will add instructions at method entry/exit, around calls and object news only in Openj9 testing on x86. Could you review the OMR and Openj9 PRs? The Openj9 PR depends on the OMR PR.

We can get the TKG change in after the merge of OMR and Openj9 changes.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pshipton picture pshipton  路  3Comments

dsouzai picture dsouzai  路  5Comments

xliang6 picture xliang6  路  3Comments

AdamBrousseau picture AdamBrousseau  路  6Comments

xliang6 picture xliang6  路  6Comments