Looking at a problematic pull request build https://ci.eclipse.org/openj9/job/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/336/ (due to #968), it contains the following output. The problem here is the Build finished. 1013 tests run, 351 skipped, 0 failed. message which might be mistaken has having completed all the tests. Perhaps this can be improved.
`
14:35:50 org.openj9.test.java.lang.
14:35:50 JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2018/01/18 19:35:50 - please wait.
14:35:50 JVMDUMP032I JVM requested System dump using '/home/jenkins/jenkins-agent/workspace/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/openj9/test/TestConfig/core.20180118.193550.30387.0001.dmp' in response to an event
[Pipeline] }
[Pipeline] // dir
[Pipeline] }
[Pipeline] // timestamps
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // stage
[Pipeline] stage
[Pipeline] { (Declarative: Post Actions)
[Pipeline] echo
Publish Test Results...
[Pipeline] junit
Recording test results
[Pipeline] step
TestNG Reports Processing: START
Looking for TestNG results report in workspace using pattern: **/testng-results.xml
Saving reports...
Processing '/home/hudson/genie.openj9/.jenkins/jobs/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/builds/336/testng/testng-results-1.xml'
Processing '/home/hudson/genie.openj9/.jenkins/jobs/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/builds/336/testng/testng-results-2.xml'
Processing '/home/hudson/genie.openj9/.jenkins/jobs/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/builds/336/testng/testng-results-3.xml'
Processing '/home/hudson/genie.openj9/.jenkins/jobs/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/builds/336/testng/testng-results-4.xml'
Processing '/home/hudson/genie.openj9/.jenkins/jobs/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/builds/336/testng/testng-results.xml'
TestNG Reports Processing: FINISH
[Pipeline] step
TAP Reports Processing: START
Looking for TAP results report in workspace using pattern: **/*.tap
Did not find any matching files.
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Adding one-line test results to commit status...
Setting status of b3256579fd0340b2828c82f9ceb64674bba2b01a to FAILURE with url https://ci.eclipse.org/openj9/job/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/336/ and message: 'Build finished. 1013 tests run, 351 skipped, 0 failed.'
The problem here is the jvm crashing, followed by the POST Jenkins actions which run after the build. This includes the PR plugin updating the PullRequest build status. Since only X tests ran, that is what the PR Plugin determines. I think we would have to make our test framework more robust or be able to recover from these crashes. The plugin cannot tell that the testing stopped early so it can only ananylze the test results that are there.
Edit:
Or somehow have the test framework print out something like 1013/21152 tests ran
I think there are two problems.
1) One is with the test framework, that we should be reporting the crash or generation of the dmp files (if it is not an expected part of the test), related item #969. At present, we do not pre-count and report the number of tests before we run them, in fact, we wrote the framework to easily pick up any tests that are present or newly added in the test directory (to make it easier for developers to add tests. I had adding a step at the beginning to count all of the test before we started the run, but decided against it for a number of reasons, including trying to keep execution time to a minimum. We can look at it again, to see what it will cost us to pre-count.
2) The second part is that the plugin is printing the confusing message. Is there a way to turn off the verbosity of the plugin, that people may mistake for framework output?
Note in this case the dump files are expected (not sure the test should be creating them because this is a lot of overhead, but currently it does). The problem is the testing suddenly stops, i.e. the VM suddenly exits while in the process of creating the expected dump files (#968). This doesn't seem to result in the running test failing, and the test summary (showing no failures) could be mistaken for a passing run, although overall the jenkins build failed.
Marking this as high priority as it is negatively affecting PR and nightly builds.
FYI @renfeiw
I think there are a couple of things going on:
1) the JCL suite is likely be too large and occasionally causing an OOM error (a known testng bug, https://github.com/cbeust/testng/issues/1259, where the workaround is to break the test suites up into smaller pieces so testng doesn't store so much test result output in memory before it writes the output to files) - when OOM is seen as reported above, its often in JCL tests - so we should proactively break that suite down into a few smaller sizes (this will help with parallelism later anyway) <-- ignore this for now, I originally made an assumption that the OOM was related to this (since the test framework exits, which is not typical behaviour in the case of core dumps in tests), but the cause of the OOM is known, it is intentionally thrown by the test
2) testkitgen / resultsSummary should check if core dumps exist, and fail if so (we will need to make sure tests that create core dumps on purpose clean them up so they do not get included in the set being considered as proof of failure)
3) archive core files in Jenkins scripts in event of failure (this can be done in a separate PR)
I believe the OOM which is occurring is intentionally caused by a test, see #968. The VM is exiting unexpectedly, which should not occur, but OpenJ9 doesn't yet have the diagnostics support to pursue the issue, although it is expected within a few weeks. Breaking up the tests suites may or may not help. Running the OOM test with -Xdump:none may or may not help.
I think we should re-open this Issue. The problem described hasn't been resolved, only the current issue causing this problem may have been resolved. The PR testing could be aborted for other reasons, and the messages would still be confusing.
Yes, this issue auto-closed, but since it is describing several problems it should not be closed or better yet, separate issues should be raised.
TAP result or string (i.e., TOTAL: 102 EXECUTED: 100 PASSED: 100 FAILED: 0 SKIPPED: 2 ) from console contains all test results and is sufficient to use as success criteria for Jenkins build. testng result and junit result do not include all tests. And all test results can be fund in TAP result or console output.
I am wondering that do we really need to show testng result and junit result? Could we just remove them to avoid confusion?
do we really need to show testng result and junit result? Could we just remove them to avoid confusion?
We can. I'm not sure what the Add test result one liner to the PR will look like. Currently it is updating with the JUnit numbers. I think before we added JUnit we just had testNG which gave slightly different numbers. I can run a quick test to see.
We should definitely not be using JUnit numbers for the "Add test result one liner" value, as that is incorrect and the root cause of the confusion reported in this issue.
One potential value of showing Junit results via the Junit plugin is that its a nicer GUI for developers to view the test results of the tests that produce that type of output (which is all the testng tests, all the openjdk regression tests, but doesn't include the cmdlinetester tests or any STF / SV tests, or JCK tests).
Build finished. No test results found.
You have to specify failIfNoResults:true. Otherwise, build will pass if there is no TAP result.
step([$class: “TapPublisher”, testResults: “**/*.tap”, failIfNoResults:true])])
ok, we should add that, however there were results. I guess the ghpr plugin doesn't recognise TAP?
supposed to have been solved
https://github.com/jenkinsci/ghprb-plugin/issues/168
If ghprb-plugin#168 doesn't resolve it, I think we can/should just turn off that counter off to avoid confusion (as briefly discussed today).
I believe all other parts of this issue have been addressed.
Does anyone object to this issue being closed? I will plan to close it late tomorrow, if not.
A recent PR build https://ci.eclipse.org/openj9/job/PullRequest-Sanity-JDK9-linux_ppc-64_cmprssptrs_le-OpenJ9/183/ shows 'Build finished. 19982 tests run, 6 skipped, 0 failed.' although 1 test failed.
Right, because the failed test only generated TAP output and the 'test results one-liner' is junit results. Which is one of the reasons why we should turn it off as Shelley suggested in the previous comment.
@AdamBrousseau - can you remove the one-liner or show me where to do that?
remove the one-liner
Done
@pshipton - please confirm that this issue can be close once you confirm that the confusing message is no longer present
I'll have to see a failing PR build to be 100% sure, but a passing build no longer has a passed/failed summary at the end. I'll close this assuming the problem is resolved and reopen later if necessary.
I found a failed PR build, and it also looked good.