Openj9: Aborted pull request testing contains confusing message

Created on 18 Jan 2018 · 24Comments · Source: eclipse/openj9

Looking at a problematic pull request build https://ci.eclipse.org/openj9/job/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/336/ (due to #968), it contains the following output. The problem here is the Build finished. 1013 tests run, 351 skipped, 0 failed. message which might be mistaken has having completed all the tests. Perhaps this can be improved.

`14:35:50 org.openj9.test.java.lang. 14:35:50 JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2018/01/18 19:35:50 - please wait. 14:35:50 JVMDUMP032I JVM requested System dump using '/home/jenkins/jenkins-agent/workspace/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/openj9/test/TestConfig/core.20180118.193550.30387.0001.dmp' in response to an event [Pipeline] } [Pipeline] // dir [Pipeline] } [Pipeline] // timestamps [Pipeline] } [Pipeline] // withEnv [Pipeline] } [Pipeline] // stage [Pipeline] stage [Pipeline] { (Declarative: Post Actions) [Pipeline] echo Publish Test Results... [Pipeline] junit Recording test results [Pipeline] step TestNG Reports Processing: START Looking for TestNG results report in workspace using pattern: **/testng-results.xml Saving reports... Processing '/home/hudson/genie.openj9/.jenkins/jobs/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/builds/336/testng/testng-results-1.xml' Processing '/home/hudson/genie.openj9/.jenkins/jobs/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/builds/336/testng/testng-results-2.xml' Processing '/home/hudson/genie.openj9/.jenkins/jobs/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/builds/336/testng/testng-results-3.xml' Processing '/home/hudson/genie.openj9/.jenkins/jobs/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/builds/336/testng/testng-results-4.xml' Processing '/home/hudson/genie.openj9/.jenkins/jobs/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/builds/336/testng/testng-results.xml' TestNG Reports Processing: FINISH [Pipeline] step TAP Reports Processing: START Looking for TAP results report in workspace using pattern: **/*.tap Did not find any matching files. [Pipeline] } [Pipeline] // stage [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline Adding one-line test results to commit status... Setting status of b3256579fd0340b2828c82f9ceb64674bba2b01a to FAILURE with url https://ci.eclipse.org/openj9/job/PullRequest-Sanity-linux_ppc-64_cmprssptrs_le/336/ and message: 'Build finished. 1013 tests run, 351 skipped, 0 failed.'

test high

Source

pshipton

All 24 comments

The problem here is the jvm crashing, followed by the POST Jenkins actions which run after the build. This includes the PR plugin updating the PullRequest build status. Since only X tests ran, that is what the PR Plugin determines. I think we would have to make our test framework more robust or be able to recover from these crashes. The plugin cannot tell that the testing stopped early so it can only ananylze the test results that are there.

Edit:
Or somehow have the test framework print out something like 1013/21152 tests ran

AdamBrousseau on 18 Jan 2018

I think there are two problems.

1) One is with the test framework, that we should be reporting the crash or generation of the dmp files (if it is not an expected part of the test), related item #969. At present, we do not pre-count and report the number of tests before we run them, in fact, we wrote the framework to easily pick up any tests that are present or newly added in the test directory (to make it easier for developers to add tests. I had adding a step at the beginning to count all of the test before we started the run, but decided against it for a number of reasons, including trying to keep execution time to a minimum. We can look at it again, to see what it will cost us to pre-count.

2) The second part is that the plugin is printing the confusing message. Is there a way to turn off the verbosity of the plugin, that people may mistake for framework output?

smlambert on 19 Jan 2018

Note in this case the dump files are expected (not sure the test should be creating them because this is a lot of overhead, but currently it does). The problem is the testing suddenly stops, i.e. the VM suddenly exits while in the process of creating the expected dump files (#968). This doesn't seem to result in the running test failing, and the test summary (showing no failures) could be mistaken for a passing run, although overall the jenkins build failed.

pshipton on 19 Jan 2018

Marking this as high priority as it is negatively affecting PR and nightly builds.

DanHeidinga on 13 Feb 2018

FYI @renfeiw

I think there are a couple of things going on:

1) the JCL suite is likely be too large and occasionally causing an OOM error (a known testng bug, https://github.com/cbeust/testng/issues/1259, where the workaround is to break the test suites up into smaller pieces so testng doesn't store so much test result output in memory before it writes the output to files) - when OOM is seen as reported above, its often in JCL tests - so we should proactively break that suite down into a few smaller sizes (this will help with parallelism later anyway) <-- ignore this for now, I originally made an assumption that the OOM was related to this (since the test framework exits, which is not typical behaviour in the case of core dumps in tests), but the cause of the OOM is known, it is intentionally thrown by the test

2) testkitgen / resultsSummary should check if core dumps exist, and fail if so (we will need to make sure tests that create core dumps on purpose clean them up so they do not get included in the set being considered as proof of failure)

3) archive core files in Jenkins scripts in event of failure (this can be done in a separate PR)

smlambert on 13 Feb 2018

I believe the OOM which is occurring is intentionally caused by a test, see #968. The VM is exiting unexpectedly, which should not occur, but OpenJ9 doesn't yet have the diagnostics support to pursue the issue, although it is expected within a few weeks. Breaking up the tests suites may or may not help. Running the OOM test with -Xdump:none may or may not help.

pshipton on 13 Feb 2018

I think we should re-open this Issue. The problem described hasn't been resolved, only the current issue causing this problem may have been resolved. The PR testing could be aborted for other reasons, and the messages would still be confusing.

pshipton on 13 Feb 2018

Yes, this issue auto-closed, but since it is describing several problems it should not be closed or better yet, separate issues should be raised.

smlambert on 14 Feb 2018

TAP result or string (i.e., TOTAL: 102 EXECUTED: 100 PASSED: 100 FAILED: 0 SKIPPED: 2 ) from console contains all test results and is sufficient to use as success criteria for Jenkins build. testng result and junit result do not include all tests. And all test results can be fund in TAP result or console output.
I am wondering that do we really need to show testng result and junit result? Could we just remove them to avoid confusion?

llxia on 20 Feb 2018

do we really need to show testng result and junit result? Could we just remove them to avoid confusion?

We can. I'm not sure what the Add test result one liner to the PR will look like. Currently it is updating with the JUnit numbers. I think before we added JUnit we just had testNG which gave slightly different numbers. I can run a quick test to see.

AdamBrousseau on 20 Feb 2018

We should definitely not be using JUnit numbers for the "Add test result one liner" value, as that is incorrect and the root cause of the confusion reported in this issue.

smlambert on 20 Feb 2018

One potential value of showing Junit results via the Junit plugin is that its a nicer GUI for developers to view the test results of the tests that produce that type of output (which is all the testng tests, all the openjdk regression tests, but doesn't include the cmdlinetester tests or any STF / SV tests, or JCK tests).

smlambert on 20 Feb 2018

👍1

Build finished. No test results found.

1214

AdamBrousseau on 20 Feb 2018

👎1

You have to specify failIfNoResults:true. Otherwise, build will pass if there is no TAP result.

step([$class: “TapPublisher”, testResults: “**/*.tap”, failIfNoResults:true])])

llxia on 21 Feb 2018

ok, we should add that, however there were results. I guess the ghpr plugin doesn't recognise TAP?

AdamBrousseau on 21 Feb 2018

supposed to have been solved
https://github.com/jenkinsci/ghprb-plugin/issues/168

AdamBrousseau on 21 Feb 2018

👍1

If ghprb-plugin#168 doesn't resolve it, I think we can/should just turn off that counter off to avoid confusion (as briefly discussed today).

I believe all other parts of this issue have been addressed.

Does anyone object to this issue being closed? I will plan to close it late tomorrow, if not.

smlambert on 21 Mar 2018

A recent PR build https://ci.eclipse.org/openj9/job/PullRequest-Sanity-JDK9-linux_ppc-64_cmprssptrs_le-OpenJ9/183/ shows 'Build finished. 19982 tests run, 6 skipped, 0 failed.' although 1 test failed.

pshipton on 28 Mar 2018

Right, because the failed test only generated TAP output and the 'test results one-liner' is junit results. Which is one of the reasons why we should turn it off as Shelley suggested in the previous comment.

AdamBrousseau on 28 Mar 2018

@AdamBrousseau - can you remove the one-liner or show me where to do that?

smlambert on 28 Mar 2018

remove the one-liner

Done

AdamBrousseau on 28 Mar 2018

👍1

@pshipton - please confirm that this issue can be close once you confirm that the confusing message is no longer present

smlambert on 5 Apr 2018

I'll have to see a failing PR build to be 100% sure, but a passing build no longer has a passed/failed summary at the end. I'll close this assuming the problem is resolved and reopen later if necessary.

pshipton on 5 Apr 2018

I found a failed PR build, and it also looked good.

pshipton on 5 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings