https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.openjdk_s390x_linux_Nightly/37
ub18-390-1
java/util/concurrent/tck/JSR166TestCase.java
00:15:46 JavaTest Message: JUnit Failure: testAccumulateAndGetMT(DoubleAccumulatorTest): expected:<9.99999E11> but was:<9.99993040122E11>
00:15:46 junit.framework.AssertionFailedError: expected:<9.99999E11> but was:<9.99993040122E11>
00:15:46 at junit.framework.Assert.fail(Assert.java:50)
00:15:46 at junit.framework.Assert.failNotEquals(Assert.java:287)
00:15:46 at junit.framework.Assert.assertEquals(Assert.java:67)
00:15:46 at junit.framework.Assert.assertEquals(Assert.java:74)
00:15:46 at DoubleAccumulatorTest.testAccumulateAndGetMT(DoubleAccumulatorTest.java:175)
We saw a similar failure before which was resolved https://github.com/eclipse/openj9/issues/7011
10x grinder https://ci.eclipse.org/openj9/job/Grinder/854 - passed ub16-390-1
30x grinder https://ci.eclipse.org/openj9/job/Grinder/855 - passed ub16-390-1
5x grinder ub18-390-1 https://ci.eclipse.org/openj9/job/Grinder/856 - failed 4/5
Note ub18-390-1 (and -2) are new, added yesterday. Existing ub16 machines were updated.
@fjeremic can you please take a look.
We also added p and x ub18 machines. I'll try some grinders on these platforms.
plinux ub18-ppcle-1 - https://ci.eclipse.org/openj9/view/Test/job/Grinder/857 - passed
xlinux ub18-x86-1 - https://ci.eclipse.org/openj9/view/Test/job/Grinder/858 - passed
@r30shah could you please help take a look? This is a z15 machine. I've launched a grinder [1] with -Xjit:disableZ15 to see if it is z15 related.
[1] https://ci.eclipse.org/openj9/job/Grinder/859 - ran on ub16-390-1 where the problem doesn't occur
Since the problem only seems to occur on ub18, I re-launched the previous grinder to target this platform.
https://ci.eclipse.org/openj9/job/Grinder/862 - failed 2/5
Thanks Peter, I was under the assumption that "Rerun in Grinder" link would have done that, but now that I think about it, it was a pretty bad assumption. So it seems the issue is not z15 related. Should make our search easier as we have local z14 machines we can work on.
Rebuilding one of my ub18 previous grinders will do it.
Thanks @pshipton @fjeremic , this gives me initial pointer to start taking a look at it. Going to try reproducing this on local Ubuntu 18 z14 to see if I can reproduce it locally and continue investigation there.
Rebuilding one of my ub18 previous grinders will do it.
@pshipton that is surprising as I used the Grinder from your previous comment [1] to launch my own. [1] ran on ub18, however clicking the "Rerun in Grinder" link navigates us to page [2] which doesn't seem to put "ub18-390-1" in the LABEL field. Is this a bug?
[1] https://ci.eclipse.org/openj9/job/Grinder/856/
[2] https://ci.eclipse.org/openj9//job/Grinder/parambuild/?JDK_VERSION=11&JDK_IMPL=openj9&BUILD_LIST=openjdk&PLATFORM=s390x_linux&TARGET=jdk_custom&SDK_RESOURCE=nightly&CUSTOM_TARGET=test/jdk/java/util/concurrent/tck/JSR166TestCase.java
Not sure the "Rerun in Grinder" behavior is a bug, just the way it's setup atm. You could open an issue to suggest that "Rerun in Grinder" links include the machine label. Not sure how feasible that is, if the machine name is available to the test job. The user may not want that behavior but is free to re-configure after the link fills in the initial template.
@fjeremic perhaps you're not aware there is a "Rebuild" option on the left in the jenkins UI which duplicates the job exactly.
@pshipton Is it possible to get access to the machine where this is failing consistently? I tried reproducing on the one of our z14 machine running Ubuntu18, but it does not fail. Also tried a grinder on internal infrastructure but it does not fail there as well. As it is failing consistently on the machine, it will be easier to debug.
Pls open an internal infra issue to request access to ub18-x86-1 and provide your ssh key.
After getting an access to the ub18-390-1 where this failure is reproduced fairly consistently. After playing with unit test and carefully going through the https://github.com/eclipse/openj9/issues/7011 where @AlenBadel has done great job in documenting the issue, I was able to reproduce this and debug this easily. It is indeed the same issue as seen on Power. From the */DoubleAccumulator.accumulate(D)V method we have an interface call to method *applyAsDouble* for which we enter into this interfaceCallHelper assembly stub [1]. Throughout the routine, it calls out bunch of runtime helpers to lookup/resolve interface method calls and also calls Runtime helper to add class to PIC site. All the runtime helper calls made from this glue was done via another assembly glue [2][3]. These assembly glue from the znathelp.m4 which takes care of doing necessary work to switch from private linkage to C linkage. This includes saving all the C volatile registers (GPRs, FPRs/VRFs) before calling the C function. For jitAddPicToPatchOnClassUnload helper call [4], we do not use the helper from znathelp.m4 to call the C function and only switch/store the Stack Pointer register and store the volatile GPRs before calling C function. In the job, either we clobber vector register or floating point register in the call jitAddPicToPatchOnClassUnload which later causes this to fail. I did small change where instead of calling C function directly from PicBuilder.m4 it calls glue [5] which takes care of switching from private linkage to C linkage saving all the C volatile registers and do not see the failures.
Also had a look at rest of the part of PicBuilder code and found that we also directly call fast_jitInstanceOf for private interface methods, where also we only store the GPRs before call [6]. Although we do not run into any issue at this point but there is a chance that if we have something live in vector/floating point registers, it can be clobbered by C call.
Working on creating a PR with fix.
[1]. https://github.com/eclipse/openj9/blob/d80689c517e3a9a4b017cc819f0ada7f78b96a51/runtime/compiler/z/runtime/PicBuilder.m4#L1843
[2]. https://github.com/eclipse/openj9/blob/d80689c517e3a9a4b017cc819f0ada7f78b96a51/runtime/codert_vm/znathelp.m4#L25
[3]. https://github.com/eclipse/openj9/blob/d80689c517e3a9a4b017cc819f0ada7f78b96a51/runtime/codert_vm/znathelp.m4#L42
[4]. https://github.com/eclipse/openj9/blob/d80689c517e3a9a4b017cc819f0ada7f78b96a51/runtime/compiler/z/runtime/PicBuilder.m4#L2038-L2082
[5]. https://github.com/eclipse/openj9/blob/d80689c517e3a9a4b017cc819f0ada7f78b96a51/runtime/codert_vm/znathelp.m4#L417
[6]. https://github.com/eclipse/openj9/blob/d80689c517e3a9a4b017cc819f0ada7f78b96a51/runtime/compiler/z/runtime/PicBuilder.m4#L2120
[7]. https://github.com/eclipse/openj9/pull/7290
@r30shah what is the outlook for fixing this?
@pshipton I have safe fix I have tested through internal builds ready.
https://github.com/r30shah/openj9/commit/6e5df2c034c3675ba29a23a266e606012f37085b
https://github.com/r30shah/omr/commit/356dc0f2c48dc87e2b5fe423d4b0e5d25c4fdd91
I am holding on opening up PR with this is not most optimal fix (but the safest one) as it adds a branch to assembly glue from the PicBuilder before calling C function. I am testing changes to avoid calling to assembly glue, which is little riskier but worth a try. This issue is tagged for 21 release correct?
Correct, the 0.21 release, for which the branch occurs June 7. The test often fails and there hadn't been any update for 18 days since Working on creating a PR with fix..