Intermittent failure ...
Testing(sm=true) US-ASCII....
STDERR:
java.lang.RuntimeException: getBytes(csn) failed -> US-ASCII
at TestStringCoding.testGetBytes(TestStringCoding.java:164)
at TestStringCoding.main(TestStringCoding.java:82)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
at java.base/java.lang.Thread.run(Thread.java:825)
JavaTest Message: Test threw exception: java.lang.RuntimeException: getBytes(csn) failed -> US-ASCII
... of test https://github.com/ibmruntimes/openj9-openjdk-jdk12/blob/openj9/test/jdk/sun/nio/cs/TestStringCoding.java observed on zLinux.
10x run showing 20% failure rate : consoleText.txt (external)
10x run showing 0% failure rate : consoleText.txt (internal)
100x run showing 3% failure rate : consoleText.txt (internal)
100x run showing 3% failure rate : consoleText.txt (internal)
100x run with -Xint (internal) is in progress - will update this comment with results
@dchopra001 can you take a look at this guy? I'm not aware of any recent changes in this area. The test is not too informative. i.e. it does not tell us what the byte arrays contained. Thankfully the test looks easy to modify so you can alter it to print out the failing arrays and that may give us a hint of what went wrong.
The 100x run with -Xint (internal) is progressing extremely slow in comparison to the previous runs - 30+ minutes for each iteration - console log thus far : consoleText.txt
At this rate, we'll likely only get 20 iterations in before the 10 hour job time limit kicks in.
The first 4 iterations have passed but I'm not confident that tells us anything yet, given the observed failure rate without -Xint.
18 complete iterations without any failure from the requested 100x -Xint run (internal) in 10 hours. It was then aborted per the Jenkins job timeout. Final console log : consoleText.txt
@fjeremic @pshipton Let me know if there is anything else you need me to try.
I've managed to reproduce this at a 100% rate with a local/modified version of the StringCoding test. Here's a sample output of what the two byte arrays look like in a failing run:
Testing(sm=true) US-ASCII....
baSC(firstArray):
0 1 2 3 4 5 6 7 8 9 a b c d 3f 10 11 12 13 14 15 16 17 18 19 1a 1b
baNIO(secondArray):
0 1 2 3 4 5 3f 7 8 9 a b c d 3f 10 11 12 13 14 15 16 17 18 19 1a 1b
Exception in thread "main" java.lang.RuntimeException: getBytes(csn) failed -> US-ASCII
at TestStringCoding.testGetBytes(TestStringCoding.java:182)
at TestStringCoding.main(TestStringCoding.java:85)
In the second array we have a 3f instead of 6. A similar pattern is seen among other failures. Instead of an actual failures we see 3f.
I'm trying to narrow the problem down further right now.
Any update here @dchopra001 ? Do we have any better understanding of the source of the mysterious 3f ?
Status Update
I have the problem narrowed down to the sub optimization that is causing the issue. I am still not sure if this is an optimization or codegen issue. Right now I am leaning towards an edge case being missed in the codegen. More details about the problem below.
During Value Propagation, we identify the icall node inside this block:
34822 n149n BBStart <block_16> [ 0x3ff66aefe50] bci=[-1,142,194] rc=0 vc=702 vn=- li=- udi=- nc=0
34823 n161n treetop [ 0x3ff66af0210] bci=[-1,155,194] rc=0 vc=702 vn=- li=- udi=- nc=1
34824 n160n icall sun/nio/cs/US_ASCII$Encoder.encodeASCII([CII[BI)I[#477 final special Method] [flags 0x20500 0x0 ] () [ 0x3ff66af01c0] bci=[-1,155,194] rc=2 vc=702 vn=- li=- udi=- nc=6 flg=0x20
34825 n152n aload this<'this' parm Lsun/nio/cs/US_ASCII$Encoder;>[#442 Parm] [flags 0x40000107 0x0 ] (X!=0 sharedMemory ) [ 0x3ff66aeff40] bci=[-1,142,194] rc=1 vc=702 vn=- li=- udi=- nc=0 flg=0x4
34826 n153n aload sa<auto slot 3>[#447 Auto] [flags 0x7 0x0 ] [ 0x3ff66aeff90] bci=[-1,143,194] rc=1 vc=702 vn=- li=- udi=- nc=0
34827 n154n iload sp<auto slot 4>[#452 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff66aeffe0] bci=[-1,144,194] rc=1 vc=702 vn=- li=- udi=- nc=0 flg=0x1000
34828 n157n isub [ 0x3ff66af00d0] bci=[-1,150,194] rc=1 vc=702 vn=- li=- udi=- nc=2
34829 n155n iload sl<auto slot 5>[#455 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff66af0030] bci=[-1,146,194] rc=1 vc=702 vn=- li=- udi=- nc=0 flg=0x1000
34830 n156n iload sp<auto slot 4>[#452 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff66af0080] bci=[-1,148,194] rc=1 vc=702 vn=- li=- udi=- nc=0 flg=0x1000
34831 n158n aload da<auto slot 6>[#462 Auto] [flags 0x7 0x0 ] [ 0x3ff66af0120] bci=[-1,151,194] rc=1 vc=702 vn=- li=- udi=- nc=0
34832 n159n iload dp<auto slot 7>[#467 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff66af0170] bci=[-1,153,194] rc=1 vc=702 vn=- li=- udi=- nc=0 flg=0x1000
34833 n162n istore n<auto slot 9>[#478 Auto] [flags 0x3 0x0 ] [ 0x3ff66af0260] bci=[-1,158,194] rc=0 vc=702 vn=- li=- udi=- nc=1
34834 n160n ==>icall
34835 n166n istore sp<auto slot 4>[#452 Auto] [flags 0x3 0x0 ] [ 0x3ff66af03a0] bci=[-1,165,195] rc=0 vc=702 vn=- li=- udi=- nc=1
34836 n165n iadd [ 0x3ff66af0350] bci=[-1,164,195] rc=1 vc=702 vn=- li=- udi=- nc=2
34837 n163n iload sp<auto slot 4>[#452 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff66af02b0] bci=[-1,160,195] rc=1 vc=702 vn=- li=- udi=- nc=0 flg=0x1000
34838 n164n iload n<auto slot 9>[#478 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff66af0300] bci=[-1,162,195] rc=1 vc=702 vn=- li=- udi=- nc=0 flg=0x1000
34839 n170n istore dp<auto slot 7>[#467 Auto] [flags 0x3 0x0 ] [ 0x3ff66af04e0] bci=[-1,172,196] rc=0 vc=702 vn=- li=- udi=- nc=1
34840 n169n iadd [ 0x3ff66af0490] bci=[-1,171,196] rc=1 vc=702 vn=- li=- udi=- nc=2
34841 n167n iload dp<auto slot 7>[#467 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff66af03f0] bci=[-1,167,196] rc=1 vc=702 vn=- li=- udi=- nc=0 flg=0x1000
34842 n168n iload n<auto slot 9>[#478 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff66af0440] bci=[-1,169,196] rc=1 vc=702 vn=- li=- udi=- nc=0 flg=0x1000
34843 n175n ificmpge --> block_22 BBStart at n15n () [ 0x3ff66af0670] bci=[-1,178,198] rc=0 vc=702 vn=- li=- udi=- nc=2 flg=0x20
34844 n171n iload sp<auto slot 4>[#452 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff66af0530] bci=[-1,174,198] rc=1 vc=702 vn=- li=- udi=- nc=0 flg=0x1000
34845 n172n iload sl<auto slot 5>[#455 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff66af0580] bci=[-1,176,198] rc=1 vc=702 vn=- li=- udi=- nc=0 flg=0x1000
34846 n150n BBEnd </block_16> ===== [ 0x3ff66aefea0] bci=[-1,178,198] rc=0 vc=701 vn=- li=- udi=- nc=0
And inline it into an arraytranslate node. After tracing this optimization the trees show the following new arraytranslate post inlining (the tracing doesn't say much else about the inlining):
38718 n4178n BBStart <block_301> [ 0x3ff66efe9b0] bci=[-1,155,194] rc=0 vc=0 vn=- li=- udi=- nc=0
38719 n4127n treetop [ 0x3ff66d2d9b0] bci=[-1,155,194] rc=0 vc=702 vn=- li=- udi=- nc=1
38720 n4153n arraytranslate <arraytranslate>[#381 helper Method] [flags 0x400 0x0 ] (char2byteXlate byteArrayXlate tableBackedByRawStorage ) [ 0x3ff66efe1e0] bci=[-1,155,194] rc=2 vc=0 vn=- li=- udi=- nc=6 flg=0xa020
38721 n4157n aladd [ 0x3ff66efe320] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=2
38722 n4144n aload <temp slot 53>[#750 Auto] [flags 0x7 0x0 ] [ 0x3ff66d2df00] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=0
38723 n4156n ladd [ 0x3ff66efe2d0] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=2
38724 n4155n lmul [ 0x3ff66efe280] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=2
38725 n4146n i2l [ 0x3ff66d2dfa0] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=1
38726 n4145n iload <temp slot 52>[#749 Auto] [flags 0x10003 0x0 ] [ 0x3ff66d2df50] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=0
38727 n4154n lconst 2 [ 0x3ff66efe230] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=0
38728 n4152n lconst 8 [ 0x3ff66efe190] bci=[-1,155,194] rc=2 vc=0 vn=- li=- udi=- nc=0
38729 n4159n aladd [ 0x3ff66efe3c0] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=2
38730 n4149n aload da<auto slot 6>[#462 Auto] [flags 0x7 0x0 ] [ 0x3ff66efe0a0] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=0
38731 n4158n ladd [ 0x3ff66efe370] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=2
38732 n4151n i2l [ 0x3ff66efe140] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=1
38733 n4150n iload dp<auto slot 7>[#467 Auto] [flags 0x3 0x0 ] [ 0x3ff66efe0f0] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=0
38734 n4152n ==>lconst 8
38735 n4160n aconst NULL [ 0x3ff66efe410] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=0
38736 n4161n iconst 255 [ 0x3ff66efe460] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=0
38737 n4148n i2l [ 0x3ff66efe050] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=1
38738 n4147n iload <temp slot 51>[#748 Auto] [flags 0x10003 0x0 ] [ 0x3ff66efe000] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=0
38739 n4162n iconst 127 [ 0x3ff66efe4b0] bci=[-1,155,194] rc=1 vc=0 vn=- li=- udi=- nc=0
38740 n4188n istore <temp slot 55>[#752 Auto] [flags 0x3 0x0 ] [ 0x3ff66efecd0] bci=[-1,155,194] rc=0 vc=0 vn=- li=- udi=- nc=1
38741 n4153n ==>arraytranslate
38742 n4190n asynccheck jitCheckAsyncMessages[#22 helper Method] [flags 0x400 0x0 ] [ 0x3ff66efed70] bci=[-1,155,194] rc=0 vc=0 vn=- li=- udi=- nc=0
38743 n4179n BBEnd </block_301> ===== [ 0x3ff66efea00] bci=[-1,155,194] rc=0 vc=0 vn=- li=- udi=- nc=0
I'm looking into the instructions generated right now to see if we perhaps missed an edge case.
@andrewcraik FYI - I don't think this is related to optimizer but tagging you just in case you notice something in the trees above or have any other insight regarding the opt.
@dchopra001 are you able to hop in the debugger and validate what the JIT is doing at the codegen level? We will evaluate arraytranslate into some form of TRTO (translate) instruction or similar which uses a precomputed static table to do the translation. It's possible we have a codegen bug there assuming the optimizer transformed the trees correctly.
@fjeremic Yes, I'm trying to see what the JIT is doing incorrectly in a debugger.
This isn't related to TROT instructions though because we end up taking this branch and generating vector instructions to handle this workload instead:
https://github.com/eclipse/omr/blob/18846d220b2b0417d75e171b6baf5819ed8555ff/compiler/z/codegen/OMRTreeEvaluator.cpp#L10586-L10590
So the issue should be in the arraytranslateEncodeSIMDEvaluator.
Note: Currently a quick workaround to this problem is to specify the -Xjit:disableSIMDArrayTranslate option. (arraytranslateEvaluator is the only evaluator that is affected by this option in the z codegen)
For documentation purposes, here is how to reproduce the problem.
This problem can be reproduced 100% with the following Java program:
public class TestStringCoding {
public static void helper() throws Throwable
{
Charset cs = Charset.availableCharsets().get("US-ASCII");
System.out.println("temp: " + cs.name());
int[] candidate = { 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 55296}; // This input creates the failure
StringBuilder sb = new StringBuilder();
StringBuilder sb2 = new StringBuilder();
for (int ii = 0; ii < candidate.length; ii++)
{
sb.append((char)candidate[ii]);
}
testGetBytes(cs, sb.toString());
}
public static void main(String[] args) throws Throwable {
helper();
}
static byte[] testGetBytes(Charset cs, String str) throws Throwable {
byte[] baSC = str.getBytes(cs.name()); // correct
//byte[] baSC = {0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x3f}; // For the failing case, the above getBytes() routine will return this array.
CharsetEncoder enc = cs.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE);
ByteBuffer bf = enc.reset().encode(CharBuffer.wrap(str.toCharArray()));
byte[] baNIO = new byte[bf.limit()];
bf.get(baNIO, 0, baNIO.length);
if (!Arrays.equals(baSC, baNIO)) {
System.out.println();
System.out.println("- Expected Output (baSC):");
for (int i = 0; i < baSC.length; i++)
{
System.out.print(Integer.toHexString(baSC[i]));
System.out.print(", ");
}
System.out.println();
System.out.println("Actual Output (baNIO)");
for (int i = 0; i < baNIO.length; i++)
{
System.out.print(Integer.toHexString(baNIO[i]));
System.out.print(", ");
}
System.out.println();
throw new RuntimeException("getBytes(csn) failed -> " + cs.name());
}
return baSC;
}
}
The file can be compiled with the following command:
jdk-12+33/bin/javac --add-exports java.base/sun.nio.cs=ALL-UNNAMED -cp "sun/nio/" sun/nio/cs/TestStringCoding.java;
(When run from the openj9-openjdk-jdk12/test/jdk/) folder.
The command to run the test, reproduce the failure and collect log files is:
jdk-12+33/bin/java -Xcompressedrefs -Xmx512m -Xjit:verbose,vlog=tempVlog,disableSuffixLogs,{"*Encoder.encodeArrayLoop*"}\(lastOptIndex=24,traceFull,traceCG,log=trace.log,lastOptSubIndex=42,traceLocalVP,optLevel=scorching,count=0,disableAsyncCompilation\) TestStringCoding
The jdk version I'm using is:
openjdk version "12" 2019-03-19
OpenJDK Runtime Environment AdoptOpenJDK (build 12+33)
Eclipse OpenJ9 VM AdoptOpenJDK (build openj9-0.13.0, JRE 12 Linux s390x-64-Bit Compressed References 20190320_31 (JIT enabled, AOT enabled)
OpenJ9 - caeb51f87
OMR - 33a33ff2
JCL - 419f571019 based on jdk-12+33)
@pshipton I don't think this needs to be rushed into 0.14.0. This bug has existed for ~3 years now and is not limited for JDK12.
Below is some background on how this problem was exposed on Java:
The error happens when we inline sun/nio/cs/US_ASCII$Encoder.encodeASCII([CII[BI)I[ into an arraytranslate node.
This routine essentially translates an array of 2-byte chars into an array of bytes. If during the translation the function encounters a value that is saturated (i.e. val > 127), then it stops translation and returns the number of chars it was able to translate. The caller of this routine then normalizes the saturated value, and then calls this function again to translate the rest of the array. This process repeats until there are no more remaining chars left to translate.
For example, consider the following int array of length=26:
{5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 55296, 29};
(Note: this array must be cast to a char array first).
To translate this array the caller will first provide a reference to the beginning of the array where the value is 5. The encodeASCII routine will be able to translate all values up to and including index 23 (it stops at 24 because array[24] = 55296 which is a saturated value). The function will return 24 as it was able to translate 24 values. At this point, the caller will normalize 55296 to 3f (which is the 25th value, and then call encodeASCII again but with the starting reference to the 26th value (array[25]=29). This process continues until we have translated all chars in the array.
To accelerate this workload, we inline this encodeASCII call into an arraytranslate node and generate vector instructions. These are generated inside: OMR::Z::TreeEvaluator::arraytranslateEncodeSIMDEvaluator.
For the above input array, the evaluator will load the first 16 chars into 2 input vector registers: v1 and v2 (Note: Each char is 2-bytes in length and so each vector can only hold up to 8 chars at a time). It will then use the VSTRC instruction available on z13 and above architectures to determine if any of the 2 vectors contain a saturated value. For the above scenario, there are no saturated values in the first 16 chars, so we then move on to process the remaining 10 chars. This is done by again loading the first 8 chars in v1 and the last two chars in v2. We then use VSTRC to detect if v1 or v2 have any saturated chars. Since v2 does have a saturated char, VSTRC will indicate the position of this character in another vector register, v0, and branch to an internal control flow that we generate here:
https://github.com/eclipse/omr/blob/8355cf729167f73259eb113f944e808e6c1a2c8c/compiler/z/codegen/OMRTreeEvaluator.cpp#L15225-L15254
(for convenience I'm pasting a copy of the code here)
// ----------------- Incoming branch -----------------
// Case 2: The 2nd vector register contained a saturated char
generateS390LabelInstruction(cg, TR::InstOpCode::LABEL, node, processSaturatedInput2);
// Extract the index of the first saturated char in the 2nd vector register
generateVRScInstruction(cg, TR::InstOpCode::VLGV, node, inputLen, vSaturated, generateS390MemoryReference(7, cg), 0);
// Divide by 2 as the unit length of each element in the vector register is 2 bytes
generateRSInstruction(cg, TR::InstOpCode::getShiftRightLogicalSingleOpCode(), node, inputLen, inputLen, 1);
// VLL and VSTL work on indices so we must subtract 1 (add 8 as saturation happened in the second input vector)
generateRIEInstruction(cg, TR::InstOpCode::getAddLogicalRegRegImmediateOpCode(), node, inputLenMinus1, inputLen, 7);
As can be seen above, we use VLGV and SRLG (via getShiftRightLogicalSingleOpCode) to calculate the position of the saturated char in the second vector. In our case, this would make the tegister inputLen=0 to indicate the saturated character is in the first halfword of v2.
After a few instructions we generate an instruction to add the value in inputLen to the total number of translated chars:
generateRRInstruction (cg, TR::InstOpCode::getAddRegOpCode(), node, translated, inputLen);
This translated register represents the return value of the function. As is evident, we have an incorrect return value in the register. Based on the above example, translated is 16 when it should be 24 (We update translated to 16 when we process the first 16 chars). As a consequence of this, the caller will assume the 17th value in the array is a saturated value and normalize it to 3f. In our above example, the 17th value (at index 16) is not a saturated value as it is less than 127. This leads to incorrect behaviour and similar output to the one seen in this issue.
The problem then is that inputLen is being used as a total count for the number of characters we translated in the residue. However that is not what it represents when the saturated char is in the second input vector v2. In the scenario described above, inputLen didn't account for the 8 characters that were successfully translated in the first input vector v1. When such a scenario happens, we must add 8 to either inputLen or the translated register to indicate this.
I've opened up a PR to fix this here:
https://github.com/eclipse/omr/pull/3719
I don't think this needs to be rushed into 0.14.0
ok, removing the milestone target.
Is it worth adding the test case into the test suite to avoid future regressions?
@dchopra001 wow, well done! Excellent analysis and write up of the problem. Reads take note. Thanks!
Is it worth adding the test case into the test suite to avoid future regressions?
Yes I would agree. We should open up an issue to write up some unit tests for this to cover such corner cases.
I've merged the fix. @dchopra001 can you please see the linked PRs which excluded the tests and submit a change to re-enable them? We can close this issue once the tests are re-enabled.
Test reincluded: https://github.com/AdoptOpenJDK/openjdk-tests/pull/1069 (thanks @dchopra001 !)
Issue to write tril tests for the arraytranslate node:
https://github.com/eclipse/omr/issues/3729
Excellent. Thank you very much! I think we're done here now.