Openj9: JVM Crash on s390x when running Lucene Solr

Created on 4 May 2020  路  29Comments  路  Source: eclipse/openj9

Issue

While testing if some old OpenJ9 bugs were still present in Lucene Solr, I ran into an issue on s390x. The issue is not present on ppc64le or on x86, as the Lucene Solr tests pass.

Here is the syserr dump I got:

Assertion failed at /home/jenkins/workspace/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-s390x-openj9/workspace/build/src/build/linux-s390x-normal-server-release/vm/compiler/../omr/compiler/z/codegen/OMRTreeEvaluator.cpp:3119: firstChild->getDataType() == secondChild->getDataType()
VMState: 0x0005ff04
    Data type of firstChild (VectorInt32) and secondChild (Int32) should match for generating compare and branch
compiling org/apache/lucene/codecs/compressing/CompressingStoredFieldsReader$BlockState.doReset(I)V at level: scorching
#0: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x8efbe6) [0x3ff7ebefbe6]
#1: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x8fc7e8) [0x3ff7ebfc7e8]
#2: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x5614d0) [0x3ff7e8614d0]
#3: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x561706) [0x3ff7e861706]
#4: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x9cefa2) [0x3ff7eccefa2]
#5: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x9d0f38) [0x3ff7ecd0f38]
#6: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x9d148c) [0x3ff7ecd148c]
#7: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x9340e2) [0x3ff7ec340e2]
#8: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x4d0c5e) [0x3ff7e7d0c5e]
#9: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x130234) [0x3ff7e430234]
#10: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x4dfad0) [0x3ff7e7dfad0]
#11: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x4ddf6e) [0x3ff7e7ddf6e]
#12: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x4db258) [0x3ff7e7db258]
#13: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x4f4c8a) [0x3ff7e7f4c8a]
#14: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x170462) [0x3ff7e470462]
#15: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x17138a) [0x3ff7e47138a]
#16: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9prt29.so(+0x22420) [0x3ff7f8a2420]
#17: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x173070) [0x3ff7e473070]
#18: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x1736ae) [0x3ff7e4736ae]
#19: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x16f0b0) [0x3ff7e46f0b0]
#20: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x16f378) [0x3ff7e46f378]
#21: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x16f460) [0x3ff7e46f460]
#22: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9prt29.so(+0x22420) [0x3ff7f8a2420]
#23: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x16f982) [0x3ff7e46f982]
#24: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9thr29.so(+0x58f0) [0x3ff7f9858f0]
#25: /lib/s390x-linux-gnu/libpthread.so.0(+0x7934) [0x3ff80907934]
#26: /lib/s390x-linux-gnu/libc.so.6(+0xedce2) [0x3ff806edce2]
JVMDUMP039I Processing dump event "abort", detail "" at 2020/05/01 21:13:01 - please wait.
JVMDUMP032I JVM requested System dump using '/home/ubuntu/lucene-solr/lucene/build/core/test/J2/core.20200501.211301.27314.0001.dmp' in response to an event
JVMPORT030W /proc/sys/kernel/core_pattern setting "|/usr/share/apport/apport %p %s %c %d %P %E" specifies that the core dump is to be piped to an external program.  Attempting to rename either core or core.28382.
JVMDUMP010I System dump written to /home/ubuntu/lucene-solr/lucene/build/core/test/J2/core.20200501.211301.27314.0001.dmp
JVMDUMP032I JVM requested Java dump using '/home/ubuntu/lucene-solr/lucene/build/core/test/J2/javacore.20200501.211301.27314.0002.txt' in response to an event
JVMDUMP010I Java dump written to /home/ubuntu/lucene-solr/lucene/build/core/test/J2/javacore.20200501.211301.27314.0002.txt
JVMDUMP032I JVM requested Snap dump using '/home/ubuntu/lucene-solr/lucene/build/core/test/J2/Snap.20200501.211301.27314.0003.trc' in response to an event
JVMDUMP010I Snap dump written to /home/ubuntu/lucene-solr/lucene/build/core/test/J2/Snap.20200501.211301.27314.0003.trc
JVMDUMP007I JVM Requesting JIT dump using '/home/ubuntu/lucene-solr/lucene/build/core/test/J2/jitdump.20200501.211301.27314.0004.dmp'
#JITDUMP:  vmThread=000000000006DA00 Crashed while printing out current IL.Assertion failed at /home/jenkins/workspace/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-s390x-openj9/workspace/build/src/build/linux-s390x-normal-server-release/vm/compiler/../omr/compiler/z/codegen/OMRTreeEvaluator.cpp:3119: firstChild->getDataType() == secondChild->getDataType()
VMState: 0x0005ff04
    Data type of firstChild (VectorInt32) and secondChild (Int32) should match for generating compare and branch
compiling org/apache/lucene/codecs/compressing/CompressingStoredFieldsReader$BlockState.doReset(I)V at level: scorching
#0: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x8efbe6) [0x3ff7ebefbe6]
#1: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x8fc7e8) [0x3ff7ebfc7e8]
#2: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x5614d0) [0x3ff7e8614d0]
#3: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x561706) [0x3ff7e861706]
#4: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x9cefa2) [0x3ff7eccefa2]
#5: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x9d0f38) [0x3ff7ecd0f38]
#6: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x9d148c) [0x3ff7ecd148c]
#7: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x9340e2) [0x3ff7ec340e2]
#8: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x4d0c5e) [0x3ff7e7d0c5e]
#9: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x130234) [0x3ff7e430234]
#10: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x4dfad0) [0x3ff7e7dfad0]
#11: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x4ddf6e) [0x3ff7e7ddf6e]
#12: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x4db258) [0x3ff7e7db258]
#13: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x4f4c8a) [0x3ff7e7f4c8a]
#14: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x170462) [0x3ff7e470462]
#15: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x17138a) [0x3ff7e47138a]
#16: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9prt29.so(+0x22420) [0x3ff7f8a2420]
#17: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x1731c4) [0x3ff7e4731c4]
#18: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x1736ae) [0x3ff7e4736ae]
#19: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x16f0b0) [0x3ff7e46f0b0]
#20: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x16f378) [0x3ff7e46f378]
#21: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x16f460) [0x3ff7e46f460]
#22: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9prt29.so(+0x22420) [0x3ff7f8a2420]
#23: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9jit29.so(+0x16f982) [0x3ff7e46f982]
#24: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/lib/compressedrefs/libj9thr29.so(+0x58f0) [0x3ff7f9858f0]
#25: /lib/s390x-linux-gnu/libpthread.so.0(+0x7934) [0x3ff80907934]
#26: /lib/s390x-linux-gnu/libc.so.6(+0xedce2) [0x3ff806edce2]

System Information

OS: Ubuntu 16.04.6 LTS
Architecture: s390x
Output of java --version:

openjdk 11.0.7 2020-04-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+10)
Eclipse OpenJ9 VM AdoptOpenJDK (build openj9-0.20.0, JRE 11 Linux s390x-64-Bit Compressed References 20200416_564 (JIT enabled, AOT enabled)
OpenJ9   - 05fa2d361
OMR      - d4365f371
JCL      - 838028fc9d9 based on jdk-11.0.7+10)

Steps to Reproduce

On a s390x Ubuntu 16.04 system, the commands I ussed:

sudo apt-get update -y && sudo apt-get upgrade -y && sudo wget -qO - https://adoptopenjdk.jfrog.io/adoptopenjdk/api/gpg/key/public | sudo apt-key add - && sudo add-apt-repository --yes https://adoptopenjdk.jfrog.io/adoptopenjdk/deb/

sudo apt-get update && sudo apt-get install adoptopenjdk-11-openj9  ant -y

git clone https://github.com/apache/lucene-solr.git

cd lucene-solr/lucene/

ant ivy-bootstrap && ant test -Dtests.nightly=true -Dtests.slow=true

The execution time total was 57 minutes 28 seconds before this error showed.

Additional Output

This was the output from the console:

   [junit4] JVM J0:     0.59 ..  3448.79 =  3448.21s
   [junit4] Execution time total: 57 minutes 28 seconds
   [junit4] ERROR: JVM J1 ended with an exception, command line: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/bin/java -XX:TieredStopAtLevel=1 -ea -esa --illegal-access=deny -Dtests.prefix=tests -Dtests.seed=153D11360BC84029 -Xmx512M -Dtests.iters= -Dtests.verbose=false -Dtests.infostream=false -Dtests.codec=random -Dtests.postingsformat=random -Dtests.docvaluesformat=random -Dtests.locale=random -Dtests.timezone=random -Dt$
sts.directory=random -Dtests.linedocsfile=europarl.lines.txt.gz -Dtests.luceneMatchVersion=9.0.0 -Dtests.cleanthreads=perMethod -Djava.util.logging.config.file=/home/ubuntu/lucene-solr/lucene/tools/junit4/logging.properties -Dtests.nightly=true -Dtests.weekly=false -Dtests.monster=false -Dtests.slow=true -Dtests.asserts=true -Dtests.multiplier=1 -DtempDir=./temp -Djava.io.tmpdir=./temp -Dcommon.dir=/home/ubuntu/lucene-so$
r/lucene -Dclover.db.dir=/home/ubuntu/lucene-solr/lucene/build/clover/db -Djava.security.policy=/home/ubuntu/lucene-solr/lucene/tools/junit4/tests.policy -Dtests.LUCENE_VERSION=9.0.0 -Djetty.testMode=1 -Djetty.insecurerandom=1 -Dsolr.directoryFactory=org.apache.solr.core.MockDirectoryFactory -Djava.awt.headless=true -Djdk.map.althashing.threshold=0 -Dtests.src.home=/home/ubuntu/lucene-solr -Djava.security.egd=file:/dev/.$
urandom -Djunit4.childvm.cwd=/home/ubuntu/lucene-solr/lucene/build/core/test/J1 -Djunit4.tempDir=/home/ubuntu/lucene-solr/lucene/build/core/test/temp -Djunit4.childvm.id=1 -Djunit4.childvm.count=3 -Dfile.encoding=US-ASCII -Djava.security.manager=org.apache.lucene.util.TestSecurityManager -Dtests.filterstacks=true -Dtests.leaveTemporary=false -classpath /home/ubuntu/lucene-solr/lucene/build/codecs/classes/java:/home/ubunt$
/lucene-solr/lucene/build/test-framework/classes/java:/home/ubuntu/lucene-solr/lucene/test-framework/lib/hamcrest-core-1.3.jar:/home/ubuntu/lucene-solr/lucene/test-framework/lib/junit-4.12.jar:/home/ubuntu/lucene-solr/lucene/test-framework/lib/randomizedtesting-runner-2.7.6.jar:/home/ubuntu/lucene-solr/lucene/build/core/classes/java:/home/ubuntu/lucene-solr/lucene/build/core/classes/test:/home/ubuntu/.ivy2/cache/com.carr$
tsearch.randomizedtesting/junit4-ant/jars/junit4-ant-2.7.6.jar com.carrotsearch.ant.tasks.junit4.slave.SlaveMainSafe -eventsfile /home/ubuntu/lucene-solr/lucene/build/core/test/temp/junit4-J1-20200501_210841_51614402896291358795987.events @/home/ubuntu/lucene-solr/lucene/build/core/test/temp/junit4-J1-20200501_210841_51617518222344466159397.suites -stdin
   [junit4] ERROR: JVM J1 ended with an exception: Forked process returned with error code: 134. Very likely a JVM crash.  See process stderr at: /home/ubuntu/lucene-solr/lucene/build/core/test/temp/junit4-J1-20200501_210841_516967293774046938498.syserr
   [junit4]     at com.carrotsearch.ant.tasks.junit4.JUnit4.executeSlave(JUnit4.java:1542)
   [junit4]     at com.carrotsearch.ant.tasks.junit4.JUnit4.access$000(JUnit4.java:123)
   [junit4]     at com.carrotsearch.ant.tasks.junit4.JUnit4$2.call(JUnit4.java:997)
   [junit4]     at com.carrotsearch.ant.tasks.junit4.JUnit4$2.call(JUnit4.java:994)
   [junit4]     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   [junit4]     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   [junit4]     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   [junit4]     at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4] ERROR: JVM J2 ended with an exception, command line: /usr/lib/jvm/adoptopenjdk-11-openj9-s390x/bin/java -XX:TieredStopAtLevel=1 -ea -esa --illegal-access=deny -Dtests.prefix=tests -Dtests.seed=153D11360BC84029 -Xmx512M -Dtests.iters= -Dtests.verbose=false -Dtests.infostream=false -Dtests.codec=random -Dtests.postingsformat=random -Dtests.docvaluesformat=random -Dtests.locale=random -Dtests.timezone=random -Dt$
sts.directory=random -Dtests.linedocsfile=europarl.lines.txt.gz -Dtests.luceneMatchVersion=9.0.0 -Dtests.cleanthreads=perMethod -Djava.util.logging.config.file=/home/ubuntu/lucene-solr/lucene/tools/junit4/logging.properties -Dtests.nightly=true -Dtests.weekly=false -Dtests.monster=false -Dtests.slow=true -Dtests.asserts=true -Dtests.multiplier=1 -DtempDir=./temp -Djava.io.tmpdir=./temp -Dcommon.dir=/home/ubuntu/lucene-so$
r/lucene -Dclover.db.dir=/home/ubuntu/lucene-solr/lucene/build/clover/db -Djava.security.policy=/home/ubuntu/lucene-solr/lucene/tools/junit4/tests.policy -Dtests.LUCENE_VERSION=9.0.0 -Djetty.testMode=1 -Djetty.insecurerandom=1 -Dsolr.directoryFactory=org.apache.solr.core.MockDirectoryFactory -Djava.awt.headless=true -Djdk.map.althashing.threshold=0 -Dtests.src.home=/home/ubuntu/lucene-solr -Djava.security.egd=file:/dev/.$
urandom -Djunit4.childvm.cwd=/home/ubuntu/lucene-solr/lucene/build/core/test/J2 -Djunit4.tempDir=/home/ubuntu/lucene-solr/lucene/build/core/test/temp -Djunit4.childvm.id=2 -Djunit4.childvm.count=3 -Dfile.encoding=US-ASCII -Djava.security.manager=org.apache.lucene.util.TestSecurityManager -Dtests.filterstacks=true -Dtests.leaveTemporary=false -classpath /home/ubuntu/lucene-solr/lucene/build/codecs/classes/java:/home/ubunt$
/lucene-solr/lucene/build/test-framework/classes/java:/home/ubuntu/lucene-solr/lucene/test-framework/lib/hamcrest-core-1.3.jar:/home/ubuntu/lucene-solr/lucene/test-framework/lib/junit-4.12.jar:/home/ubuntu/lucene-solr/lucene/test-framework/lib/randomizedtesting-runner-2.7.6.jar:/home/ubuntu/lucene-solr/lucene/build/core/classes/java:/home/ubuntu/lucene-solr/lucene/build/core/classes/test:/home/ubuntu/.ivy2/cache/com.carr$
tsearch.randomizedtesting/junit4-ant/jars/junit4-ant-2.7.6.jar com.carrotsearch.ant.tasks.junit4.slave.SlaveMainSafe -eventsfile /home/ubuntu/lucene-solr/lucene/build/core/test/temp/junit4-J2-20200501_210841_5163001900503949437802.events @/home/ubuntu/lucene-solr/lucene/build/core/test/temp/junit4-J2-20200501_210841_51613718426740564116325.suites -stdin
   [junit4] ERROR: JVM J2 ended with an exception: Forked process returned with error code: 134. Very likely a JVM crash.  See process stderr at: /home/ubuntu/lucene-solr/lucene/build/core/test/temp/junit4-J2-20200501_210841_5163752315207595302421.syserr
   [junit4]     at com.carrotsearch.ant.tasks.junit4.JUnit4.executeSlave(JUnit4.java:1542)
   [junit4]     at com.carrotsearch.ant.tasks.junit4.JUnit4.access$000(JUnit4.java:123)
   [junit4]     at com.carrotsearch.ant.tasks.junit4.JUnit4$2.call(JUnit4.java:997)
   [junit4]     at com.carrotsearch.ant.tasks.junit4.JUnit4$2.call(JUnit4.java:994)
   [junit4]     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   [junit4]     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   [junit4]     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   [junit4]     at java.base/java.lang.Thread.run(Thread.java:834)

BUILD FAILED
/home/ubuntu/lucene-solr/build.xml:59: The following error occurred while executing this line:
/home/ubuntu/lucene-solr/lucene/build.xml:50: The following error occurred while executing this line:
/home/ubuntu/lucene-solr/lucene/common-build.xml:1521: The following error occurred while executing this line:
/home/ubuntu/lucene-solr/lucene/common-build.xml:1048: At least one slave process threw an exception, first: Forked process returned with error code: 134. Very likely a JVM crash.  See process stderr at: /home/ubuntu/lucene-solr/lucene/build/core/test/temp/junit4-J1-20200501_210841_516967293774046938498.syserr

Total time: 57 minutes 31 seconds
jit segfault userRaised

Most helpful comment

@fjeremic I am currently looking at the issue. Sorry for not updating it. The root cause of the failure is the wrong IL generation due to commoning. This will happen for the cases where the loop control variable is commoned with the vectorized IL nodes.
For power and x86, this would have ended up in a crash if it was vectorized. In x86 and power, the loop is not vectorized due to not meeting dependencies.
I did come up with an initial solution, and so far, with that change, it seems to generate correct IL. However, I can not access z system as of yet to test it. I will update here once I can check the solution in a z machine.

All 29 comments

@mstoodle FYI ^^

fyi @fjeremic

To close the loop: @fjeremic this is the problem I mentioned to you earlier this morning

@r30shah can you please take a look. We are hitting the assert added in eclipse/omr#4763. The assert seems valid and it appears the tree looks ill formed here. We need a log to find the source of this tree.

We should also take this opportunity to improve that assert by using TR_ASSERT_FATAL_WITH_NODE API [1] which would have printed the tree in question as part of the assert.

Also please work with me to give feedback on whether the jitdump here generated a useful trace.

[1] https://github.com/eclipse/omr/blob/ac5f4ec2cb7cebf898408bc23734427583b81543/compiler/infra/Assert.hpp#L57-L59

Tried the steps described by @james-crowley and I was able to reproduce this fairly consistently on local Dev machine running Ubuntu 18.04, looked into the jitdump produced by the failure, I am seeing the following trees in the dump.

 n13624n  (  0)  ificmplt --> block_1170 BBStart at n19856n ()                                        [     0x3fec2e89240] bci=[-1,351,442] rc=0 vc=12 vn=- li=1080 udi=- nc=3 flg=0x20
 n21566n  (  4)    ==>viRegLoad (in VRF_1559) (SeenRealReference )
 n21560n  (  2)    ==>iRegLoad (in GPR_1553) (cannotOverflow SeenRealReference )

First Child is viRegLoad while second child is iRegLoad. This is indeed illegal for comparison node ificmplt which should have both child of same type (TR::Int32) [1]

@fjeremic I have just ran this with the driver available on AdoptOpenJDK, but I will be happy to give it a spin with your jit dump improvement changes, as for further investigation, we need to see where that tree is generated as the tree itself is incorrect in first place. Is latest OpenJ9 and omr contains your changes or I need to add some un-committed changes. I guess for this one, changes to trace all the opts is the one I need.

[1]. https://github.com/eclipse/omr/blob/ebc1043968e90172da6fc7d45f9f942ffb017b53/compiler/il/OMRILOpCodeProperties.hpp#L4696-L4710

Is latest OpenJ9 and omr contains your changes or I need to add some un-committed changes.

I can work with you offline on this one so we can stay focused on the issue at hand. Yeah that tree looks ill-formed.

@r30shah we're coming up on the v0.20.0 code split on June 7th. Will this make the release?

@DanHeidinga I have not actively looked into this issue as I got caught up with other issues for 0.21 release. The trees in question was generated by SPMDKernelParallelization optimization. I am currently looking into the method log with tracing that optimization, till now it seems like we just missed to convert one of the tree to vectorized node, leaving us with ill formed tree. As I am done with other issues, looking into this guy today and see if it is an easy fix that can go in current release.
FYI @gita-omr

After digging into the failure found the issue with the trees generated by SIMD optimization.
Consider the following trees,

n39n      BBStart <block_4> (freq 9866) (in loop 4)
n64n      istorei  <array-shadow>[#241  Shadow] [flags 0x80000603 0x0 ]
n63n        aladd (X>=0 internalPtr sharedMemory )
n43n          aload  <temp slot 23>[#533  Auto] [flags 0x7 0x0 ] (X!=0 sharedMemory )
n62n          lsub (highWordZero X>=0 cannotOverflow )
n60n            lmul (X>=0 cannotOverflow )
n59n              i2l (highWordZero X>=0 )
n48n                isub
n47n                  iload  <auto slot 4>[#381  Auto] [flags 0x3 0x0 ] (cannotOverflow )
n46n                  iconst -1 (X!=0 X<=0 )
n58n              lconst 4 (highWordZero X!=0 X>=0 )
n61n            lconst -8 (X!=0 X<=0 )
n53n        imul
n48n          ==>isub
n52n          iload  <temp slot 18>[#509  Auto] [flags 0x3 0x0 ] (cannotOverflow )
n68n      istore  <auto slot 4>[#381  Auto] [flags 0x3 0x0 ]
n48n        ==>isub
n1311n    ificmplt --> block_4 BBStart at n39n ()
n48n        ==>isub
n1313n      iload  <temp slot 25>[#535  Auto] [flags 0x3 0x0 ] (X>=0 cannotOverflow )
n1310n    BBEnd </block_4> =====

In the above trees (which is loop) , we vectorize the loop and redefine n48n to vector type (VectorInt32) which is also used as loop induction variable, so ended up with the trees which causes the Assert to be thrown. Another issue I observed in the log file is when n48n is redefined to VectorInt32 type, a parent node in the store tree is still i2l so in codegen we end up with the tree which looks like following.

...
nXn      i2l (GPR_0001)
nYn         vsub (VRF_0001)
... 

And we end up with the instruction which looks like following,

...
LGFR GPR_0001, VRF_0001
...

@gita-omr @IBMJimmyk I played with the failure, and was able to extract out the Java Code from the Lucene to a unit test that causes the failure and was able to reproduce with the unit test. I will let you take the investigation further as this comes into your expertise. Let me know if you need any assistance.

Here is the test,

import java.util.Random;
public class UnitTest {
    public static final Random rand = new Random();
    private int[] offsets = new int[0];
    private int cachedRandomNumber = -1;
    public static int[] grow(int[] array, int minSize) {
        int totalLength = array.length + minSize;
        int[] returnArray = new int[totalLength];
        System.arraycopy(array, 0, returnArray, 0, array.length);
        return returnArray;
    }
    private int returnCachedRandomNumber() {
        if (cachedRandomNumber == - 1)
            cachedRandomNumber = 100+rand.nextInt(100);
        return cachedRandomNumber;
    }
    private void unitTest(int numberOfChunks) {
        final int token = returnCachedRandomNumber();
        int addnArrayLength = token >>> 1;
        offsets = grow(offsets, addnArrayLength);
        for (int i = 0; i < addnArrayLength-1; ++i)
            offsets[1 + i] = (1 + i) * token;
    }
    public static void main(String[] args) {
        System.out.println("Unit Test Startin, running it 100 times to make sure JIT compiled");
        for (int i = 0; i < 10; ++i) {
            UnitTest obj = new UnitTest();
            for (int j=0; j < 10; ++j)
                obj.unitTest(i);
        }
        System.out.println("Unit Test Starting");
        for (int i = 0; i < 100; ++i) {
            UnitTest obj = new UnitTest();
            for (int j=0; j < 100; ++j)
                obj.unitTest(i);
        }
        System.out.println("Unit Test Finished");
    }
}

I have downloaded the JDK11 for Linux on IBM Z from AdoptOpenJDK site and using that I was able to reproduce this using following steps,

$ jdk-11.0.8+5/bin/javac UnitTest.java
$ jdk-11.0.8+5/bin/java -Xshareclasses:none -Xnoaot -Xjit:count=100,disableAsyncCompilation,verbose,vlog=jitV,"{*unitTest*}(optLevel=scorching,traceFull,log=methodLog)"  UnitTest
Unit Test Startin, running it 100 times to make sure JIT compiled
Unit Test Starting
Assertion failed at /home/jenkins/workspace/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-s390x-openj9/workspace/build/src/build/linux-s390x-normal-server-release/vm/compiler/../omr/compiler/z/codegen/OMRTreeEvaluator.cpp:3119: firstChild->getDataType() == secondChild->getDataType()
VMState: 0x0005ff04
    Data type of firstChild (VectorInt32) and secondChild (Int32) should match for generating compare and branch
compiling UnitTest.unitTest(I)V at level: scorching
#0: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x901fe6) [0x3ff83b01fe6]
#1: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x90e6b0) [0x3ff83b0e6b0]
#2: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x56c3e0) [0x3ff8376c3e0]
#3: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x56e136) [0x3ff8376e136]
#4: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x9e1ad2) [0x3ff83be1ad2]
#5: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x9e3a68) [0x3ff83be3a68]
#6: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x9e3fbc) [0x3ff83be3fbc]
#7: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x947aca) [0x3ff83b47aca]
#8: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x4db8ba) [0x3ff836db8ba]
#9: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x131672) [0x3ff83331672]
#10: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x4ea5e8) [0x3ff836ea5e8]
#11: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x4e8ade) [0x3ff836e8ade]
#12: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x4e5dc8) [0x3ff836e5dc8]
#13: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x4ff822) [0x3ff836ff822]
#14: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x171f22) [0x3ff83371f22]
#15: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x172e52) [0x3ff83372e52]
#16: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9prt29.so(+0x22478) [0x3ff887a2478]
#17: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x174b70) [0x3ff83374b70]
#18: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x1751be) [0x3ff833751be]
#19: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x170b70) [0x3ff83370b70]
#20: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x170e38) [0x3ff83370e38]
#21: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x170f20) [0x3ff83370f20]
#22: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9prt29.so(+0x22478) [0x3ff887a2478]
#23: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9jit29.so(+0x171442) [0x3ff83371442]
#24: /sandbox/rahil/9446/jdk-11.0.8+5/lib/compressedrefs/libj9thr29.so(+0x58f0) [0x3ff888858f0]
#25: /lib/s390x-linux-gnu/libpthread.so.0(+0x7aa8) [0x3ff89807aa8]
#26: /lib/s390x-linux-gnu/libc.so.6(+0xfa936) [0x3ff895fa936]
#27: [(nil)]

Excellent job @r30shah. Whoever ends up fixing the actual issue here, as part of the fix if possible I'd like to see a variation of the above unit test added to our testing bucket to ensure such problems never arise again since this came from a user raised issue and we would like to avoid regressing users in the future.

Whoever ends up fixing the actual issue here, as part of the fix if possible I'd like to see a variation of the above unit test added to our testing bucket to ensure such problems never arise again since this came from a user raised issue and we would like to avoid regressing users in the future

Sure, I can refactor the above unit test (With some functinality checking enhancements) and contribute to OpenJ9 test.

@gita-omr @IBMJimmyk wondering if there may be a better person to take a look at the SPMDParallelizer issue noted by Rahil above? Let us know if you will have time to take a look before 0.21 release.

@fjeremic I am currently looking at the issue. Sorry for not updating it. The root cause of the failure is the wrong IL generation due to commoning. This will happen for the cases where the loop control variable is commoned with the vectorized IL nodes.
For power and x86, this would have ended up in a crash if it was vectorized. In x86 and power, the loop is not vectorized due to not meeting dependencies.
I did come up with an initial solution, and so far, with that change, it seems to generate correct IL. However, I can not access z system as of yet to test it. I will update here once I can check the solution in a z machine.

Might be work opening a follow-up issue to investigate the lack of vectorization on x86 and power; it seems weird that a loop dependency would be platform specific, particularly given the loop in question (assuming it's the one in unitTest().

@mstoodle this is a very specific optimization where we are trying to vectorize the induction variable itself. It's something like this:

for (i=0; i<N; i++)
{
a[i] = (i+1)*X;
}

We are creating a vector temp for i on RHS : [0,1,2,3] and then increment it by [1,1,1,1]. In order to do that we use vsetelem and some other opcodes that are not supported on Power and Intel.

The bug on Z happens due to the (i+1) in RHS being commoned with i++ in the loop control expression. Note that we handle properly when just i is commoned between the two. But it's true that this particular optimization currently effectively only exists on Z.

@fjeremic @r30shah maybe to speed things up, let's not burden @mnalam-p with getting access to Z? It's strictly speaking a common code issue. We can give you the code to test after it's fixed in the common code.

Sure! @r30shah would you be able to get in touch with @mnalam-p once he has a fix and you can give it a spin on a Z machine?

to do that we use vsetelem and some other opcodes that are not supported on Power and Intel.

Sadness. Presumably that's a gap in the code generators rather than the architectures. So maybe filling that gap is the missing issue: to implement the vsetelem evaluators (maybe even a "beginner" issue) ? Or do we already have an issue for that?

The missing issue I mentioned is sort of related to #3898 so mentioning here to cross-link.

I ran a custom build containing my changes with sanity, regression and promotion tests and they all passed. I have encountered some test failures with openj9 build and suspect they are not related to my changes. I am trying to get some suggestion from @r30shah on them.
Meanwhile, I am discussing with @gita-omr to make the fix more generic

Thanks @mnalam-p for the investigation and the prototype. I've been thinking about the problem and wrote down my thoughts here: https://github.com/eclipse/openj9/issues/9928

Considering that the most general solution will take time, I would suggest we disable vectorization of the induction variable for now. Please let me know what you think.

@gita-omr From disabling the vectorizing of the induction variable, how much perf impact do you expect? (Considering it is one type of transformation from from the umbrella of different SIMD transformations).

It is a small subset of what auto-SIMD generally does. There is a chance of course that somewhere where is a small loop that was vectorized before and would not be now, but I just afraid correctness issues override the possible (but most likely rare) benefits. Note that vectorization of the induction variable (IV) is currently effectively only enabled on Z.

If we see a degradation somewhere, it would be a great incentive to fix it properly.

@gita-omr Thanks, I agree, considering the tree it generated and seeing the instruction generated later on, I have serious doubts about the functionally correctness. I would like to get opinion of @fjeremic as well for disabling vectorization of IV.

I'm ok with disabling as well until we are certain of functional correctness.

I found something interesting, with my changes and @r30shah 's test case, the output was functionally incorrect although the generated IL seemed correct. Now, I have disabled the auto-vectorization for the problematic situations but left it open for other situations (when it generates correct IL). However, to my surprise, the vectorization was still functionally incorrect (failed with a different test case similar to @r30shah one). Maybe we need to look at the codegen also?
Because, I can see following message from z/codegen in trace file
ArtificiallyInflateReferenceCountWhenNecessary: bug potential with node 000003FF67EF0780 (reg 000003FF65B92760 &GPR_0320): ref count after OMR::Z::MemoryReference::tryBaseIndexDispl is 1. _stackOfArtificiallyInflatedNodes.pop() 000003FF67EF0780, decReferenceCount(...) called. reg=&GPR_0320 _stackOfMemoryReferencesCreatedDuringEvaluation.pop() 000003FF65B93B70, stopUsingMemRefRegister called.
Seems like there was another bug even before incorrect IL issue.

Considering this test case:
for (int i = 0; i < addnArrayLength-1; ++i) offsets[i+1] = i * token;
The IL generation should be correct and it appears to be. This should make iload i + 1 be commoned with the loop control variable. Yielding following IL (One instance of an unrolled loop)
BBStart <block_4> (freq 9857) (in loop 177) n62n vstorei <array-shadow>[#255 Shadow] [flags 0x8000060a 0x0 ] n61n aladd (X>=0 internalPtr sharedMemory ) n43n aload <temp slot 22>[#539 Auto] [flags 0x7 0x0 ] (X!=0 sharedMemory ) n60n lsub (highWordZero X>=0 cannotOverflow ) n58n lmul (X>=0 cannotOverflow ) n57n i2l (highWordZero X>=0 ) n48n isub n46n iload <auto slot 3>[#387 Auto] [flags 0x3 0x0 ] (cannotOverflow ) n2500n iconst -4 (X!=0 X<=0 ) n56n lconst 4 (highWordZero X!=0 X>=0 ) n59n lconst -8 (X!=0 X<=0 ) n51n vmul (VectorInt32) n2566n vload <auto slot -1>[#546 Auto] [flags 0xa 0x0 ] (cannotOverflow ) n2569n vsplats (VectorInt32) n2568n iload <temp slot 17>[#515 Auto] [flags 0x3 0x0 ] (cannotOverflow ) n2571n vstore <auto slot -1>[#546 Auto] [flags 0xa 0x0 ] n2572n vsub (VectorInt32) n2576n vload <auto slot -1>[#546 Auto] [flags 0xa 0x0 ] (cannotOverflow ) n2579n vsplats (VectorInt32) n2578n iconst -4 (X!=0 X<=0 ) **n66n istore <auto slot 3>[#387 Auto] [flags 0x3 0x0 ] n48n ==>isub** **n1306n ificmplt --> block_177 BBStart at n2414n () n48n ==>isub** n2475n iload <temp slot 28>[#545 Auto] [flags 0x3 0x0 ] n1305n BBEnd </block_4> =====
Which fails the functionality test

@mnalam-p May be you can continue discussion on the issue @gita-omr created (https://github.com/eclipse/openj9/issues/9928) where you put the full log as well. Also may help but when I ran my unit test with your potential fix, it was the very first Vectorized iteration where first four elements for [0,1,2,3] should be [1a,2a,3a,4a] while it was [0,0,0,0], so something is off when you enter the loop.

@mstoodle I will add codegen todos into #9928

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xliang6 picture xliang6  路  3Comments

pshipton picture pshipton  路  3Comments

AdamBrousseau picture AdamBrousseau  路  6Comments

hrstoyanov picture hrstoyanov  路  4Comments

AdamBrousseau picture AdamBrousseau  路  6Comments