Openj9: Building OpenJ9 for e500v2 core equipped SoC

Created on 12 Aug 2018  路  123Comments  路  Source: eclipse/openj9

Dear All,

I'm trying to build OpenJ9 on the PPC SoC equipped with e500v2 core. This core doesn't have the AltiVec IP block (Instead it uses the SPE extension for floating point calculation).

The problem seems to be with the OpenJ9 assumption that all supported cores support AltiVec instructions. One of the assembly tuned files:
./openj9/runtime/compiler/p/runtime/J9PPCCRC32.spp

This is the __crc32_vpmsum [1] optimized implementation of CRC32 calculation for 16B data blocks.

Is there any C implementation of this function available? Or maybe one for SPE assembler?

Please correct me if I'm wrong, but it seems to me that one would need to:

  • Rewrite the [1] in C and then allow gcc to optimize it for e500_v2 core

or

  • Rewrite from scratch the [1] to support SPE assembler instructions instead of altivec

Personally, I would prefer the first option with C, but I'm not sure what would be the performance impact
on OpenJ9.

Has anybody tried to run OpenJ9 on e500_v2?

Thanks in advance,
艁ukasz

jit userRaised

Most helpful comment

@PTamis I was able to recreate your latest crash. I've pushed a commit to https://github.com/eclipse/openj9/pull/2764 to fix it. We can do better than this fix performance-wise, but it's OK for the moment.

All 123 comments

@gita-omr @ymanton this seems like your area of expertise. Could you help answer OP's questions?

The problem seems to be with the OpenJ9 assumption that all supported cores support AltiVec instructions.

We only use AltiVec if we detect the processor at runtime and know that it supports AltiVec. The same applies to VSX and various other hardware features. The __crc32_vpmsum routine for example will only be called if we detected that the processor is an IBM POWER8 or later, otherwise we will not use it.

We don't detect the e500 so we will assume we are running on a basic PPC chip that has no support for AltiVec, VSX, crypto instructions, transactional memory, etc. If those sorts of instructions get executed on your chip that's a bug in the JIT that can be fixed.

Does it mean that the OpenJ9 shall be compiled on very basic PPC ISA if no supported architecture is detected?

Why I do ask?
The guess-platform.sh script checks the system on which we do run. On Linux it seems like the
x86_64, ppc64 and ppc64le are supported.
The ppc (32 bit as e500_v2) is not supported out of the box.

The guess-platform.sh script checks the system on which we do run.

This script just attempts to guess the platform you're compiling OpenJ9 on. The compiler options (gcc or xlC) used when compiling OpenJ9 will target the minimum supported architecture level. I'm not sure what that is on Power, but presumably it is a very old processor.

What @ymanton is talking about is what happens at runtime. At runtime OpenJ9 will detect what processor you are running under and the JIT compiler will generate calls to __crc32_vpmsum for example if we detected you are running on IBM POWER8 or later.

As @fjeremic said, guess-platform.sh is checking at build-time, not run-time. Since we don't compile OpenJ9 in 32-bit environments there is currently no support for it in the code, but feel free to add it.

If you want to port OpenJ9 to the e500 then most of your work will be in making changes to the build system to work in a 32-bit ppc environment. Once you have a successful build you shouldn't have much trouble running OpenJ9 except for one issue related to using 64-bit instructions -- we assume that ldarx and stdcx are available, which is not true on 32-bit systems so that will need to be fixed.

If you have not already seen issue #2399 please take a look at it, it discusses problems that are very similar to yours.

I'm not sure what that is on Power, but presumably it is a very old processor.

No, it is not. This is quite powerful embedded system; 2 cores , 1.5GHz, 1 GiB RAM.
It just doesn't support altivec and has SPE instead.

I will look into the pointed thread. Thanks for reply.

I've started the porting.

Why: OpenJ9 claims to be much faster than other JVMs.
Goal: To have OpenJ9 build on PPC (e500_v2 core).

For sake of simplicity I've decided to use zero variant (to avoid altivec issues) and build it native environment.

I've followed: https://www.eclipse.org/openj9/oj9_build.html
Side question: Why gcc 4.8 is used (recommended) ? I'm using gcc 6.4.0
After having the source code (and all prerequisites) the configure passes:
./configure --with-freemarker-jar=/lib/freemarker.jar --with-jobs=2 --with-debug-level=fastdebug --without-freetype --without-x --without-cups --without-alsa --disable-headful --with-jvm-variants=zero

`====================================================
A new configuration has been successfully created in
/root/openj9-openjdk-jdk8/build/linux-ppc-normal-zero-fastdebug
using configure arguments '--with-freemarker-jar=/lib/freemarker.jar --with-jobs=2 --with-debug-level=fastdebug --without-freetype --without-x --without-cups --without-alsa --disable-headful --with-jvm-variants=zero'.

Configuration summary:

  • Debug level: fastdebug
  • JDK variant: normal
  • JVM variants: zero
  • OpenJDK target: OS: linux, CPU architecture: ppc, address length: 32

Tools summary:

  • Boot JDK: openjdk version "1.8.0_102-internal" OpenJDK Runtime Environment (build 1.8.0_102-internal-b14) OpenJDK Zero VM (build 25.102-b14, interpreted mode) (at /usr/lib/jvm/openjdk-8)
  • C Compiler: powerpc-poky-linux-gnuspe-gcc (GCC) version powerpc-poky-linux-gnuspe-gcc (GCC) 6.4.0 (at /usr/bin/powerpc-poky-linux-gnuspe-gcc)
  • C++ Compiler: powerpc-poky-linux-gnuspe-g++ (GCC) version powerpc-poky-linux-gnuspe-g++ (GCC) 6.4.0 (at /usr/bin/powerpc-poky-linux-gnuspe-g++)

Build performance summary:

  • Cores to use: 2
  • Memory limit: 1008 MB
  • ccache status: installed, but disabled (version older than 3.1.4)
    `

The I've decided to build it with:
make CONF=linux-ppc-normal-zero-fastdebug LOG=trace JOBS=2 images

The build errors poped up in:
javac: file not found: /root/openj9-openjdk-jdk8/jdk/src/solaris/classes/sun/awt/org/xml/generator/WrapperGenerator.java [1]

This file has been appended to the end of:
/root/openj9-openjdk-jdk8/build/linux-ppc-normal-zero-fastdebug/jdk/btclasses/_the.BUILD_TOOLS_batch
as part of BUILD_TOOLS generation:

SetupJavaCompilation(BUILD_TOOLS)
[2] SETUP := GENERATE_OLDBYTECODE
[3] SRC := /root/openj9-openjdk-jdk8/jdk/make/src/classes /root/openj9-openjdk-jdk8/jdk/src/solaris/classes/sun/awt/X11/generator
[4] BIN := /root/openj9-openjdk-jdk8/build/linux-ppc-normal-zero-fastdebug/jdk/btclasses
Tools.gmk:38: Running shell command

  • /usr/bin/find /root/openj9-openjdk-jdk8/jdk/make/src/classes /root/openj9-openjdk-jdk8/jdk/src/solaris/classes/sun/awt/X11/generator -type f -o -type l
    gensrc/GensrcProperties.gmk:

When I replace /org/xml -> /X11 the file (WrapperGenerator.java) is present.
Another strange thing - why AWT is build/needed at all? I've asked ./configure to build headless and without X VM.

Any idea why it is like that? Maybe some explanation, which could shed some light?

Regarding the debug infrastructure of OpenJ9 build:

  • makefile's LOG=trace and -d option

Are there any other available?

Side question: Why gcc 4.8 is used (recommended) ?

There was work needed to get higher levels working. The JIT specifically made use of a slightly modified CRTP which work on gcc 4.8 but not on 5+ due to spec conformance. We should be able to build now with gcc 7.3 through and will be moving to that compiler level soon. See #1684.

For sake of simplicity I've decided to use zero variant (to avoid altivec issues) and build it native environment.

I don't know how the zero parts of OpenJDK are built for OpenJ9, but OpenJ9 itself doesn't have a "zero" VM so unfortunately it will be the same as building a non-zero VM and various assembly files and the JIT will have to be built.

When I replace /org/xml -> /X11 the file (WrapperGenerator.java) is present.
Another strange thing - why AWT is build/needed at all? I've asked ./configure to build headless and without X VM.

Any idea why it is like that? Maybe some explanation, which could shed some light?

I don't know if it is a bug in the OpenJDK build system or the just the OpenJ9 parts, but the --without-x flag is not respected. I just install all the needed libs and headers and build with the default config. I don't even know why a Solaris Java class is being built on other platforms, but this might also be another bug in the build system.

--without-x flag is not respected

Ok, So this is a dead option.

I just install all the needed libs and headers and build with the default config

I assume that you use PPC64? Have you ever tried to cross compile the OpenJ9?

Is there any way to improve the debug output? I do have a hard time to find places where the files (like _the.BUILD_TOOLS_batch) are generated.

Also please correct me if I'm wrong, but it seems to me like the ./configure is already created in the repository (and downloaded). Maybe I do need to regenerate it?

I assume that you use PPC64? Have you ever tried to cross compile the OpenJ9?

No, OpenJ9 only builds on ppc64le, not ppc64 or ppc (the IBM JDK builds on ppc64 in both 32- and 64-bit modes). I have not tried to cross-compile OpenJ9 myself, but I think we may support that for ARM targets, but I'm not sure.

Is there any way to improve the debug output? I do have a hard time to find places where the files (like _the.BUILD_TOOLS_batch) are generated.

Unfortunately not that I know of, OpenJ9 had to make changes to the OpenJDK build system in order to integrate, but some things are still less than perfect. The only thing I can suggest is that if you're building jdk8 that you set VERBOSE="" in your env for make, which should echo commands so you can better see what's being invoked.

Also please correct me if I'm wrong, but it seems to me like the ./configure is already created in the repository (and downloaded). Maybe I do need to regenerate it?

The version that's checked in should be in sync with configure.ac, but it doesn't hurt to regenerate it. The file you care about is actually common/autoconf/configure, the top-level just calls this one.

I have not tried to cross-compile OpenJ9 myself, but I think we may support that for ARM targets, but I'm not sure.

Do you have maybe the build system adjustments to cross-compile the OpenJ9 on ARM? I mean the arm is also not supported (at all), so I could reuse some of its code on ppc port.

Unfortunately I don't, I haven't spent any time on ARM. @JamesKingdon might have some info on how to get OpenJ9 to cross compile and/or some patches for that on ARM.

If I may ask about OMR's tools - namely tracemerge, hookgen, etc.

What is their purpose? In my native build - for example the tracemerge is used during build:
./tracemerge -majorversion 5 -minorversion 1 -root .

Why do we need to merge trace information during build?
Moreover, this means that it shall be cross-compiled on the HOST (x86_64| PPC64).
Why OpenJ9 needs it?

I've also noticed the OMR_CROSS_CONFIG="yes", which gives tools the possibility to be cross compiled.
This might be quite useful, as omr/tools/tracegen/makefile calls:
include $(top_srcdir)/tools/toolconfigure.mk

However, it seems to be tunned to PPC64 (-m64).

OMR and OpenJ9 use a trace engine to record diagnostic info on how the code is executing into a circular buffer on the thread. The descriptions of these trace points need to be converted into binary forms and then merged into a single data file that can be used by the runtime. That's roughly tracemerge.

hookgen is used to generate the appropriate macros for the low overhead pub/sub system used in OMR / OpenJ9 to communicate events across the system.

Ok, so those are components, which will be used by running JVM instance and hence shall be either cross-compiled of build natively.

They're only needed as part of the build and not at runtime.

I think that I've misunderstood you in some way.

Are they only used when the OpenJ9 is compiled (so they could be compiled as x86_64)?
Or they need to be available on target (and cross compiled as PPC) ?

Sorry I wasn't clear. Most of the tools - like hookgen & tracemerge - are only used when OpenJ9 is compiled and can be compiled as x86_64.

There is one that depends on right architecture: constgen

If you support DDR (used for debugging jvm crashes), it will also need to run on the right architecture.

With current version of openJ9 build system (scripts) the successful configure gives following output:

  • ccache status: installed, but disabled (version older than 3.1.4)

Build performance tip: ccache gives a tremendous speedup for C++ recompilations.
You have ccache installed, but it is a version prior to 3.1.4. Try upgrading.

The problems is that on my system:
/openj9-openjdk-jdk8# ccache -V
ccache version 3.2.5+dirty

Is there any workaround to fix this? Or the ./configure script logic is just wrong and the version is determined in a wrong way?

@dnakamura Any thoughts on the ccache question?

I believe the openjdk code assumes that the version < 3.1.4 if it fails to parse the version. IT's been a while since I looked at the relevant code, but I think they fail to parse when they seee anything other than digits or a decimal points. Will look into it

Ok no my bad. It will handle alphabetic characters in the version string. However to check the version number they are just matching against the regex 3.1.[456789] which means anything > 3.1.9 will fail.

If I may ask again the question regarding the gcc 4.8 (which is recommended for this VM native build):

I've backported the gcc 4.8.2 to my setup. Unfortunately during the ./configure execution, it wants to check if gcc is working:

configure:22215: /usr/bin/powerpc-poky-linux-gnuspe-gcc -O2 -pipe -g -feliminate-unused-debug-types -Wno-error=deprecated-declarations -fno-lifetime-dse -fno-delete-null-pointer-checks -m32 -mcpu=8548 -mabi=spe -mspe -mfloat-gprs=double -
-sysroot=/ -Wl,-O1 -Wl,--hash-style=gnu -Wl,--as-needed -fPIC conftest.c >&5
powerpc-poky-linux-gnuspe-gcc: error: unrecognized command line option '-fno-lifetime-dse'

The problem is that this particular optimization option is NOT supported in 4.8.[12345].
It first shows up on 4.9 -> e.g.
https://gcc.gnu.org/onlinedocs/gcc-4.9.3/gcc/Optimize-Options.html

Why it is like that? Is the '-fno-lifetime-dse' only needed on PPC (as it is possible to compile J9 on x86_64).

From the other reply -> the problem with compiling proper code only shows up on gcc 5+, so I guess that 4.9.x can be used?

Looks like that issue comes from OpenJDK code, not OpenJ9. If you look here

https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/2b004fdb6829f287eaa464a57a8680377886ca75/common/autoconf/toolchain.m4#L1425-L1440

you'll see that they're trying to disable that opt under GCC 6, so it should not be used when you build using GCC 4.8. Is your default host compiler GCC 6 or later? Perhaps the configure scripts are invoking that in some places instead of your powerpc-poky-linux-gnuspe-gcc cross compiler and getting confused. You can look in the various config.log files that are generated to see what's going on.

You should also note there is a runtime check you need to disable to work on 32 bit ( see #2399 ).
Note: in the issue they also discuss issues with 32 bit power missing for certain instructions, however I dont think thats an issue for the e500 cores. However you may run into other issues where bits of our code assume we are running on a 64bit chip

Do you have maybe the build system adjustments to cross-compile the OpenJ9 on ARM? I mean the arm is also not supported (at all), so I could reuse some of its code on ppc port.

I recently followed James' instructions and successfully cross-compiled from Ubuntu/AMD64 to the RPi and the resulting VM works fine. Caveat: you may want to read the recent conversation on Slack about back-contributing directly to the master repo, not via James' fork.

I am also actively trying to cross-compile to the e500. I am approaching it differently though, I am trying to start from (pieces of) the OMR testcompiler which kind of looks more within reach. What I understood however is that its build system is quite disconnected from the other two i.e. from both TR's and J9's. And I have a feeling that it's less actively being looked at, as while the other parts cross-compile just fine, I had to dance around things to get the tc/tril/etc to cross-compile to ARM. I'll keep you posted on the progress with tc/tril on e500.

Thanks Boris for your input.

I recently followed James' instructions and successfully cross-compiled from Ubuntu/AMD64 to the RPi and the resulting VM works fine.

I've looked on your Github repositories and I couldn't find the ARM port for J9. Would it be possible to upload it somewhere?

Slack about back-contributing directly to the master repo, not via James' fork.

Do you have any reference/logs to those conversations?

I had to dance around things to get the tc/tril/etc to cross-compile to ARM.

Could you share the steps (or repository), which were needed on ARM to get it working?

I'll keep you posted on the progress with tc/tril on e500.

Thanks.

I've moved a bit further with native compilation. The gcc 4.8.2 compiles the 'images/j2re-image/bin/java'
binary.
However, I do experience the "Illegal instruction" aborts. One was caused by 'lwsync' not being available on e500(_v2) ISA (https://www.nxp.com/docs/en/reference-manual/E500CORERM.pdf).

This issue has been fixed by replacing 'lwsync' calls with 'sync' - mostly in OMR code generator (e500 supports msync, which probably shall be used - this will be fixed when it all starts working).

Now, I do have problem with 'cmpl' as being "Illegal Instruction":
0x0fb9808c in loop () from /mnt/openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9vm29.so
(gdb) x/20i $pc-32
0xfb9806c : ori r8,r4,0
0xfb98070 : lwarx r9,0,r12
0xfb98074 : rotlwi r3,r9,0
0xfb98078 : ori r4,r9,0
0xfb9807c : ori r10,r8,0
0xfb98080 : ori r11,r6,0
0xfb98084 : rlwimi r10,r5,0,0,0
0xfb98088 : rlwimi r11,r7,0,0,0
=> 0xfb9808c : cmpl cr0,1,r9,r10
0xfb98090 : bne- 0xfb980a0
0xfb98094 : stwcx. r11,0,r12
0xfb98098 : bne+ 0xfb98070
0xfb9809c : blr
0xfb980a0 : stwcx. r9,0,r12
0xfb980a4 : bne+ 0xfb98070

This is strange as 'cmpl' _is_ supported in the e500 ISA. Some other threads points out to check the cache line size (32B for e500) - though it would be strange, as OpenJDK8 is working on this machine with the same setup (and its 'zero' variant is used for J9 compilation).

That's the 64-bit version of cmpl you're crashing on. You can change it to cmpl cr0,0,r9,r10 for the 32-bit instruction to go along with your changes for ldarx and stdcx., however it's incorrect and will probably give you bad results and even more mysterious crashes. We really need to exchange a full 64-bit value atomically here.

https://github.com/eclipse/omr/pull/2930 and https://github.com/eclipse/openj9/pull/2764 are patches I started putting together last week to get ppc32 working, you can give them a try instead. You can export VMDEBUG="-DOMR_NO_64BIT_LCE" in your build env to build the VM with these changes.

@ymanton This is great; thanks for sharing it. By the way, what's the copyright check problem with #2764? (I'm not used to Jenkins.) Is it missing a copyright header in one of the files?

I'll try it on my PowerBook sometime later today, time allowing (assuming the G4's ISA supports everything that your system does).

I've looked on your Github repositories and I couldn't find the ARM port for J9.

That's the "arm" branch in JamesKingdon's repo, and he also has nice instructions here

Do you have any reference/logs to those conversations?

It's in the #general channel on Aug 9th.

Could you share the steps (or repository), which were needed on ARM to get it working?

The JVM works as-is out of James' branch. What doesn't work are the simple tests in testcompiler and tril. I care about those because the goal of these exercises for me is a riscv port and the JVM is definitely the wrong level of complexity for approaching that. So I got them to work and am trying to make the change into a nice branch that can be pulled into master but after a day of cursing I am starting to think that maybe doing it in one go is too ambitious and maybe it warrants having a little conversation in today's community call.

@wyatt8740 there are only patches to get rid of the 64-bit CAS in that tree unfortunately. You still need to make changes to the build files in OMR, OpenJ9, and OpenJDK to get a VM built, but since @lmajewski and/or @shingarov are making progress on that I decided to tackle other things and built/ran the IBM JDK to test them. I'll see if I can find the lwsyncs @lmajewski mentioned above as well when time permits.

@ymanton After applying your patches I do see following error:

unix/linux/ppc/32/cas8help.s: Assembler messages:
unix/linux/ppc/32/cas8help.s:74: Error: unrecognized opcode: rldimi' unix/linux/ppc/32/cas8help.s:75: Error: unrecognized opcode:rldimi'
unix/linux/ppc/32/cas8help.s:77: Error: unrecognized opcode: ldarx' unix/linux/ppc/32/cas8help.s:78: Error: unrecognized opcode:cmpld'
unix/linux/ppc/32/cas8help.s:80: Error: unrecognized opcode: stdcx.' unix/linux/ppc/32/cas8help.s:84: Error: unrecognized opcode:srdi'

For example the 'rldimi' in Table 3-44: https://www.nxp.com/docs/en/reference-manual/E500CORERM.pdf
Is marked as PowerPC AIM specific, not available on e500.

I've uploaded my branches for PPC32 e500 to github:
https://github.com/lmajewski/ppc32_j9_omr
https://github.com/lmajewski/ppc32_j9_openj9
https://github.com/lmajewski/ppc32_j9_openj9-openjdk-jdk8

There are lukma_* files to configure it and execute - those were the same as zero variant of OpenJDK8 (which seems to work on the platform).

For those who want to build it natively - there is a qemu-system-ppc port. One can use -M ppce500 or -M mpc8544.

However up to 2 cores are supported and max memory of 512 MiB (with more RAM and cores some strange errors emerge).

qemu-system-ppc -M ppce500 -m 512M -nographic -d guest_errors \
./arch/powerpc/boot/uImage \
-drive file=core-image-qoriq-qoriq-20180626070914.rootfs.ext2,if=virtio \
-append "root=/dev/vda rw rootwait rootfs=ext2"

It works with Linux 4.18 kernel - but IS extremely slow to compile.

Thanks for testing. It makes sense that your assembler doesn't want to deal with unsupported instructions. I was on ppc64 so I didn't see these errors. I'll fix it shortly.

@ymanton As side question - is there any way to test only OMR (or other separate J9 component)?

I mean it is very time consuming to build it. The J9 makefile has make , but the smallest instance is "jvm".

any way to test only OMR
That's what the discussion on yesterday's call was about. Basically, the set of native makefiles is screwed up, they are wrong, duplicated in several places, and some of them have been neglected for a while. In the call Mark made the point that they aren't worth fixing because they will be deprecated in favour of CMake which isn't quite there yet on any platform. So in the meantime I propose that we simply push temporary branches to exchange kludges to keep going -- I'll prepare and push mine when I come back from ESUG.

@shingarov Yes, I also think that the _goal_ is to make the OMR (and the whole J9) working correctly first and only then cleanup things.

@ymanton As side question - is there any way to test only OMR (or other separate J9 component)?
I mean it is very time consuming to build it. The J9 makefile has make , but the smallest instance is "jvm".

If building from the top is too painful during development you can try this shortcut for rebuilding just the VM binaries:

VERSION_MAJOR=8 \
OPENJDK_VERSION_NUMBER_FOUR_POSITIONS=8.0.0.0 \
make -C build/<your-configured-build>/vm/

(You may need to specify some additional vars in your env, I haven't used this shortcut in a while and can't check ppc at the moment.)

It will only rebuild the VM components in that directory, but it will not compose the image. Once you're done making changes you can build one of the top level targets to get an image you can run. The makefile in the vm directory also has specific targets that you can build if you want more granularity, e.g. omr_ddrmacros omrsig j9omrport but there may be ordering dependencies with the other targets in that makefile so I've always just let it build all.

@ymanton I've poked a bit into e500 Reference manual: https://www.nxp.com/docs/en/reference-manual/E500CORERM.pdf

In the point "A.1.1.6 Compare and Swap" it is stated that for this core lwarx and stwcx. only work on word size data (32bits).
To me It seems that "simple" replacement of instructions will not provide proper atomic operation of
"compare and swap of a 64-bit value on a 32-bit system" J9CAS8Helper function:

uint64_t J9CAS8Helper(volatile uint64_t *addr, uint32_t compareLo, uint32_t compareHi, uint32_t swapLo, uint32_t swapHi);

IMHO we would need to use lwarx/stwcx. functions to read and operate separately on compare{HI|Lo} and swap{Hi|Lo}. After the successful CAS operation we would need to repeat it and compare the results (a bit different problem described in [1]).

Have I overlooked something? Or is there any other/better solution for this? (I'm wondering how IBM's original J9 implementation handled this for 32bit PPC :-) )

[1] - https://stackoverflow.com/questions/45054323/powerpc-e500-p1020-read-64bit-2x32bit-registers-in-atomic-way

Yes you are correct, simply replacing ldarx/stdcx with lwarx/stwcx will not work. IBM's original implementation is the one you see currently, we really use J9CAS8Helper and ldarx/stdcx. :grin: The IBM JDK only supports 64-bit CPUs, even in 32-bit mode, since 64-bit instructions and registers can still be used in 32-bit mode.* The IBM JDK is mostly used on IBM POWER systems (we support POWER4 and later), but we've also used and tested it on chips like the PPC970 and the e5500, all of which are 64-bit capable. I don't know if we supported true 32-bit chips in the distant past, but you and @wyatt8740 are the first I've seen ask about it.

Anyway, the patches I pointed you to should solve the problem, they use lwarx/stwcx and a dedicated 32-bit lock word to synchronize on and will not call J9CAS8Helper. See here:

https://github.com/eclipse/omr/blob/e3972b55a2235a3e04b90083523e2242ff38e4aa/include_core/AtomicSupport.hpp#L414-L430

However you cannot even build OpenJ9 because your assembler will not tolerate unsupported instructions. Can you try building with -mppc64 here:

https://github.com/lmajewski/ppc32_j9_omr/blob/58a9411ebae7980c9d2cd4dbf2dcbd5e3707bda9/omrmakefiles/rules.linux.mk#L113

We will not execute 64-bit assembly routines in 32-bit mode except J9CAS8Helper (and with my patches even it will not be executed) so it should be OK to have them in your VM, we can work on excluding them from the build later. I started a build in QEMU using Debian 8 and it has finished building the VM parts so it seems to be accepted by my assembler, but like you said the parts of the build that invoke Java are incredibly slow so it has still been building for the last 12 hours. Unfortunately the #ifdefs in my original patch are broken and export VMDEBUG="-DOMR_NO_64BIT_LCE" will not work as expected in OMR. I'll fix that shortly, but can you simply replace OMR_NO_64BIT_LCE with 1 for now? You can do something like sed -i s/OMR_NO_64BIT_LCE/1/g ....

IMHO we would need to use lwarx/stwcx. functions to read and operate separately on compare{HI|Lo} and swap{Hi|Lo}. After the successful CAS operation we would need to repeat it and compare the results (a bit different problem described in [1]).

[1] - https://stackoverflow.com/questions/45054323/powerpc-e500-p1020-read-64bit-2x32bit-registers-in-atomic-way

Unfortunately I don't think a trick like the one for reading the time base will work. Reading the time base does not have to be atomic, and no software threads will ever race to write to it, so it is an easier problem to solve. Since the time base is monotonically increasing and the high 32 bits are will not change within the span of a few instructions you can always detect when you have mismatching hi/lo parts. With arbitrary 64 bit values however you cannot swap hi/lo and detect if your hi/lo gets mixed up with the lo/hi of another thread atomically, there will be a quantum of time where the hi/lo of two threads can be in memory and it will probably lead to rare but painful bugs. Perhaps it could be done with nested lwarx/stwcx, but the architecture does not allow that. You can try to come up with an algorithm if you wish, I would be happy to look at it because it might be more convenient than my solution, but I'm not hopeful it can be done.

* In practice ldarx/stdcx are the only "safe" 64-bit instructions that can be used in 32-bit mode because kernels may not preserve the upper 32 bits of registers in 32-bit mode, but proper kernels will also force l*arx/st*cx to fail on any interrupt, so you can use them and also place other 64-bit instructions between them and be protected from context switches.

Anyway, the patches I pointed you to should solve the problem, they use lwarx/stwcx and a dedicated 32-bit lock word to synchronize on and will not call J9CAS8Helper. See here:

It seems like I wrongly defined OMR_ARCH_POWER in my build so the OMRCAS8Helper() (or J9CAS8Helper) is called.

From the code: https://github.com/eclipse/omr/blob/e3972b55a2235a3e04b90083523e2242ff38e4aa/include_core/AtomicSupport.hpp#L414-L430

This solution seems to be reusing the already available 32 bit functions. I will give it a try.

However, on the line 436:
return __sync_val_compare_and_swap(address, oldValue, newValue);
This is a gcc 4.2 built-in function. Unfortunately, it has been replaced in 4.8.2. with __atomic_* version.
The code to implement J9CAS8Helper with built-ins:

static inline uint64_t
J9CAS8Helper(volatile uint64_t *addr, uint32_t compareLo, uint32_t compareHi, uint32_t swapLo, uint32_t swapHi)
{
uint64_t exp = (((uint64_t)compareHi) << 32) | compareLo;
uint64_t des = (((uint64_t)swapHi) << 32) | swapLo;
uint64_t val;
bool ret;
do {
__atomic_load (addr, &val, __ATOMIC_SEQ_CST);
ret = __atomic_compare_exchange (addr, &exp, &des, false, __ATOMIC_SEQ_CST,__ATOMIC_SEQ_CST);
} while (!ret);
return val;
}

The problem above is that __atomic_compare_exchange is not "returning" the old *addr value (it does it only on failure - this is the difference from __sync_val_compare_and_swap available on gcc 4.2). There may be a race between __atomic_load() and __atomic_compare_exchange().
Moreover, one needs to add -latomic switch to gcc.

However, I will try your patches again - so I could avoid adding -mppc64 where possible.

It took me some time, but I've managed to compile natively the J9 for e500_v2:

root@qoriq:/mnt/openj9-openjdk-jdk8# ./build/linux-ppc-normal-zero-release/images/j2re-image/bin/java -version
openjdk version "1.8.0_181-internal"
OpenJDK Runtime Environment (build 1.8.0_181-internal-b14)
Eclipse OpenJ9 VM (build openj9-ppc32-fixes-c5a6251, JRE 1.8.0 Linux ppc-32-Bit 20180912_000000 (JIT enabled, AOT enabled)
OpenJ9 - c5a6251
OMR - 59e927e
JCL - c186542

However, I do need to run some validation tests for it. Any recommendations (despite compiling some JAVA code and check if it is not crashing)?

Please find updated repositories:
https://github.com/lmajewski/ppc32_j9_omr
https://github.com/lmajewski/ppc32_j9_openj9
https://github.com/lmajewski/ppc32_j9_openj9-openjdk-jdk8

The trick was to properly use patches from @ymanton :-)

As mentioned above - I had to export VMDEBUG="-DOMR_NO_64BIT_LCE" and also rebuild some files manually with -mppc64 (e.g. CAS8 helper). More info is in the top directory lukma_* files (as I've been using OpenJDK8 zero for compilation).

There is however a room for improvement - I've blindly replaced lwsync with sync, which is painfully slow. The e500 does support 'msync' which probably shall be used instead.

Moreover, during compilation, I saw some warnings regarding 32 bit shifts, which doesn't look good on 32 bit machine.

There are a couple of 32 shift out of range warnings that happen during 32 bit builds, which I'm assured are on a code path that doesn't execute on 32 bit platforms. I'd be happier if we cleaned those up :)

Good to hear that you got it built. If you find that OMRCAS8Helper is still being called it's likely because some files from OMR are built without VMDEBUG included in the command line so the #ifdef guarded code is removed, but you can hack around that.

@ymanton Any hint on built J9 validation process?

I guess that running https://www.spec.org/jvm2008/ would be a really good test sample.
It has lots of tests that can be performed.

@lmajewski If you want to run the OpenJ9 regression tests you can try these instructions:

export JAVA_BIN=/path/to/build/images/j2sdk-image/jre/bin
export SPEC=linux_ppc
export JAVA_VERSION=SE80

cd openj9/test/TestConfig
make -f run_configure.mk
make test

Building DDR_Test failed for me so I just disabled it via:

--- a/test/functional/build.xml
+++ b/test/functional/build.xml
@@ -63,6 +63,7 @@
                                        <fileset dir="." includes="*/build.xml" >
                                                <exclude name="Panama/build.xml" />
                                                <exclude name="Valhalla/build.xml" />
+                                               <exclude name="DDR_Test/build.xml" />
                                        </fileset>
                                </subant>
                        </else>

Documentation on testing is here if you want to know more: https://github.com/eclipse/openj9/blob/master/test/docs/OpenJ9TestUserGuide.md

I've managed to get a ppc32 JVM built with -m32 -mcpu=G4 on a ppc64 server and tested in QEMU on a Debian 8 + G4 combination. I ran a simple program with the JIT disabled and it worked. With the JIT enabled it crashed while the JIT was compiling a method. That's probably a bug in the JIT so I'll look at that shorty. The regression tests are currently running on ppc64 so it won't catch any illegal instructions o real 32-bit chips, but it should test functionality.

I've updated https://github.com/eclipse/omr/pull/2930 and https://github.com/eclipse/openj9/pull/2764 and the patches I used for openj9-openjdk-jdk8 are here: https://github.com/ibmruntimes/openj9-openjdk-jdk8/pull/113. Still has lots of rough edges but I'll do a bit more later, feel free to use what's there in your efforts.

I compiled also on a native e500v2 core the openj9 with the instructions given. The compile finished OK.
But when I am trying to run java -version I have the following crash at JIT library.

`
bt

0 0x0f1a0d60 in OMR::CodeGenerator::addAllocatedRegisterPair(TR::RegisterPair*) () from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

1 0x0f1a1284 in OMR::CodeGenerator::allocateRegisterPair(TR::Register, TR::Register) ()

from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

2 0x0f0505dc in TR::PPCPrivateLinkage::buildDirectDispatch(TR::Node*) () from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

3 0x0f034f10 in J9::Power::TreeEvaluator::directCallEvaluator(TR::Node, TR::CodeGenerator) ()

from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

4 0x0f198ca8 in OMR::CodeGenerator::evaluate(TR::Node*) () from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

5 0x0f57cbe8 in OMR::Power::TreeEvaluator::treetopEvaluator(TR::Node, TR::CodeGenerator) ()

from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

6 0x0f198ca8 in OMR::CodeGenerator::evaluate(TR::Node*) () from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

7 0x0eea5570 in J9::CodeGenerator::doInstructionSelection() () from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

8 0x0f1a9a68 in OMR::CodeGenPhase::performInstructionSelectionPhase(TR::CodeGenerator, TR::CodeGenPhase) ()

from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

9 0x0f1a570c in OMR::CodeGenPhase::performAll() () from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

10 0x0f1a33d4 in OMR::CodeGenerator::generateCode() () from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

11 0x0f1c2498 in OMR::Compilation::compile() () from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

12 0x0eed8fc0 in TR::CompilationInfoPerThreadBase::compile(J9VMThread, TR::Compilation, TR_ResolvedMethod, TR_J9VMBase&, TR_OptimizationPlan, TR::SegmentAllocator const&) ()

from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

13 0x0eed9e00 in TR::CompilationInfoPerThreadBase::wrappedCompile(J9PortLibrary, void) ()

from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so

14 0x0fa0a4c8 in omrsig_protect () from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9prt29.so

15 0x0eedb440 in TR::CompilationInfoPerThreadBase::compile(J9VMThread, TR_MethodToBeCompiled, J9::J9SegmentProvider&) ()

from /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/build/linux-ppc-normal-zero-fastdebug/images/j2re-image/lib/ppc/default/libj9jit29.so
`

@ymanton is the stack the same as yours? And if I also disable JIT with -Xnojit the ./java -version does not return :(

Yes that looks like the same crash. Try -Xint, it behaves differently than -Xnojit and worked for me.

I'm now wondering why I don't experience such error........?

@ymanton Which CONF do you use for your build?
I do use the "release" configuration -> CONF=linux-ppc-normal-zero-release

@lmajewski I use the release conf as well. JIT compiler problems are sometimes intermittent or hard to recreate in different environments so it would not be surprising that you have not experienced it. I'll have some time later in the week to look into it.

@PTamis I've pushed a fix for that JIT crash to https://github.com/eclipse/openj9/pull/2764. The bug would only be seen on true 32-bit systems, not 64-bit capable machines, so that explains why I only saw it on QEMU and not on a ppc64 server. With that fix I can run some small programs on a G4 QEMU setup without any issue.

Hello @ymanton and thanks a lot for your patch. Your patch worked but java crashed a bit later with illegal instructions again.
0x95fbd554: li r3,8
0x95fbd558: msync
=> 0x95fbd55c: .long 0x7c0418a8
0x95fbd560: .long 0x78e8000e
0x95fbd564: .long 0x78a6000e
0x95fbd568: cmp cr0,1,r0,r8

I am using the same HW (e500v2) as @lmajewski so I was wandering why it did not work.
After a lot of tries I made it work also on my system.

The problem was the kernel. I am using 4.9.35-rt kernel while @lmajewski was using 3.12.19.
I used his kernel version and defconfig and it worked also for me.

So now I am trying to find the problem. I tend to believe that the problem it might not be the actual kernel itself but rather a configuration option or the worst a mix of those 2.

The crash involved illegal instructions .long with strange opcodes.
For example I grepped that opcode in kernel and I couldn't find what command 0x7c0418a8 is.
https://elixir.bootlin.com/linux/v4.9.35/source/arch/powerpc/include/asm/ppc-opcode.h does not help a lot.

What kind of opcode is that and why the compilation would produce those?

I'm not familiar with e500 so this may be irrelevant, but on ARM some older instructions are emulated by the kernel when running on processors (or in modes) where those instructions are not available or work differently. The emulation is configurable in the kernel so it is possible to go from one machine to another and suddenly run into unexpected crashes. Perhaps something similar is done on e500?

The first one is ldarx, which is the same instruction we fixed in J9CAS8Helper. The JIT compiler is generating a CAS8 sequence without checking if the CPU is 64-bit capable -- basically the same bug as before. I'll try to fix it shortly.

@JamesKingdon is right about kernel emulation, it's done on PPC as well, but not for ldarx. You are probably seeing this bug due to chance, but both I and Lukas will likely see it as well if we run some different applications.

@PTamis I was able to recreate your latest crash. I've pushed a commit to https://github.com/eclipse/openj9/pull/2764 to fix it. We can do better than this fix performance-wise, but it's OK for the moment.

@ymanton your patch worked again :) . At kernel v4 ./java -version does not crashes anymore.
Thought as I wrote some comments above the call does not return at all.

I will try to understand why is this happening. I will recompile again from scratch because I just deleted vm and images folders.

From gdb the ./java -version seems to stop at:
bt
#0 0x0fd98248 in pthread_join () from /lib/libpthread.so.0
#1 0x0ff853b0 in ContinueInNewThread0 (continuation=continuation@entry=0xff7fc08 <JavaMain>, stack_size=<optimized out>, args=args@entry=0xbfffb1c8) at /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/jdk/src/solaris/bin/java_md_solinux.c:1044
#2 0x0ff81a24 in ContinueInNewThread (ifn=0xbffff278, threadStackSize=<optimized out>, argc=1, argv=<optimized out>, mode=<optimized out>, what=<optimized out>, ret=0) at /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/jdk/src/share/bin/java.c:2033
#3 0x0ff854d8 in JVMInit (ifn=ifn@entry=0xbffff278, threadStackSize=<optimized out>, argc=<optimized out>, argv=<optimized out>, mode=mode@entry=0, what=what@entry=0x0, ret=ret@entry=0) at /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/jdk/src/solaris/bin/java_md_solinux.c:1091
#4 0x0ff82628 in JLI_Launch (argc=1, argv=0x100110a4, jargc=jargc@entry=1, jargv=jargv@entry=0x0, appclassc=appclassc@entry=1, appclassv=appclassv@entry=0x0, fullversion=fullversion@entry=0x100007bc "1.8.0_181-internal-b14", dotversion=dotversion@entry=0x100007d4 "1.8", pname=pname@entry=0x100007d8 "java", lname=lname@entry=0x100007e0 "openjdk", javaargs=javaargs@entry=0 '\000', cpwildcard=cpwildcard@entry=1 '\001', javaw=javaw@entry=0 '\000', ergo=ergo@entry=0) at /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/jdk/src/share/bin/java.c:304
#5 0x1000046c in main (argc=<optimized out>, argv=<optimized out>) at /mnt/persistent/tamis-openj9/openj9-openjdk-jdk8-lucasz/jdk/src/share/bin/main.c:125

And it never comes back. The CPU load is 100% so something is doing but who knows what. An infinite loop?

Now on kernel v3 where ./java -version works I tried to execute some more complex progs.
Like the https://www.spec.org/jvm2008/docs/UserGuide.html#TrialRun

Again I faced some more illegal instructions there.
And that's is the output from there:

Thread 16 "GC Slave" received signal SIGILL, Illegal instruction.
[Switching to LWP 469]
0x0ebaffec in OMRCAS8Helper () from /mnt/openj9-openjdk-jdk8-jit-fix/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9gc29.so
(gdb) bt
#0 0x0ebaffec in OMRCAS8Helper () from /mnt/openj9-openjdk-jdk8-jit-fix/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9gc29.so
#1 0x0eb74474 in MM_Scavenger::scavengeRememberedSetList(MM_EnvironmentStandard*) () from /mnt/openj9-openjdk-jdk8-jit-fix/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9gc29.so
#2 0x0eb800b8 in MM_Scavenger::workThreadGarbageCollect(MM_EnvironmentStandard*) () from /mnt/openj9-openjdk-jdk8-jit-fix/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9gc29.so
#3 0x0eb34588 in MM_ParallelDispatcher::slaveEntryPoint(MM_EnvironmentBase*) () from /mnt/openj9-openjdk-jdk8-jit-fix/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9gc29.so
#4 0x0eb341e0 in dispatcher_thread_proc2(OMRPortLibrary*, void*) () from /mnt/openj9-openjdk-jdk8-jit-fix/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9gc29.so
#5 0x0fa0b4c8 in omrsig_protect () from /mnt/openj9-openjdk-jdk8-jit-fix/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9prt29.so
#6 0x0eb33ff8 in dispatcher_thread_proc () from /mnt/openj9-openjdk-jdk8-jit-fix/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9gc29.so
#7 0x0fa947f4 in thread_wrapper (arg=0x943b6128) at ../omr/thread/common/omrthread.c:1596
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) disassemble
Dump of assembler code for function OMRCAS8Helper:
=> 0x0ebaffec <+0>: .long 0x78a4000e
0x0ebafff0 <+4>: .long 0x78e6000e
0x0ebafff4 <+8>: .long 0x7d0018a8
0x0ebafff8 <+12>: cmpl cr0,1,r8,r4
0x0ebafffc <+16>: bne- 0xebb0008 <OMRCAS8Helper+28>
0x0ebb0000 <+20>: .long 0x7cc019ad
0x0ebb0004 <+24>: bne- 0xebafff4 <OMRCAS8Helper+8>
0x0ebb0008 <+28>: mr r4,r8
0x0ebb000c <+32>: .long 0x79030022
0x0ebb0010 <+36>: blr

(gdb) bt
#0 0x0ebaffec in OMRCAS8Helper () from /mnt/openj9-openjdk-jdk8-jit-fix/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9gc29.so
#1 0x0eb74474 in MM_Scavenger::scavengeRememberedSetList(MM_EnvironmentStandard*) () from /mnt/openj9-openjdk-jdk8-jit-fix/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9gc29.so

Are you using older versions of https://github.com/eclipse/omr/pull/2930/commits/0b769dc5c0f406ff41adc0f353c23d1070b2bb80 and https://github.com/eclipse/openj9/pull/2764/commits/7c0253f2974f8aa2a0573f3a30946960f450b8b3? If so apply the current versions instead and rebuild from scratch because it looks like a problem that I fixed. Perhaps apply the latest version of all of the patches you're using just to be safe. :grin:

The hang in java -version might also be the same problem.

The build system could be more "friendly".
It turned out that simple make clean did not wipe the openj9-openjdk-jdk8 completly.
The "only" solution was to cd ./build/linux-ppc-normal-zero-release/ && rm -rf *

I've also updated the openJ9 repository to include new fixes from @ymanton :
https://github.com/lmajewski/ppc32_j9_openj9

@ymanton If I may have a little request for you:

Could you check if we are missing some code? With @PTamis we do have following state of the source code repositories:

https://github.com/lmajewski/ppc32_j9_openj9
https://github.com/lmajewski/ppc32_j9_omr
https://github.com/lmajewski/ppc32_j9_openj9-openjdk-jdk8

When build:
export VMDEBUG="-DOMR_NO_64BIT_LCE"
make BUILD_HEADLESS_ONLY=true BUILD_HEADLESS=true X11_NOT_NEEDED=yes CUPS_NOT_NEEDED=yes ALSA_NOT_NEEDED=yes PULSE_NOT_NEEDED=yes LOG=trace LOG_LEVEL=trace OPENJ9_BUILDSPEC=linux_ppc_gcc JOBS=2 CONF=linux-ppc-normal-zero-release images

Do we miss some switch?

And I can confirm that on 3.12.13 when we run:
java -jar SPECjvm2008.jar -wt 5s -it 5s -bt 2 compress

It crashes at:
OMRCAS8Helper () from /mnt/openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/images/j2re-image/lib/ppc/default/libj9gc29.so

(The other story is that we cannot link debug info to dynamic libraries as then the build crashes..... are they too large for our embedded systems?
FIX: +#$(OBJCOPY) [email protected] $@ at omrmakefiles/rules.linux.mk)

@lmajewski I think if you replace https://github.com/lmajewski/ppc32_j9_omr/commit/e3972b55a2235a3e04b90083523e2242ff38e4aa with https://github.com/eclipse/omr/commit/0b769dc5c0f406ff41adc0f353c23d1070b2bb80 and rebuild everything you will be OK. You will also no longer need export VMDEBUG="-DOMR_NO_64BIT_LCE" because it doesn't actually work with the OMR part of the code. That was my mistake earlier. :sweat_smile:

(The other story is that we cannot link debug info to dynamic libraries as then the build crashes..... are they too large for our embedded systems?
FIX: +#$(OBJCOPY) [email protected] $@ at omrmakefiles/rules.linux.mk)

Might be a bug in your objcopy, adding a debug link does not combine the debug info with the library, it just places the path to the debug info in the library. Try copying the debug file and the library to another system and then run objcopy --add-gnu-debuglink=... ... and see if it works there.

I've noticed that the problem was with GC, which with the https://github.com/lmajewski/ppc32_j9_omr/commit/e3972b55a2235a3e04b90083523e2242ff38e4aa was still calling OMRCAS8Helper directly.
(MM_Scavenger::scavengeRememberedSetList -> OMRCAS8Helper) but according to the code it shall use the higher level implementation.

I've removed the #ifdefs and now recompiling the code.

I've "hacked" a bit the OMRCAS8Helper handling - removed the #ifdefs / #endifs as it seems like
flag(s) are not passed properly to e.g. garbage collector when build.
(The J9 build infrastructure is a "challenge" on its own :-) )

Please find the branch: https://github.com/lmajewski/ppc32_j9_omr/commits/working_hack_OMRCAS8Helper

With 3.12.13 kernel I can run some SPECjvm2008 tests:
/mnt/openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/images/j2re-image/bin/java -jar SPECjvm2008.jar -wt 5s -it 5s -bt 2 compress

However, I got problem with log() calculation test:
[018] checkMathFcts: log(0.7) evaluated to: 0.7, expected: -0.356675

This is a progress as we do not see "Illegal instructions" issues.
I suspect that J9 code expects FPU HW, not SPE available on e500_v2 and hence the different result.

I've updated the omr 'devel' branch to test @ymanton changes (the correct one for OMRCAS8Helper) as mentioned above:
https://github.com/lmajewski/ppc32_j9_omr/commits/devel

Now, I'm re-building J9 on my HW.

@ymanton The above changes seems to work (at least J9 doesn't crash).

However, though I'm experiencing some other (strange) issues :-) .
I'm testing the build with SPEC tests:

openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/images/j2sdk-image/bin/java -jar SPECjvm2008.jar -wt 5s -it 5s -bt 2 compress -ikv -ctf false -chf false

It crashes with error:
Received output:
[017] checkRemainders: long double OK
[018] checkMathFcts: log(0.7) evaluated to: 0.7, expected: -0.356675

Similar issue was discussed long ago in this thread:
http://mail.openjdk.java.net/pipermail/core-libs-dev/2007-August.txt

@ymanton Could you check if on your QEMU PPC32 setup this test passes?

Some thoughts/observations:

  1. I've debugged with gdb the java and it seems like sqrt() [jsqrt], sin(), log() [jlog] and others provide correct results.
  1. The e_log.c (part of fdlibm) file is build with following flags:
    /usr/bin/powerpc-poky-linux-gnuspe-gcc -m32 -mcpu=8548 -mabi=spe -mspe -mfloat-gprs=double -fPIC --sysroot=/ -DHEADLESS -W -Wall -Wno-unused -Wno-parentheses -pipe -D_GNU_SOURCE -D_REENTRANT -D_LARGEFILE64_SOURCE -D_BIG_ENDIAN -DLINUX -DNDEBUG -DARCH='"ppc"' -Dppc -DRELEASE='"1.8.0_181-internal"' -I ....... -O2 -pipe -fPIC -g -feliminate-unused-debug-types -Wno-error=deprecated-declarations -fno-delete-null-pointer-checks -fPIC -I...... -g -O0 -DTHIS_FILE='"e_log.c"' -c -MMD -MF /mnt/openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/jdk/objs/libfdlibm/e_log.d -o /mnt/openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/jdk/objs/libfdlibm/e_log.o /mnt/openj9-openjdk-jdk8/jdk/src/share/native/java/lang/fdlibm/src/e_log.c

Two optimization parameters are passed (-O0 and -O2), but the -O0 is passed last, so there should be no issue with over-optimization.
Also, the -mspe is passed, which is required for e500_v2 to generate proper code.

  1. Even more strange - I wanted to check if jdb would help me with debugging the tests.
    ./openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/images/j2sdk-image/bin/jdb
    > run -jar SPECjvm2008.jar -wt 5s -it 5s -bt 2 compress -ikv -ctf false -chf false

And from jdb the test passes without any issues. Why ?

Thanks in advance for any hints / thoughts.

I also run the same test with -Xint, Xnojit, Xnoaot.
With Xint it took some time but the test passed OK.
With Xnojit the same think. The test was OK.
With Xnoaot the failed.

So the problem it is definitely with the JIT compiler.
It reports wrongs results:
log(0.7) evaluated to: 0.7, expected: -0.356675

Excellent! That fits with the test working in the debugger which likely also pushed execution into the interpreter instead of running JIT compiled code. You could use -Xjit:verbose={compile*} to get a list of methods that are compiled during the run and see if we can identify the method that is being mis-compiled. If we can figure out which one is causing problems we can get a compilation log and see what it is doing.

If it is difficult to tell which method is responsible you can use the output of the verbose log as a limit file using
-Xjit:limitFile=(<filename>, <m>, <n>)
(see https://www.ibm.com/support/knowledgecenter/en/SSYKE2_8.0.0/com.ibm.java.vm.80.doc/docs/xjit.html). m and n are line numbers in the file specifiying the subset of methods that should be compiled and you can use a binary search to find the problem method that when compiled causes the test to fail.

I can run SPECjvm2008 just fine, but I'm running QEMU configured as a G4 Mac not as an e500v2 board.

The JIT compiler knows nothing about the SPE instruction set and will blindly generate FP instructions. I was expecting you to hit a SIGILL when you executed them, but it sounds like they are being executed somehow (kernel emulation?) but calculating the wrong results. Do you know how FP instructions are handled in your env?

@ymanton There are two issues here:

  1. Lack of FPU on e500 core:
    root@qoriq:/mnt/SPECjvm2008# ../openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/images/j2sdk-image/bin/java -Xjit:disableInlining,verbose -jar SPECjvm2008.jar -wt 5s -it 5s -bt 2 compress -ikv -ctf false -chf false
#INFO:  _______________________________________                                                                                                                                                                                                
#INFO:  Version Information:                                                                                                                                                                               
#INFO:       JIT Level  - f799cc2                                                                                                                                                                
#INFO:       JVM Level  - 20181006_000000                                                                                                                                                                                 
#INFO:       GC Level   - f799cc2                                                                                                                                                                       
#INFO:                                                                                                                                                                                 
#INFO:  Processor Information:                                                                                                                                                                 
#INFO:       Platform Info:Unknown PPC processor    
                  ^^^^^^ - this also needs to be fixed to show PPC e500_v2                                                                                                                         
#INFO:       Supports HardwareSQRT:0                                                                                                                                                      
#INFO:       Supports HardwareRound:0                                                                                                                                                                                                          
#INFO:       Supports HardwareCopySign:0                                                                                                                                          
#INFO:       Supports FPU:1 
                  ^^^^^^^^ - the e500 by design doesn't have FPU (e5500 has one).
                  Instead it has SPE (which is NOT compatible with FPU)                                                                                                                                                        
#INFO:       Supports DFP:0                                                                                                                                                                                              
#INFO:       Supports VMX:0                                                                                                                                                                                                                    
#INFO:       Supports VSX:0                                                                                                                                                                    
#INFO:       Supports AES:0                                                                                                                                                                                                           
#INFO:       Supports  TM:0                                                                                                                                                                                                                    
#INFO:       Vendor:Unknown                                                                                                                                                                                                             
#INFO:       numProc=2                                                                                                                                                                                                                         

  1. The test above also crashes with: -Xjit:optLevel=noOpt
    Which points to JIT generating the math related code using wrongly emulated (stubs?) PPC ASM instructions.

  2. Excluding StrictMath.log from JIT makes the test passing
    -Xjit:exclude={java/lang/StrictMath.log\(D\)D}

The kernel (3.12.13) doesn't have CONFIG_FPU enabled, nor CONFIG_ALTIVEC.
Only CONFIG_SPE is enabled.

TO DO:

  • Find where the log() for FPU compatible PPC is emulated (libc? kernel?).

QUESTION @ymanton :

  • Maybe it would suffice to compile e_log.c with -O{2,3} (now it is with -O0) and then prevent JIT from compiling it "on-fly" and use the toolchain optimized version instead (in interpreter mode)?
    Would it work? Or is there any other, better solution?

Do you have CONFIG_MATH_EMULATION?

The kernel emulation code is here: https://github.com/torvalds/linux/tree/master/arch/powerpc/math-emu

There are a bunch of fixes for the e500 SPE but they are from 2013. Does your kernel have those fixes? If not it might be worth getting them to see if it's a problem that is already fixed.

Maybe it would suffice to compile e_log.c with -O{2,3} (now it is with -O0) and then prevent JIT from compiling it "on-fly" and use the toolchain optimized version instead (in interpreter mode)?
Would it work? Or is there any other, better solution?

That is one solution, but if there are bugs in the FP emulation you will probably run into more problems later. You can disable JIT compilation of any method that has floating point bytecodes but there is no option for that, you would have to change the compiler a bit.

I looked at the code the JIT generates for StrictMath.log(D)D and there are no calculations being performed, there are only three FP instructions, lfd, stfd, and fmr because the JIT compiled method simply calls the C-compiled version.

Can you try this simple program?

class LogTest {
        public static void main(String args[]) {
                System.out.println(java.lang.Math.log(0.7));
                System.out.println(java.lang.Math.log(0.7));
        }
}

java '-Xjit:disableAsyncCompilation,limit={java/lang/StrictMath.log(D)D}(count=0)' LogTest

If it fails the same way we can try to examine the method with gdb and see what's going on.

We do use CONFIG_MATH_EMULATION_FULL=y

The kernel emulation code is here: https://github.com/torvalds/linux/tree/master/arch/powerpc/math-emu
There are a bunch of fixes for the e500 SPE but they are from 2013. Does your kernel have those fixes? If not it might be worth getting them to see if it's a problem that is already fixed.

It turns out that 3.12.13 is missing following (important IMHO) patches for SPE emulation on e500:

powerpc: fix exception clearing in e500 SPE float emulation
powerpc: fix e500 SPE float rounding inexactness detection
powerpc: fix e500 SPE float to integer and fixed-point conversions
powerpc: fix e500 SPE float SIGFPE generation
powerpc: Correct emulated mtfsf instruction

After applying (porting) them to 3.12.13 the error is the same.

I looked at the code the JIT generates for StrictMath.log(D)D

Could you share how you examined (and get) the JIT generated assembly? Is there any link to documentation of how to do it?

root@qoriq:/mnt/SPECjvm2008# ../openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/images/j2sdk-image/bin/java '-Xjit:disableAsyncCompilation,limit={java/lang/StrictMath.log(D)D}(count=0)' LogTest
-0.35667494393873245
0.7

The result is the same as in the SPECjvm2008 test. I will examine this this program with JDB/GDB.

Question:
I assume that disableAsyncCompilation forces J9 to always use JIT to generate code ?

Here are some useful -Xjit options.

  • disableAsyncCompilation - Disables background/asynchronous JIT compilation and instead does compilation in application threads by pausing the threads between method calls. This is useful to get more deterministic behaviour and make debugging easier.
  • verbose, vlog=/path/to/file - Prints JIT compiler activity log to stderr or to a file if vlog= is used.
  • limit={method} - Limits the JIT compiler to only compile the named method.
  • exclude={method} - Like limit= but prevents the JIT compiler from compiling the named method. You may also want to use dontInline={method} to prevent the JIT from inlining the named method into other methods that are compiled.
  • traceFull, log=/path/to/file - Generates compilation logs, which show all the optimization passes and code generation. You can use this to examine which instructions the JIT compiler emitted. There are also more specific ones like traceCG (trace code generation only), traceILGen, and so on, but traceFull gives you everything. IIRC log= must be provided, otherwise the output will be suppressed.
  • breakAfterCompile - Calls raise(SIGTRAP) after a method is compiled; useful if you have a debugger attached. The compiler will print the start address of the compiled method so you can disassemble it or set a breakpoint.
  • breakOnEntry - Same as above, except it inserts a trapping instruction at the beginning of compiled methods so that you will break to the debugger whenever the method is executed. On PPC this inserts an illegal instruction so you have to be prepared to suppress the resulting SIGILL and step over the instruction. I find using breakAfterCompile and setting a breakpoint to be more convenient than this.
  • count=n - Causes methods to be compiled after n invocations. count=0 means the method will be compiled the first time it is called. Note however that compiling a method before it has ever been executed will sometimes cause the compiler to generate code that is very different than if the method has been executed before.

Some options like trace* and break* can apply to specific methods. You can do this by -Xjit:{method}(traceFull,log=trace.log),.... Others like disableAsyncCompilation must be applied globally.

If you want to debug this in GDB you can try something like this:

gdb --args java '-Xjit:disableAsyncCompilation,limit={java/lang/StrictMath.log(D)D},{java/lang/StrictMath.log(D)D}(count=0,breakAfterCompile)' LogTest

(gdb) run
-0.35667494393873245

=== Finished compiling java/lang/StrictMath.log(D)D at 0x9613107c ===

Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x96130470 (LWP 17826)]
0x0ffd7860 in raise (sig=5) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
37      ../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
(gdb) x/30i 0x9613107c
   0x9613107c:  lfd     f0,0(r14)
   0x96131080:  lwz     r11,40(r13)
   0x96131084:  addi    r14,r14,-16
   0x96131088:  mflr    r0
   0x9613108c:  cmplw   r14,r11
   0x96131090:  stw     r0,12(r14)
   0x96131094:  ble     0x961310d4
   0x96131098:  stfd    f0,16(r14)
   0x9613109c:  fmr     f1,f0
   0x961310a0:  lis     r3,-27231
   0x961310a4:  ori     r3,r3,6144
   0x961310a8:  addi    r4,r3,28
   0x961310ac:  ori     r3,r13,0
   0x961310b0:  lis     r12,4031
   0x961310b4:  ori     r12,r12,4164
   0x961310b8:  mtctr   r12
   0x961310bc:  bctrl
   0x961310c0:  fmr     f0,f1
   0x961310c4:  lwz     r0,12(r14)
   0x961310c8:  mtlr    r0
   0x961310cc:  addi    r14,r14,16
   0x961310d0:  blr
   0x961310d4:  addi    r14,r14,12
   0x961310d8:  li      r12,12
   0x961310dc:  bl      0x9632f950
   0x961310e0:  addi    r14,r14,-12
   0x961310e4:  b       0x96131098
   0x961310e8:  .long 0x0
   0x961310ec:  .long 0x0
   0x961310f0:  .long 0x0
(gdb) break *0x9613107c
Breakpoint 1 at 0x9613107c
(gdb) cont
Continuing.
[Switching to Thread 0xb7fcb470 (LWP 17824)]

Breakpoint 1, 0x9613107c in ?? ()
(gdb) display/i $pc
1: x/i $pc
=> 0x9613107c:  lfd     f0,0(r14)
(gdb) x/gf $r14
0xb765b150:     0.69999999999999996
(gdb) stepi
0x96131080 in ?? ()
1: x/i $pc
=> 0x96131080:  lwz     r11,40(r13)
(gdb) p/f $f0
$1 = 0.69999999999999996
(gdb) break *0x961310c0
Breakpoint 2 at 0x961310c0
(gdb) cont
Continuing.

Breakpoint 2, 0x961310c0 in ?? ()
1: x/i $pc
=> 0x961310c0:  fmr     f0,f1
(gdb) p/f $f1
$2 = -0.35667494393873245

@ymanton Thanks for very exhausting explanation :-)

To set off:

  1. One "strange" thing:
    run '-Xjit:disableAsyncCompilation,limit={java/lang/StrictMath.log(D)D}(count=0,breakAfterCompile)' LogTest

we do specify count=0, but at least on my J9:
The "pure SW" implementation of log is executed from:

__j__ieee754_log @ /mnt/openj9-openjdk-jdk8/jdk/src/share/native/java/lang/fdlibm/src/e_log.c:109

It provides correct result.

  1. Regarding GDB debugging:

In my case the code is identical (excluding missing ASM mnemonics for FPU instructions):

(gdb) x/30i 0x95f8907c
=> 0x95f8907c:  .long 0xc80e0000
   0x95f89080:  lwz     r11,40(r13)
   0x95f89084:  addi    r14,r14,-16
   0x95f89088:  mflr    r0
   0x95f8908c:  cmplw   r14,r11
   0x95f89090:  stw     r0,12(r14)
   0x95f89094:  ble-    0x95f890d4
   0x95f89098:  .long 0xd80e0010
   0x95f8909c:  .long 0xfc200090
   0x95f890a0:  lis     r3,-27180
   0x95f890a4:  ori     r3,r3,18944
   0x95f890a8:  addi    r4,r3,28
   0x95f890ac:  ori     r3,r13,0
   0x95f890b0:  lis     r12,4042
   0x95f890b4:  ori     r12,r12,47368
   0x95f890b8:  mtctr   r12
   0x95f890bc:  bctrl
   0x95f890c0:  .long 0xfc000890
   0x95f890c4:  lwz     r0,12(r14)
   0x95f890c8:  mtlr    r0
   0x95f890cc:  addi    r14,r14,16
   0x95f890d0:  blr
   0x95f890d4:  addi    r14,r14,12
   0x95f890d8:  li      r12,12
   0x95f890dc:  bl      0x96187950
   0x95f890e0:  addi    r14,r14,-12
   0x95f890e4:  b       0x95f89098

However, some instructions are not supported (as expected):
lfd f0,0(r14) -> .long 0xc80e0000 , but those have emulation functions in kernel (as expected)

From the context I don't quite get the algorithm:

0x95f8908c:  cmplw   r14,r11
(gdb) x/gf $r14
0xb7460a40:     1.2291371401990921e-314
(gdb) x/gf $r13
0xb7460c00:     9.682528029950874e-233

we compare those values with cmplw which on e500 is Compare Logical Word (just word - double precision IEEE-754 is 64 bit).
and we jump to:
0x95f890d4: addi r14,r14,12 ----> is adding anything correct in this case?
And in the end we jump to 0x96187950 ----> It looks like an another code generated by JIT.

(gdb) x/30i 0x96187950
=> 0x96187950:  li      r11,31868
   0x96187954:  oris    r11,r11,3932
   0x96187958:  mtctr   r11
   0x9618795c:  bctr
   0x96187960:  .long 0x0

After the bctr:
0x0f5c7c7c <jitStackOverflow+0>: 90 01 00 88 stw r0,136(r1)

And then we end in:
Java_java_lang_StrictMath_log (env=0xb7460c00, unused=0x95d44a1c, d=3.1226928498684416e-313) at /mnt/openj9-openjdk-jdk8/jdk/src/share/native/java/lang/StrictMath.c:76

The 3.1226928498684416e-313 is quite interresting ----> I would expect 0.699999999999.
The above function exits with broken result: Value returned is $4 = -719.57043838465881

Comments/Thoughts:

  1. I may not correctly understood the algorithm for log calculation (and the "pieces" of ASM generated by JIT).
  2. It is quite surprising that only a few FPU instructions are used (fmr, lfd, stfd) - I would expect more work done with the help of FPU operations.
  3. The 0x95f890dc: bl 0x96187950 also looks like a code generated by JIT - I will check if it shows up in the traceFull log.

And then we end in:
Java_java_lang_StrictMath_log (env=0xb7460c00, unused=0x95d44a1c, d=3.1226928498684416e-313) at /mnt/openj9-openjdk-jdk8/jdk/src/share/native/java/lang/StrictMath.c:76
The 3.1226928498684416e-313 is quite interresting ----> I would expect 0.699999999999.
The above function exits with broken result: Value returned is $4 = -719.57043838465881

d=3.1226928498684416e-313 looks wrong indeed and should be what you expected. The result -719.57043838465881 is actually correct for 3.1226928498684416e-313, so Java_java_lang_StrictMath_log is working, but somehow the argument is not what we wanted.

Can you set the following breakpoints and print some data?

   0x9613107c:  lfd     f0,0(r14)
   0x96131080:  lwz     r11,40(r13)  ; break, p/f $f0, x/gf $r14
   0x96131084:  addi    r14,r14,-16
   0x96131088:  mflr    r0
   0x9613108c:  cmplw   r14,r11
   0x96131090:  stw     r0,12(r14)
   0x96131094:  ble     0x961310d4
   0x96131098:  stfd    f0,16(r14)
   0x9613109c:  fmr     f1,f0
   0x961310a0:  lis     r3,-27231    ; break, p/f $f1
   0x961310a4:  ori     r3,r3,6144
   0x961310a8:  addi    r4,r3,28
   0x961310ac:  ori     r3,r13,0
   0x961310b0:  lis     r12,4031
   0x961310b4:  ori     r12,r12,4164
   0x961310b8:  mtctr   r12
   0x961310bc:  bctrl
   0x961310c0:  fmr     f0,f1
   0x961310c4:  lwz     r0,12(r14)   ; break, p/f $f0, p/f $f1
   0x961310c8:  mtlr    r0
   0x961310cc:  addi    r14,r14,16
   0x961310d0:  blr
   0x961310d4:  addi    r14,r14,12
   0x961310d8:  li      r12,12
   0x961310dc:  bl      0x9632f950
   0x961310e0:  addi    r14,r14,-12
   0x961310e4:  b       0x96131098

The breakpoint results:

(gdb) si
0x95f89084 in ?? ()
(gdb) p/f $f0
$2 = Value can't be converted to integer.
(gdb) x/gf $r14
0xb7460a50:     0.69999999999999996

The 'f0' is not read as expected - e500 doesn't have FPU at all. We only read it through emulation (by serving proper exception).
Value passed to the kernel:

(gdb) si
-->lfd: D ef0cf710, ea b7460a50:
0x3fe66666 0x66666666
(gdb) x/1gf $r14
0xb7460a50:     0.69999999999999996
(gdb) x/2wx $r14
0xb7460a50:     0x3fe66666      0x66666666

Which is correct (as expected)

After 'fmr' instruction:

(gdb) b *0x95f890a0
Breakpoint 2 at 0x95f890a0
(gdb) c
Continuing.-->fmr: D ef110d18, B ef110d10:

frD[0]: 0x3fe66666 frD[1]: 0x66666666
frB[0]: 0x3fe66666 frB[1]: 0x66666666

And the last break:

Thread 2 "main" hit Breakpoint 5, 0x95f890c0 in ?? ()
(gdb) si
-->fmr: D ef17c7d0, B ef17c7d8:
frD[0]: 0x3fe66666 frD[1]: 0x66666666
frB[0]: 0x3fe66666 frB[1]: 0x66666666
0x95f890c4 in ?? ()

It seems to me like some operations for FPU were either generated not executed ....

Side question:

  • For testing - I'm trying to force J9 to not generate FPU instructions at all. To achieve that I've set (at openj9/buildspecs/linux_ppc_gcc.spec):
    <flag id="env_hasFPU" value="false"/>

But it seems to not be enough as I still see (with -Xjit:verbose) that FPU is supported:
#INFO: Supports FPU:1

Is it possible/feasible to build OpenJ9 with no FPU support at all?

For testing - I'm trying to force J9 to not generate FPU instructions at all. To achieve that I've set (at openj9/buildspecs/linux_ppc_gcc.spec):

But it seems to not be enough as I still see (with -Xjit:verbose) that FPU is supported:

INFO: Supports FPU:1

Is it possible/feasible to build OpenJ9 with no FPU support at all?

No, it will not work. The env_hasFPU build configuration exists, but it is ignored by the JIT for PPC and probably also ignored by the rest of the VM. It was probably used for other architectures that J9 used to support.

Anyway, I think I know what the problem is, you built the VM and native parts of the class library with -mabi=spe -mspe -mfloat-gprs=double however the JIT compiler is still generating FPU code. Specifically, when the JIT generates code to call Java_java_lang_StrictMath_log it will pass arguments in FP registers and expect the return value to be in f0, however since that function was compiled for the SPE ABI it will expect and return data in GPRs. Can you rebuild the class library code with FP support instead and see what happens?

I've checked if the -float-gprs=dobule changed to no will change anything.
Unfortunately I do receive a following error:

configure:22215: /usr/bin/powerpc-poky-linux-gnuspe-gcc  -O2 -pipe -g -feliminate-unused-debug-types -Wno-error=deprecated-declarations  -fno-delete-null-pointer-checks -m32 -mcpu=8548 -mabi=spe -mspe -mfloat-gprs=no --sysroot=/   -Wl,-O1
-Wl,--hash-style=gnu -Wl,--as-needed -fPIC conftest.c  >&5
conftest.c:1:0: error: E500 and FPRs not supported

I will check if this will work with very generic setup.

I've also removed -mcpu=8548 -mabi=spe -mspe, but the gcc is tuned for e500 and I cannot pass
-mfloat-gprs=no to it. It is though logical as e500 doesn't have FPU at all, so it looks like only feasible option is to pass double values via GPRs.

Questions:

  1. For the first log() run we use pure SW implementation of log(). Would it be possible to force J9 via JIT to optimize this source code ?

  2. Where will I find code responsible for generating JIT code, which forces using FPRs as input for log() (and others) function? It seems like some general issue with prolog and epilog code generation.

For the first log() run we use pure SW implementation of log(). Would it be possible to force J9 via JIT to optimize this source code ?

No, the JIT compiler will only compile Java code, it cannot do anything for C code or the interpreter.

Where will I find code responsible for generating JIT code, which forces using FPRs as input for log() (and others) function? It seems like some general issue with prolog and epilog code generation.

The code for that is mostly here:

https://github.com/eclipse/openj9/blob/master/runtime/compiler/p/codegen/PPCJNILinkage.cpp

@ymanton Do you have maybe some hints regarding the code?

Maybe you remember any other J9 supported arch which doesn't use FPU and hence use emulated code from the outset?

In principle the ARM code supports software floating point, but I've never tried to build it or looked into the code. But searching the arm codegen for SOFTFP might give some clues for how that platform handled it.
Looks like much of the relevant code is in omr/compiler/arm/codegen/FPTreeEvaluator.cpp, but hits for SOFTFP show up in a lot of other files too.

@ymanton from your comment I understand that the JIT is responsible for emitting those
.long instructions eg .long 0xc80e0000 which as it is said above is lfd f0,0(r14)

At this issue another effort has been done before to change some of those unsupported ppc instructions to ones that are supported.

eg lfd should be replaced with evldd

Is it possible to make such a change to JIT code in order to generate different FP instructions compatible with e500v2 according to the table:
https://github.com/ibmruntimes/v8ppc/issues/119#issuecomment-72793421

If yes can we have some instructions or an example and maybe help you with this change in JIT code?
Where those changes should be done?

Thanks a lot!

Is it possible to make such a change to JIT code in order to generate different FP instructions compatible with e500v2

Yes it's possible, let's call this option A. It would be the best way to support your hardware, but it's also by far the most work and requires changing the JIT compiler and a few other parts of the VM as well. This would probably be weeks or months of work by someone who had a good understanding of the VM, JIT, and PPC code generator. If you want to undertake this you can start by adding the SPE instructions to PPCOps.ops, then you can start generating SPE instructions instead of FPU instructions by following the examples in the FP tree evaluators. You will also have to change how we generate code for functions taking FP arguments and returning FP values, you can do that in PPCPrivateLinkage.cpp and PPCJNILinkage.cpp. That's all I can think of at the moment, but there is certainly even more required that I haven't thought of. You will basically be doing a subset of the kind of work @shingarov is attempting in porting OpenJ9 to RISCV.

A less difficult solution, option B, is to continue to allow the JIT to generate FPU instructions and change how we call C code compiled for the SPE ABI. This is what @lmajewski asked about earlier. To do this you have to change https://github.com/eclipse/openj9/blob/4333a9822af0846325876bd142bcf354b5f0d07d/runtime/compiler/p/codegen/PPCJNILinkage.cpp#L92 and https://github.com/eclipse/openj9/blob/4333a9822af0846325876bd142bcf354b5f0d07d/runtime/compiler/p/codegen/PPCJNILinkage.cpp#L779 to do what you want.

The way we generate code for calls is to iterate through all of the arguments and, depending on their type, put them into the registers required by the ABI. This loop in buildJNIArgs() https://github.com/eclipse/openj9/blob/4333a9822af0846325876bd142bcf354b5f0d07d/runtime/compiler/p/codegen/PPCJNILinkage.cpp#L956-L1296 has a switch statement and cases for float and double types. What you would need to do is move the arguments from the FP registers they occupy into the correct GPR by generating an stfs or stfd to store the FP value to memory and then the appropriate SPE load to place them in the correct GPR. Similarly, return value would have to be moved from a GPR to an FPR. There is a handy function for this sort of thing https://github.com/eclipse/omr/blob/e8f52374ab0c86945d8bcc1cc1bbdbba87148fcd/compiler/p/codegen/GenerateInstructions.cpp#L61-L200 that will generate instructions to move data between the FPRs and GPRs; you can modify it to generate the correct sequences for SPE-based processors.

An even less difficult solution, option C, would be to simply prevent the JIT from compiling any Java method that does any floating point operations and let them be interpreted forever. There might be some other changes required to make this work, but I think it would be the least effort. Performance might be worse, or perhaps even as good as or better than option B because kernel emulation is probably not cheap at all and the interpreter in this scenario would benefit from using SPE hardware.

@ymanton @PTamis The C option seems the most appealing to have _any_ working solution for our HW.
I'm just wondering on which "level" shall we catch (and disable) the FPU instruction?

Forbid the execution of certain JAVA methods? Like we did with Match.log() (via -Xjit:exclude={java/lang/StrictMath.log\(D\)D})?

The problem is that we would need to be 100% sure that by any chance JIT will not compile any such
instruction ...

This is the part of the compiler that first reads the Java bytecode for the method and starts generating the compiler's intermediate representation: https://github.com/eclipse/openj9/blob/f031ac4012794306c16a6ba8fcc9907ca6fbcf90/runtime/compiler/ilgen/Walker.cpp#L1184

You will see a giant switch in that function that processes each bytecode. Find all of the FP related bytecodes and call comp()->failCompilation<TR::ILGenFailure>("Floating point unsupported for ppcspe"); for them and see if that is enough to run SPECjvm2008. :crossed_fingers: You can look in the verbose JIT log to see all of the methods that fail.

Actually, it turns out there is already an option to disable compiling methods that contain FP bytecodes. Can you try -Xjit:disableFPCodeGen and see if that works as expected?

I can confirm that -Xjit:disableFPCodeGen when used makes the test passing.
(As observed, the performance degradation is not so huge - less than 20%)

One thing though - when we run:
../openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/images/j2re-image/bin/java -Xjit:disableFPCodeGen -jar SPECjvm2008.jar -wt 5s -it 5s -bt 2 compress -ikv -ctf false -chf false

We do see:

--- --- --- --- --- --- --- --- ---

  Benchmark:   compress
  Run mode:    timed run
  Test type:   multi
  Threads:     2
  Warmup:      5s
  Iterations:  1
  Run length:  5s

Warmup (5s) begins: Mon Oct 15 13:57:55 UTC 2018
Warmup (5s) ends:   Mon Oct 15 14:01:09 UTC 2018
Warmup (5s) result: 0.65 ops/m

Iteration 1 (5s) begins: Mon Oct 15 14:01:09 UTC 2018
Iteration 1 (5s) ends:   Mon Oct 15 14:01:31 UTC 2018
Iteration 1 (5s) result: 8.37 ops/m

Valid run!
Score on compress: 8.37 ops/m

@ymanton Maybe you would know - what is the reason form such long "warmup()" operation?

I think this is just the way the benchmark behaves. If it's running slower than 1 op per minute per thread it will run longer than your requested 5s because it has to complete at least 1 op per thread.

The reason it's running so slowly is probably because you don't have enough CPU resources available to both run the benchmark and do a lot of compilation. You can see that once the compiler has had a chance to do some work it runs faster during iteration 1.

I can confirm also that with -Xjit:disableFPCodeGen openJ9 seems to work fine on e500v2.
Also I resolved the kernel problem.

The configuration option:
CONFIG_MATH_EMULATION_FULL
has to be enabled in kernel. Otherwise the ./java -version does not return.

After some testing - it turned out that OpenJ9 has very long startup time (up to 4x).
Is there any trick to re-use (store) some precompiled code at "cold" start of the program?

Try -Xshareclasses. It will create a cache file in /tmp and store class data and compiled code in it and use it in the future. The option has sub-options that you can use to change some settings and tweak the behaviour and you can find the documentation online.

Note that the first time you use it startup time will still be long since the cache will need to be created and filled, but after that you should see faster startups.

The scimark.fft.large took 30 minutes to "Warmup()". And similar time to run the test.

../openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/images/j2re-image/bin/java -Xjit:disableFPCodeGen -Xshareclasses -jar SPECjvm2008.jar -wt 5s -it 5s -bt 2 scimark.fft.large  -ikv -ctf false -chf false

Unfortunately, this is too long.

I would expect the penalty from lack of FP support, but on the other hand we do emulate them on this SoC anyway.

During the compilation I caught some strange warnings.
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9codert_vm.a(cnathelp.o) uses soft float
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9util.a(fltconv.o) uses soft float
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9util.a(bcdump.o) uses soft float
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9util.a(fltmath.o) uses soft float
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9util.a(fltrem.o) uses soft float
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9utilcore.a(j9argscan.o) uses soft float

I dug a bit over this and I show that JIT files were not compiled with the correct flags.
mcpu=powerpc was used instead of -mcpu=8548 -mabi=spe -mspe -mfloat-gprs=double

I saw that this variable is set by:
common.mk so I changed that by the options above.

Compile broke only to one point after that. This was at the compilation of PPCHWProfiler.cpp because it includes file PPCHWProfilerPrivate.hpp which has 64bit assembly in it and of course this is not compatible with our HW. So the compiler stopped.

The problem is at lines 250 to 270.
I commented those lines and I redefined MTSPR64 and MFSPR64 exactly like MTSPR MFSPR inside __GNUC__.
After that compilation went all the way and no hard soft floats warnings emitted again.

I guessed that this code should not be called otherwise we would have seen and SIGILL during some of the test.
Any comments for a better fix would be appreciated.

Yes you are right, the code in the PPCHWProfiler files will only run on certain POWER processors. The inline assembly should be better guarded or the files should not be built at all on platforms like yours ideally, but it's a minor issue. I'll keep it in mind.

Most of the code in libj9jit is dedicated to the compiler itself, which doesn't do a lot of floating point math, but some code is runtime code that will be executed by the program. Some of those runtime routines are in object files built via VM makefiles and they get linked together with the JIT library, which explains those warnings. The files built with hardfloat will never interact with the ones built with softfloat so you were not in danger, but it's good to fix that anyway.

Is floating point performance important to you? If it is then you really need the JIT to support SPE. With -Xjit:disableFPCodeGen I see 30x slower performance on FP-intensive programs on large servers as well, so I think that is an unavoidable reality.

mcpu=powerpc was used instead of -mcpu=8548 -mabi=spe

Out of curiosity, how is your gcc configured? I explicitly invoke the OMR configure script with CC=powerpc-linux-gnuspe-gcc and that gcc just comes in a Debian package and -v says it was compiled with --with-cpu=8548 --enable-e500_double --target=powerpc-linux-gnuspe.

@ymanton FP cannot be excluded.
I tested tomcat to see the load time and it took considerable time to load along with all its apps.
I believe that FP is the bottleneck and it took so much time to load. FP tests alone are getting 30x times slower indeed.

@shingarov CC=powerpc-linux-gnuspe-gcc -m32 -mcpu=8548 -mabi=spe -mspe -mfloat-gprs=double --sysroot=/ as an environmental variable

configuration comes from ELDK 5.6 which is based on yocto daisy 1.6. I took configuration from there in order to make the native build. Later I will try also the cross compile with a yocto recipe.

But if you explicitly say to the gcc mcpu=powerpc I guess the wrong one will be used and that's why the warnings being produced. This is hard coded to the common.mk file

Either way even with the -mcpu=8548 -mabi=spe -mspe -mfloat-gprs=double I did not notice any performance difference.

@ymanton

Is floating point performance important to you? If it is then you really need the JIT to support SPE.
With -Xjit:disableFPCodeGen I see 30x slower performance on FP-intensive programs on large servers as well, so I think that is an unavoidable reality.

Yes, the floating point support is necessary. The observed performance regression is not acceptable.
I'm now looking into the kernel to see what exactly fails with JIT generated code.

From my understanding -> as we emulate FPU instructions in-kernel, the J9 JIT which uses them shall work with fully emulated code. Performance shall be better than -Xjit:disableFPCodeGen , but worse than HW FP.

To set off:

  1. One "strange" thing:
    run '-Xjit:disableAsyncCompilation,limit={java/lang/StrictMath.log(D)D}(count=0,breakAfterCompile)' LogTest

we do specify count=0, but at least on my J9:
The "pure SW" implementation of log is executed from:

__j__ieee754_log @ /mnt/openj9-openjdk-jdk8/jdk/src/share/native/java/lang/fdlibm/src/e_log.c:109

It provides correct result.

The above behaviour puzzles me a bit.
Corresponding ASM code snippet:

116                 if (hx<0) return (x-x)/zero;        /* log(-#) = NaN */
   0x0f923ef0 <+172>:   lwz     r9,12(r31)
   0x0f923ef4 <+176>:   cmpwi   cr7,r9,0
   0x0f923ef8 <+180>:   bge-    cr7,0xf923f18 <__j__ieee754_log+212>
   0x0f923efc <+184>:   evldd   r10,104(r31)
   0x0f923f00 <+188>:   evldd   r9,104(r31)
   0x0f923f04 <+192>:   efdsub  r10,r10,r9
   0x0f923f08 <+196>:   lwz     r9,-32764(r30)
   0x0f923f0c <+200>:   evldd   r9,0(r9)

117                 k -= 54; x *= two54; /* subnormal number, scale up x */
   0x0f923f18 <+212>:   lwz     r9,8(r31)
   0x0f923f1c <+216>:   addi    r9,r9,-54
   0x0f923f20 <+220>:   stw     r9,8(r31)
   0x0f923f24 <+224>:   evldd   r10,104(r31)
   0x0f923f28 <+228>:   lwz     r9,-32768(r30)
   0x0f923f2c <+232>:   evldd   r9,0(r9)
   0x0f923f30 <+236>:   efdmul  r9,r10,r9
   0x0f923f34 <+240>:   evstdd  r9,104(r31)

It uses "ev*" ASM SPE instructions (like evdmul -> the same performance as FPU but on GPRs), so this is the fastest possible code on this SoC.
This code is _always_ compiled - even when we set 'count=0' in
'-Xjit:disableAsyncCompilation,limit={java/lang/StrictMath.log(D)D}(count=0,breakAfterCompile)' LogTest

Even better, the JIT code uses trampoline to jump to function, which provides the log (@ymanton is there a way to check where dcall java/lang/StrictMath.log(D)D[#618 final native static Method] is provided? The traceFull doesn't show where this function's ASM representation can be found).

When we disable the -Xjit:disableFPCodeGen we shall use the above code, which is the fastest possible.

Maybe the problem with performance regression lies somewhere else? Maybe locking (as JVM uses several threads) as we use sync instead of lwsync or msync?

The above behaviour puzzles me a bit.

That is an implementation detail. java/lang/StrictMath.log(D)D is special in that it is a JNI method not a Java method. This method is declared in a class as native and it's implementation is in C, not Java. The JIT behaves a little differently for these kinds of methods, even if you use count=0.

@ymanton is there a way to check where dcall java/lang/StrictMath.log(D)D[#618 final native static Method] is provided? The traceFull doesn't show where this function's ASM representation can be found.

Notice that it is a native method, this means that it's implementation will be in C. You can still find the assembly for it in gdb or by looking at the .o file, but the source code will be in https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/openj9/jdk/src/share/native/java/lang/StrictMath.c. The implementation eventually reaches the fdlibm version of log() in https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/openj9/jdk/src/share/native/java/lang/fdlibm/src/e_log.c.

When we disable the -Xjit:disableFPCodeGen we shall use the above code, which is the fastest possible.

Maybe the problem with performance regression lies somewhere else? Maybe locking (as JVM uses several threads) as we use sync instead of lwsync or msync?

You have to consider that yes you will execute a "fast" version of log() that uses the SPE hardware, however by using -Xjit:disableFPCodeGen more methods will now run in the interpreter instead of being compiled. Every method that has even a single float or double bytecode will never be JIT compiled, so if for example the main loop of the benchmark cannot be compiled and must execute in the interpreter, your overall performance will be much worse, even if log() is fast. As I said earlier, I can reproduce a 30x slowdown on some floating point benchmarks when I use -Xjit:disableFPCodeGen on a ppc64 machine, so it can be a big penalty even on server machines.

Do you have maybe the build system adjustments to cross-compile the OpenJ9 on ARM? I mean the arm is also not supported (at all), so I could reuse some of its code on ppc port.

I recently followed James' instructions and successfully cross-compiled from Ubuntu/AMD64 to the RPi and the resulting VM works fine. Caveat: you may want to read the recent conversation on Slack about back-contributing directly to the master repo, not via James' fork.

I am also actively trying to cross-compile to the e500. I am approaching it differently though, I am trying to start from (pieces of) the OMR testcompiler which kind of looks more within reach. What I understood however is that its build system is quite disconnected from the other two i.e. from both TR's and J9's. And I have a feeling that it's less actively being looked at, as while the other parts cross-compile just fine, I had to dance around things to get the tc/tril/etc to cross-compile to ARM. I'll keep you posted on the progress with tc/tril on e500.

@shingarov - Have you managed to make any progress there (with e500 or RISC V)?
If yes - could you share your code (even development stage) on github?

@ymanton during my tests I figure out something a bit strange.
In order to make java -version to work in my system (kernel v4) I had to enable CONFIG_MATH_EMULATION_FULL. I also enabled the traces in kernel and I can see all the FPU functions being emulated by the kernel.

Now I did the following test:
./java -Xint -version disables both JIT and AOT.

The strange thing I show in dmesg was that jvm was emitting lots of lfd and stfd instructions.
And I was wandering how is this possible since I compile all files with mcpu=8548 -mabi=spe -mspe -mfloat-gprs=double so all jvm code should be SPE specific and not have mcpu=powerpc instructions.

Who is emitting those instructions if not JIT or AOT?

Who is emitting those instructions if not JIT or AOT?

There are probably a few low-level routines written in assembly that are still being called. For example https://github.com/eclipse/openj9/blob/master/runtime/util/xp32/volatile.s

@ymanton
If I may ask - have you tried/used on any of your systems the J9 with --with-zlib=system set during
configuration?

We do experience some slow downs when decompressing files (.war/.jar). After enabling the above switch the speedup was not present (but expected as this is a native lib).
The ldd output:

root@lala:/usr/lib/jvm# ldd `which java`
    ...
    libz.so.1 => /lib/libz.so.1 (0x0f9e0000)
    ...

Is there any other way to speed up decompression on J9?

No I've never used that option. It looks like it allows you to use your system's zlib rather than the one in https://github.com/ibmruntimes/openj9-openjdk-jdk8/tree/openj9/jdk/src/share/native/java/util/zip/zlib.

Even if you don't use that option you will be getting a native zlib implementation, the only difference is which one. It sounds like there is no performance to be gained by using your system's zlib over the OpenJDK one.

Not sure how you can speed up decompression, other than reducing the number of JAR files you access or maybe changing the build so that JAR files are built with no compression (i.e. jar -0 ...). Are you using the -Xshareclasses option that I suggested earlier? I don't know if it actually allows us to skip decompressing jar files or not, but it generally helps startup.

@ymanton

Not sure how you can speed up decompression, other than reducing the number of JAR files you access or maybe changing the build so that JAR files are built with no compression (i.e. jar -0 ...)

This I do need to check if those files can be converted.

Are you using the -Xshareclasses option that I suggested earlier? I don't know if it actually allows us to skip decompressing jar files or not, but it generally helps startup.

I've tried it and the results are promissing - when put into /tmp dir I can see the startup speedup of around 15%

I did some detailed tests on the target production application ( with -Xjit:disableFPCodeGen):

The JIT log interresting parts:

root@lala:~# grep -E "^! " /tmp/JIT_log.20181025.143321.6552 | grep -i Hash
! java/util/WeakHashMap.<init>(IF)V time=3741us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=815 MB
! java/util/HashMap.resize()[Ljava/util/HashMap$Node; time=3498us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=806 MB
! java/util/Hashtable.rehash()V time=2038us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=774 MB
! java/util/WeakHashMap.<init>()V time=1249us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=746 MB
! java/util/HashMap.<init>()V time=1279us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=746 MB
! java/util/HashMap.<init>(IF)V time=2877us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=735 MB
! java/util/HashMap.putMapEntries(Ljava/util/Map;Z)V time=2132us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=729 MB
! java/util/HashMap.<init>(I)V time=1588us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=728 MB
! java/util/Hashtable.<init>(IF)V time=4472us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=723 MB
! java/util/Hashtable.<init>()V time=1270us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=713 MB
! java/util/LinkedHashMap.<init>(IF)V time=1085us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=695 MB
! java/util/HashSet.<init>(IFZ)V time=1327us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=695 MB
! java/util/LinkedHashSet.<init>()V time=1038us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=695 MB

And maybe the most important:

grep -E "^! " /tmp/JIT_log.20181025.143321.6552 | grep -i Zip
! java/util/zip/ZipCoder.getBytes(Ljava/lang/String;)[B time=3326us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=815 MB
! java/util/zip/ZipCoder.toString([BI)Ljava/lang/String; time=3622us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=746 MB

The code responsible for unzip'ing.
I do guess that compilationRestrictedMethod is caused by adding -Xjit:disableFPCodeGen

It seems like this is the main slow down factor (the 'prod' application is 3x slower and the cpu usage is very high). Hence my above question if unzipping can be replaced with on-system library.

It seems like the zip decompression is the bottle neck - at least from the FPU point of view.....

I took the webapp (*.war) and re-archived it with jar:
usr/bin/fastjar -c0f web.war web/ -> the size increased a bit (23 MiB -> 27 MiB),

but the execution time (for this part) was reduced from 60 seconds to 2.5 seconds !!!

but the execution time (for this part) was reduced from 60 seconds to 2.5 seconds !!!

That's surprising. I don't see how disableFPCodeGen fits into this since I don't think the default zlib compression algorithm uses floating point, but maybe I'm wrong.

Is the perf tool or OProfile available to you? If so can you collect some profiles?

With perf you should use the -Xjit:perfTool option, with OProfile use -agentlib:jvmti_oprofile.

Output from the perf report (for the part performing the zlib decompression)

# Overhead   Command        Shared Object
 20.24%  jsvc.ppc  libj9vm29.so         [.] bytecodeLoop
 4.45%  jsvc.ppc  [kernel.kallsyms]    [k] program_check_exception
 3.10%  jsvc.ppc  libj9thr29.so        [.] omrthread_spinlock_acquire
 2.14%  jsvc.ppc  [kernel.kallsyms]    [k] do_mathemu
 1.45%  jsvc.ppc  [kernel.kallsyms]    [k] do_resched
 1.05%  jsvc.ppc  [kernel.kallsyms]    [k] __do_softirq
  0.89%  jsvc.ppc  [kernel.kallsyms]    [k] finish_task_switch

...
  0.19%  jsvc.ppc  [kernel.kallsyms]    [k] lfd
  0.19%  jsvc.ppc  [kernel.kallsyms]    [k] stfd 
  0.15%  jsvc.ppc  [kernel.kallsyms]    [k] fsub
  And also fdiv, fmuls, etc

It is apparent that some FPU instructions have slipped in the zip "decompression code".

However, neither the images/j2sdk-image/jre/lib/ppc/libzip.so nor ./jdk/lib/ppc/default/libj9zlib29.so contain the FPU ASM instructions.

I've also grep'ed the libj9*.so libs (in the build directory) and the lfd, stfd, etc. are placed there very often.

OK thanks, bytecodeLoop is the Java interpreter doing a lot of work because of disableFPCodeGen and program_check_exception, do_mathemu, lfd , stfd, etc are FP emulation in the kernel. do_resched, __do_softirq, etc are possibly also emulation related. So it looks like 20% of your time is in the interpreter and 10% or more is in FP emulation.

I'll look into some of this when I have some time in the next couple of days and get back to you.

I took a quick look at why ZipCoder.getBytes() and ZipCoder.toString() were not being compiled with -Xjit:disableFPCodeGen and it is indeed because they use floating point operations; for example float CharsetEncoder.maxBytesPerChar() and CharsetDecoder.maxCharsPerByte(), which is unfortunate because the calculation isn't all that interesting. There are probably lots of other places where some minor FP usage is causing methods to fail compilation.

If you want you can try the following change to your JCL to see how much performance you can get back for unzipping:

diff --git a/jdk/src/share/classes/java/util/zip/ZipCoder.java b/jdk/src/share/classes/java/util/zip/ZipCoder.java
index b920b82..cc449e6 100644
--- a/jdk/src/share/classes/java/util/zip/ZipCoder.java
+++ b/jdk/src/share/classes/java/util/zip/ZipCoder.java
@@ -45,7 +45,7 @@ final class ZipCoder {

     String toString(byte[] ba, int length) {
         CharsetDecoder cd = decoder().reset();
-        int len = (int)(length * cd.maxCharsPerByte());
+        int len = (int)(length * maxCharsPerByte);
         char[] ca = new char[len];
         if (len == 0)
             return new String(ca);
@@ -76,7 +76,7 @@ final class ZipCoder {
     byte[] getBytes(String s) {
         CharsetEncoder ce = encoder().reset();
         char[] ca = s.toCharArray();
-        int len = (int)(ca.length * ce.maxBytesPerChar());
+        int len = (int)(ca.length * maxBytesPerChar);
         byte[] ba = new byte[len];
         if (len == 0)
             return ba;
@@ -127,6 +127,8 @@ final class ZipCoder {
     private Charset cs;
     private CharsetDecoder dec;
     private CharsetEncoder enc;
+    private int maxCharsPerByte;
+    private int maxBytesPerChar;
     private boolean isUTF8;
     private ZipCoder utf8;

@@ -139,11 +141,15 @@ final class ZipCoder {
         return new ZipCoder(charset);
     }

+    private int maxCharsPerByteRU() { return (int)(dec.maxCharsPerByte() + 0.5f); }
+    private int maxBytesPerCharRU() { return (int)(enc.maxBytesPerChar() + 0.5f); }
+
     private CharsetDecoder decoder() {
         if (dec == null) {
             dec = cs.newDecoder()
               .onMalformedInput(CodingErrorAction.REPORT)
               .onUnmappableCharacter(CodingErrorAction.REPORT);
+            maxCharsPerByte = maxCharsPerByteRU();
         }
         return dec;
     }
@@ -153,6 +159,7 @@ final class ZipCoder {
             enc = cs.newEncoder()
               .onMalformedInput(CodingErrorAction.REPORT)
               .onUnmappableCharacter(CodingErrorAction.REPORT);
+            maxBytesPerChar = maxBytesPerCharRU();
         }
         return enc;
     }

Thanks @ymanton for your investigation.

As one can see above - code which on the first glance doesn't require FP support, needs one.
I think that the only feasible solution would be to:

  1. Enable full FPU support in the kernel (also some fpu instructions - like sqrt() are implemented in libc)
  2. Recompile the whole SW stack with (-mcpu=powerpc, do not use SPE at all)
  3. Only then use the OpenJ9 with FPU enabled.

Taking the above into consideration - we can get away with massive changes in OpenJ9 code and just add support for PPC32 bit to its repository.

Have you managed to make any progress there (with e500 or RISC V)? If yes - could you share your code (even development stage) on github?

@lmajewski Our immediate goals at this stage are much more modest, being currently confined to just OMR. On RISC-V, we successfully JIT some simple methods such as Fibonacci. We hope to share that initial code during this coming RISC-V summit.

On e500, I would like to understand how you were able to run so much of OpenJ9 so successfully. In my experiments so far, I have confined myself to the much simpler TestCompiler, and even for those trivial tests, the generated code is sometimes incorrect. For example, I am trying to debug problems in the area of PPCSystemLinkage::calculateActualParameterOffset() and around; sometimes the offsets are wrong, with catastrophic results, the lwz will trash the saved LR in the link area, causing the blr to segfault. I would like to understand how OpenJ9 doesn't crash in the same place. Investigating...

@shingarov PPCSystemLinkage implements the ppc64le ABI only (because OMR is only supported on ppc64le), it does not handle the AIX/ppc64be ABI or the ppc32 ABI. We don't use the native ABIs for Java, we use our own and you can find the implementations for that stuff in https://github.com/eclipse/openj9/blob/master/runtime/compiler/p/codegen/PPCPrivateLinkage.cpp and https://github.com/eclipse/openj9/blob/master/runtime/compiler/p/codegen/PPCJNILinkage.cpp

@lmajewski and @PTamis just curious if you're still pursuing this and/or still using OpenJ9 on e500?

I'm going to spend some time figuring out what we can salvage from the various patches that have been discussed in this issue that can be contributed back to OMR and OpenJ9.

Dear @ymanton please find some small update from this project:

  1. As you might noticed the PPC SPE architecture in GCC9 has been removed [1], hence there is no point in providing OpenJ9 support for this particular one (especially as it is time consuming)
  1. Considering the above, the idea was to recompile the whole rootfs and userspace binaries to support "generic" PPC32 architecture and see what we can achieve with OpenJ9 + SW match emulation.
    Some initial investigation has been carried out, but the decision for switch hadn't been made.

[1] - https://www.phoronix.com/scan.php?page=news_item&px=GCC-Removes-PowerPCSPE

Further notes upon researching:

I'm not sure if there's enough interest in non-e500-using communities, though, since most of the (non-ancient) obtainable hardware for an average user is 64-bit POWER with AltiVec:

  • IBM machines
  • Other enterprise/special purpose high performance computers
  • Raptor Talos

With that said my PowerBook G4 would certainly be easier to justify keeping if I could run Java software at a reasonable (given the system's inherent performance) clip. OpenJDK/zero is absolutely miserable, and IBM's old J9 JVM only runs on it up to version 1.6 or so (where it performs quite well in Debian).

@ymanton sorry for my late reply.
For the time being we stopped any more development over this issue. It is a pity since we made so much progress but the time frame we had for completion was to narrow and the risk was high.
As @lmajewski said we believe that it might be easier to use pure PPC32 and drop any SPE instructions. I made some tests over this direction and the results were very positive.
I want to believe that we will start working again over this issue by the start of the next year where we will have to reconsider our options for the java being used in our PPC targets.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

markehammons picture markehammons  路  63Comments

pshipton picture pshipton  路  59Comments

pshipton picture pshipton  路  62Comments

pshipton picture pshipton  路  64Comments

AlenBadel picture AlenBadel  路  106Comments