Openj9: Unexpected JVM startup behavior when a release build java uses nightly build SCC

Created on 5 Nov 2020  路  21Comments  路  Source: eclipse/openj9

Raising this issue expecting : (Correct me if I'm wrong)

  • SCC's nonfatal option makes sure JVM starts up even if SCC is not accessible or unusable
  • Boot classes cache of a same major version (same/different builds) will not make any interruption for JVM startup

This issue is an extension for the issue seen at AdoptOpenJDK tests https://github.com/AdoptOpenJDK/openjdk-tests/issues/2031

Scenario:

The latest AdoptOpenJDK OpenJ9 images comes with a pre built SCC (boot classes only), As part of tests we have mounted the release build from this link on the AdoptOpenJDK OpenJ9 nightly docker image adoptopenjdk/openjdk8-openj9:nightly and we have seen that the java -version is resulting empty and the exit code is 1

vm userRaised

All 21 comments

Mounting release build on to nightly

Appalis-MacBook-Air:~ bharathappali$ docker run --rm -v /Users/bharathappali/Downloads/jdk8u272-b10:/opt/java/openjdk -it adoptopenjdk/openjdk8-openj9:nightly

Environment:

root@cba9819db184:/# env
HOSTNAME=cba9819db184
LANGUAGE=en_US:en
JAVA_HOME=/opt/java/openjdk
PWD=/
HOME=/root
LANG=en_US.UTF-8
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
OPENJ9_JAVA_OPTIONS=-Xshareclasses:name=openj9_system_scc,cacheDir=/opt/java/.scc,readonly,nonFatal
TERM=xterm
JAVA_TOOL_OPTIONS=-XX:+IgnoreUnrecognizedVMOptions -XX:+IdleTuningGcOnIdle
SHLVL=1
LC_ALL=en_US.UTF-8
PATH=/opt/java/openjdk/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
JAVA_VERSION=jdk8u
_=/usr/bin/env

So by default java takes the SCC settings as -Xshareclasses:name=openj9_system_scc,cacheDir=/opt/java/.scc,readonly,nonFatal as OPENJ9_JAVA_OPTIONS env var is set

Exploration:

root@cba9819db184:/# ls -lh /opt/java/.scc/C290M4F1A64P_openj9_system_scc_G41L00 
-rwxrwxrwx 1 root root 13M Oct 30 17:06 /opt/java/.scc/C290M4F1A64P_openj9_system_scc_G41L00

root@cba9819db184:/# java -version
root@cba9819db184:/# echo $?
1

root@cba9819db184:/# ls -lh /opt/java/.scc/C290M4F1A64P_openj9_system_scc_G41L00 
-rwxrwxrwx 1 root root 13M Oct 30 17:06 /opt/java/.scc/C290M4F1A64P_openj9_system_scc_G41L00

root@cba9819db184:/# java -Xshareclasses:name=openj9_system_scc,cacheDir=/opt/java/.scc -version
openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-b10)
Eclipse OpenJ9 VM (build openj9-0.23.0, JRE 1.8.0 Linux amd64-64-Bit Compressed References 20201022_822 (JIT enabled, AOT enabled)
OpenJ9   - 0394ef754
OMR      - 582366ae5
JCL      - 41f03341d5 based on jdk8u272-b10)

root@cba9819db184:/# echo $?
0

root@cba9819db184:/# ls -lh /opt/java/.scc/C290M4F1A64P_openj9_system_scc_G41L00 
-rw-r--r-- 1 root root 13M Nov  5 08:47 /opt/java/.scc/C290M4F1A64P_openj9_system_scc_G41L00

root@cba9819db184:/# java -version
openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-b10)
Eclipse OpenJ9 VM (build openj9-0.23.0, JRE 1.8.0 Linux amd64-64-Bit Compressed References 20201022_822 (JIT enabled, AOT enabled)
OpenJ9   - 0394ef754
OMR      - 582366ae5
JCL      - 41f03341d5 based on jdk8u272-b10)

Once readonly,nonfatal setting is removed the JVM is printing the java version. And later on with readonly,nonfatal also the java version is printed to stdout.

Initially the error code is 1 when we do a java -version and after a run without readonly,nonfatal options, error code is 0.

The permissions for the cache is changed (I feel its expected as JVM access / modifies it as root)

Initially the error code is 1 when we do a java -version and after a run without readonly,nonfatal options, error code is 0

Once you removed readOnly, the JVM deletes the old cache that is from a different build and create a new one, so error code is always 0 after that.

In your initial run that has error code 1, can you run with -Xshareclasses:name=openj9_system_scc,cacheDir=/opt/java/.scc,readonly,verbose -Xtrace:print={j9shr} ? You will see a lot of trace points printed to the console. There should be some error message explaining the error.

Once you removed readOnly, the JVM deletes the old cache that is from a different build and create a new one, so error code is always 0 after that.

Thanks for the info, I was in an impression that not specifying readOnly makes JVM add cache if the scc has free space, wasn't aware that it deletes the existing and creates a new cache.

can you run with -Xshareclasses:name=openj9_system_scc,cacheDir=/opt/java/.scc,readonly,verbose -Xtrace:print={j9shr} ?

Sure will run and update the issue.

Forgot to mention earlier that JVM is exiting with error code even with nonfatal option

This link states using nonfatal will turn off the class data sharing and starts the JVM if SCC fails.

Unfortunately the nonfatal option is not perfect. It depends on what goes wrong. If the cache can be opened and appears to be valid after the integrity checks, the JVM will use it. If there is a problem after that, it's too late to not use the cache, and any undetected problem can take down the JVM.

To put it another way, nonfatal works for normal problems, but doesn't cover bugs.

If the cache can be opened and appears to be valid after the integrity checks, the JVM will use it. If there is a problem after that, it's too late to not use the cache, and any undetected problem can take down the JVM.

Thanks @pshipton. I'm not aware of the internals of OpenJ9 CDS and SCC, Sorry I shouldn't have considered nonfatal as a perfect solution after knowing things are failing even with it. Your comment makes it a bit clear for me to diagnose at which point the problem may exist. (After cache being opened and performing integrity checks).

I have tried to get a trace as mentioned in this comment and its around 150K lines to go through. Here is what i have tried:

  • Searched for keywords ERROR EXCEPTION FAIL and haven't found any results related to the issue

I'm trying out -Xtrace for the first time. Can i please get more info on whatelse can i try out ?

As you searched for key word ERROR, and haven't found any helpful result. I guess you didn't hit the trace point:
j9shr.2272 INIT j9shr_init: exit on non-fatal error

Did you see:
j9shr.445 < INIT HOOK FIND CLASS: Exiting hookFindSharedClass
if yes, SCC is already started up and functioning as expected. Class in SCC is being returned to the class loader. Then something else is wrong.

If no, did you see
j9shr.11 < CM startup successful
If j9shr.11 is not there, the cache failed to be started up, other trace points in the failure path that contains CM startup should give some information on why it is not started up. But I see most failure CM startup trace points have word "fail" or "error" in it.

Thanks @hangshao0 for your guidance.

I have seen this when i searched for INIT j9shr_init

root@abd57009f7cb:/# java -Xshareclasses:name=openj9_system_scc,cacheDir=/opt/java/.scc,readonly,verbose -Xtrace:print={j9shr} -version 2>&1 | grep "INIT j9shr_init"
21:39:16.791 0x14e00           j9shr.1554     - INIT j9shr_init failed to enter 'write' area lock. String table will not be initialized.

There are around 106 results for INIT HOOK FIND CLASS: Exiting hookFindSharedClass

When i was going through the trace i have seen JVM is looking for 100 layers of given cache name, the first one (0) is successfully detected and the other 99 resulted printing file not found. I'm jus curious to know why JVM is searching for 100 layers (0-99) when there is only 1 layer (layer 0) in the cache directory.

EDIT: I could see CM startup successful as well

root@abd57009f7cb:/# java -Xshareclasses:name=openj9_system_scc,cacheDir=/opt/java/.scc,readonly,verbose -Xtrace:print={j9shr} -version 2>&1 | grep "CM startup successful"
21:46:48.368 0x14e00           j9shr.11       < CM startup successful

j9shr.1554 - INIT j9shr_init failed to enter 'write' area lock. String table will not be initialized.

This is an expected trace point as you are using readonly, string table in the SCC won't be used.

why JVM is searching for 100 layers (0-99) when there is only 1 layer (layer 0) in the cache directory.

The JVM always start up the cache from the top-most layer down to layer 0. The code is searching for the top-most layer.

j9shr.11 < CM startup successful
There are around 106 results for INIT HOOK FIND CLASS: Exiting hookFindSharedClass

As you are seeing these trace points, the SCC is already functioning, i.e. classes are being found in the cache. Then something else is wrong.

I tried reproducing your issue, but things are working fine on my side.

  1. Create a cache using a nightly build (from yesterday):
java -Xshareclasses:cacheDir=/tmp,name=Cache1 -version              
openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-202011041720-b10)
Eclipse OpenJ9 VM (build master-a236ce7ee, JRE 1.8.0 Linux amd64-64-Bit Compressed References 20201104_839 (JIT enabled, AOT enabled)
OpenJ9   - a236ce7ee
OMR      - e5f5e201d
JCL      - 289ae443 based on jdk8u272-b10)
  1. Run the same release build as yours (openj9-0.23.0) to startup the cache:
java -Xshareclasses:cacheDir=/tmp,name=Cache1,readonly,nonfatal,verbose -version
[-Xshareclasses persistent cache enabled]
[-Xshareclasses verbose output enabled]
JVMSHRC285I Opened shared class persistent cache Cache1 read-only
JVMSHRC246I Attached shared classes persistent cache Cache1
JVMSHRC274I Memory page protection disabled for cache
JVMJITM005W AOT code in the shared class cache cannot run on current JVM release. Ignoring AOT code in shared class cache.
openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-b10)
Eclipse OpenJ9 VM (build openj9-0.23.0, JRE 1.8.0 Linux amd64-64-Bit Compressed References 20201022_822 (JIT enabled, AOT enabled)
OpenJ9   - 0394ef754
OMR      - 582366ae5
JCL      - 41f03341d5 based on jdk8u272-b10)
JVMSHRC168I Total shared class bytes read=272. Total bytes stored=0
JVMSHRC818I Total unstored bytes due to the setting of shared cache soft max is 0. Unstored AOT bytes due to the setting of -Xscmaxaot is 0. Unstored JIT bytes due to the setting of -Xscmaxjitdata is 0.

Well in my case I'm using the cache generated by nightly build (which comes pre built in adoptopenjdk/openjdk8-openj9:nightly image) and mounting the release build to jdk path in the nightly image and i see this.

Appalis-MacBook-Air:~ bharathappali$ docker run --rm -v /Users/bharathappali/Downloads/jdk8u272-b10:/opt/java/openjdk -it adoptopenjdk/openjdk8-openj9:nightly
root@def96433fb23:/#  java -Xshareclasses:name=openj9_system_scc,cacheDir=/opt/java/.scc,readonly,nonFatal,verbose -version
[-Xshareclasses persistent cache enabled]
[-Xshareclasses verbose output enabled]
JVMSHRC285I Opened shared class persistent cache openj9_system_scc read-only
JVMSHRC246I Attached shared classes persistent cache openj9_system_scc
JVMSHRC274I Memory page protection disabled for cache
JVMJITM005W AOT code in the shared class cache cannot run on current JVM release. Ignoring AOT code in shared class cache.
root@def96433fb23:/# echo $?
1
root@def96433fb23:/# 

The JVM always start up the cache from the top-most layer down to layer 0. The code is searching for the top-most layer.

Thanks for the info @hangshao0

mounting the release build to jdk path in the nightly image

OK, if I put the release JDK to the same location where the nightly build JDK was, I am able to re-produce this issue.

I turned on -Xshareclasses sub-option verboseIO and -verbose:init, it always stops at:

...
Found class java/lang/invoke/ConstantDoubleHandle in shared cache for class-loader id 0.
Found class java/lang/invoke/InvokeGenericHandle in shared cache for class-loader id 0.
Found class java/lang/invoke/MethodHandleCache in shared cache for class-loader id 0.
Found class java/lang/InternalRamClass in shared cache for class-loader id 0.
Found class com/ibm/oti/util/WeakReferenceNode in shared cache for class-loader id 0.
Found class java/lang/ref/WeakReference in shared cache for class-loader id 0.
                completed with rc=-2 in 2592573807 nano sec.
        for library zeroInitStages...
                completed with rc=0 in 592 nano sec.
        for library VMInitStages...
                completed with rc=0 in 93888 nano sec.
        for library threadInitStages...
                completed with rc=0 in 353 nano sec.

Checking results for stage JCL_INITIALIZED

The issue here is having one JDK populate the SCC with its bootstrap classes, then use another JDK (from a different build) to load such classes (by mounting the second JDK at the same location and turn on SCC in readOnly mode). This is basically equivalent to copying the bootstrap jars, like rt.jar, from the first JDK to the second JDK. Things won't work this way.

Is there a reason why you need to use readOnly in the second JDK (the release build) ?

For the JVM side, I think we could do a better job.
In readOnly mode, the JVM is not able to delete the SCC and recreate a new one, so the SCC buildID check is currently disabled. I guess we should turn on buildID check and let SCC startup fail on a buildID mis-match error. The bootstrap classes and AOT are not usable for this JVM. This is a change of behaviour to some existing users.

The jar timestamp check should fail, classes should only be re-used if they match. Is that not working? I can understand an AOT mismatch causing a problem, but mismatched classes shouldn't be loaded from the cache.

I'm not against the proposal. It means we don't need to worry about updating the cache generation ever again. There is limited benefit to sharing a cache between different JVM builds anyway. Without using readonly, caches would only be shared on Windows and zOS since the mismatched cache can't be deleted when in use.

The jar timestamp check should fail, classes should only be re-used if they match. Is that not working?

The timestamp of each jar in the bootstrap class path is checked once, and I see the timestamp change has been detected. Normally the JVM will mark all the classes in this jar as stale so that they won't be returned to class loader. However, in this issue, the JVM opened SCC in readOnly mode, it cannot do stale making. So only the first class from a bootstrap jar is not returned to the class loader, all the following classes are returned.

Ya, that is a bug.

Although it can't update the cache, the running JVM should recognize everything in the classpath as stale.

Yes, there is a bug here:

MARKED_STALE_FLAG is set by this line inside the else block (line 1123) if we detect a timestamp change:
https://github.com/eclipse/openj9/blob/master/runtime/shared_common/ClasspathManagerImpl2.cpp#L1145

But when reaching this function for the first time, as there is no local class path, we always go into the if block:
https://github.com/eclipse/openj9/blob/master/runtime/shared_common/ClasspathManagerImpl2.cpp#L1096
The only timestamp check is done by localValidate_CheckAndTimestampManually(), but I don't see MARKED_STALE_FLAG is set if timestamp change is detected. After that, each time we enter this function, we will always go into the else block (line 1123). However, timestamp won't be checked again, so MARKED_STALE_FLAG is never set.

Was this page helpful?
0 / 5 - 0 ratings