Openj9: OutOfMemoryErrors on AIX with openj9-0.15.1

Created on 18 Oct 2019  路  44Comments  路  Source: eclipse/openj9

I'm currently trying to get to the bottom of why an application running in WebSphere Liberty Application Server on AIX 7.2 can be started using an IBM VM, but not with an OpenJ9 VM. I believe the problem is caused by JIT. The 2 VMs I'm using are:

Java(TM) SE Runtime Environment (build 8.0.5.20 - pap6480sr5fp20-20180802_01(SR5 FP20))
IBM J9 VM (build 2.9, JRE 1.8.0 AIX ppc64-64-Bit Compressed References 20180731_393394 (JIT enabled, AOT enabled)
OpenJ9 - bd23af8
OMR - ca1411c
IBM - 98805ca)
JCL - 20180719_01 based on Oracle jdk8u181-b12

openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-b10)
Eclipse OpenJ9 VM (build openj9-0.15.1, JRE 1.8.0 AIX ppc64-64-Bit Compressed References 20190717_374 (JIT enabled, AOT enabled)
OpenJ9 - 0f66c64
OMR - ec782f2
JCL - f147086 based on jdk8u222-b10)

Using OpenJ9, I'm seeing OutOfMemoryErrors being thrown. The only way I've found to prevent this is to specify:

-Xnojit

I've also experimented with LDR_CNTRL and -Xlp:codecache:pagesize but with no success.

JIT memory usage in the javacore file doesn't look excessive:

1MEMUSER JRE: 1,017,715,352 bytes / 23398 allocations
1MEMUSER |
:
2MEMUSER +--JIT: 48,773,848 bytes / 2469 allocations
2MEMUSER | |
3MEMUSER | +--JIT Code Cache: 18,874,944 bytes / 9 allocations
2MEMUSER | |
3MEMUSER | +--JIT Data Cache: 6,291,648 bytes / 3 allocations
2MEMUSER | |
3MEMUSER | +--Other: 23,607,256 bytes / 2457 allocations

userRaised

All 44 comments

On an OutOfMemory, OpenJ9 generates some diagnostic files - a javacore, system core, and heapdump - that can be used to investigate the cause of the OOM.

Can you share those files? We'll have an easier time to resolving this if we can see what was happening at the time it occurred.

Hi Dan,

Thanks for your response. I finally got the go-ahead to upload these files, they can be found here:

https://github.com/JonDGH/j9-aix-oom

Regards,
Jon

~You made reference to the javacore, but I don't see it in the collection of diagnostic files.~

Nm, I see it now.

Seems pthread_create() is returning EAGAIN:
https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/p_bostechref/pthread_create.html
EAGAIN | If WLM is running, the limit on the number of threads in the class is reached.
EAGAIN | The limit on the number of threads per process has been reached.

Thanks for this. I won't have access to the AIX box until Wednesday now. but iirc the core.dmp file was showing that there were around 180 threads running. I don't believe that WLM is running or that there are any limits set on the number of threads a process can start,

Anyway, thanks again - I'll do some more digging on Wednesday.

The javacore file shows:

2CIUSERLIMIT RLIMIT_THREADS unlimited unlimited
2CIUSERLIMIT RLIMIT_NPROC unlimited unlimited

I've also managed to get the server to start by setting the native thread stack size (-Xmso) down from 256K to 176K (the IBM VM runs the server happily with this set to 1024K).

Even though I can get the application to start, I don't have confidence that it will run successfully.

There looks to be a more general problem with the creation of threads using that J9 VM. I've written some simple code that starts a number of threads that all just sleep "forever". Running with:

java -Xmx1024m -Xss256K -Xmso256K StartThreads 1000

This completes successfully for the IBM VM, but the the J9 VM fails around the 300 mark. The code, class file and traces are in:

https://github.com/JonDGH/j9-aix-oom

Regards,
Jon

@zl-wang can you spot anything in the executable header which is different between IBM and OpenJ9 builds that would explain this behavior?

Not sure it matters, but there are two different compiler versions used.
IBM Java uses xlc 12.0.1
OpenJ9 uses xlc 13.1.3

I saw nothing stood out except the fact that OpenJ9 text/data are in the way of java heap allocation (but that is irrelevant to this issue with 1GB heap).

It appeared to me an ulimit issue. Before I unlimited various limits, I can recreate this on my machine even with -Xint. After I unlimited them, it ran through without problem:

openj9_8.0.15/bin/java -Xmx1024m -Xss256K -Xmso256K StartThreads 1000

Started: 1
Started: 2
[...]
Started: 999
Started: 1000
root at abruzzo - /ppc_abruzzo/zlwang [!]

openj9_8.0.15/bin/java -version

openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-b10)
Eclipse OpenJ9 VM (build openj9-0.15.1, JRE 1.8.0 AIX ppc64-64-Bit Compressed References 20190717_374 (JIT enabled, AOT enabled)
OpenJ9 - 0f66c64
OMR - ec782f2
JCL - f147086 based on jdk8u222-b10)
root at abruzzo - /ppc_abruzzo/zlwang [!]

ulimit -a

core file size (blocks, -c) 1048575
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 2000
pipe size (512 bytes, -p) 64
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
root at abruzzo - /ppc_abruzzo/zlwang [!]

On AIX 7.2 using the same JVM parameters, I am also able to create 1000 threads with OpenJ9 0.15.1. I also tried with 10,000 and it worked.

This seems unrelated to the java heap placement, as in both the original javacore from https://github.com/eclipse/openj9/issues/7503#issuecomment-546297230 and on my machine, the object heap is located at 0xE0000000.

These are the hard limits on the AIX box that I'm using:

-> ulimit -aH

time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 4194304
memory(kbytes) unlimited
coredump(blocks) unlimited
nofiles(descriptors) 65536
threads(per process) unlimited
processes(per user) unlimited

And the soft limits:

-> ulimit -aS

time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 131072
stack(kbytes) 32768
memory(kbytes) 32768
coredump(blocks) 2097151
nofiles(descriptors) 32768
threads(per process) unlimited
processes(per user) unlimited

If I set the soft limit for data to unlimited the J9 VM runs in the same way as the IBM VM. Shouldn't the VM be setting the soft limit based on the -Xmx value?

https://www.ibm.com/developerworks/community/blogs/troubleshootingjava/entry/ulimit_and_xmx?lang=en

The machine I'm using has the same soft limits, but no problem creating 10,000 threads.
bin/java -Xmx1024m -Xss256K -Xmso256K StartThreads 10000

time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         131072
stack(kbytes)        32768
memory(kbytes)       32768
coredump(blocks)     unlimited
nofiles(descriptors) unlimited
threads(per process) unlimited
processes(per user)  unlimited
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-b10)
Eclipse OpenJ9 VM (build openj9-0.15.1, JRE 1.8.0 AIX ppc64-64-Bit Compressed References 20190717_374 (JIT enabled, AOT enabled)
OpenJ9   - 0f66c64
OMR      - ec782f2
JCL      - f147086 based on jdk8u222-b10)



md5-55130fd5a32972911a7c1dc401afdce4



core file size        (blocks, -c) unlimited
data seg size         (kbytes, -d) unlimited
file size             (blocks, -f) unlimited
max memory size       (kbytes, -m) unlimited
open files                    (-n) unlimited
pipe size          (512 bytes, -p) 64
stack size            (kbytes, -s) hard
cpu time             (seconds, -t) unlimited
max user processes            (-u) 128
virtual memory        (kbytes, -v) unlimited

The only change to the limits done by OpenJ9 (and the IBM Java VM) is to the file descriptors.
https://github.com/eclipse/openj9/blob/200fc549e0400c6d25820866220d85507df77394/runtime/vm/jvminit.c#L1855

It seems that IBM Java does set the RLIMIT_DATA limit in the JCL code, but this change is specific to IBM Java and isn't in OpenJDK. OpenJ9 builds use OpenJDK.

@andrew-m-leonard can you please look at this, perhaps we can contribute this change to OpenJDK. The IBM setrlimit() code is in jdk/src/java.base/unix/native/libjli/java_md_solinux.c

I suppose we could also consider making this change in OpenJ9.
@DanHeidinga

Assuming the data limits code is along the lines of the file descriptor limits, it seems like the workaround for us is to add: ulimit -d -S $(ulimit -d -H)

to our launch script.

Am I correct in thinking that this isn't specific to AIX?

If there's no code to adjust the limits, it seems odd that this is the case: "The machine I'm using has the same soft limits, but no problem creating 10,000 threads."

Agreed, I don't understand why it works on the machines we have, but not yours.
@andrew-m-leonard can you get any history of why the change was added to IBM Java?

The IBM Java change to adjust the RLIMIT_DATA limit is specific to AIX.

@JonDGH I do notice that you have a stack limit stack(kbytes) 4194304, while the other machines do not. Not sure if this is related. @zl-wang any ideas?

I see the same issue even with the stack size set to unlimited, so seems like it's solely down to the data limit:

-> ulimit -aH

time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) unlimited
memory(kbytes) unlimited
coredump(blocks) unlimited
nofiles(descriptors) 65536
threads(per process) unlimited
processes(per user) unlimited

-> ulimit -aS

time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 131072
stack(kbytes) 32768
memory(kbytes) 32768
coredump(blocks) 2097151
nofiles(descriptors) 32768
threads(per process) unlimited
processes(per user) unlimited

I suppose we could also consider making this change in OpenJ9.

+1 I support fixing this in OpenJ9 rather than having a patch to the Extensions repo. Given there's a patch in the IBM JDK for this, we've clearly identified a need for it in the past.

What is the DATA-limit fix (in JCL)? Set it to accommodate the java heap? That is missing the target, since java heap is typically not budgeted from C/system heap where DATA-limit applies. It looks to me it just happens to work out.

What is the DATA-limit fix (in JCL)? Set it to accommodate the java heap?

@zl-wang the JCL fix sets RLIMIT_DATA to RLIM64_INFINITY if it isn't already. If RLIM64_INFINITY can't be set, it prints the message below. I don't know that this change is related to the java heap. In this case it is affecting the number of threads which can be created, and not the size of the heap which can be used. When the JVM fails to allocate a thread, an OOM error occurs. "java/lang/OutOfMemoryError" "Failed to create a thread: retVal -1073741830, errno 11" received

WARNING: JVM is running with hard ulimit not set to unlimited, out of memory errors may occur
If an out of memory error occurs try running \"chuser data_hard=-1 username\" as root
before running java

@zl-wang if you didn't catch it, see https://github.com/eclipse/openj9/issues/7503#issuecomment-549781070 If I set the soft limit for data to unlimited the J9 VM runs in the same way as the IBM VM. meaning that 1000 threads can be created with OpenJ9, rather than getting OOM after 300 threads.

That makes senses. It should be applied to all platforms (of resource limitation), shouldn't it?

That makes senses. It should be applied to all platforms (of resource limitation), shouldn't it?

@zl-wang It has never been applied to all platforms before, and afaik nobody has reported a problem on other platforms. We'll see if Andrew can get any more information from investigating why the limit change was introduced on AIX.

What we still don't understand is why the machine I'm using, and the one you are using, doesn't have the same limitation, although at least my machine and the problematic machine have the same soft limit for data. Likely we have different revisions of AIX 7.2, but I'm not sure what to check. I did notice the order of lines in the hard limit output (ulimit -aH) is different between the machines. The uname -a output on my machine is AIX aix72p7vm14 2 7 00F72C294C00

I'm not sure if any of this is helpful, but on my machine:
$ uname -a
AIX paix303 2 7 00FB112A4C00

$ oslevel -q

Known Maintenance Levels

7.2.0.0

@pshipton on my machine, i indeed needed to (re)set the limits in order to run that test case, as i described in https://github.com/eclipse/openj9/issues/7503#issuecomment-549493444.

ok, so seems only my machine is wonky. It does seem to be a different level (00F72C294C00 instead of 00FB112A4C00), perhaps there was a bug.

Thanks for all the assistance on this. I don't think the dumps and code in my repo (https://github.com/JonDGH/j9-aix-oom) are of any value now, so I'll be deleting it shortly. I can always make anything it contained available again if it's needed.

@pshipton Looking into rt-patch and the jazz db for setrlimit() and why/where it was implemented in IBM Java

~setrlimit() was implemented in IBM Java as a merge from Oracle.~ Still looking through jazz to see if there is a reason on why this was ported over.

MERCURIAL COMMIT
jdk8u20-b08 merge: merge Oracle aix porting changes into this patch.

EDIT: Full list of ports for JDK8u20-b08 http://mail.openjdk.java.net/pipermail/jdk8u-dev/2014-April/000532.html. I am currently running through these as jazz doesn't appear to state why this change was included in IBM Java

My suspicion is that the full details on why this was included in OpenJDK in the first place are behind Oracle's closed doors.

@pshipton Other than jazz and the reviews at http://cr.openjdk.java.net/~amurillo/8u20/hs25.20-b08-jdk8u20-b08.webrev/ are there any other places worth looking into if I wanted to dig deeper into this?

There should be history in OpenJDK for the addition, did you track that down?

i.e. if you figure which of the changes in http://cr.openjdk.java.net/~amurillo/8u20/hs25.20-b08-jdk8u20-b08.webrev/ added the code, you can look at the individual change. Like https://bugs.openjdk.java.net/browse/JDK-8037915

@M-Davies are you sure the setrlimit came from OpenJDK? It seems unlikely, because it's not in OpenJDK now. The IBM Java code is tagged IBM-aix-bringup. I think this can be traced back in the IBM Java 8 code stream.

i.e. if you figure which of the changes in http://cr.openjdk.java.net/~amurillo/8u20/hs25.20-b08-jdk8u20-b08.webrev/ added the code, you can look at the individual change. Like https://bugs.openjdk.java.net/browse/JDK-8037915

I looked through those changes already and couldn't see it. Looking into hotspot's source, they also don't seem to use setrlimit either. I believe you're right that I'm looking in the wrong place

@M-Davies are you sure the setrlimit came from OpenJDK? It seems unlikely, because it's not in OpenJDK now. The IBM Java code is tagged IBM-aix-bringup. I think this can be traced back in the IBM Java 8 code stream.

@andrew-m-leonard is looking into getting me access to the IBM Java 8 source over IBM Java 9 at the moment so I can track this change back

Src code itself doesn't reveal much, only that the changes in Java 9 are identical to the ones in Java 8 so it's possible they were introduced to both versions at the same time. I'll have a look through the changelogs at https://developer.ibm.com/javasdk/downloads/sdk8/, see if I can find any references to OutOfMemory errors or setting rlimits

Related articles:

@M-Davies the title of 70929 is "basic class library bringup changes" which means these changes have been forward ported from an older version of Java. You need to look back further at the version of Java where the change was first introduced.

Jazz/rtc doesn't have any more relevant information on why this was included. @andrew-m-leonard and I had a look through Jira but it didn't reveal much information there either. I'm certain that this was either an IBM or Oracle change that is documented in a closed location.

Perhaps the Current Release JCL team can help track it down.

Paul Cheeseman found a small extract back from 2001. From sov defect 29024 in CMVC

Below is the information found in old reports in 2001, which seems to correspond to Java 1.3. There is no setLibAndExec() function any more I could find, or any mention of LDR_CNTRL or MAXDATA. We did recently make this change https://github.com/ibmruntimes/openj9-openjdk-jdk8/pull/349

The聽[thread and threadgroups] tests fail since they run out of malloc'ed space to allocate thread stacks. Test succeeds
if ulimit -d unlimited is run first. Test also succeeds with smaller stack sizes. The behaviour of
malloc on 64-bit is due to be changed so that MAXDATA can be set to indicate
the soft data rlimit.

java_md.c setLibAndExec() needs to have the
#if !defined(__64BIT__) and corresponding
#endif
removed so that the MAXDATA setting is used for
both 32- and 64-bit JVMs.
...from around the testing and setting of the environment
variable LDR_CNTRL=MAXDATA=0x80000000

Was this page helpful?
0 / 5 - 0 ratings