omrsysinfo_get_addressable_physical_memory()/omrsysinfo_get_limit() usages at z/OS

Created on 25 Mar 2021  路  23Comments  路  Source: eclipse/omr

Problem summary:

A recent internal investigation discovered that OMR GC doesn't create the heap with proper size at z/OS accordingly to ulimit -M value which sets memory above bar [1].

GC uses IARV64/__moservices to create the heap which can only be allocated at memory above bar.

The root cause was due to the usage of omrsysinfo_get_addressable_physical_memory() which invokes sysinfo_get_limit(portLibrary, OMRPORT_RESOURCE_ADDRESS_SPACE, &memoryLimit) https://github.com/eclipse/omr/blob/0d0662bc6f31cbfd93035aaa567a9e17740a33c5/port/unix/omrsysinfo.c#L3151 in which OMRPORT_RESOURCE_ADDRESS_SPACE is mapped to RLIMIT_AS to return the maximum address space size for the process but not RLIMIT_MEMLIMIT for the memory above bar.

As per z/OS API doc [2]:

int getrlimit(int resource, struct rlimit *rlp);
...
The resource argument specifies which resource to get the hard and/or soft limits for, and may be one of the following values:
...
 RLIMIT_MEMLIMIT
The maximum amount of usable storage above the 2 gigabyte bar (in 1 megabyte segments) that can be allocated.
...
 RLIMIT_AS
The maximum address space size for the process, in bytes. If the limit is exceeded, malloc() and mmap() functions will fail with an errno of ENOMEM. Automatic stack growth will also fail.
...

Solution proposed

omrsysinfo_get_limit() maps OMRPORT_RESOURCE_ADDRESS_SPACE to RLIMIT_MEMLIMIT for z/OS 64 bit, and RLIMIT_AS for z/OS 31 bit and other non-zOS nix-like platforms.

If there is a user request for query of zOS RLIMIT_AS value, a new port library API will be added.

fyi @joransiu @pshipton @dmitripivkine @babsingh @0xdaryl

[1] https://www.ibm.com/support/knowledgecenter/SSLTBW_2.4.0/com.ibm.zos.v2r4.ieaa500/ieaa500109.htm
[2] https://www.ibm.com/support/knowledgecenter/SSLTBW_2.2.0/com.ibm.zos.v2r2.bpxbd00/rgrlmt.htm

z bug zos

All 23 comments

https://github.com/eclipse/omr/pull/5886#issuecomment-815170074
_"I still think that omrsysinfo_get_addressable_physical_memory() should subtract 2GB from the usableMemory on z/OS"_

@dmitripivkine any comments since GC is the main (and only so far) user of this particular usage.

To fill in the details, the first 2GB is not usable since IAVR64 is used to allocate above the bar, so including it in the omrsysinfo_get_addressable_physical_memory() result seems wrong.

I agree first 2GB should not be included to result. Is it correct that proposed solution check RLIMIT_MEMLIMIT reports available memory above 2GB bar? If so, for case if limit is not set first 2GB should not be included as well.

To fill in the details, the first 2GB is not usable since IAVR64 is used to allocate above the bar, so including it in the omrsysinfo_get_addressable_physical_memory() result seems wrong.

From the internal experiment and investigation, the first 2GB is not included in the result for RLIMIT_MEMLIMIT.
@joransiu please correct if this is not the case.

"I still think that omrsysinfo_get_addressable_physical_memory() should subtract 2GB from the usableMemory on z/OS"

I think this confusion was from an initial thought returning 2GB + return value of RLIMIT_MEMLIMIT, then GC do a minus math which was not actually implemented.

I'm saying that when RLIMIT_MEMLIMIT is not set, usableMemory memory is returned instead, and this value should have 2GB subtracted.

or in the case where usableMemory - 2GB is less than the RLIMIT_MEMLIMIT.

i.e. on z/OS the code should be

    uint64_t usableMemory = portLibrary->sysinfo_get_physical_memory(portLibrary) - 2GB;

Isn't sysinfo_get_limit(portLibrary, OMRPORT_RESOURCE_ADDRESS_SPACE, &memoryLimit) to retrieve the memory above bar instead?

Isn't sysinfo_get_limit(portLibrary, OMRPORT_RESOURCE_ADDRESS_SPACE, &memoryLimit) to retrieve the memory above bar instead?

The limit may not be set, i.e. set to unlimited, or perhaps it can be set larger than the available memory. In which case the API should still return a correct answer.

usablePhysicalMemory = omrsysinfo_get_addressable_physical_memory() which returns minimum of (sysinfo_get_physical_memory(), sysinfo_get_limit()).
sysinfo_get_physical_memory is retrieved via following code snippet

    J9CVT * __ptr32 cvtp = ((J9PSA * __ptr32)0)->flccvt;
    J9RCE * __ptr32 rcep = cvtp->cvtrcep;
    result = ((U_64)rcep->rcepool * J9BYTES_PER_PAGE);

This returns 67447922688 in target zOS LPARs.
sysinfo_get_limit() is affected by ulimit -M, i.e., memory above bar, and it is this setting caused the GC heap size calculation problem.

Right, but omrsysinfo_get_addressable_physical_memory() can return the wrong answer when ulimit -M is unlimited. Although the ulimit reports memory above the bar, I'm not sure ((U_64)rcep->rcepool * J9BYTES_PER_PAGE) does, so we should subtract 2GB on z/OS so that omrsysinfo_get_addressable_physical_memory() always returns the memory above the bar on z/OS. And add a comment to explain that.

@joransiu do you know if ((U_64)rcep->rcepool * J9BYTES_PER_PAGE) is only reporting memory above the bar?

do you know if ((U_64)rcep->rcepool * J9BYTES_PER_PAGE) is only reporting memory above the bar?

From the documentation [1] RCEPOOL specifies:

Total number of frames in 4k units currently obtainable by the workload, including those already in use. Frames excluded are those backing permanent storage, frames offline, bad frames once they are marked offline, frames reserved for system use, and frames in the 2G LFAREA.

As such, the description suggests it accounts for the number of frames that can be allocated to back virtual pages, and is not limited to above the bar. I'm not sure if we need to subtract 2GB from this value either, as frames should be allocated only for the pages that are touched.

Note the point about 2GB LFAREA, in which real storage are pre-assigned to back those large pages. RCEPOOL won't capture the available frames available for that.

[1] https://www.ibm.com/docs/en/zos/2.4.0?topic=information-rce-mapping

It seems we should subtract something, but 2GB would be too much and we don't know how much of RCEPOOL is below the bar.

Yeah, what we need to subtract really comes down to how much we allocate and actually touch below the bar. I guess subtracting 2GB would give us a conservative amount.

Seems we can make a better guess if we take these into account.

RCEBELPL - THE SAME AS RCEPOOL EXCEPT THAT ONLY FRAMES BELOW 16M REAL ARE COUNTED.
RCEABVPL - Same as RCEPOOL, but only counts frames from 16M to 2G

I'm checking with RSM team to get clarity on this, and whether it makes sense to go with: RCEPOOL - RCEBELPL - RCEABVPL.

After chatting with the RSM team:

  • rcepool - rcebelpl - rceabvpl will give us the available number of 4k frames above the bar defined on the system. This counts both in-use and available frames, minus the excluded list mentioned in RCEPOOL description [1].
  • STGTEST service [2] may be a better approach, as it returns info about available storage. It returns a set of 3 values:

    • Use of the first number affects system performance little, if at all.
    • Use of the second number might affect system performance to some degree.
    • Use of the third number might substantially affect system performance.

    Using the 2nd of the 3 returned values might be a good starting point, but we鈥檒l need to experiment if we go down this path.


Note: 2GB large pages that are pre-allocated in LFAREA are not recognized by either of these approaches.

[1] https://www.ibm.com/docs/en/zos/2.4.0?topic=information-rce-mapping
[2] https://www.ibm.com/docs/en/zos/2.4.0?topic=event-obtain-system-measurement-information-stgtest

Thanks Joran for the updates. I built a small c program similar with yours in another issue, and got some questions.

void showLimits() {
    J9CVT * __ptr32 cvtp = ((J9PSA * __ptr32)0)->flccvt;
    J9RCE * __ptr32 rcep = cvtp->cvtrcep;
        uint64_t result = (rcep->rcepool * J9BYTES_PER_PAGE);
    printf("rcep->rcepool = %llu (%x) \n", rcep->rcepool, rcep->rcepool);

    struct rlimit rlp;
    int rc = getrlimit(RLIMIT_AS, &rlp);
    if (0 == rc) {
      printf("getrlimit RLIMIT_AS cur: %d max: %d\n", rlp.rlim_cur, rlp.rlim_max);
    }
    rc = getrlimit(RLIMIT_MEMLIMIT, &rlp);
    if (0 == rc) {
      printf("getrlimit RLIMIT_MEMLIMIT cur: %d max: %d\n", rlp.rlim_cur, rlp.rlim_max);
    }
}
void zosmalloc2(int num) {
  long *array = (long * )malloc( num * sizeof(long));
  if (NULL != array) {
    long *index = array;
    int i = 0;  
    for ( i = 0; i < num; ++i ) {
       *index++ = i % 10;
    }
    printf("malloc succeeded with array address 0x%p \n", array);
  } 
}
void moservices2(int num) {
    __mopl_t mymopl = {0};
    void *mymoptr = NULL;
        memset(&mymopl, 0, sizeof(__mopl_t));
        mymopl.__moplrequestsize = num;   /* units are in 1MB chunks */
    mymopl.__mopldumppriority = __MO_DUMP_PRIORITY_HEAP;

        int rc = __moservices(__MO_GETSTOR, sizeof(mymopl), &mymopl, &mymoptr);
        if (0 == rc) {
                printf("moservices successful, mymoptr address: 0x%p\n",         mymoptr);
    }
}



md5-d5e789b424d132f28cd852180f745639



    showLimits();
    zosmalloc2(99);
    showLimits();
    moservices2(512);
    zosmalloc2(999999);
    showLimits();



md5-6bc7d1f641fdc6d8d3db9bd2cfe133a1



rcep->rcepool = 1027626 (fae2a) 
getrlimit RLIMIT_AS cur: 2147483647 max: 2147483647
getrlimit RLIMIT_MEMLIMIT cur: 0 max: 0
malloc succeeded with array address 0x5008601230 
rcep->rcepool = 1027626 (fae2a) 
getrlimit RLIMIT_AS cur: 2147483647 max: 2147483647
getrlimit RLIMIT_MEMLIMIT cur: 0 max: 0
moservices successful, mymoptr address: 0x5008800000
malloc succeeded with array address 0x5028800050 
rcep->rcepool = 1027626 (fae2a) 
getrlimit RLIMIT_AS cur: 2147483647 max: 2147483647
getrlimit RLIMIT_MEMLIMIT cur: 0 max: 0



md5-a797f063b4503241e8fc1f8672b3f812



core file         8192b
cpu time          10800 
data size         unlimited 
file size         unlimited 
stack size        unlimited 
file descriptors  10000 
address space     unlimited 
memory above bar  17592186040320m

It appears rcepool result wasn't affected by the either malloc or __moservices.
Could you try it at STLABA0?

Regarding STGTEST, is there an API for such service?

It appears rcepool result wasn't affected by the either malloc or __moservices.

recpool captures the number of frames defined on the system... and accounts for both in-use and free frames. Here's the doc, which admittedly is awkwardly phrased...

Total number of frames in 4k units currently obtainable by the workload, including those already in use.

That might explain why the value doesn't get affected by malloc or __moservices.

Regarding STGTEST, is there an API for such service?

The doc only shows a HLASM example at the end. I wasn't able to find if there's an equivalent C API available. I'll check with the team.

Total number of frames in 4k units currently obtainable by the workload, including those already in use.
That might explain why the value doesn't get affected by malloc or __moservices.

Aha, that makes sense.

Additionally

Frames excluded are those backing permanent storage, frames offline, bad frames once they are marked offline, frames reserved for system use, and frames in the 2G LFAREA.

What's the frames in the 2G LFAREA, the memory between the line and bar? Is it excluded from RCEPOOL?

What's the frames in the 2G LFAREA, the memory between the line and bar? Is it excluded from RCEPOOL?

The LFAREA parameter[1] allows a sys prog to specify how much fixed 1MB and 2GB large pages will be made available on the system. For 2GB large pages, the equivalent real storage is reserved to back those 2GB frames, which is likely (my guess) why it's excluded from RCEPOOL calculation.

In other words, if we were to allocate a Java heap backed by 2GB fixed (non-pageable) large pages, that memory will come out of the real storage reserved for 2GB large pages specified by LFAREA, rather than the frames accounted by RCEPOOL.

[1] https://www.ibm.com/docs/en/zos/2.4.0?topic=lfarea-parameter

Regarding STGTEST, is there an API for such service?

STGTEST only provides an assembler interface as a SYSEVENT service. We can however generate inline asm from C to invoke it. It takes two parameters, R0: coded as SYSEVENT ID 75 and R1: pointing to a 12-byte below-the-bar buffer. SVC clobbers R14 / R15.

Here's some sample code to invoke it:

#define _XOPEN_SOURCE_EXTENDED 1
#include <sys/resource.h>
#include <stdlib.h>

int main() {
        int* stgtestbuffer = (int*)__malloc31(sizeof(int)*3);

        memset(stgtestbuffer, 0, sizeof(stgtestbuffer));
        __asm(" LGHI 0,75\n");
        __asm(" LG 1,%0"::"m"(stgtestbuffer));
        __asm(" SVC 95\n":::"r14","r15");

        printf("STGTEST:\n[Raw: %x Size: %lld]\n[Raw: %x Size: %lld]\n[Raw: %x Size: %lld]\n",
                stgtestbuffer[0], ((long long)stgtestbuffer[0]) * 4096,
                stgtestbuffer[1], ((long long)stgtestbuffer[1]) * 4096,
                stgtestbuffer[2], ((long long)stgtestbuffer[2]) * 4096);
        free(stgtestbuffer);
        return 0;
}

Here's the output on my local machine.
Raw is the value returned by service in Hex, Size is in bytes (Raw * 4096 for 4k frames).

$ ./stgtest
STGTEST:
[Raw: e4b2f8 Size: 61390946304]
[Raw: e5a46c Size: 61644128256]
[Raw: e5a46c Size: 61644128256]

Was this page helpful?
0 / 5 - 0 ratings