This is opened for 3 reasons:
$ cat foo.cc
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <malloc.h>
void *list[1024 * 1024];
int r() {
// add a one so as to never return 0
return 1 + random() % 137;
}
int main() {
// use calloc so that the pages are 'touched'.
for(int i=0;i< 1024 * 1024;i++)
list[i] = calloc(r(), 1);
for(int i=0;i< 1024 * 1024;i++)
free(list[i]);
sleep(60 * 1000);
}
A linux (RHEL 8) observation:
The static memory usage (without the callocs) of this code is 12 MB
top output after the program runs and sleeps for enough time:
$ top -p 6777
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6777 root 20 0 99772 96344 932 S 0.0 5.1 0:00.16 a.out
$ cat /proc/meminfo | grep MemFree
MemFree: 1637932 kB
$
top output after the system is heavily loaded:
top -p 6777
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6777 root 20 0 99772 0 0 S 0.0 0.0 0:00.16 a.out
$ cat /proc/meminfo | grep MemFree
MemFree: 70444 kB
pmap shows 85 MB is still mapped into the process, matching with top.
0000000002c7c000 87384K rw--- [ anon ]
Observations:
rss stays high (~ virtual memory) for ever, if there is no load in the systemProblems / questions:
rss is a function of not only on the program's activity with memory, but also to other parts of the system. virtual memory on the other hand was apparently the sum of allocations (malloc/calloc/mmap...) in the program, but looks like it is not, and depends on the glibc's memory management implementations and its discretion as to when to retain and when to release chunks back to OS, and as to what size of chunks qualify for removal, and may be also depend on the memory usage hueristics as well! When I used 4K chunks for calloc/free, I got different results.In Cloud workloads this has implications. A user would want to minimize memory tagged against their process and would do everything in their application to achieve that, but due the dependency on the system load and allocation caching behavior, this would not translate well?
memory leak definition becomes loose and detection becomes less comprehensive. Plurality of tools profiling the app from different angles would bring-in mutually contradicting results? For example: in a batch process, a code that genuinely leaks an MB versus an MB that is cached in malloc arenas - what is the difference from outside, and why should I worry more about the former and not worry about the later?
I read about wss (working set size) - that is a true measure of how much memory the program is actively engaged in. The idea looks like tracking paging-in; as that would be a true reflection of the program's needs. I don't find an API for this, nor a platform-neutral way to obtain this. Even if we could, this has an issue - the value is transient, as it depends on the history of usage, not the current state?
I tested in mac and windows, and saw the same behavior - with minor differences w.r.t numbers (probably owing to the system configurations and implementation differences in memory management at glibc / kernel), but the pattern is same.
I found references that confirm my theory:
http://www.brendangregg.com/wss.html
https://www.networkworld.com/article/2298360/kernel-space--how-much-memory-am-i-really-using-.html
relevant excerpt:
The problem is that the numbers exported by the current kernels are nearly meaningless.
The reported virtual size of an application is nearly irrelevant; it says nothing about
how much of that virtual space is actually being used. The resident set size (RSS)
number is a little better, but there is no information on sharing of pages there. The
/proc/pid/smaps file gives a bit of detail, but also lacks sharing information. And the
presence of memory pressure can change the situation significantly.
with that, I would like to convert this into a (libuv?) feature request:
use case: ability for applications to figure out its memory usage in a reasonable manner - one of:
with a preference on the former.
/cc @nodejs/libuv
affected module: @nodejs/workers
virtual memory minus chunks residing in malloc cache
I'm not sure what you're asking for here. What's a "malloc cache"?
working set size
How would you measure that? The approach in Gregg's article is... creative.... but not something that can be generalized across platforms, possibly not even architectures.
@bnoordhuis - malloc cache is a term I coined, for lack of a better qualifier. If you look at my calculation in the first comment, 94 MB is seen as allocated to the process, after the malloc/free business is long over - out of which only 12 MB belongs to the program, the rest has to be released and un-accounted from virtual memory, but it hasn't - so my guess is that it is cached by malloc subsystem, for future requests.
How would you measure that? The approach in Gregg's article is... creative.... but not something that can be generalized across platforms, possibly not even architectures.
I agree that it is complicated, and might need OS support to achieve that; and hence I realistically opted for the first one.
Oh, you mean arena allocations. Libuv has no insight into libc's bookkeeping though.
You could write an add-on that calls glibc's mtrace() but at the risk of stating the obvious: that's specific to glibc.
@bnoordhuis - ok, but Libuv has malloc wrappers that can potentially keep a track of active allocations?
https://github.com/libuv/libuv/blob/18ff13c71e5e68066729bc90664f9550bfe5ead4/src/uv-common.c#L75-L79
I am not saying the entire app's memory demand funnel through Libuv. My take is that given Node.js is a language runtime that abstracts many implementation details from the underlying system, one or more or all of libuv / v8 / node should be able to measure the real memory allocation either pro-actively (through malloc wrapping) or reactively (through relecting on process's memory, with help of underlying platform if need be).
In the absence of that, the closest what we get is rss and virtual mem. As illustrated in the linux example, those are too far from the real memory (96 instead of 12) to be of any meaningful accounting purpose?
Libuv has malloc wrappers that can potentially keep a track of active allocations?
It does, but those only track what libuv itself allocates, which isn't much compared to Node or V8.
An LD_PRELOAD wrapper will let you track all malloc/realloc/posix_memalign/free calls, however.
@bnoordhuis - unfortunately, the LD_PRELOAD approach does not help on the immediate issue (#23277) or the wider need (accurate metering in Cloud deployments)
An in-built, callable API will be the best way, and for the long term.
or the wider need (accurate metering in Cloud deployments)
I'm not sure we can influence metering in cloud deployments. Even if we find a way to provide working set size instead of resident set size, cloud providers would probably still consider rss for metering since the memory allocated in rss can't be allocated for other processes. While I agree it would be awesome to have a better way to visualize actual memory usage, we should also be aware that introducing yet another metric can lead to more confusion.
An in-built, callable API will be the best way, and for the long term.
There's no guarantee native modules will use it, which can also lead to further confusion (as wss will be imprecise). If a native module depends on a third-party library (not uncommon), the library will definitely not use this callable API.
Have you looked into how other languages and runtimes deal with this? It's hard to believe we're the first runtime trying to incorporate a more precise memory reporting solution.
@mmarchini: thanks for the views!
I'm not sure we can influence metering in cloud deployments. Even if we find a way to provide working set size instead of resident set size, cloud providers would probably still consider rss for metering since the memory allocated in rss can't be allocated for other processes.
There's no guarantee native modules will use it
Sorry for the confusion. I meant here about an API to get the accurate memory usage, not an allocation wrapper. Agree that we can't enforce third party native code to invoke the wrappers, so we might need to override the allocators.
openj9 shows memory layout of the program to this level of granularity:
NULL ------------------------------------------------------------------------
0SECTION NATIVEMEMINFO subcomponent dump routine
NULL =================================
0MEMUSER
1MEMUSER JRE: 1,063,311,424 bytes / 5393 allocations
1MEMUSER |
2MEMUSER +--VM: 788,956,312 bytes / 4855 allocations
2MEMUSER | |
3MEMUSER | +--Classes: 3,509,464 bytes / 124 allocations
2MEMUSER | |
3MEMUSER | +--Memory Manager (GC): 551,046,224 bytes / 2346 allocations
3MEMUSER | | |
4MEMUSER | | +--Java Heap: 536,932,352 bytes / 1 allocation
3MEMUSER | | |
4MEMUSER | | +--Other: 14,113,872 bytes / 2345 allocations
2MEMUSER | |
3MEMUSER | +--Threads: 25,027,616 bytes / 681 allocations
3MEMUSER | | |
4MEMUSER | | +--Java Stack: 902,280 bytes / 73 allocations
3MEMUSER | | |
4MEMUSER | | +--Native Stack: 23,199,744 bytes / 74 allocations
3MEMUSER | | |
4MEMUSER | | +--Other: 925,592 bytes / 534 allocations
2MEMUSER | |
3MEMUSER | +--Trace: 687,920 bytes / 455 allocations
2MEMUSER | |
3MEMUSER | +--JVMTI: 17,784 bytes / 13 allocations
2MEMUSER | |
3MEMUSER | +--JNI: 31,800 bytes / 49 allocations
2MEMUSER | |
3MEMUSER | +--Port Library: 207,028,232 bytes / 67 allocations
3MEMUSER | | |
4MEMUSER | | +--Unused <32bit allocation regions: 207,018,312 bytes / 1 allocation
3MEMUSER | | |
4MEMUSER | | +--Other: 9,920 bytes / 66 allocations
2MEMUSER | |
3MEMUSER | +--Other: 1,607,272 bytes / 1120 allocations
1MEMUSER |
2MEMUSER +--JIT: 273,076,160 bytes / 184 allocations
2MEMUSER | |
3MEMUSER | +--JIT Code Cache: 268,435,456 bytes / 1 allocation
2MEMUSER | |
3MEMUSER | +--JIT Data Cache: 2,097,216 bytes / 1 allocation
2MEMUSER | |
3MEMUSER | +--Other: 2,543,488 bytes / 182 allocations
1MEMUSER |
2MEMUSER +--Class Libraries: 1,278,952 bytes / 354 allocations
2MEMUSER | |
3MEMUSER | +--Harmony Class Libraries: 2,000 bytes / 1 allocation
2MEMUSER | |
3MEMUSER | +--VM Class Libraries: 1,276,952 bytes / 353 allocations
3MEMUSER | | |
4MEMUSER | | +--sun.misc.Unsafe: 528 bytes / 6 allocations
4MEMUSER | | | |
5MEMUSER | | | +--Direct Byte Buffers: 528 bytes / 6 allocations
3MEMUSER | | |
4MEMUSER | | +--Other: 1,276,424 bytes / 347 allocations
NULL
NULL ------------------------------------------------------------------------
and I guess it wraps malloc to achieve that:
(gdb) where
#0 0x00007ffff745bad0 in malloc () from /lib64/libc.so.6
#1 0x00007ffff79b4fc5 in JLI_MemAlloc ()
from /opt/ibm/java-x86_64-80/jre/bin/../lib/amd64/jli/libjli.so
#2 0x00007ffff79b26ed in JLI_Launch () from /opt/ibm/java-x86_64-80/jre/bin/../lib/amd64/jli/libjli.so
#3 0x000000000040065a in main ()
@gireeshpunathil Node.js could provide allocation wrappers to libuv, but V8 (which most likely performs more native-heap allocations than libuv in a typical Node.js scenario) and Node.js itself use operator new a lot to allocate memory. I think that currently means that heap snapshots are the only way to track memory usage in a very detailed way.
@addaleax in the report above I think only this section
3MEMUSER | +--Memory Manager (GC): 551,046,224 bytes / 2346 allocations
3MEMUSER | | |
4MEMUSER | | +--Java Heap: 536,932,352 bytes / 1 allocation
3MEMUSER | | |
4MEMUSER | | +--Other: 14,113,872 bytes / 2345 allocations
is memory that would be reported in a heap snapshot. The rest is off-heap memory. Having said that it might not change your point that a lot of the memory is being allocated by V8 versus Node.js. Information on the Node.js allocations might be interesting anyway but would only be a subset of the overall.
Most helpful comment
@gireeshpunathil Node.js could provide allocation wrappers to libuv, but V8 (which most likely performs more native-heap allocations than libuv in a typical Node.js scenario) and Node.js itself use
operator newa lot to allocate memory. I think that currently means that heap snapshots are the only way to track memory usage in a very detailed way.