A memory leak surfaced after upgrading our clients from .Net core 2.2 to 3.1. I've narrowed it (hopefully the only one) down with this example to have happened between 2.2 and 3.0:
C#
for (int i = 0; i < int.MaxValue; i++)
{
using (var process = Process.Start("ls"))
Console.WriteLine(i);
}
The above is memory stable targeting 2.2 but leaks on 3.0 and 3.1. Although it looks silly, I would consider it serious as it breaks any long running processes that cycle child processes.
Upon fixing, resources like process pipes should also be considered (see https://github.com/dotnet/runtime/issues/24476).
Tested against 3.1.202 on macOS 10.15.3 (19D76) and linux-arm Raspbian GNU/Linux 8 (jessie)
Tagging subscribers to this area: @eiriktsarpalis
Notify danmosemsft if you want to be subscribed.
cc: @tmds
@myrup your example doesn't WaitForExit. .NET Core needs to track some things until process exits, which can happen post Process.Dispose. If your example starts processes faster than they terminate, it will have increased memory usage over time.
@tmds If that was the explanation it would mean the processes consistently were exiting slower (EDIT: or launching faster) on .Net Core 3.0 but making the cut on 2.2. To ensure that is not the cause I've inserted a WaitForExit() and the leak persists. The code is now:
C#
for (int i = 0; i < int.MaxValue; i++)
{
using (var process = Process.Start("ls"))
process.WaitForExit();
Console.WriteLine(i);
}
I slightly changed the code:
for (int i = 0; i < int.MaxValue; i++)
{
using (var process = Process.Start("ls"))
process.WaitForExit();
Console.WriteLine(i);
if ((i % 100) == 0)
{
GC.Collect();
}
}
When I run this app and monitor using dotnet counters:
GC Heap Size (MB) is low and stays low.
Working Set (MB) is incrementing monotonically.
I think the first means there are no managed memory leaks.
I set export COMPlus_GCHeapHardLimit=10000000 (10MB). The behavior is unchanged.
Maybe the increasing working set is indicative of a native memory leak?
@janvorli does this make sense? What can I do next? Maybe mtrace?
@tmds you could try to use mtrace or Valgrind to see if it detects native leaks from the native heap. Please note that you'll see false positives as we don't do any cleanup at exit. However, mtrace doesn't track mmap allocated memory and I think that neither does Valgrind.
One more idea - let the process run until it accumulates a substantially large working set and then dump the /proc/PID/maps. That might help us to figure out where the memory comes from.
I tried debugging this.
Using mtrace didn't make me much wiser.
I used this:
class Program
{
static void Main(string[] args)
{
Run(trace: false);
Run(trace: true);
}
static void Run(bool trace)
{
if (trace)
{
mtrace();
}
using (var process = Process.Start("ls"))
process.WaitForExit();
if (trace)
{
muntrace();
}
}
[DllImport("libc")]
static extern void mtrace();
[DllImport("libc")]
static extern void muntrace();
}
and resulting mtrace output was:
$ ./bin/Debug/netcoreapp3.1/console
bin console.csproj obj Program.cs
bin console.csproj obj Program.cs
$ mtrace ./bin/Debug/netcoreapp3.1/console $MALLOC_TRACE
- 0x00007f2154034c40 Free 3 was never alloc'd 0x7f21fe3de36e
- 0x00007f2154033080 Free 8 was never alloc'd 0x7f21fe3de36e
- 0x000055fc9d268b60 Free 11 was never alloc'd 0x7f21fe3de36e
- 0x00007f2154008d30 Free 14 was never alloc'd 0x7f21fe3de36e
- 0x000055fc9d28b780 Free 25 was never alloc'd 0x7f21fe3de36e
- 0x000055fc9d362220 Free 27 was never alloc'd 0x7f21fe3de36e
- 0x000055fc9d35e440 Free 154 was never alloc'd 0x7f21fe3de36e
Memory not freed:
-----------------
Address Size Caller
0x000055fc9d181df0 0x288 at 0x7f21fe3de307
0x000055fc9d353920 0x1d0 at 0x7f21fe3e3d60
0x000055fc9d357c90 0x20 at 0x7f21fe3de307
0x000055fc9d35b690 0x58 at 0x7f21fe3de307
0x000055fc9d35ba80 0x4f8 at 0x7f21fe3de307
0x000055fc9d35c130 0x58 at 0x7f21fe3f8c2a
0x000055fc9d35c190 0x428 at 0x7f21fe3de307
0x000055fc9d35c690 0x38 at 0x7f21fe3de307
0x000055fc9d35d620 0x38 at 0x7f21fe3de307
0x000055fc9d35d660 0x38 at 0x7f21fe3de307
0x000055fc9d35d6a0 0x428 at 0x7f21fe3de307
0x000055fc9d35dad0 0x38 at 0x7f21fe3de307
0x000055fc9d35e070 0xb8 at 0x7f21fe3de307
0x000055fc9d3623d0 0x20 at 0x7f21fe3de307
Unfortunately no symbol names showed up here. Using lldb/gdb I was also not able to figure out the symbol names.
I realized that it would not be so useful because the symbol names would most likely be things like PAL_malloc which doesn't help find the location.
valgrind doesn't work for me with 3.1. I've reported an issue in bugzilla: https://bugs.kde.org/show_bug.cgi?id=422174.
@tmds If in any way manageable in size, perhaps it's possible to diff the changes between 2.2 in 3.0 in the code behind launching and cleaning up processes to look for any obvious culprits.
@myrup on the native side these are the changes: https://github.com/dotnet/corefx/commits/release/3.1/src/Native/Unix/System.Native/pal_process.c.
Big disclaimer: I have not had a look at any of the code behind, and never used dotnet counters before, but I think GC Heap Size (MB) staying low is not a guarantee that there are no leaks caused by managed code calling native code in an unintended way. In other words, the regression can also have been introduced on the managed side. But perhaps someone with more insight can weigh in.
I think GC Heap Size (MB) staying low is not a guarantee that there are no leaks caused by managed code calling native code in an unintended way. In other words, the regression can also have been introduced on the managed side. But perhaps someone with more insight can weigh in.
Yes, this is true.
GC Heap Size (MB) staying low means the leak is not managed memory.
@janvorli this is the change to maps over 7 minutes.
@@ -3,11 +3,12 @@
55f206cba000-55f206cbf000 r--p 00011000 fd:00 20333525 /home/tmds/console/bin/Release/netcoreapp3.1/console
55f206cc0000-55f206cc1000 r--p 00016000 fd:00 20333525 /home/tmds/console/bin/Release/netcoreapp3.1/console
55f206cc1000-55f206cc2000 rw-p 00017000 fd:00 20333525 /home/tmds/console/bin/Release/netcoreapp3.1/console
-55f2077c6000-55f208bd9000 rw-p 00000000 00:00 0 [heap]
+55f2077c6000-55f212d7f000 rw-p 00000000 00:00 0 [heap]
7f7f30000000-7f7f3004a000 rw-p 00000000 00:00 0
7f7f3004a000-7f7f34000000 ---p 00000000 00:00 0
7f7f38000000-7f7f38021000 rw-p 00000000 00:00 0
7f7f38021000-7f7f3c000000 ---p 00000000 00:00 0
+7f7f3d354000-7f7f3d859000 rw-p 00000000 00:00 0
7f7f3dd4a000-7f7f3dd4b000 r--p 00000000 fd:00 17998356 /usr/lib64/libicudata.so.65.1
7f7f3dd4b000-7f7f3dd4c000 r-xp 00001000 fd:00 17998356 /usr/lib64/libicudata.so.65.1
7f7f3dd4c000-7f7f3f7fb000 r--p 00002000 fd:00 17998356 /usr/lib64/libicudata.so.65.1
@@ -181,7 +182,6 @@
7f7f65d32000-7f7f65d50000 ---p 00000000 00:00 0
7f7f65d50000-7f7f65d62000 rw-p 00000000 00:00 0
7f7f65d62000-7f7fce8ed000 ---p 00000000 00:00 0
-7f7fce9d4000-7f7fcea59000 rw-p 00000000 00:00 0
7f7fcea59000-7f7fceb45000 r--p 00000000 fd:00 17998358 /usr/lib64/libicui18n.so.65.1
7f7fceb45000-7f7fcecc5000 r-xp 000ec000 fd:00 17998358 /usr/lib64/libicui18n.so.65.1
7f7fcecc5000-7f7fced4d000 r--p 0026c000 fd:00 17998358 /usr/lib64/libicui18n.so.65.1
Ok, so that means that a growth in memory size by 150MB came from the native heap (malloc).
It is possible though that it is not a real leak, but rather a native heap fragmentation issue. What I mean is that say you allocate many blocks that sum together to 150MB by malloc, then you allocate say 100 bytes by malloc and then free all the 150MB blocks. And let's assume that the 100 bytes were at the highest address in the heap. The heap cannot shrink after the free, as the heap is a continuous block of memory (IIRC).
The items you've got listed as "Memory not freed" by the mtrace could be those tiny bits preventing the heap from being able to shrink.
The document on how GLIBC malloc works supports this theory:
https://sourceware.org/glibc/wiki/MallocInternals
Note that, in general, "freeing" memory does not actually return it to the operating system for other applications to use. The free() call marks a chunk of memory as "free to be reused" by the application, but from the operating system's point of view, the memory still "belongs" to the application. However, if the top chunk in a heap - the portion adjacent to unmapped memory - becomes large enough, some of that memory may be unmapped and returned to the operating system
@janvorli I printed backtraces for the Memory not freed: reported by mtrace using backtrace_symbols_fd:
0x0000561109c11180 0x58 at 0x7fc518c3ace1
libcoreclr.so(+0x491d59)[0x7fc518c3ad59]
libcoreclr.so(+0x4ca059)[0x7fc518c73059]
libcoreclr.so(+0x4c75c6)[0x7fc518c705c6]
libcoreclr.so(+0x4b39a9)[0x7fc518c5c9a9]
libcoreclr.so(+0x4b3fc2)[0x7fc518c5cfc2]
libcoreclr.so(+0x4c1aa4)[0x7fc518c6aaa4]
libcoreclr.so(CreateEventExW+0x66)[0x7fc518c6ac76]
[0x7fc49f06f630]
[0x7fc49f175186]
[0x7fc49f176b65]
[0x7fc49f643b40]
[0x7fc49f641dad]
[0x7fc49f63a0ba]
[0x7fc49f63430a]
[0x7fc49f633767]
[0x7fc49f63364c]
[0x7fc49f63335b]
[0x7fc49f63263e]
libcoreclr.so(CallDescrWorkerInternal+0x7b)[0x7fc5189eabd6]
libcoreclr.so(+0x175cc8)[0x7fc51891ecc8]
libcoreclr.so(+0x257289)[0x7fc518a00289]
libcoreclr.so(+0x2575d8)[0x7fc518a005d8]
libcoreclr.so(+0xcf862)[0x7fc518878862]
libcoreclr.so(coreclr_execute_assembly+0xc4)[0x7fc518864f94]
.corerun(+0x38c5)[0x5611095198c5]
.corerun(+0x269d)[0x56110951869d]
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7fc51914a041]
.corerun(+0x22cd)[0x5611095182cd]
* 0x0000561109c59280 0xf8 at 0x7fc518c3ace1
-> freed
0x0000561109c5e910 0x58 at 0x7fc518c3ace1
libcoreclr.so(+0x491d59)[0x7fc518c3ad59]
libcoreclr.so(_Znam+0x8)[0x7fc5188650e8]
libcoreclr.so(+0x162c16)[0x7fc51890bc16]
libcoreclr.so(+0x15f126)[0x7fc518908126]
libcoreclr.so(+0x160fa1)[0x7fc518909fa1]
libcoreclr.so(+0x160298)[0x7fc518909298]
libcoreclr.so(ResolveWorkerAsmStub+0x70)[0x7fc5189eb8e4]
[0x7fc49f63a7ee]
[0x7fc49f639c3e]
[0x7fc49f63430a]
[0x7fc49f633767]
[0x7fc49f63364c]
[0x7fc49f63335b]
[0x7fc49f63263e]
libcoreclr.so(CallDescrWorkerInternal+0x7b)[0x7fc5189eabd6]
libcoreclr.so(+0x175cc8)[0x7fc51891ecc8]
libcoreclr.so(+0x257289)[0x7fc518a00289]
libcoreclr.so(+0x2575d8)[0x7fc518a005d8]
libcoreclr.so(+0xcf862)[0x7fc518878862]
libcoreclr.so(coreclr_execute_assembly+0xc4)[0x7fc518864f94]
.corerun(+0x38c5)[0x5611095198c5]
.corerun(+0x269d)[0x56110951869d]
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7fc51914a041]
.corerun(+0x22cd)[0x5611095182cd]
0x0000561109c5fa90 0x1c0 at 0x7fc518c3ace1
libcoreclr.so(+0x491d59)[0x7fc518c3ad59]
libcoreclr.so(+0x4b3ef7)[0x7fc518c5cef7]
libcoreclr.so(+0x4c1aa4)[0x7fc518c6aaa4]
libcoreclr.so(CreateEventExW+0x66)[0x7fc518c6ac76]
[0x7fc49f06f630]
[0x7fc49f175186]
[0x7fc49f176b65]
[0x7fc49f643b40]
[0x7fc49f641dad]
[0x7fc49f63a0ba]
[0x7fc49f63430a]
[0x7fc49f633767]
[0x7fc49f63364c]
[0x7fc49f63335b]
[0x7fc49f63263e]
libcoreclr.so(CallDescrWorkerInternal+0x7b)[0x7fc5189eabd6]
libcoreclr.so(+0x175cc8)[0x7fc51891ecc8]
libcoreclr.so(+0x257289)[0x7fc518a00289]
libcoreclr.so(+0x2575d8)[0x7fc518a005d8]
libcoreclr.so(+0xcf862)[0x7fc518878862]
libcoreclr.so(coreclr_execute_assembly+0xc4)[0x7fc518864f94]
.corerun(+0x38c5)[0x5611095198c5]
.corerun(+0x269d)[0x56110951869d]
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7fc51914a041]
.corerun(+0x22cd)[0x5611095182cd]
0x0000561109c5fd00 0xb8 at 0x7fc518c3ace1
libcoreclr.so(+0x491d59)[0x7fc518c3ad59]
libcoreclr.so(_Znam+0x8)[0x7fc5188650e8]
libcoreclr.so(+0x11a4bb)[0x7fc5188c34bb]
libcoreclr.so(+0x11a7f5)[0x7fc5188c37f5]
libcoreclr.so(+0x3083b5)[0x7fc518ab13b5]
libcoreclr.so(+0x309348)[0x7fc518ab2348]
libcoreclr.so(+0x3196a2)[0x7fc518ac26a2]
libcoreclr.so(+0x27fff4)[0x7fc518a28ff4]
libcoreclr.so(+0x27fad7)[0x7fc518a28ad7]
libcoreclr.so(+0x280da7)[0x7fc518a29da7]
libcoreclr.so(+0x27cc4b)[0x7fc518a25c4b]
libcoreclr.so(+0x27debd)[0x7fc518a26ebd]
libcoreclr.so(+0x27edfe)[0x7fc518a27dfe]
libcoreclr.so(+0x12d132)[0x7fc5188d6132]
libcoreclr.so(+0x16b462)[0x7fc518914462]
libcoreclr.so(+0x125455)[0x7fc5188ce455]
libcoreclr.so(+0x12688a)[0x7fc5188cf88a]
libcoreclr.so(DelayLoad_Helper+0x76)[0x7fc5189eae5a]
[0x7fc49f0bf6f8]
libcoreclr.so(CallDescrWorkerInternal+0x7b)[0x7fc5189eabd6]
libcoreclr.so(+0x175cc8)[0x7fc51891ecc8]
libcoreclr.so(+0x24cb00)[0x7fc5189f5b00]
libcoreclr.so(+0x1a0bda)[0x7fc518949bda]
libcoreclr.so(+0x4d159e)[0x7fc518c7a59e]
/lib64/libpthread.so.0(+0x9431)[0x7fc519647431]
/lib64/libc.so.6(clone+0x42)[0x7fc5192249d2]
* 0x0000561109c600b0 0x4f8 at 0x7fc518c3ace1
-> freed
0x0000561109c64a40 0x38 at 0x7fc518c3ace1
libcoreclr.so(+0x491d59)[0x7fc518c3ad59]
libcoreclr.so(_Znam+0x8)[0x7fc5188650e8]
libcoreclr.so(+0x160db5)[0x7fc518909db5]
libcoreclr.so(+0x160298)[0x7fc518909298]
libcoreclr.so(ResolveWorkerAsmStub+0x70)[0x7fc5189eb8e4]
[0x7fc49f63ad4f]
[0x7fc49f63a6d9]
[0x7fc49f639c3e]
[0x7fc49f63430a]
[0x7fc49f633767]
[0x7fc49f63364c]
[0x7fc49f63335b]
[0x7fc49f63263e]
libcoreclr.so(CallDescrWorkerInternal+0x7b)[0x7fc5189eabd6]
libcoreclr.so(+0x175cc8)[0x7fc51891ecc8]
libcoreclr.so(+0x257289)[0x7fc518a00289]
libcoreclr.so(+0x2575d8)[0x7fc518a005d8]
libcoreclr.so(+0xcf862)[0x7fc518878862]
libcoreclr.so(coreclr_execute_assembly+0xc4)[0x7fc518864f94]
.corerun(+0x38c5)[0x5611095198c5]
.corerun(+0x269d)[0x56110951869d]
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7fc51914a041]
.corerun(+0x22cd)[0x5611095182cd]
0x0000561109c65ac0 0x38 at 0x7fc518c3ace1
libcoreclr.so(+0x491d59)[0x7fc518c3ad59]
libcoreclr.so(_Znam+0x8)[0x7fc5188650e8]
libcoreclr.so(+0x160db5)[0x7fc518909db5]
libcoreclr.so(+0x160298)[0x7fc518909298]
libcoreclr.so(ResolveWorkerAsmStub+0x70)[0x7fc5189eb8e4]
[0x7fc49f63a6f2]
[0x7fc49f639c3e]
[0x7fc49f63430a]
[0x7fc49f633767]
[0x7fc49f63364c]
[0x7fc49f63335b]
[0x7fc49f63263e]
libcoreclr.so(CallDescrWorkerInternal+0x7b)[0x7fc5189eabd6]
libcoreclr.so(+0x175cc8)[0x7fc51891ecc8]
libcoreclr.so(+0x257289)[0x7fc518a00289]
libcoreclr.so(+0x2575d8)[0x7fc518a005d8]
libcoreclr.so(+0xcf862)[0x7fc518878862]
libcoreclr.so(coreclr_execute_assembly+0xc4)[0x7fc518864f94]
.corerun(+0x38c5)[0x5611095198c5]
.corerun(+0x269d)[0x56110951869d]
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7fc51914a041]
.corerun(+0x22cd)[0x5611095182cd]
0x0000561109c666d0 0x38 at 0x7fc518c3ace1
libcoreclr.so(+0x491d59)[0x7fc518c3ad59]
libcoreclr.so(_Znam+0x8)[0x7fc5188650e8]
libcoreclr.so(+0x160db5)[0x7fc518909db5]
libcoreclr.so(+0x160298)[0x7fc518909298]
libcoreclr.so(ResolveWorkerAsmStub+0x70)[0x7fc5189eb8e4]
[0x7fc49f63a72b]
[0x7fc49f639c3e]
[0x7fc49f63430a]
[0x7fc49f633767]
[0x7fc49f63364c]
[0x7fc49f63335b]
[0x7fc49f63263e]
libcoreclr.so(CallDescrWorkerInternal+0x7b)[0x7fc5189eabd6]
libcoreclr.so(+0x175cc8)[0x7fc51891ecc8]
libcoreclr.so(+0x257289)[0x7fc518a00289]
libcoreclr.so(+0x2575d8)[0x7fc518a005d8]
libcoreclr.so(+0xcf862)[0x7fc518878862]
libcoreclr.so(coreclr_execute_assembly+0xc4)[0x7fc518864f94]
.corerun(+0x38c5)[0x5611095198c5]
.corerun(+0x269d)[0x56110951869d]
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7fc51914a041]
.corerun(+0x22cd)[0x5611095182cd]
0x0000561109c66710 0x38 at 0x7fc518c3ace1
libcoreclr.so(+0x491d59)[0x7fc518c3ad59]
libcoreclr.so(_Znam+0x8)[0x7fc5188650e8]
libcoreclr.so(+0x160db5)[0x7fc518909db5]
libcoreclr.so(+0x160298)[0x7fc518909298]
libcoreclr.so(ResolveWorkerAsmStub+0x70)[0x7fc5189eb8e4]
[0x7fc49f2ee840]
[0x7fc49f6417dd]
[0x7fc49f639fba]
[0x7fc49f63430a]
[0x7fc49f633767]
[0x7fc49f63364c]
[0x7fc49f63335b]
[0x7fc49f63263e]
libcoreclr.so(CallDescrWorkerInternal+0x7b)[0x7fc5189eabd6]
libcoreclr.so(+0x175cc8)[0x7fc51891ecc8]
libcoreclr.so(+0x257289)[0x7fc518a00289]
libcoreclr.so(+0x2575d8)[0x7fc518a005d8]
libcoreclr.so(+0xcf862)[0x7fc518878862]
libcoreclr.so(coreclr_execute_assembly+0xc4)[0x7fc518864f94]
.corerun(+0x38c5)[0x5611095198c5]
.corerun(+0x269d)[0x56110951869d]
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7fc51914a041]
.corerun(+0x22cd)[0x5611095182cd]
0x0000561109c66840 0x438 at 0x7fc518c3ace1
libcoreclr.so(+0x491d59)[0x7fc518c3ad59]
libcoreclr.so(_Znwm+0x8)[0x7fc5188650b8]
libcoreclr.so(+0x1ffb92)[0x7fc5189a8b92]
libcoreclr.so(+0x29e91e)[0x7fc518a4791e]
libcoreclr.so(+0x2a71e8)[0x7fc518a501e8]
libcoreclr.so(+0x2a32fd)[0x7fc518a4c2fd]
libcoreclr.so(+0x2a3b4e)[0x7fc518a4cb4e]
libcoreclr.so(+0x2a3ebe)[0x7fc518a4cebe]
libcoreclr.so(+0x2a42a5)[0x7fc518a4d2a5]
libcoreclr.so(+0x123b2f)[0x7fc5188ccb2f]
libcoreclr.so(+0x1233eb)[0x7fc5188cc3eb]
libcoreclr.so(ThePreStub+0x5b)[0x7fc5189eb54b]
[0x7fc49f642741]
[0x7fc49f641d2d]
[0x7fc49f63a0ba]
[0x7fc49f63430a]
[0x7fc49f633767]
[0x7fc49f63364c]
[0x7fc49f63335b]
[0x7fc49f632634]
libcoreclr.so(CallDescrWorkerInternal+0x7b)[0x7fc5189eabd6]
libcoreclr.so(+0x175cc8)[0x7fc51891ecc8]
libcoreclr.so(+0x257289)[0x7fc518a00289]
libcoreclr.so(+0x2575d8)[0x7fc518a005d8]
libcoreclr.so(+0xcf862)[0x7fc518878862]
libcoreclr.so(coreclr_execute_assembly+0xc4)[0x7fc518864f94]
.corerun(+0x38c5)[0x5611095198c5]
.corerun(+0x269d)[0x56110951869d]
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7fc51914a041]
.corerun(+0x22cd)[0x5611095182cd]
I don't know why so many symbol names are unresolved.
These readable names show up:
coreclr_execute_assembly
CreateEventExW (2x)
DelayLoad_Helper (1x)
ResolveWorkerAsmStub (5x)
ThePreStub (1x)
_Znam -> operator new[](unsigned long)
_Znwm -> operator new(unsigned long)
Are some of these 'leaks' expected, like ResolveWorkerAsmStub/ThePreStub/...?
I guess you were using your own build of the runtime so that you can insert the mtrace call, right? Then it could be that the mtrace couldn't find the stripped debugging symbols. You can patch the cmake build to not to strip the symbols and see if it helps. You'd just delete or comment out the following two lines and do a clean build:
https://github.com/dotnet/runtime/blob/93b6c449d4f31ddd7d573d1d3769e681d5ebceb9/eng/native/functions.cmake#L294-L295
Ah, I've missed the fact that you've called the mtrace via pinvoke from the managed code. So I guess you've used the official builds of .NET. In that case, you'd need to fetch the symbol files. The dotnet-symbol tool can do that. You can find a doc on how to install it here:
https://github.com/dotnet/diagnostics/blob/master/documentation/debugging-coredump.md
While the doc describes how to get symbols from a core dump, it can also fetch symbols for shared libraries. So you can e.g. run dotnet symbol libcoreclr.so.
@stephentoub I think I may have identified the memory leak.
In https://github.com/dotnet/corefx/pull/36199 Process.{Safe}Handle became a waitable event handle on Unix.
In constructor we set ownsHandle: false, so ReleaseHandle won't get called. I think this leads to the unmanaged memory leaks that are allocated under CreateEventExW.
Does that make sense? I think the fix is to change to ownsHandle: true?
@janvorli fyi, there is a fix for the issue with valgrind: https://bugs.kde.org/show_bug.cgi?id=422174.
Does that make sense? I think the fix is to change to ownsHandle: true?
Yes. The "handle" here that's owned isn't actually the IntPtr but rather the ref count on it, what ReleaseHandle would undo. So that should be true. Nice job tracking it down.
@tmds thank you for the notice of the fix! And congratulations on finding the culprit. Have you managed to get symbols for the for the mtrace stack traces?
@tmds Great! Is the fix included in the next 3.1.x release?
Is the fix included in the next 3.1.x release?
Not currently. There's a separate process for that, and the bar is quite high for what gets ported back to it. Can you speak to how impactful this actually is for your real workloads?
@stephentoub This is very impactful for our embedded clients. If left out would mean skipping .Net core 3. The clients cycle a process exactly once per second which amounts to ~ 10-20 mb leak every 24 hrs. I should imagine any long running processes like asp.net core servers that spawn any child processes are also affected.
Have you managed to get symbols for the for the mtrace stack traces?
Unfortunately not. I used backtrace_symbols_fd in PAL_malloc and a few symbol names showed up like CreateEventExW, but most weren't resolved. I didn't look into it further.
@myrup it sounds like yo'ure not on 3.1 yet (on 2.2 maybe). What timeframe are you looking at to move to 3.1? If it's not immediately, then we could get this fix out in a 5.0 preview, to get it a bit more validation.
@stephentoub I think preview 6 is still open if someone feels like creating a PR against release/5.0-preview6
(I'm happy to do it if you think it makes sense)
@danmosemsft We rolled back to 2.2 when the memory leak appeared. There's no immediate need for us to upgrade except enjoy version alignment across our systems with the latest LTS plus perhaps any 2.2 -> 3.1 speed improvements :). I'd happily verify the leak is sealed with a preview release. EDIT: I forgot that one of the reasons we got excited about upgrading is the ability to trim the execs.
@myrup look for this in 5.0 preview 6 coming out likely later in the month.
@danmosemsft Thanks! I'll test it and report back here 馃憤
@danmosemsft I can confirm this bug was fixed between 5.0 preview 5 and 6 馃檶馃徏
@myrup thank you for verifying! Glad we could fix this quickly and do open new issues if you find any.
@danmosemsft Should it be included in 3.1.x?
@myrup ah yes. I created a request: https://github.com/dotnet/corefx/pull/42941
@danmosemsft
A similar issue was reported on the old repo https://github.com/dotnet/core/issues/3989
Maybe this should be referenced? Or does the commit not affect that?
@mdisg I've added links to this issue and the PR that fixes it on https://github.com/dotnet/core/issues/3989.
@tmds thanks. I will now upgrade to the newest 3.1 version and do a test run with our devices.
Most helpful comment
Ok, so that means that a growth in memory size by 150MB came from the native heap (malloc).
It is possible though that it is not a real leak, but rather a native heap fragmentation issue. What I mean is that say you allocate many blocks that sum together to 150MB by malloc, then you allocate say 100 bytes by malloc and then free all the 150MB blocks. And let's assume that the 100 bytes were at the highest address in the heap. The heap cannot shrink after the free, as the heap is a continuous block of memory (IIRC).
The items you've got listed as "Memory not freed" by the mtrace could be those tiny bits preventing the heap from being able to shrink.
The document on how GLIBC malloc works supports this theory:
https://sourceware.org/glibc/wiki/MallocInternals