Runtime: .NET Core applications get oom killed on Kubernetes/OpenShift

Created on 20 Jul 2018 · 9Comments · Source: dotnet/runtime

We have been investigating why .net core applications are killed by OpenShift because they exceed their assigned memory.

OpenShift/Kubernetes informs the app via the sysfs limit_in_bytes. This is detected by .NET Core:

https://github.com/dotnet/coreclr/blob/08d39ddf02c81c99bd49c19b808c855235cbabdc/src/pal/src/misc/cgroup.cpp#L25

Then memory is monitored by the oom killer based on sysfs usage_in_bytes.
.NET Core is using statm for this:

https://github.com/dotnet/coreclr/blob/08d39ddf02c81c99bd49c19b808c855235cbabdc/src/pal/src/misc/cgroup.cpp#L24

usage_in_bytes includes RSS and CACHE, while statm is only RSS.
So memory in cache is a reason to get oom killed, but .NET Core doesn't use it to detect when to do a GC.

We should change the implementation so it also is aware of usage_in_bytes to measure the memory load of the system.

CC @janvorli

area-GC-coreclr

Source

tmds

Most helpful comment

@janvorli can we backport this to the 2.1? Should I do a PR targeting dotnet:release/2.1 branch?

tmds on 17 Aug 2018

👍5

All 9 comments

cc @MichaelSimons @richlander

jkotas on 20 Jul 2018

👍1

@tmds thank you for the investigation! I just got back from my vacation and I'll look into fixing it soon.

janvorli on 13 Aug 2018

🎉1 👍1

@janvorli I am also back from vacation :) If you want, you can assign the issue to me.

tmds on 14 Aug 2018

@tmds thank you, I'm gladly accepting your offer :-)

janvorli on 14 Aug 2018

@janvorli can we backport this to the 2.1? Should I do a PR targeting dotnet:release/2.1 branch?

tmds on 17 Aug 2018

👍5

Yes please @tmds ? And if I may be so bold, I'm looking for assistance here: https://stackoverflow.com/questions/51983312/net-core-on-linux-lldb-sos-plugin-diagnosing-memory-issue .. would this be a worthy issue on dotnet/coreclr just yet?

kierenj on 24 Aug 2018

PR to backport to 2.1: https://github.com/dotnet/coreclr/pull/19650

@kierenj , yes, you can create an issue for that in coreclr repo.

tmds on 24 Aug 2018

Excellent, this will be great for me in 2.1. On that issue - in fact I was using 2.0 and memory usage is way, way down on 2.1 so no need there. Thank you!

kierenj on 24 Aug 2018

I tested the recent PR for 2.1 (19650) in our application and saw a significant memory use reduction. The charts here are from Amazon ECS, and relative to the soft memory limit of 384MB (which is why it can show more than 100%). The hard memory limit for the cgroup is 1024MB.

before_and_after_gc_fix

The background memory use has remained stable at around 300MB for the last 12h or so, compared to the unpatched application which uses around 420MB.

The difference is more pronounced under load, where in production we are bouncing close to the 2048MB cgroup limit regularly at the moment (we do significant logging and other I/O so roughly half our prod memory use is for page caching).

For ages we thought we had a memory leak, but after scratching our heads for some time trying to find one, I finally found this ticket, which seems to fix our issue. 🍾 🎆

Thanks very much for your work! 👍 👍 👍