Nomad: Nomad task got OOM killed when it was using only ~70% of it's MemoryMB limit

Created on 11 Jul 2018 · 8Comments · Source: hashicorp/nomad

This is the same issue then described in https://github.com/hashicorp/nomad/issues/4491 but as bug report.

Nomad version

Nomad v0.8.3

Issue

We have a nomad job that runs an application called claimsearch-service with the exec executor.
The memory limit is set to 50MiB in the nomad job file.
The application got OOM killed when it was only using 35,35MB RSS.

In the memory cgroup were the following processes with the following RSS usage:

|Process|RSS
-----------|-------
|nomad | 13,75MiB
|claimsearch-service | 35,36MiB
|grpc-health-check | 4,93 MiB

Expected behaviour

The task is not OOM killed when it uses less RSS memory then configured in the MemoryMB parameter in the Resources Stanza of the nomad job file.
The configured memory limit only applies to the executed nomad Task.

That the memory consumption of other processes are accounted into the memory limit is non-intuitive, it's not documented and it makes it difficult to calculate the correct Memory limit value for a task.
See also: https://github.com/hashicorp/nomad/issues/4491

OOM kill Kernel log

Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: Task in /nomad/ebf75298-ba47-98a8-28e5-a08daf20d60e killed as a result of limit of /nomad/ebf75298-ba47-98a8-28e5-a08daf20d60e
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: memory: usage 51200kB, limit 51200kB, failcnt 16402868
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: Memory cgroup stats for /nomad/ebf75298-ba47-98a8-28e5-a08daf20d60e: cache:36KB rss:51164KB rss_huge:16384KB mapped_file:8KB dirty:0KB writeback:0KB inactive_anon:25624KB a
ctive_anon:23492KB inactive_file:0KB active_file:0KB unevictable:0KB
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: [31962]     0 31962    82711     3521      44       5       61             0 nomad
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: [31970]    33 31970    13807     9046      31       5        0             0 claimsearch-ser
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: [30195]    33 30195    28465     1262      19       6        0             0 grpc-health-che
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: Memory cgroup out of memory: Kill process 31970 (claimsearch-ser) score 709 or sacrifice child
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: Killed process 31970 (claimsearch-ser) total-vm:55228kB, anon-rss:36184kB, file-rss:0kB
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: grpc-health-che invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0
Jul 10 08:56:03 prd-sisu-nomad-client-1.localdomain kernel: grpc-health-che cpusetëf75298-ba47-98a8-28e5-a08daf20d60e mems_allowed=0

Job file

The full job file can be found at: http://dpaste.com/05YWFVW

themdriveexec typbug

Source

fho

👍5

Most helpful comment

Wanted to clarify the behavior of Nomad 0.9:

Nomad 0.9 has a regression in exec driver where a task requires declaring at least 50-100MB RAM requirement and where nomad binary overhead is reported as part of the task cgroup stats, penalizing low memory tasks. The underlying issue was a CVE fix in runc (opencontainers/runc#1980) that was fixed in runc/libcontainer (opencontainers/runc#1984) and we picked it up in #5437 . Nomad 0.9.2 should address this point.
Additionally, Nomad 0.9 had some client re-architecture that caused more host level memory overhead per task; for example, we run an additional log collector process per task that can be significant overhead when running many tasks and without kernel caching binaries effectively. We plan to address this regression along with overall nomad overhead.

notnoop on 31 May 2019

👍4

All 8 comments

@fho thanks for the details. We plan to fix executor memory utilization in the upcoming release, 13MB is rather high.

preetapan on 12 Jul 2018

@preetapan
The amount of memory that nomad executor consumes is not the issue.
As long as the task is in a cgroup with other processes , the task gets OOM killed when it uses less
memory then it's configured memory limit.

Let's assume the memory consumption of nomad executor was lowered from 13MB to 5MB.
Now I run a task with a low memory footprint of 2MB via nomad, configure it's memory limit to 5MB to have some buffer.
The task would get OOM killed because the cgroup memory limit is reached:
2MB task memory + 5MB nomad executor memory > 5MB memory limit

fho on 13 Jul 2018

@fho see comments me and my co-worker made already about why the executor and script checks have to be in the same cgroup.

https://github.com/hashicorp/nomad/issues/4491#issuecomment-403942294
https://github.com/hashicorp/nomad/issues/4491#issuecomment-403946018

There's always going to be some amount of overhead from using the executor, and we will address that with a TBD mechanism - will likely either account for that when creating the container, or use soft limits.

preetapan on 13 Jul 2018

@preetapan

The executor is responsible for managing the lifecycle of the application, so its a desired feature to have it be in the same cgroup.
[..]
Allowing script checks to run outside the task's container and resource limits would be a major security and isolation issue.

I don't understand yet why they have to be in the same cgroup.
It would be great if you could elaborate on it.

What would be the disadvantages of other solutions like having each check and each nomad-executor in their own memory cgroups?
What are the advantages of having them in the same memory cgroup?
What would are the concrete security and isolation issues if each check and each nomad-executor is in their own memory cgroup?

thanks a lot

fho on 13 Jul 2018

I'm having a hard time understanding this. I have a container with 600MB limit. A java process with use heap+non-heap used of ~360. And its getting oom killed every 10 minutes or so. I can't be that nomad services are using 240MB? And if not, how can i tell _why_ the processing is getting killed?

memelet on 6 Sep 2018

😕2

I would just like to add that with nomad 0.9 the resource footprint of nomad processes within the cgroup have increased even more.
Most of our lightweight microservices now need double the resources configured in nomad compared to 0.8.

sirkjohannsen on 21 May 2019

👍1

Wanted to clarify the behavior of Nomad 0.9:

Nomad 0.9 has a regression in exec driver where a task requires declaring at least 50-100MB RAM requirement and where nomad binary overhead is reported as part of the task cgroup stats, penalizing low memory tasks. The underlying issue was a CVE fix in runc (opencontainers/runc#1980) that was fixed in runc/libcontainer (opencontainers/runc#1984) and we picked it up in #5437 . Nomad 0.9.2 should address this point.
Additionally, Nomad 0.9 had some client re-architecture that caused more host level memory overhead per task; for example, we run an additional log collector process per task that can be significant overhead when running many tasks and without kernel caching binaries effectively. We plan to address this regression along with overall nomad overhead.

notnoop on 31 May 2019

👍4

I'm closing this ticket as exec driver has been significantly changed since 0.8 and I believe the notes here are either addressed or no longer relevant. I'd encourage users experiencing memory issues to create a new issue against 0.10.

Since my last May 31, comment we made the following changes:

We've discovered that the overhead of additional processes we speculated and worried about about isn't actually significant and RSS metric is misleading. Some info is in https://github.com/hashicorp/nomad/issues/6543 .