Nomad: Nomad 0.7.0 memory leak

Created on 27 Nov 2017 · 13Comments · Source: hashicorp/nomad

I've run into an issue with 0.7.0 memory leak.

We have 3 instances in the cluster - in this case 1 dies and the cluster remains in quorum, after restarting the dead server it dies once more.

Attached is the log

memory.txt

stagwaiting-reply typbug

Source

discobean

All 13 comments

We then ran a force-leave and terminated the instance, another came back into play but when it tried to join the cluster the remaining 2 instances lost quorum.

The solution was to kill the masters and rebuild with a new cluster performing a job restore from backup.

discobean on 27 Nov 2017

How much memory was nomad using? How much memory does your machine have?

dadgar on 27 Nov 2017

The servers running had 2G of RAM total with no swap. But the issue was unusual that the server became unresponsive when I tried to restart nomad.service. It was as if it were in an infinite loop.

discobean on 27 Nov 2017

1m uptime showed 7+ when I was able to run the command, otherwise the server became unresponsive - These are running on t2 in AWS, but they still had plenty of CPU credits. They sit at about 2% CPU usage until this issue is hit.

discobean on 27 Nov 2017

How much memory are the other two nomad servers using? Can you show the output of curl http://localhost:4646/v1/jobs?pretty=true

dadgar on 27 Nov 2017

Unfortunately they are dead as I was rushing to get the cluster up and running. I forwarded you a gist via email of the job output. Our current replaced cluster is running about 300MB in memory usage with the same job specs.

discobean on 27 Nov 2017

@discobean how many allocations do you have, are you creating more constantly? Can you add monitoring of the servers memory usage. You likely just require a server with more memory for how you are using Nomad

dadgar on 27 Nov 2017

@discobean when you restart your cluster can you change it to debug logging.

dadgar on 27 Nov 2017

I will add monitoring for RAM so we have a decent history to refer to for when this comes up again. And next time I'll start the failed instance with debugging on for you.

discobean on 27 Nov 2017

@discobean you can change the value in the config file to debug and sighup to change to debug logging for the current servers!

dadgar on 27 Nov 2017

👍1

@discobean I would also suggest you increase the memory of the servers to something like 8gb and if there is a memory leak we will still be able to observe.

dadgar on 27 Nov 2017

👍1

I am going to close this until we get clear evidence that there is a leak.

dadgar on 11 Dec 2017

Somewhat similar memory leak happened to us too.

We tried to recover a leaderless cluster with new raft/peers.json, and two of the servers ate all the memory on the machine and started swapping (and were then killed by ASG...) while the first node started and exited shortly thereafter...

There may have been a lot of dead jobs that were not garbage collected since we have a fair bit of periodics that run every few minutes...

Here's a gist with the logs from one of the machines that gobbled up memory:
https://gist.github.com/slobo/8b74e70fd6013d988f2d1dc8446ee50c

In the logs you will notice a reference to 10.10.1.119, which was the node that started and then nomad died.

slobo on 14 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[improvements][Intro] Running clients on a Mac

mancusogmu · 3Comments

Unable to set default network_speed in 0.5.0-rc2

bdclark · 3Comments

[thinking] Periodic batch job trigger immediately option

jrasell · 3Comments

Improvement: interpolation ${local.git.tag} / ${local.git.branch} / ${local.git.commit}

samber · 3Comments

[Feature] TTL Health Check based process monitoring

ashald · 3Comments