Nomad: Nomad 0.7.0 memory leak

Created on 27 Nov 2017  路  13Comments  路  Source: hashicorp/nomad

I've run into an issue with 0.7.0 memory leak.

We have 3 instances in the cluster - in this case 1 dies and the cluster remains in quorum, after restarting the dead server it dies once more.

Attached is the log

memory.txt

stagwaiting-reply typbug

All 13 comments

We then ran a force-leave and terminated the instance, another came back into play but when it tried to join the cluster the remaining 2 instances lost quorum.

The solution was to kill the masters and rebuild with a new cluster performing a job restore from backup.

How much memory was nomad using? How much memory does your machine have?

The servers running had 2G of RAM total with no swap. But the issue was unusual that the server became unresponsive when I tried to restart nomad.service. It was as if it were in an infinite loop.

1m uptime showed 7+ when I was able to run the command, otherwise the server became unresponsive - These are running on t2 in AWS, but they still had plenty of CPU credits. They sit at about 2% CPU usage until this issue is hit.

How much memory are the other two nomad servers using? Can you show the output of curl http://localhost:4646/v1/jobs?pretty=true

Unfortunately they are dead as I was rushing to get the cluster up and running. I forwarded you a gist via email of the job output. Our current replaced cluster is running about 300MB in memory usage with the same job specs.

@discobean how many allocations do you have, are you creating more constantly? Can you add monitoring of the servers memory usage. You likely just require a server with more memory for how you are using Nomad

@discobean when you restart your cluster can you change it to debug logging.

I will add monitoring for RAM so we have a decent history to refer to for when this comes up again. And next time I'll start the failed instance with debugging on for you.

@discobean you can change the value in the config file to debug and sighup to change to debug logging for the current servers!

@discobean I would also suggest you increase the memory of the servers to something like 8gb and if there is a memory leak we will still be able to observe.

I am going to close this until we get clear evidence that there is a leak.

Somewhat similar memory leak happened to us too.

We tried to recover a leaderless cluster with new raft/peers.json, and two of the servers ate all the memory on the machine and started swapping (and were then killed by ASG...) while the first node started and exited shortly thereafter...

There may have been a lot of dead jobs that were not garbage collected since we have a fair bit of periodics that run every few minutes...

Here's a gist with the logs from one of the machines that gobbled up memory:
https://gist.github.com/slobo/8b74e70fd6013d988f2d1dc8446ee50c

In the logs you will notice a reference to 10.10.1.119, which was the node that started and then nomad died.

Was this page helpful?
0 / 5 - 0 ratings