consul 🚀 - Consul/raft consumes lot of space disk

Based on the small size of the snapshots and the large size of the other folders, my guess is that you guys did some heavy benchmarking or something against it. The LMDB architecture causes the files to grow up to a "working size". Under heavy write load that file will grow until it is compacted.

We are changing some of the architecture underneath however. BoltDB is architected similarly so I'd expect roughly the same space utilization.

armon on 18 Apr 2015

Actually, we did not do some heavy benchmarking. It's just our regular use of consul.
We use it among other thing to monitor our infrastructure. So lot of "event" are triggered. Then, because of consul-alert, lot of R/W are done against the KV store.

I think consul store all event, and IMHO a background task could be implemented to purge "old" event that are stored into the local database.

lyrixx on 20 Apr 2015

We're considering leveraging Consul for primary monitoring and event transfer responsibilities. Is there a solution for keeping the data set manageable?

williamsjj on 19 Jun 2015

@williamsjj You should just stand up a sandbox environment and try it out. It is very unlikely that the raft size will be an issue. The file grows to accommodate a "working set" which is usually on the order of tens of megabytes.

armon on 19 Jun 2015

We had similar issue. With very little data on consul, the snapshot can take up to 3GB disk.

j1n6 on 8 Sep 2015

To clarify @activars's post: Our situation is subtly different, in that it's the snapshots that take up the bulk of the space:

root@consul-host:/mnt/consul_data/raft# du -had 1
20M ./raft.db
4.0K    ./peers.json
1.5G    ./snapshots
1.5G    .

pdpi on 14 Sep 2015

Is there a way to remove old values from the DB? Our use case does not use Consul to persist any data, and we would really like to keep the RAFT DB down to a more desirable size (ours is hitting hundreds of gigs now).

chnrxn on 18 Nov 2015

Hi @chnrxn - hundreds of gigs is super unusual. Can you summarize which files are taking up all the space in the raft directory in your Consul data-dir? It would be interesting to see if snapshots are not being cleaned up or something.

slackpad on 18 Nov 2015

Hi @slackpad,

It's actually (just?) 100G, which is all raft.db.

Using strings tells me that it seems to be full of KV data and consul-replicate statuses. We are writing about 20 numeric values into the KV every 20 seconds and doing peer-replication with another consul DC.

Snapshots are really just 60k.

$ ls -lh /var/lib/consul/raft
total 100G
-rwxr-xr-x 1 consul consul 107 Oct 12 08:26 peers.json
-rw------- 1 consul consul 100G Nov 4 04:10 raft.db
drwxr-xr-x 17 consul consul 4.0K Nov 4 04:12 snapshots
$ ls -lh /var/lib/consul/raft/snapshots
total 60K

chnrxn on 18 Nov 2015

@chnrxn that's super weird since the raft db should get compacted periodically. Do you see any new snapshots being created in your logs and in your data-dir?

slackpad on 9 Jan 2016

@slackpad, not any more after I cleared those huge ones and restarted Consul. (and those happened 2 out of maybe 30 machines).

chnrxn on 11 Jan 2016

Thanks for the update. I wonder if there was some issue writing the snapshots that was preventing the raft db from compacting - it would be interesting to see if there were any log entries related to that.

slackpad on 13 Jan 2016

i got the same trouble with big raft.db after load testing.
i use bench tool with put/get logic and several data sizes from 1 to 100000 with 10 scale.
after several tests my raft.db uses 3G.
i have only one node in cluster
consul version
Consul v0.6.2
Consul Protocol: 3 (Understands back to: 1)

freeseacher on 26 Jan 2016

@freeseacher The raft.db itself its a B-Tree that only grows. Once it compacts, it manages free space internally, but the actual file is never reduced in size. Under heavy write load the file grows, and then we internally truncate and re-use space within the file. Its a limitation of the BoltDB underneath, and typically the file size settles to support the steady state of the cluster.

armon on 26 Jan 2016

Once it compacts, it manages free space internally, but the actual file is never reduced in size.

Perhaps I was running into such a situation (my initial implementation had 25 concurrent replication streams).

Would the only way of resolving this be to re-create a new BoltDB file, or simply remove it if it exceeds some threshold (assuming I do not need the data in the DB)?

https://github.com/boltdb/bolt/issues/423 suggests a workaround to copy out the DB to reduce its size. If I were to write a separate tool to do this, would it suffice to create the new DB, replace the existing DB, and restart consul?

chnrxn on 27 Jan 2016

@chnrxn in theory something like that should work if you were to stop Consul first. Once Consul has done a Raft snapshot / compaction it might be easier to just swap out a sever, though. Newly-added servers will only get the actual working contents of the Raft log, and shouldn't have the size bloat from the previous benchmarking.

slackpad on 2 Feb 2016

@slackpad, if I understand you correctly, I could just have an external process control (e.g. daemontools) stop Consul, remove the Raft DB, and restart Consul (of course one host at a time)?

chnrxn on 2 Feb 2016

@chnrxn potentially, though it would be tricky because you'd need to wait for the cluster to take a snapshot after the expensive load testing / etc. If you take a server out, clear out its state in the data-dir, and join it back with the other servers then it should receive the snapshot + the much smaller compacted Raft log.

slackpad on 2 Feb 2016

@slackpad how can I tell when a snapshot has been taken, or when a compaction has been done? Is there anything in the logs I can watch out for, or something?

chnrxn on 2 Feb 2016

Yes, you'd want to watch for your current leader to output some log lines like this:

    2016/02/01 23:50:06 [INFO] snapshot: Creating new snapshot at /var/folders/q_/fvcn55l55cl01tst9rvynlmc0000gn/T/consul271175954/raft/snapshots/1-77955-1454399406743.tmp
    2016/02/01 23:50:09 [INFO] raft: Compacting logs from 1 to 67723
    2016/02/01 23:50:09 [INFO] raft: Snapshot to 77955 complete

And if you were going to try this you'd want to replace non-leader nodes first, or else wait for all of your servers to snapshot and compact.

slackpad on 2 Feb 2016

I have a similar problem, is there any work to fix this or provide workarounds?

Omeryl on 14 Apr 2016

HI @Omeryl currently the only workaround is to replace the servers after a snapshot. It sounds like it could be tricky to compact the DB automatically, so we will need to do some investigation there.

slackpad on 25 Apr 2016

@slackpad couldn't I just remove server one by one from cluster, delete the entire data directory, then add back the particular machine back into the cluster.

We are facing a similar issue and need to resolve this quickly.

myusuf3 on 7 Oct 2016

Secondly, why do need to wait for a snapshot to occur?

myusuf3 on 7 Oct 2016

any ideas here, other than deploying new cluster? @armon

myusuf3 on 10 Oct 2016

@myusuf3 you are correct that rolling the servers one-by-one should correct this. I was thinking that you'd want to wait for it to snapshot so the newly-added servers wouldn't get a bunch of raft log entries replayed, but the leader will fall back to a snapshot fairly quickly anyway if you introduce a server with a fresh data directory.

slackpad on 10 Oct 2016

@slackpad well I am not convinced that's how it works, I recently had to replace a server and introduced it to the cluster and now its raft.db is just as big as the others, it was a new server with clean data directory.

What is exactly the process you are describing? Also is there a way to know when I snapshot was completed other than grepping logs and waiting?

Do you mean to say new servers added won't have this bloat? If so how do you explain new server I recently added to the cluster.

Lastly, by data directory, you mean parent folder containing raft.db and snapshot directories?

myusuf3 on 10 Oct 2016

@myusuf3 ok then that means there's a problem in one of two places. Either snapshots are not happening correctly or that logs aren't getting truncated correctly after a snapshot. When a new server joins, it will indicate to the leader what index it knows about (for a new server this will effectively be zero), so usually the leader will send the last snapshot followed the the log entries that occurred after the snapshot. If there was something that caused the raft.db to bloat, I would not expect it to be carried over to the new server.

In your case it's not clear what happened and it seems like whatever went into the Raft log that caused the bloat got replicated to the new server. Deleting the parent folder with raft.db and snapshots is the right way to remove the data, that's what I meant by new server. Right now the logs are the only way to see the snapshots happen (0.7.1 will have an API to force a snapshot). Do you see any of those in your leader logs? Do you have any logs from when you joined the new server (logs from the leader and the new server)?

slackpad on 11 Oct 2016

@slackpad hmm. logs incoming.

myusuf3 on 11 Oct 2016

@slackpad anything specific we are looking for here? there are quite a few logs!

myusuf3 on 11 Oct 2016

It did restore from snapshot

==> Log data will now stream in as it occurs:

    2016/10/01 21:02:17 [INFO] raft: Restored from snapshot 318-5080444-1475355266300

myusuf3 on 11 Oct 2016

@slackpad also getting this on the leader since the 6th

    2016/10/06 16:29:07 [WARN] consul: Attempting to apply large raft entry (1049969 bytes)

myusuf3 on 11 Oct 2016

Interesting - that's a snapshot from term 318, index 5080444, and from Sat, 01 Oct 2016 20:54:26 GMT so not that old relative to when the server started. That large raft entry could definitely cause bloat if it was happening often. That's over double the limit for a KV entry so the most likely cause of that is a health check that has a ton of output that's flapping. In Consul 0.7 we capped that, but in previous versions of Consul there wasn't a limit. Does that sound like it might fit anything that's happening for you?

slackpad on 11 Oct 2016

@slackpad possibly our health checks are comprehensive, so removing box from cluster one by one deleting data directory consul/data and adding them back to the cluster should reclaim disk usage? Wouldn't we see behaviour like we just saw for the newly added node?

myusuf3 on 11 Oct 2016

But even adding the cap would only push our the eventual problem of disk running out. No?

myusuf3 on 11 Oct 2016

The cap is 4k so it's a lot less than the entry you are seeing. How often does that print in the server log?

slackpad on 11 Oct 2016

@slackpad on leader? quite often

myusuf3 on 11 Oct 2016

@slackpad possibly our health checks are comprehensive, so removing box from cluster one by one deleting data directory consul/data and adding them back to the cluster should reclaim disk usage? Wouldn't we see behaviour like we just saw for the newly added node?

More interested about this though ^^

myusuf3 on 11 Oct 2016

@myusuf3 yes if you have something rapidly committing ~1 MB entries to the Raft log that could use up space pretty quickly, even for a fresh set of servers. Something like this query will locate your top health checks by output size:

export CONSUL_HTTP=http://demo.consul.io:80
curl -s $CONSUL_HTTP/v1/catalog/services | jq -r 'keys | .[]' | xargs -n 1 -I {} curl -s $CONSUL_HTTP/v1/health/service/{} | jq -r '.[] .Checks | .[] | (.Output | length | tostring) + " " + .Node + " " + .CheckID' | sort -n

slackpad on 11 Oct 2016

👍1

@slackpad well those machines have been running for over a year and I would image that this isn't issue atm, but totally possible but I would imagine I would have seen this issue earlier. Regardless how do I fix this issue without taking out production since it seems like cluster seems to be replicating the large raft.db to new machines

myusuf3 on 11 Oct 2016

@myusuf3 you'd have to locate the source and adjust that health check to not produce a large output because it seems like some process is repeatedly creating new, large log entries. I think a health check should be the only thing capable of making such a large log entry in Consul prior to 0.7.

slackpad on 11 Oct 2016

@slackpad cool! so we upgraded to Consul 0.7 and have removed consul 0.5.* from our system completely. As you mentioned adding new machines to the cluster and having it have a clean raft.db wasn't true in consul 0.5.* but it is true for Consul 0.7. When we would previously add machine it would copy the entire raft.db or what looks to be the entire thing to the new box.

Anyways that was an observation ^^

The issue I am seeing now is that raft.db for the old boxes that was at 70% disk usage isn't going down and I would like to clear that up since it's dangerously close to full.

Is it safe to remove the node from cluster rm -rf /data/ and restart consul add it back to the cluster? Then repeat for all the remaining servers to reset disk usage or not?

myusuf3 on 14 Nov 2016

I had a similar issue, consul servers (0.7.0) running on an c4.xlarge with farly small (slow) 40GB GP2 EBS volume and it had grown the raft.db file to 1.5GB, while actual dataset is less than 50MB. It was doing constant heavy IO with pretty much constant 30+ MB/s write and triggering alerts (90%+ of IO capacity). It was a follower so I stopped the server, moved the raft.db to backup folder and started consul up. It quickly recovered and re-created the raft.db that is now at 41MB and doing a constant 250-400KB/s of writes. Running with 3 consul servers managing 40 nodes with around 500~checks.

Nomon on 8 Dec 2016

This issue is still why I have been unable to use Consul in production. Why hasn't this been at least partially resolved?

Omeryl on 9 Dec 2016

@Nomon so doing this rotating per machine till they came back up, but didn't have bloat and that worked.

I will give it a shot, but recently added a new box and it matched the machines in bloat as well.

myusuf3 on 9 Dec 2016

Could health checks with volatile data in the output (e.g. timestamp, process id) trigger this?

magiconair on 27 Feb 2017

Could health checks with volatile data in the output (e.g. timestamp, process id) trigger this?

This definitely can by creating a large number of Raft log entries for each update.

slackpad on 12 Apr 2017

Just encountered with this problem. Sad.
Unfortunately I was not able to see much data in logs, logs were huge and rotated.
So my setup is multi-server multi-dc (DC1 DC2).
Consul has about 8 services with about 40 nodes registered in total in each dc.
Server cluster of 4 nodes.
consul v0.9.0
each server node is c4.large with default space of 8GB.
Only failed dc is DC1 failed.
raft.db was ~6GB on each node.
This DC1 is authoritative for ACLs, DC2 is replicating data from it.
Server nodes in DC2 had no problem with raft.db, size is about 50M and growing slowly.
Pity, I can't say when raft.db has started taking all the space. But it is only 1.5 month since I have upgraded it from v0.8.5 enabled ACLs and replication.
This is production setup but it's idle. Currently we are not using it in full production.
But we already have vault setup using this consul as backend, and going to migrate configuration_values for configuration management system in K/V storage.
Most funny thing is that I have completely removed stuff in raft/* folder, before I found this thread, so I've lost all ACLs in both DCs, and K/V data only in DC1.

Agents have simplest healthchecks which look like.

{
    "service": {
        "checks": [
            {
                "interval": "10s",
                "script": "test $(curl -s 'localhost/health/1/ping') = 'PONG'"
            }
        ],
        "port": 80,
       ...
    }
}

{
    "check": {
        "id": "nginx",
        "interval": "10s",
        "name": "nginx",
        "script": "systemctl is-active nginx"
    }
}

  ::consul::check { 'nginx':
    interval => '10s',
    script   => 'systemctl is-active nginx',
  }

If replication can be cause of such raft.db growth?
If anything is done towards simplifying recovery from this issue or automatic prevention?
How often snapshots are created?
How many snapshots exist at give moment of time?
How often they are pushed to raft.db (if I correctly understand it's purpose.) Or raft.db gets realtime writes from environment?

den-is on 17 Aug 2017

Hi @den-is sorry for the late reply - did you get any more info on this? Live writes go to the raft.db and then a periodic process runs a snapshot and truncates the Raft log. If there's an event that causes a huge number of writes it can blow up the size of raft.db, even if after it gets truncated, since BoltDB can't shrink it - we do intend to fix that. We added a more aggressive snapshot polling interval in 0.7.2 that should help prevent this, though it's not a complete fix. If you've been upgrading from a version before then, you may have ended up with a large static raft.db.

slackpad on 29 Sep 2017

@slackpad Hey,
I've not encountered this problem since then.
Still don't know what events have caused this issue.

den-is on 30 Sep 2017

@slackpad is it possible to configure that snapshot interval somehow? or disable it via cmd line flag maybe?

ghost on 12 Oct 2017

@alex-leonhardt-els not currently, and the aggressive tuning we did in 0.7.2 tries to snapshot as often as possible to prevent the Raft DB from getting large. You definitely wouldn't want to disable snapshots - can you clarify what you are thinking there?

slackpad on 18 Oct 2017

Thanks @slackpad, oh I understand now I think, to keep the DB size small (which also contains the log?) one would want the snapshots to be done more often, rather than the other way around.

Re why I asked.. I saw snapshots happening whilst restoring from a snapshot, so I was wondering if it'd be at all possible.

ghost on 18 Oct 2017

Hi,
I am using consul(Consul v1.0.6), along with vault to store data, and the setup contains 2 nodes, which has 2 vault and 2 consul instances on each node, and third consul instance is depended on virtual IP, which will be present on either one node. So at time on a node we will have 2 consul instances active.
Vault and consul has a dedicated disk partition, which is getting filled up because of the multiple snapshots(2 + 2 .tmp) stored by both instances, which is coming up to 3.5GB.

First instance
drwxr-xr-x. 2 kmsu kmsg 4096 Sep 18 13:59 906-1688466-1537279189394
drwxr-xr-x. 2 kmsu kmsg 4096 Sep 18 14:00 906-1697701-1537279210112
drwxr-xr-x. 2 kmsu kmsg 4096 Sep 18 14:00 906-1707914-1537279233313.tmp
drwxr-xr-x. 2 kmsu kmsg 4096 Sep 19 09:50 906-1725730-1537350628573.tmp

Virtual IP instance
drwxr-xr-x. 2 kmsu kmsg 4096 Sep 18 13:59 906-1679018-1537279164842.tmp
drwxr-xr-x. 2 kmsu kmsg 4096 Sep 18 13:59 906-1689691-1537279194955
drwxr-xr-x. 2 kmsu kmsg 4096 Sep 18 14:00 906-1698425-1537279215051
drwxr-xr-x. 2 kmsu kmsg 4096 Sep 18 14:00 906-1708085-1537279234573.tmp

Please let me know , if I can snapshot size and number of snapshot that can be created, through some configuration in Consul v1.0.6?

Thanks in advance.

BharathB23 on 20 Sep 2018

It was a follower so I stopped the server, moved the raft.db to backup folder and started consul up. It quickly recovered and re-created the raft.db that

@Nomon can you share how you performed checks on raft.db? we're facing the same issue in Consul 1.4
@slackpad opened a an issue that relates to is - we're having serious problems with Consul in production. https://github.com/hashicorp/consul/issues/5181

orarnon on 2 Jan 2019

I just wanted to point out new configuration values added in 1.1.0 that are related to snapshotting.

This gives more control over the snapshotting behavior and explains the trade-off being made in some detail. We definitely don't recommend jumping into configuring this type of thing as it may have other side-effects on performance or otherwise but wanted to call attention to it if you're confident it is necessary for your use-case.

pearkes on 3 Jan 2019

@pearkes We are still puzzled as for why it is happening.
Also, there are no mentions or caveats discussed in Consul's documentation. Big health-checks or health-checks with dynamic values are not mentioned as a possible reason for performance degradation for instance.
There are no tools to properly debug this issue and decide if new snapshots values are needed.

orarnon on 6 Jan 2019

@orarnon see my comments on your other issue. But the short answer is this is a "feature" of boltDB which we currently use for the raft store. The snapshot settings Jack mentioned are really just knobs that tune how many log entries are allowed to accumulate there which impacts the largest size the DB can grow to.

The real fix for this issue though involves replacing the raft log with a non-boltDB log implementation. There are other minor benefits to doing so but it's also a lot of work to do in a robust way and prove out against all forms of disk corruption etc.

banks on 7 Jan 2019

We're running into an issue having a small db but a very large snapshot. As some people previously mentioned about the volatile healthchecks, the discard_check_output configuration mentioned in https://www.consul.io/docs/agent/options.html specifically mentions volatile healthcheck output potentially causing this behavior, and setting this configuration (at the expense of seeing your healthcheck output) could be a solution.

calebmayeux on 12 Mar 2019

@calebmayeux We have actually taken care of these health-checks and still encountered the same issue.

orarnon on 19 Mar 2019

Hey there,
We wanted to check in on this request since it has been inactive for at least 60 days.
If you think this is still an important issue in the latest version of Consul
or its documentation please reply with a comment here which will cause it to stay open for investigation.
If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well!
Thank you!

stale[bot] on 21 Oct 2019

Consul is creating a snapshot in every minute or so and this eats up free disk space.

PS C:\src\repos> docker image inspect consul
[
{
"Id": "sha256:48b314e920d681cf3a5161fc4fbac1e3e4cf2480ecaedd3150ae36e3b49f5944",
"RepoTags": [
"consul:latest"
],
"RepoDigests": [
"consul@sha256:94cdbd83f24ec406da2b5d300a112c14cf1091bed8d6abd49609e6fe3c23f181"
],
"Parent": "",
"Comment": "",
"Created": "2019-09-13T06:21:13.106489061Z",
"Container": "8b845730cb66cd7e32cf10001167a7ab0fe23a72601ad3ccb6767f9113ce7e7b",
"ContainerConfig": {
"Hostname": "8b845730cb66",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"ExposedPorts": {
"8300/tcp": {},
"8301/tcp": {},
"8301/udp": {},
"8302/tcp": {},
"8302/udp": {},
"8500/tcp": {},
"8600/tcp": {},
"8600/udp": {}
},
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"CONSUL_VERSION=1.6.1",
"HASHICORP_RELEASES=https://releases.hashicorp.com"
],

DockerVersion: 18.06.1-ce

The container was created with docker compose:

consul:
container_name: Consul
image: consul
hostname: consul
ports:
- "8500:8500"
networks:
- mynet
volumes:
- c:/docker-data/consul-data:/consul/data
command: agent -server -bootstrap -ui -client 0.0.0.0

In container logs:

2019/11/05 12:06:01 [ERROR] raft: Failed to take snapshot: failed to close snapshot: sync /consul/data/raft/snapshots: invalid argument
2019/11/05 12:07:00 [INFO] consul.fsm: snapshot created in 46.9µs
2019/11/05 12:07:00 [INFO]  raft: Starting snapshot up to 413836
2019/11/05 12:07:00 [INFO] snapshot: Creating new snapshot at /consul/data/raft/snapshots/1072-413836-1572955620117.tmp
2019/11/05 12:07:01 [ERR] snapshot: Failed syncing parent directory /consul/data/raft/snapshots, error: sync /consul/data/raft/snapshots: invalid argument
2019/11/05 12:07:01 [ERROR] raft: Failed to take snapshot: failed to close snapshot: sync /consul/data/raft/snapshots: invalid argument
2019/11/05 12:07:55 [INFO] consul.fsm: snapshot created in 63.8µs
2019/11/05 12:07:55 [INFO]  raft: Starting snapshot up to 413840
2019/11/05 12:07:55 [INFO] snapshot: Creating new snapshot at /consul/data/raft/snapshots/1072-413840-1572955675583.tmp
2019/11/05 12:07:57 [ERR] snapshot: Failed syncing parent directory /consul/data/raft/snapshots, error: sync /consul/data/raft/snapshots: invalid argument

kosa-gyula-77 on 5 Nov 2019

Hey there,
We wanted to check in on this request since it has been inactive for at least 60 days.
If you think this is still an important issue in the latest version of Consul
or its documentation please reply with a comment here which will cause it to stay open for investigation.
If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well!
Thank you!

stale[bot] on 4 Jan 2020

I don't use consul anymore. And as the bot said, it's never going to be fixed. So I'm closing this issue. Thanks everyone involved.

lyrixx on 30 Jan 2020

Consul: Consul/raft consumes lot of space disk

All 64 comments

Related issues