Consul: Snapshot open file descriptor permanently unusable after "filesystem full" condition

Created on 20 Feb 2016  Â·  9Comments  Â·  Source: hashicorp/consul

We've run into an interesting error scenario where an attempt to write to the local snapshot files fails with a "no space left on device" error. The process seems to hold on to that error status and short-circuits subsequent write requests, indefinitely. So even after the filesystem issue is resolve, the consul process continues to SYSLOG "no space left on device" errors for each "tryAppend" call.

Running "strace" on the process show that no write() system calls are made to the file descriptor open against the snapshot file. Calls just result in writes to SYSLOG with an error returned to the application.

theminternal-cleanup typbug

Most helpful comment

All 9 comments

Bumping this one... Any thoughts?

@jjones-smug sorry haven't had a chance to chase this down yet.

Bumping again, since I have a ticket open on this one that I'd like to close out.

Could this be related?

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/consul-tool/pODmVVEy4Ds/6-LpTAZBAgAJ

Hi,

I'll test it.

Added following line to /etc/fstab.

tmpfs   /opt/diskfull    tmpfs   defaults,noatime,size=250M 0 0

Set consul's data_dir to /opt/diskfull/consul

# systemctl stop consul
# mount -a
# systemctl start consul

And...

# dd if=/dev/zero of=/opt/diskfull/dummy.img bs=512 count=9999999999
dd: error writing ‘dummy.img’: No space left on device
511745+0 records in
511744+0 records out
262012928 bytes (262 MB) copied, 0.582111 s, 450 MB/s

It takes time to run out the resered area of local.snapshot file.
After reproduce the issue.

# rm /opt/diskfull/dummy.img

Although I got free space, outputting the error messages continue.

I hope the problem will be fixed by following patch.
-> https://github.com/hashicorp/consul/pull/2236

best regards,

I have a system that just ran into this problem as well:

consul: 2016/12/19 17:35:21 [ERR] serf: Failed to update snapshot: write /var/lib/consul/serf/local.snapshot: no space left on device

A df -h showed:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        72G   30G   39G  44% /

Though a day or so ago that same machine did indeed run out of space, but obviously that space has since been cleared up.

We did a systemctl restart consul and consul is happy again.

Consul version 0.7.1

Just hit this (still) with 0.7.5 where agent never recovers after a disk full.

Also see this error in the log
==> Error starting agent: failed decoding service file "/var/lib/consul/services/00ea4b7117da453e50d3972b86dad0f8-648e8828-9969-2139-bf2a-bca0d3b17d27.tmp": unexpected end of JSON input

And the agent will not start.

Is there a workaround for this issue?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pritam97 picture pritam97  Â·  3Comments

satheeshCharles picture satheeshCharles  Â·  3Comments

slackpad picture slackpad  Â·  3Comments

nicholasjackson picture nicholasjackson  Â·  3Comments

sandstrom picture sandstrom  Â·  3Comments