We've run into an interesting error scenario where an attempt to write to the local snapshot files fails with a "no space left on device" error. The process seems to hold on to that error status and short-circuits subsequent write requests, indefinitely. So even after the filesystem issue is resolve, the consul process continues to SYSLOG "no space left on device" errors for each "tryAppend" call.
Running "strace" on the process show that no write() system calls are made to the file descriptor open against the snapshot file. Calls just result in writes to SYSLOG with an error returned to the application.
Bumping this one... Any thoughts?
@jjones-smug sorry haven't had a chance to chase this down yet.
Bumping again, since I have a ticket open on this one that I'd like to close out.
Could this be related?
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/consul-tool/pODmVVEy4Ds/6-LpTAZBAgAJ
Hi,
I'll test it.
Added following line to /etc/fstab.
tmpfs /opt/diskfull tmpfs defaults,noatime,size=250M 0 0
Set consul's data_dir to /opt/diskfull/consul
# systemctl stop consul
# mount -a
# systemctl start consul
And...
# dd if=/dev/zero of=/opt/diskfull/dummy.img bs=512 count=9999999999
dd: error writing ‘dummy.img’: No space left on device
511745+0 records in
511744+0 records out
262012928 bytes (262 MB) copied, 0.582111 s, 450 MB/s
It takes time to run out the resered area of local.snapshot file.
After reproduce the issue.
# rm /opt/diskfull/dummy.img
Although I got free space, outputting the error messages continue.
I hope the problem will be fixed by following patch.
-> https://github.com/hashicorp/consul/pull/2236
best regards,
I have a system that just ran into this problem as well:
consul: 2016/12/19 17:35:21 [ERR] serf: Failed to update snapshot: write /var/lib/consul/serf/local.snapshot: no space left on device
A df -h showed:
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 72G 30G 39G 44% /
Though a day or so ago that same machine did indeed run out of space, but obviously that space has since been cleared up.
We did a systemctl restart consul and consul is happy again.
Consul version 0.7.1
Just hit this (still) with 0.7.5 where agent never recovers after a disk full.
Also see this error in the log
==> Error starting agent: failed decoding service file "/var/lib/consul/services/00ea4b7117da453e50d3972b86dad0f8-648e8828-9969-2139-bf2a-bca0d3b17d27.tmp": unexpected end of JSON input
And the agent will not start.
Is there a workaround for this issue?
Most helpful comment
Fixed in https://github.com/hashicorp/consul/pull/3236