After rebooting a server running Consul agent (in server mode), attempting to restart Consul's agent with its previously used parameters throws these errors:
==> Starting Consul agent...
==> Error starting agent: unexpected end of JSON input
Looking in the directories:
The server's reboot occurred at the same time that we registered health checks to Consul, so I believe that the files are corrupt (empty actually) because they are not being written atomically.
Thanks,
Amir
Yeah this could probably done in a little safer manner to reduce the chances of corrupting them. It should be safe to delete them and restart, as long as you re-register the checks since the agent is the source of truth about what's registered.
:+1: for this. We ran into this as well:
==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting raft data migration...
==> Starting Consul agent...
==> Error starting agent: failed decoding service from /data/services/3d2e6bded046ca2e3f906f04235b279b: unexpected end of JSON input
The same problem :(
saw this as well with 0.6.0. I just made a config change then restarted consul and a couple of the clients had empty files in services/ that when I deleted and restarted, solved the problem.
This happened about half the times I made changes to node_ttl on a running consul client and then restarted consul. Let me know if I can help debug.
I did see serf: Failed to update snapshot: write /opt/consul/data/serf/local.snapshot: no space left on device before the restart. Not sure if related.
When there was 2 empty files, I actually got a useful error for the second one:
Error starting agent: failed decoding service file "/opt/consul/data/services/4e2b1752dd646044507e9ac846479a90": unexpected end of JSON input
I have a pull request up for this in #1750
This looks like the same problem that nomad had in hashicorp/nomad#1357 which IMHO is a two part solution. Firstly fix the write to be atomic like in #1750 but secondly still allow startup of corrupt state.
Out of curiosity, did you find something in my implementation that you didn't agree with, and is startup with corrupt state a desirable thing for a system responsible for dns and service discovery? I personally feel like consul's config files need to be as guaranteed as possible.
Not sure if https://github.com/hashicorp/nomad/issues/1367 is the same thing so I am also posting here.
+1 for this fix.
Occasionally I see this issue with Nomad 0.4.0
# /path/nomad agent -config /path/config.json
Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:
* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input
It turns out that some state.json file are empty
# /...path.../nomad version
Nomad v0.4.0
# find /...nomad_directory.../ -name state.json | xargs ls -la
-rw------- 1 root root 0 Aug 11 13:13 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/state.json
-rw------- 1 root root 2760 Aug 8 12:01 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/task-a519f3bf481a344e286bbdab7d3cfe5f/state.json
-rw------- 1 root root 0 Aug 11 13:13 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/state.json
-rw------- 1 root root 0 Aug 8 19:52 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/task-e71f891226c0be65a6a1d38f954cc4b8/state.json
Deleting /nomad_directory/ helped to overcome the issue.
Not sure what caused this. It might have been the shutdown of the nodes. There was no shortage of disk space.
I just realized that my previous comment was for a similar issue with Nomad. Now I have experienced it with Consul too on 0.6.3.dev.
# /path/consul agent -config-file /path/config.json
==> Starting Consul agent...
==> Error starting agent: failed decoding service file "/path/services/18501c56aa3648940425d400c66dad9a": unexpected end of JSON input
# ls -la /path/services/18501c56aa3648940425d400c66dad9a
-rw------- 1 root root 0 Aug 29 19:50 /path/services/18501c56aa3648940425d400c66dad9a
# /path/consul version
Consul 0.6.3.dev
Consul Protocol: 3 (Understands back to: 1)
This issue is still valid in 0.8.5 with an empty JSON file in /var/consul/services:
# consul version
Consul v0.8.5
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
# cat /var/consul/services/foo
# /usr/bin/consul agent -config-dir=/etc/consul.d -pid-file /tmp/consul.pid
==> WARNING: LAN keyring exists but -encrypt given, using keyring
==> Starting Consul agent...
==> Error starting agent: failed decoding service file "/var/consul/services/foo": unexpected end of JSON input
Hi @zg did you experience a crash/running out of disk space to get the empty file, or did it occur some other way?
I created the tmp file myself as a proof of concept, but previously there was an incident where consul couldn鈥檛 start due to zero byte tmp files that were in the data directory that we previously were unaware of.
I see. I'll kick this back open so we can take a look.
@zg how did you resolve this? I have an issue where I just pushed a rolling cluster update using kops and one of my consul instances failed to start as it's looking for a service file that doesn't exist.
==> Error starting agent: failed decoding service file "/var/lib/consul/services/a369186d812d2a6ea01b87b6956ef828-340fa39d-14d1-e3f2-1305-d2228422425d.tmp": unexpected end of JSON input
@pjelar delete the empty services file and restart consul
@MrMMorris there wasn't a file there. I upgraded to 1.0.0 and the problem went away.
It could be that an upgrade deletes the tmp files.
The files weren't there @zg I deleted them. I tried placing valid .json files there too. More likely that 1.0.0 handles this issue better than 0.8.5, it wasn't a normal situation as I also had to do a peers.json manual recovery.
That鈥檚 how I resolved my issue as well, by deleting the empty JSON files.