Nomad: [feature] gracefully handle corrupt startup state

Created on 28 Jun 2016  ·  19Comments  ·  Source: hashicorp/nomad

As is mentioned in https://github.com/hashicorp/nomad/issues/1170 , if any of the state files are corrupted when a nomad client start up then it will never be able to start.

I think that on start up it should be considered (best effort) to try and recover the state of the machine but in the event that state files are corrupted maybe it would be best to make sure and shutdown any tasks that are still running, and delete all of the allocation state information on disk. This way it will still start up, but let the scheduler figure out a new home for the task.

themclient typenhancement

Most helpful comment

Another issue today on 0.4.0 when the client did not want to start. This time not with JSON files. Anybody knows what could be wrong here?

# /path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

* 2 error(s) occurred:

* alloc_runner snapshot includes a nil allocation
* alloc_runner snapshot includes a nil context

Deleting the Nomad data directory helped again. I kept a copy in case you need some information from it.

All 19 comments

This would be neat! I experience this fairly occasionally when docker hangs and I have to force reboot an instance.

I've built a temporary work around where I've written health checks for backends that checks the state of docker and nomad, and, if either fail, let Google Compute Engine automatically destroy and reprovision the instance.

I would be interested to know if you continue seeing this after 0.4.0 comes
out because of #1357

On Tue, Jun 28, 2016, 4:26 PM George Robinson [email protected]
wrote:

This would be neat! I experience this fairly occasionally when docker
hangs and I have to force reboot an instance.

I've built a temporary work around where I've written health checks for
backends that checks the state of docker and nomad are healthy, and, if
either fail, let Google Compute Engine automatically destroy and
reprovision the instance.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/hashicorp/nomad/issues/1367#issuecomment-229189087,
or mute the thread
https://github.com/notifications/unsubscribe/AAKGBYVKtnIC8qndmvslKZB4sbbsJ60Kks5qQZGZgaJpZM4JAh0G
.

+1 for this feature.

Occasionally I see this issue with Nomad 0.4.0

# /path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input

It turns out that some state.json file are empty

# /...path.../nomad version
Nomad v0.4.0
# find /...nomad_directory.../ -name state.json | xargs ls -la
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/state.json
-rw------- 1 root root 2760 Aug  8 12:01 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/task-a519f3bf481a344e286bbdab7d3cfe5f/state.json
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/state.json
-rw------- 1 root root    0 Aug  8 19:52 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/task-e71f891226c0be65a6a1d38f954cc4b8/state.json

Deleting /nomad_directory/ helped to overcome the issue.

Not sure what caused this. It might have been the shutdown of the nodes. There was no shortage of disk space.

+1
Experience the same problem

@flegler is that usually after a hard server crash?

@a86c6f7964 I also see with kill -9. It should be simple to do atomic writes - write to a tmp file then do a rename operation - since renaming a file on Linux (and most fs implementations) is an atomic operation.

Ya that is what #1357 was supposed to do. It just doesn't do an fsync which is why I was asking if it was on server crash. Makes me think we missed something

@a86c6f7964 I am not sure, but it is possible that the nodes were shut down without ending Nomad first.

Another issue today on 0.4.0 when the client did not want to start. This time not with JSON files. Anybody knows what could be wrong here?

# /path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

* 2 error(s) occurred:

* alloc_runner snapshot includes a nil allocation
* alloc_runner snapshot includes a nil context

Deleting the Nomad data directory helped again. I kept a copy in case you need some information from it.

Another variation today with task runner:

/path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

* 1 error(s) occurred:

* task runner snapshot include nil Task

This is happening in a cloud environment where services are usually not shut down before nodes/machines are shut down.

@flegler Yes it would be awesome if you could post those up

For now my initial work on this issue I'm going to do the following:

  • Handle restoring nil contexts properly as it's not an error restore a nil context
  • Add some sanity checks around state saving/restoring

If we determine there are unavoidable environmental factors that could lead to state files being truncated/corrupted while at rest (e.g. unclean shutdowns and restarts), then we have a plan on how to gracefully deal with them. However, we'd like to ensure there's no bugs with persisting state before we focus on defensive mitigations that could cover up solvable bugs.

Dropped the milestone but keeping this open in case anyone hits any issue we can track it. 0.5.0 includes some potential cases of corruption and adds sanity checks. Hopefully those were all the issues!

hey guys,

i also got this issue in v0.4.1, task state.json is empty.
-rw-------. 1 root root 0 Nov 28 06:47 /var/lib/nomad/client/alloc/5074d3ec-e4b8-9477-46a0-fa187300f84c/task-86a1b907d54bf7010394bf316e183e67/state.json

systemd start log:
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal systemd[1]: Started Nomad.
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal systemd[1]: Starting Nomad...
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal nomad[3624]: Loaded configuration from /etc/nomad/client.hcl
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal nomad[3624]: ==> Starting Nomad agent...
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal nomad[3624]: ==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal nomad[3624]: * failed to decode state: unexpected end of JSON input
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal systemd[1]: nomad.service: main process exited, code=exited, status=1/FAILURE
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal systemd[1]: Unit nomad.service entered failed state.
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal systemd[1]: nomad.service failed.

@yangzhares Any chance you can upgrade to 0.5.0? That's where some fixes and sanity checks were introduced to address this.

@schmichael I am encountering something related to this on Nomad v0.5.1

Nomad starts just fine.

[root@centos-7-cl1 ~]# systemctl status nomad
● nomad.service - Nomad Agent
   Loaded: loaded (/usr/lib/systemd/system/nomad.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-12-16 20:54:42 UTC; 1s ago
 Main PID: 10955 (sh)
   Memory: 2.4M
   CGroup: /system.slice/nomad.service
           ├─10955 /bin/sh -c /sbin/nomad agent -config /etc/nomad  >> /var/log/nomad/nomad.log 2>&1
           └─10956 /sbin/nomad agent -config /etc/nomad

Dec 16 20:54:42 centos-7-cl1 systemd[1]: nomad.service holdoff time over, scheduling restart.
Dec 16 20:54:42 centos-7-cl1 systemd[1]: Started Nomad Agent.
Dec 16 20:54:42 centos-7-cl1 systemd[1]: Starting Nomad Agent...

I can schedule jobs onto my clients, however, these messages still appear on the clients.

$ cat /var/log/nomad/nomad.log
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input
    Loaded configuration from /etc/nomad/client.hcl
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input
    Loaded configuration from /etc/nomad/client.hcl
==> Starting Nomad agent...

If I run these commands, I can stop the errors from appearing.

systemctl stop nomad
cd /opt/nomad/client/alloc/
rm -rf *
cd /opt/nomad/alloc/
rm -rf *
systemctl restart nomad

It would be really beneficial if the logs had a timestamp.

Same in nomad 0.5.4

Loaded configuration from /etc/nomad.d/nomad.hcl, /etc/nomad.d/nomad_client.hcl
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

  • 2 error(s) occurred:

    • alloc_runner snapshot includes a nil allocation

    • alloc_runner snapshot includes a nil alloc dir

rm -R /opt/nomad/client/alloc/ && systemctl start nomad

I'm closing this as 0.6 substantially changes the state persistence code. Nomad now uses a single boltdb file instead of a tree of JSON files. This made atomic state changes much easier and hopefully fixed corrupt state issues (along with some other minor improvements to persistence code in 0.6).

I've also added more verbose logging when state issues are encountered in #2763

Please don't hesitate to open new issues if state restore bugs continue in 0.6!

(I opted against trying to write more sophisticated cleanup/recovery code at this time as it seems more likely to cause more problems than it fixes until we know the exact form new persistence issues take.)

Similar problems on:

$ nomad  -version
Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

Here's a little script that helps restore the node quickly.

Please think twice before you run this on your box! Especially on production nodes.

#!/bin/bash

mount_points=`mount | grep /var/nomad/alloc | awk {'print $3'}`

echo "Stopping Nomad..."
systemctl stop nomad

echo "Kill processes using /var/nomad/alloc/..."
for pid in `lsof | grep '/var/nomad/alloc' | awk {'print $2'} | sort | uniq`
do
    kill -TERM $pid
done

nomad_procs=`pgrep -fc /usr/local/bin/nomad`
if [[ "$nomad_procs" == "0" ]]
then
    for mount_point in $mount_points
    do
        echo "Unmounting $mount_point"
        umount $mount_point
    done
fi

echo "Removing /var/nomad/alloc and /var/nomad/client..."
rm -r /var/nomad/alloc
rm -r /var/nomad/client

echo "Starting Nomad"
systemctl start nomad
Was this page helpful?
0 / 5 - 0 ratings