Consul: Add snapshot/restore to outage recovery guide

Created on 8 Dec 2016  Â·  6Comments  Â·  Source: hashicorp/consul

We definitely need to mention snapshot/restore on here. Things to cover:

  1. Mention Consul Enterprise and the snapshot agent.
  2. Show an example disaster recovery restore and mention how it works into a fresh cluster.
  3. Mention how ?stale can be used to snapshot even if there's no leader, and how consul snapshot inspect can help you figure out which snapshot is better.

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/consul-tool/_TitQGHdRSA/18mZiFnJCQAJ

typdocs

Most helpful comment

It would be good to discuss how to use the automatically created snapshots also (they seem to be created regularly ever few hours).

cd /var/lib/consul/raft/snapshots/2-146226-1481465408058
[in snapshot directory]
# sha256sum * >SHA256SUMS
# tar -czf /tmp/recreated.snap *
# consul snapshot restore -token=... /tmp/recreated.snap
Restored snapshot

All 6 comments

It would be good to discuss how to use the automatically created snapshots also (they seem to be created regularly ever few hours).

cd /var/lib/consul/raft/snapshots/2-146226-1481465408058
[in snapshot directory]
# sha256sum * >SHA256SUMS
# tar -czf /tmp/recreated.snap *
# consul snapshot restore -token=... /tmp/recreated.snap
Restored snapshot

I'm exploring using the new snapshots feature as a backup mechanism as a mitigation tactic against accidental data loss.

I noticed that in the snapshot docs it says:

Restores involve a potentially dangerous low-level Raft operation that is not designed to handle server failures during a restore. This operation is primarily intended to be used when recovering from a disaster, restoring into a fresh cluster of Consul servers.

Can you add some clarification of what exactly that means? Conventional wisdom about database backups is that you should exercise them regularly. If we were to use the restore operation on our most recent snapshot weekly would we be at risk of data loss?

Probably unrelated to this issue, but is there some feasible mechanism that could be added to restore only certain keys? For example, imagine that 1000 keys were deleted 6 hours ago and many other keys were modified/updated since then. Would there be a way to restore only the 1000 deleted keys? Or do we need to keep our own separate dump of kv pairs and restore them through the normal /v1/kv API?

Edit: looks like once we upgrade we can use kv import and kv export to replace our JSON dumping process, but it uses the same /v1/kv API so will perform at the same speed

Can you add some clarification of what exactly that means? Conventional wisdom about database backups is that you should exercise them regularly. If we were to use the restore operation on our most recent snapshot weekly would we be at risk of data loss?

There's a little more detail in the comment here. The restore is implemented by having the leader take on the state of the snapshot and then bump the raft index which creates a "hole" in the Raft log, which causes the snapshot to go out to its followers. This means that the server commits the restore before replicating anything to its followers, which is weird from a Raft perspective, and could leave the cluster in an incorrect state if the leader were to die during that restore operation. If that happened you might have to blow away your server state and do the restore into a fresh cluster to recover. This should be a very unusual case to hit in practice (and the restore API returns success only once the followers have replicated the snapshot itself), but we wanted to fully disclose this possibility.

Where is this outage recovery guide?
The link mentioned in the groups was 404

The thing that's still missing from
https://www.consul.io/docs/guides/outage.html is the simple restore of a
snapshot. On our cluster we take and save regular snapshots; of course
Consul takes them as well. Snapshots can easily be restored to build a
cluster even from scratch.

Our disaster restore process is at
https://github.com/drud/vault-consul-on-kube/blob/master/troubleshooting.md#complete-loss-and-rebuild-with-recovery-using-a-consul-snapshot

  • it's worked fine in testing and development.

-Randy

On Fri, May 26, 2017 at 9:17 AM, James Phillips notifications@github.com
wrote:

@richard-mauri https://github.com/richard-mauri
https://www.consul.io/docs/guides/outage.html

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/hashicorp/consul/issues/2583#issuecomment-304310012,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAG3PDc1Xda12souqilut-vpsYjLBNQzks5r9u0UgaJpZM4LIJtV
.

--
Randy Fay
[email protected]
+1 970.462.7450

Was this page helpful?
0 / 5 - 0 ratings

Related issues

powerman picture powerman  Â·  3Comments

matteoturra picture matteoturra  Â·  4Comments

pritam97 picture pritam97  Â·  3Comments

slackpad picture slackpad  Â·  3Comments

eshujiushiwo picture eshujiushiwo  Â·  3Comments