Elasticsearch: Snapshot lifecycle management

Created on 5 Feb 2019 · 16Comments · Source: elastic/elasticsearch

ILM has been included in Elasticsearch, which allows us to manage the lifecycle
of an index, however, this lifecycle management does not currently include
periodic snapshots of the index.

In order to provide a full replacement for other cluster periodic management
tools out there (such as Curator), we should add snapshot management to
Elasticsearch.

Ideally this would fall under the same sort of management than ILM provides, the
difference, however, is that snapshots are multi-index whereas index lifecycle
policies are applied to a single index (and all actions are executed on a single
index).

We need a way of specifying a periodic and/or scheduled snapshots of a given set
of indices using a specific repository, perhaps something like this (all of the
API is made up)

PUT /_slm/policy/snapshot-every-day
{
  // Run this every day at 2:30am
  "schedule": "0 30 2 * * ?",

  // What the snapshot should be named, supporting date-math
  "name": "<production-snap-{now/d}>",

  // Which snapshot repository to use for the snapshot
  "repository": "my-s3-repository",

  // "config" is a map of all the options that the regular snapshot API takes
  "config": {
    "indices": ["foo-*", "important"],
    "ignore_unavailable": true,
    "include_global_state": false
  }
}

Elasticsearch will then manage taking snapshots of the given indices for the
repository on the schedule specified. The status of the snapshots would have to
be stored somewhere, likely in an index (.tasks perhaps?)

Some other things that would be nice (but not required) to support:

Snapshots every N minutes. Where N only starts counting from the completion of
the previous snapshot (for example, a snapshot every 30 minutes that takes 4
minutes to complete would start a snapshot at 00:00, and then the next would
be 00:34 - 30 minutes after the completion of the previous snapshot).
Retention of snapshots. Specifying something like "max_count": 10 meaning to
keep the last 10 snapshots, or "max_age": "7d" meaning to keep a weeks'
worth of snapshots, the old snapshot deletion would be managed by ES.

Task Checklist

[X] Basic CRUD for snapshot lifecycle policies (@dakrone) #39795
[x] Correctly handle updates and deletes to snapshot lifecycle policies (@dakrone) #40062
[x] Issue snapshot request when job is triggered (@dakrone) #40383
[x] Persist debugging and error information about making snapshot requests (@gwbrown) #40619
[x] Persist a history of successful/failed snapshots in an ES index (@gwbrown) #41707
[x] Add validation for snapshot lifecycle policies (check repo exists and pass its validation, check snapshot name doesn't break S3, etc) (@dakrone) #40654
[x] Hook into the existing ILM stop/start so users can perform maintenance (@dakrone) #40871
[x] Change URI paths to be under /_slm/policy (currently GET|PUT|DELETE /_ilm/snapshot/<policy-id>) (@dakrone) #41320
[x] Add API to execute a snapshot for a policy now rather than waiting for the scheduled time (@dakrone) #41038
[x] Display "the next time this policy will execute is: ____" with the success/failure/info when retrieving policy (@dakrone) #41221
[x] Ensure that SLM has a dedicated cluster privilege and that its actions are separate from ILM actions (@dakrone) #41607
[x] Documentation (@dakrone) #41510
- [x] Package level javadocs (@jbaiera) #43535
- [x] Document what security privileges are necessary for using SLM (@dakrone) #43708
[x] High Level Rest Client Support (@dakrone) #41767
Testing
- [x] Add integration test for SLM x-pack security role (dakrone) #42678
- [x] Manual testing (everyone)
Retention
- [x] Add support for _meta in CreateSnapshotRequest (@gwbrown) #41281
- [x] Send _meta associating each snapshot with the policy that created it (@gwbrown) #43132
- Implement retention (See: #43663)

:CorFeatureILM+SLM :DistributeSnapshoRestore >feature release highlight v7.4.0 v8.0.0

Source

dakrone

❤10 👍5

Most helpful comment

Going to close this as SLM has been merged to master and 7.x and will be in the 7.4 release.

Further work on retention can be found at https://github.com/elastic/elasticsearch/issues/43663

dakrone on 30 Jul 2019

🚀3

All 16 comments

Pinging @elastic/es-core-features

elasticmachine on 5 Feb 2019

Pinging @elastic/es-distributed

elasticmachine on 5 Feb 2019

@dakrone asked the cloud team to discuss our requirements, with a view to potentially replacing our snapshot logic down the road. I'm going to talk about how it currently works and then very quickly summarize in a list of (high level) requirements at the bottom:

We take snapshots at user-specified periods
- _actually an oft-requested user requirement (that we _haven't_ implemented, but should definitely be part of any "blank sheet of paper" approach) is the ability to make it a schedule, eg "*:30" instead of "every 30 minutes"_
Our retention policy is quite (over!) complicated, a simplified summary is:
- The user decides how many snapshots to keep (default is 100) and the snapshot period (30 mins)
- If a snapshot fails or partially fails a minimum number of good snapshots is retained, regardless of any other considerations
- (there are some other retention params, but these are not currently exposed to the cluster admin)
- Purging of "expired" snapshots is handled as follows:
  - After a snapshot has been taken (ie at the user specified interval), we look for the oldest snapshots that do not satisfy the "retain" rules and delete 1+ of them up to an admin-configurable maximum, which defaults to 2
    - (the reason for this max number is because you can't cancel a snapshot deletion, which results in some complications described below)
    - _(there's some logic to increase this number where it is safe to do so that I haven't looked at yet!)_
The previous two operations (snapshot and delete) are performed by (a sidecar attached to) _one_ of the nodes
One of the complications is that when a user (or admin or bot) performs a configuration change to an Elasticsearch cluster (handled by a service called the "constructor"), it takes a "safety" snapshot first.
- If the "snapshotter service" described above is taking a snapshot, the "constructor" will wait for that to complete and use that instead.
- If the "snapshotter service" is in the middle of deleting a snapshot then the "safety snapshot" will fail after a few (3) retries, causing the entire configuration to fail (this is obviously highly undesirable)
- Before starting, the "constructor service" will write a flag telling the "snapshotter service" to pause
We don't currently have any ability to control _which_ indices are included in the snapshot (though that gets asked for a decent amount)

The Cloud requirements I infer:

Configurable snapshot schedule and retention policy
The ability to pause the SLM process (during cluster configuration)
The ability to control time spent deleting snapshots (ie to avoid conflicts with cluster configuration)

cc @nordbergm / @paulcoghlan - feel free to add/correct/amend anything you think is useful. Don't reply, just edit the comment directly.

AlexP-Elastic on 19 Mar 2019

👍1

Snapshot Resiliency

One of the major things for me is the snapshot resiliency work we've been doing for the past 6 months or more. This effectively boils down to the challenges S3 eventual consistency has caused us with corrupt snapshots, and the cool down periods we've had to introduce as a result.

Every write operation to the repository (create snapshot, delete snapshot) has to be followed by a cool down period (currently 10 minutes, we might experiment with reducing this to 5), to allow S3 caches to expire between operations.
The constructor also adheres to these cool down periods and won't create a safety snapshot until the period has passed (this is still behind feature flag, but will be enabled once we've worked out a couple of issues).
We now use a ZK leader latch to coordinate this as well as shared snapshot task state in ZK. This allows the constructor and snapshot sidecar to coordinate operations against the repository and adhere to the cool down period. The flag @AlexP-Elastic mentioned will be superseded by this once the feature flag is enabled.
Because of cool down periods, a strict *:30 schedule is difficult to guarantee (unless you're willing to skip an interval if you run over)

These are very specific Cloud/S3 challenges. I don't believe GCP necessarily has the same issues because GCS is way more consistent. Still, if we don't consider it I worry we'll suffer more snapshot corruption again.

Access Control

Another area I've been thinking about is access control. Snapshots in Cloud today are controlled by cloud admins, and can't easily be meddled with by the cluster admin. They can reduce the retention to minimum 2 snapshots, and they can disable snapshots if they go through support and understand the risks etc.

With ILM/SLM it would be good to understand what kind of access the cluster admins would have to configuration and how we could restrict access. In case of disaster we want to be sure the cluster has snapshots and the cluster admin hasn't broken the configuration by accident.

nordbergm on 19 Mar 2019

👍1

@dakrone Would it be possible to add some meta-data argument to the create policy API that would help the user figure which policy created the snapshot . two options come to mind:

policy have a unique name and the snapshot will have a created by which will either "API_call" or [policy name]
Alternatively a metadata field with max x chars that the user fill as optional description when creating the policy and the policy adds to each snapshot as metadata field

yaronp68 on 28 Mar 2019

👍1

@yaronp68 interesting suggestion, I do think that would be useful.

@original-brownbear what do you think about us adding something like that to the CreateSnapshotRequest? I'm not sure exactly what the backwards compatibility issues could be (I assume nothing too difficult to work around).

dakrone on 28 Mar 2019

@dakrone @yaronp68

technically speaking there is no reason not to add a metadata field to the snapshot in cluster-state and stored in the repository as far as I can see.
The question is, whether it's worth the added complexity I guess :) I'm not against it, but if we can do it without adding more complexity to the cluster state and repository that may be better.
=> my question: If you're already planning to "Persist a history of successful/failed snapshots in an ES index", why not just add the metadata for each snapshot to the history in that index?

original-brownbear on 28 Mar 2019

👍1

@dakrone maybe it's possible to persist to a metadata file in the repository and not in cluster state to avoid changes to cluster state

yaronp68 on 29 Mar 2019

my question: If you're already planning to "Persist a history of successful/failed snapshots in an ES index", why not just add the metadata for each snapshot to the history in that index?

@original-brownbear I believe the idea is that when listing snapshots for a repository, you could then tell which snapshot came from what (manually triggered, triggered via policyA, policyB, etc). We will have something on the other side (for a policy, what's the last snapshot taken), I think the desire was for something the other direction.

@dakrone maybe it's possible to persist to a metadata file in the repository and not in cluster state to avoid changes to cluster state

@yaronp68 we need to persist at least one end state in the cluster state, because in the event of a snapshot failure, we wouldn't be able to persist it in a metadata file in the repo, because the snapshot failed :) so we have to have a place to have something like "your snapshot failed because of XYZ" for users to see.

dakrone on 29 Mar 2019

@yaronp68

maybe it's possible to persist to a metadata file in the repository and not in cluster state to avoid changes to cluster state

I would rather we not do this, sorry. The repository is currently undergoing some redesign to resolve issues like https://github.com/elastic/elasticsearch/issues/38941.
If we start putting custom blobs in the repo that's gonna be one more thing to worry about when we make changes there. Plus, the eventually consistent nature of some blob-stores like s3 will also create problems for a metadata file/blob that would be read and updated I would assume?

-> the private index for the snapshot history seems like the safest bet to me still. If that's not an option for some reason the cluster state is still the better option compared to a custom repository blob.

original-brownbear on 29 Mar 2019

the private index for the snapshot history seems like the safest bet to me still. If that's not an option for some reason the cluster state is still the better option compared to a custom repository blob.

We are planning to store the latest success and failure in the cluster state (only one of each), and store the result for every snapshot invocation into an index for history/alerting purposes.

dakrone on 29 Mar 2019

I believe the idea is that when listing snapshots for a repository, you could then tell which snapshot came from what (manually triggered, triggered via policyA, policyB, etc). We will have something on the other side (for a policy, what's the last snapshot taken), I think the desire was for something the other direction

I see. In that case I think adding this information to the cluster state (and then as a result to the snapshot metadata we store in the repository) may be an option. In the end, the repository is the only place we can store that metadata to if we want to be able to use it with the snapshot list.

original-brownbear on 29 Mar 2019

I think adding this information to the cluster state (and then as a result to the snapshot metadata we store in the repository) may be an option.

This is unclear to me; I think the original desire was for something in CreateSnapshotRequest (perhaps an origin String) so when SLM issued the request it could specify the policy name, which is then stored with the snapshot's metadata (just like the list of indices, start time, end time, etc). How does that involve the cluster state?

dakrone on 29 Mar 2019

@dakrone

How does that involve the cluster state?

Sorry that was needlessly confusing :) Just by virtue of how this is implemented we'd have to add that information to the ephemeral cluster state. It's not important for the feasibility though :) -> I'm fine with adding this to the request and then storing it to the snapshot meta in the repo. That we should be able to do in a BwC manner.

original-brownbear on 29 Mar 2019