Elasticsearch: Restoring specific shard(s) from snapshots

Created on 24 Dec 2015 · 14Comments · Source: elastic/elasticsearch

This is 100% a duplicate of #9984, however it's pretty old and nobody will likely notice that a comment was made on it, so I figure I'd recreate it to give it some attention.

I'd be more than happy to implement this feature, as I found myself wishing it existed for the same reason the OP did all those months ago, however, I'm stuck at a bit of a crossroads when it comes the actual implementation.

Here's a rundown of my understanding of the situation and work involved in this feature. I'd love some feedback on whether I'm right or wrong on some points, completely talking out of my ass, etc.; as well as some advice on what the ES team would consider the preferred approach:

When a snapshot of an index is taken, it dumps blobs of the data of each shard, along with the entirety of an index's metadata (mappings, aliases, etc.) to a blobstore (s3, azure storage, etc.)

Later, when restoring, the restoration process effectively:

Replaces an existing index's metadata with that which is in the snapshot if the index already exists in the cluster; if not, it creates a new index with the same name and metadata in the snapshot. (or uses other name/metadata/etc. if the operator specified it in the restore request)
Allocates new shards, and fills them with the data in the snapshot's blobs, and sets the index's shard routing table to these "new" restored shards. The old shards, if they existed, somehow get reaped.

In essence, if an index already existed and gets restored from a snapshot, it effectively gets completely replaced with a completely different index containing the snapshot's data, just one that happens to have all the same metadata and naming, etc. of the snapshotted one.

This means that fundamentally, to restore specific shard(s), I just need to create "new" shards for the shards I want to restore, and then update the routing for the index to use those new shards and keep the shards that I'm not restoring untouched.

The RestoreService is very intimately coupled with all of conveniences that the snapshot restore feature gives you such as renaming the index, changing the number of replicas, etc. that (IMO) do not apply to the operation of restoring specific shards. Furthermore, some of these conveniences are only possible because of the fact that a restored index is really just a brand new index.

In the scenarios that I can imagine for restoring specific shards, one would just want a no-bullshit, replacement of (a) corrupt shard(s), ASAP, without affecting the rest of the index properties.

Inside o.e.s.RestoreService.restoreSnapshot$ClusterStateUpdateTask.execute (at lines 266-268) (see also, lines 341-347) there's an ignoredShards Set that gets generated during the restore operation which is used when the partial flag is enabled for a snapshot restore. The partial flag tells ES, "if you have any trouble restoring a shard from snapshot for this restore operation, just create an empty one instead of failing the restore". These "troublesome" shards are put into ignoredShards.

This means that I am left with two choices:

1)

Modify RestoreService, RestoreRequest, RestoreSnapshotRequest, TransportRestoreSnapshotAction, and probably others to take an "only_restore_shards": [int] field.
If this new field is given, it prohibits you from performing metadata changes (by giving the operator a stern talking to if they try to do metadata updates in a shard-only restore via an error message), which should hopefully simplify the implementation a little bit.
Furthermore, an only_restore_shards implies a partial restore, with the major caveat being,
the meaning of a partial restore gets changed (see the point under 50/50). After this change, in a partial restore, if the index already existed and there are shards in ignoredShards, they're actually ignored as opposed to replaced with blanks. i.e., shards IDs that are listed in ignoreShards and exist in the index get reused. (keep reading and this will all make sense)

Benefits:

The ignoredShards Set actually works as expected, and you can fill it with the difference of the set of all shards numbers in the index and the set of the shard numbers you want to restore (i.e., in only_restore_shards; meaning it ignores [reuses] the existing [unaffected] ones) and the code makes a bit more intuitive sense.
Compared to option 2, there's a lot less boilerplate code to be written, and minimal modifications to the API.
If the snapshot you're restoring from has bad shards, and you've opted for a "partial" restore, instead of you potentially losing the data in those shards, at least some data gets kept. This is great for PoLA in the intuitive sense of "partial". Additionally, an only_restore_shards operation implies a "partial" restore in the intuitive sense, which ties into the next point:

50/50:

The semantics of a partial restore get inverted, rather than you getting a "partially complete" index restored, you're "partially restoring" the index. (I want to say that this leans a bit towards a drawback, given that ES has had this semantic for a while now, _buuuuuuuuut_ it's certainly more in line with what you would intuitively expect a "partial" restore to mean)

Drawbacks:

Due to the previously mentioned intimate coupling, this will make the RestoreService even more complicated than it already is.
It creates the need for a bunch of additional validations and potential edge cases that will need to be considered inside the Requests and their builders.
This is my first deep foray into patching ES, and I'm more than confident that I'll probably break something that tests might not catch and nobody would have even thought possible :P
More room for potentially odd bugs in RestoreService due to its increased complexity.

2)

Create a RestoreShardsService/RestoreShardsRequest/etc./etc./etc. along with a new REST endpoint such as /_snapshots/etc/etc/_restore_shards as well as a corresponding TransportAction.

Benefits:

Cleaner implementation and separation of concerns by starting from a clean slate and not further complicating RestoreService.
Less likely to break anything as all the logic is separated out into its own module
No weird short circuits or retrofits around the whole "an ignoredShard is really just replaced with an empty shard" logic.
Full backwards compatibility (assuming someone uses the current partial semantics to their advantage somehow?).
No need for particularly crazy validations, as the _restore_shards action really just takes a snapshot, a single index and a list of shards to restore from it. As long as the snapshot can satisfy the shards to be restored, there's no reason for it to fail.

50/50:

You don't get to invert the semantics of a partial restore to be more intuitive. On the other hand, this might not be necessary since you can now always cherry pick the shards to restore, meaning a full index restore (which is the only way to do it as is), would only get used in the worst case that your index is totally fucked. In which case, if the shards in the snapshot are bad, you're proper fucked anyway.
An entire ClusterService seems kinda heavy for something that should really be an extension of the RestoreService anyway, but on the other hand RestoreService is a pretty damn big chunk of code.

Drawbacks:

Tons of boilerplate code that bloats up the codebase for request objects, builders, actions, transport, etc.
I'd be effectively duplicating a lot of logic that's in RestoreService to perform a very specific operation, which is invariably a code smell.

Thanks for taking the time to read this wall of text, and apologies if it's in poor form to self-bump by making a new issue!

:DistributeSnapshoRestore >enhancement help wanted

Source

iostat

👍5

Most helpful comment

I would like this feature as well, it seems like the task management service is finished for 6.0.0. I'm aware the complexities mentioned, but for clusters holding large amounts of data it could potentially be very useful. Thanks!

taintedkernel on 3 Apr 2018

👍8

All 14 comments

@imotov could you take a look at this please?

clintongormley on 10 Jan 2016

@iostat thank you for the analysis of the issue. I agree, that it would be a valuable feature that would help some users who lost a part of their index due to shard corruptions. I also agree that RestoreService is already more complicated than we would like it to be. The good news is that a lot of complexity in RestoreService comes from managing of the restore lifecycle and we are currently working on extracting some of this lifecycle management logic into a separate TaskManagement service. So, my hope is that when this work is complete, it would be possible to split RestoreService into more manageable pieces. Until then I would definitely not recommend to use this task as the first foray into Elasticsearch because it touches some of the most complex parts of Elasticsearch such as shard allocation and recover/restore processes.

The placement of the new logic that you described is one issue. Even a bigger issue, in my opinion, is that metadata in a snapshot might not be compatible with metadata of an existing index. That means that it might or might not be OK to restore a few shards. When we restore all shards (even in case of a partial restore) we can wipe out old metadata and replace it with metadata from a snapshot without thinking about it because it's guaranteed that the data in the existing index will be completely gone by the end of the process. When we restore only some shards, it's much trickier. What would you do if during restore you discover that settings and/or mappings of the new index are different. In the case of mappings, we have the mapping merging logic that we could reuse to a large degree. In the case of settings - there is no such mechanism. So, we would need to make a determination about what to do.

Some settings are ok to merge. For example, if an index in a snapshot has refresh_interval set to 10 sec and the corresponding index in the cluster has refresh_interval of 1 sec and everything else is the same, it's perfectly fine to partially restore the index. The same goes for the number_of_replicas setting. However, if the number_of_shards setting is different we definitely shouldn't restore. It gets even more complicated with analysis settings. For example, different settings for the same analyzer can break the restored index, but if we are just adding a new analyzer, it's perfectly fine to merge the settings.

imotov on 11 Jan 2016

@imotov: Thanks for the reply! I definitely agree that it's a behemoth of a
task especially for someone like me getting their feet wet with the
codebase, but I thought it was least worth starting a discussion for.

On the point of mismatched settings: I think it's fair to say that if the
number of shards in the snapshot doesn't match the number of shards in the
index, you shouldn't be able to restore that snapshot period. As far as I'm
aware there's no way to reshard an index without creating a new one anyway,
so in that case the snapshot being restored from is effectively from a
completely different index and shouldn't be restored from in the first
place. Likewise for settings such as the routing hash function, analyzer
settings, etc. You could apply this to the extreme and even say if the
destination index was created after snapshot was taken, then to reject the
restore request completely.

I don't really see this feature as an absolute replacement for the existing
restore functionality, it should really be available as a convenience if
anything, for instance in the case that 1) regular backups are being taken
and 2) out of nowhere a shard got corrupted for both primary and replica
and 3) no "major" settings changes occurred since the snapshot was taken.

As far as the TaskManagement service, any way to track the progress of
that? I'd love to revisit this if/when that's implemented, but if it's a
long ways off (next major release or something like that) perhaps it's
worth writing this feature as a separate module and then merging its
functionality with the original RestoreService when that's all refactored.

On Monday, January 11, 2016, Igor Motov [email protected] wrote:

@iostat https://github.com/iostat thank you for the analysis of the
issue. I agree, that it would be a valuable feature that would help some
users who lost a part of their index due to shard corruptions. I also agree
that RestoreService is already more complicated than we would like it to
be. The good news is that a lot of complexity in RestoreService comes from
managing of the restore lifecycle and we are currently working on
extracting some of this lifecycle management logic into a separate
TaskManagement service. So, my hope is that when this work is complete, it
would be possible to split RestoreService into more manageable pieces.
Until then I would definitely not recommend to use this task as the first
foray into Elasticsearch because it touches some of the most complex parts
of Elasticsearch such as shard allocation and recover/restore processes.

The placement of the new logic that you described is one issue. Even a
bigger issue, in my opinion, is that metadata in a snapshot might not be
compatible with metadata of an existing index. That means that it might or
might not be OK to restore a few shards. When we restore all shards (even
in case of a partial restore) we can wipe out old metadata and replace it
with metadata from a snapshot without thinking about it because it's
guaranteed that the data in the existing index will be completely gone by
the end of the process. When we restore only some shards, it's much
trickier. What would you do if during restore you discover that settings
and/or mappings of the new index are different. In the case of mappings, we
have the mapping merging logic that we could reuse to a large degree. In
the case of settings - there is no such mechanism. So, we would need to
make a determination about what to do.

Some settings are ok to merge. For example, if an index in a snapshot has
refresh_interval set to 10 sec and the corresponding index in the cluster
has refresh_interval of 1 sec and everything else is the same, it's
perfectly fine to partially restore the index. The same goes for the
number_of_replicas setting. However, if the number_of_shards setting is
different we definitely shouldn't restore. It gets even more complicated
with analysis settings. For example, different settings for the same
analyzer can break the restored index, but if we are just adding a new
analyzer, it's perfectly fine to merge the settings.

—
Reply to this email directly or view it on GitHub
https://github.com/elastic/elasticsearch/issues/15653#issuecomment-170585114
.

Sent from a web browser, why is it a trend to tell people what you used to
send an email?

Ilya Ostrovskiy

iostat on 12 Jan 2016

You could apply this to the extreme and even say if the destination index was created after snapshot was taken, then to reject the restore request completely.

The requirement of a snapshot being created after creation of the restored index is irrelevant to success or failure of the partial restore. I think what you are trying to say is we could reject the restore request completely if the index that we restore is not the same as the index that was snapshotted. Yes, we could make this a requirement, but even then there is plenty of scenarios for a snapshot and an index to diverge. So, I don't think it's practical to make the determination about possibility of restore by simply looking at historical origins of indices.

I don't really see this feature as an absolute replacement for the existing restore functionality, it should really be available as a convenience if anything.

Yes, and this is exactly why I wouldn't want to rush this feature in.

... no "major" settings changes occurred since the snapshot was taken.

Currently, it's hard to determine which settings change is "major" and which one is "minor".

As far as the TaskManagement service, any way to track the progress of that?

You can keep track of the task management development progress on the task management meta issue #15117.

imotov on 12 Jan 2016

FWIW I would love this feature and found this issue while trying to cope with an unexpected outage where two nodes in different racks were lost and 13 out of 128 shards went missing.

Because we had a snapshot from the day before we took the following steps (note: this may not be the best way to do this) to restore only a small subset of the shards:

started the snapshot restore process to a new restore index within the same cluster
disabled shard allocation for the restore index
used the Cluster Reroute api to manually allocate the 13 shards we wanted to recover (allow_primary: true)
disabled shard allocation throughout the cluster and stopped Elasticsearch on the node we were trying to recover on
copied the directory containing the shard from the restore index to the origin index (removed write.lock, and ensured that the _state/state-*.st file contained the same UUID by copying it from an existing shard
restarted elasticsearch on that node and re-enabled shard allocation to get replicas of the restored shards
rinse/repeat for other nodes/shards

Huge caveat, it worked for us on 2.4.x and YMMV but it seemed relatively painless (though time intensive) and much faster than restoring the full snapshot and then reindexing missing writes.

c4urself on 9 Feb 2017

This feature request is an interesting idea but since its opening we have not seen enough feedback that it is a feature we should pursue. We prefer to close this issue as a clear indication that we are not going to work on this at this time. We are always open to reconsidering this in the future based on compelling feedback; despite this issue being closed please feel free to leave feedback on the proposal (including +1s).

ywelsch on 22 Mar 2018

taintedkernel on 3 Apr 2018

👍8

Sukesh-Alluri on 10 Jun 2019

We have similar case - a database with 50+ terrabytes of data.
For indices that do not have replication the backup recovery process will take a lot of time. Partial recovery just for broken shards will speed this process a lot.

serjgamover on 18 Dec 2019

ksa31 on 24 Feb 2020

linhao1990 on 7 Aug 2020

+1
This would be a really nice feature to see! we have an logging ("observability") cluster where we use un-replicated shards for longer term data storage.
We take backups of this data because we classify it as "nice to have but not critical" and we have been in situations where a one or more primary shards across one or more indexes are lost (EC allocator forced movements for example).
It would be nice to selectively restore the missing shards only which would greatly reduce restore traffic / time.
It could be specified manually (as in index-y, shard-0 - cumbersome but functional) or introduced as a restore option ("restore missing shards only to existing indexes" - nicely streamlined but less configurable), either would be good as a first cut.
Thanks!

MrBones757 on 4 Jan 2021

@MrBones757 the feature you're looking for is searchable snapshots which adds support for resilient zero-replica shards; the resilience is automatic too, you don't need any manual intervention to restore a lost shard.

I should also point out that today Elasticsearch already only restores any data that is missing: single-shard restores would not offer any further reduction to restore traffic anyway.