This is 100% a duplicate of #9984, however it's pretty old and nobody will likely notice that a comment was made on it, so I figure I'd recreate it to give it some attention.
I'd be more than happy to implement this feature, as I found myself wishing it existed for the same reason the OP did all those months ago, however, I'm stuck at a bit of a crossroads when it comes the actual implementation.
Here's a rundown of my understanding of the situation and work involved in this feature. I'd love some feedback on whether I'm right or wrong on some points, completely talking out of my ass, etc.; as well as some advice on what the ES team would consider the preferred approach:
When a snapshot of an index is taken, it dumps blobs of the data of each shard, along with the entirety of an index's metadata (mappings, aliases, etc.) to a blobstore (s3, azure storage, etc.)
Later, when restoring, the restoration process effectively:
In essence, if an index already existed and gets restored from a snapshot, it effectively gets completely replaced with a completely different index containing the snapshot's data, just one that happens to have all the same metadata and naming, etc. of the snapshotted one.
This means that fundamentally, to restore specific shard(s), I just need to create "new" shards for the shards I want to restore, and then update the routing for the index to use those new shards and keep the shards that I'm not restoring untouched.
The RestoreService is very intimately coupled with all of conveniences that the snapshot restore feature gives you such as renaming the index, changing the number of replicas, etc. that (IMO) do not apply to the operation of restoring specific shards. Furthermore, some of these conveniences are only possible because of the fact that a restored index is really just a brand new index.
In the scenarios that I can imagine for restoring specific shards, one would just want a no-bullshit, replacement of (a) corrupt shard(s), ASAP, without affecting the rest of the index properties.
Inside o.e.s.RestoreService.restoreSnapshot$ClusterStateUpdateTask.execute (at lines 266-268) (see also, lines 341-347) there's an ignoredShards
Set that gets generated during the restore operation which is used when the partial
flag is enabled for a snapshot restore. The partial
flag tells ES, "if you have any trouble restoring a shard from snapshot for this restore operation, just create an empty one instead of failing the restore". These "troublesome" shards are put into ignoredShards
.
This means that I am left with two choices:
Modify RestoreService, RestoreRequest, RestoreSnapshotRequest, TransportRestoreSnapshotAction, and probably others to take an "only_restore_shards": [int]
field.
If this new field is given, it prohibits you from performing metadata changes (by giving the operator a stern talking to if they try to do metadata updates in a shard-only restore via an error message), which should hopefully simplify the implementation a little bit.
Furthermore, an only_restore_shards
implies a partial
restore, with the major caveat being,
the meaning of a partial
restore gets changed (see the point under 50/50). After this change, in a partial
restore, if the index already existed and there are shards in ignoredShards
, they're actually ignored as opposed to replaced with blanks. i.e., shards IDs that are listed in ignoreShards
and exist in the index get reused. (keep reading and this will all make sense)
Benefits:
ignoredShards
Set actually works as expected, and you can fill it with the difference of the set of all shards numbers in the index and the set of the shard numbers you want to restore (i.e., in only_restore_shards
; meaning it ignores [reuses] the existing [unaffected] ones) and the code makes a bit more intuitive sense.only_restore_shards
operation implies a "partial" restore in the intuitive sense, which ties into the next point:50/50:
partial
restore get inverted, rather than you getting a "partially complete" index restored, you're "partially restoring" the index. (I want to say that this leans a bit towards a drawback, given that ES has had this semantic for a while now, _buuuuuuuuut_ it's certainly more in line with what you would intuitively expect a "partial" restore to mean)Drawbacks:
Create a RestoreShardsService/RestoreShardsRequest/etc./etc./etc. along with a new REST endpoint such as /_snapshots/etc/etc/_restore_shards as well as a corresponding TransportAction.
Benefits:
ignoredShard
is really just replaced with an empty shard" logic.partial
semantics to their advantage somehow?).50/50:
partial
restore to be more intuitive. On the other hand, this might not be necessary since you can now always cherry pick the shards to restore, meaning a full index restore (which is the only way to do it as is), would only get used in the worst case that your index is totally fucked. In which case, if the shards in the snapshot are bad, you're proper fucked anyway.ClusterService
seems kinda heavy for something that should really be an extension of the RestoreService anyway, but on the other hand RestoreService is a pretty damn big chunk of code.Drawbacks:
Thanks for taking the time to read this wall of text, and apologies if it's in poor form to self-bump by making a new issue!
@imotov could you take a look at this please?
@iostat thank you for the analysis of the issue. I agree, that it would be a valuable feature that would help some users who lost a part of their index due to shard corruptions. I also agree that RestoreService is already more complicated than we would like it to be. The good news is that a lot of complexity in RestoreService comes from managing of the restore lifecycle and we are currently working on extracting some of this lifecycle management logic into a separate TaskManagement service. So, my hope is that when this work is complete, it would be possible to split RestoreService into more manageable pieces. Until then I would definitely not recommend to use this task as the first foray into Elasticsearch because it touches some of the most complex parts of Elasticsearch such as shard allocation and recover/restore processes.
The placement of the new logic that you described is one issue. Even a bigger issue, in my opinion, is that metadata in a snapshot might not be compatible with metadata of an existing index. That means that it might or might not be OK to restore a few shards. When we restore all shards (even in case of a partial restore) we can wipe out old metadata and replace it with metadata from a snapshot without thinking about it because it's guaranteed that the data in the existing index will be completely gone by the end of the process. When we restore only some shards, it's much trickier. What would you do if during restore you discover that settings and/or mappings of the new index are different. In the case of mappings, we have the mapping merging logic that we could reuse to a large degree. In the case of settings - there is no such mechanism. So, we would need to make a determination about what to do.
Some settings are ok to merge. For example, if an index in a snapshot has refresh_interval set to 10 sec and the corresponding index in the cluster has refresh_interval of 1 sec and everything else is the same, it's perfectly fine to partially restore the index. The same goes for the number_of_replicas setting. However, if the number_of_shards setting is different we definitely shouldn't restore. It gets even more complicated with analysis settings. For example, different settings for the same analyzer can break the restored index, but if we are just adding a new analyzer, it's perfectly fine to merge the settings.
@imotov: Thanks for the reply! I definitely agree that it's a behemoth of a
task especially for someone like me getting their feet wet with the
codebase, but I thought it was least worth starting a discussion for.
On the point of mismatched settings: I think it's fair to say that if the
number of shards in the snapshot doesn't match the number of shards in the
index, you shouldn't be able to restore that snapshot period. As far as I'm
aware there's no way to reshard an index without creating a new one anyway,
so in that case the snapshot being restored from is effectively from a
completely different index and shouldn't be restored from in the first
place. Likewise for settings such as the routing hash function, analyzer
settings, etc. You could apply this to the extreme and even say if the
destination index was created after snapshot was taken, then to reject the
restore request completely.
I don't really see this feature as an absolute replacement for the existing
restore functionality, it should really be available as a convenience if
anything, for instance in the case that 1) regular backups are being taken
and 2) out of nowhere a shard got corrupted for both primary and replica
and 3) no "major" settings changes occurred since the snapshot was taken.
As far as the TaskManagement service, any way to track the progress of
that? I'd love to revisit this if/when that's implemented, but if it's a
long ways off (next major release or something like that) perhaps it's
worth writing this feature as a separate module and then merging its
functionality with the original RestoreService when that's all refactored.
On Monday, January 11, 2016, Igor Motov [email protected] wrote:
@iostat https://github.com/iostat thank you for the analysis of the
issue. I agree, that it would be a valuable feature that would help some
users who lost a part of their index due to shard corruptions. I also agree
that RestoreService is already more complicated than we would like it to
be. The good news is that a lot of complexity in RestoreService comes from
managing of the restore lifecycle and we are currently working on
extracting some of this lifecycle management logic into a separate
TaskManagement service. So, my hope is that when this work is complete, it
would be possible to split RestoreService into more manageable pieces.
Until then I would definitely not recommend to use this task as the first
foray into Elasticsearch because it touches some of the most complex parts
of Elasticsearch such as shard allocation and recover/restore processes.The placement of the new logic that you described is one issue. Even a
bigger issue, in my opinion, is that metadata in a snapshot might not be
compatible with metadata of an existing index. That means that it might or
might not be OK to restore a few shards. When we restore all shards (even
in case of a partial restore) we can wipe out old metadata and replace it
with metadata from a snapshot without thinking about it because it's
guaranteed that the data in the existing index will be completely gone by
the end of the process. When we restore only some shards, it's much
trickier. What would you do if during restore you discover that settings
and/or mappings of the new index are different. In the case of mappings, we
have the mapping merging logic that we could reuse to a large degree. In
the case of settings - there is no such mechanism. So, we would need to
make a determination about what to do.Some settings are ok to merge. For example, if an index in a snapshot has
refresh_interval set to 10 sec and the corresponding index in the cluster
has refresh_interval of 1 sec and everything else is the same, it's
perfectly fine to partially restore the index. The same goes for the
number_of_replicas setting. However, if the number_of_shards setting is
different we definitely shouldn't restore. It gets even more complicated
with analysis settings. For example, different settings for the same
analyzer can break the restored index, but if we are just adding a new
analyzer, it's perfectly fine to merge the settings.—
Reply to this email directly or view it on GitHub
https://github.com/elastic/elasticsearch/issues/15653#issuecomment-170585114
.
Sent from a web browser, why is it a trend to tell people what you used to
send an email?
Ilya Ostrovskiy
You could apply this to the extreme and even say if the destination index was created after snapshot was taken, then to reject the restore request completely.
The requirement of a snapshot being created after creation of the restored index is irrelevant to success or failure of the partial restore. I think what you are trying to say is we could reject the restore request completely if the index that we restore is not the same as the index that was snapshotted. Yes, we could make this a requirement, but even then there is plenty of scenarios for a snapshot and an index to diverge. So, I don't think it's practical to make the determination about possibility of restore by simply looking at historical origins of indices.
I don't really see this feature as an absolute replacement for the existing restore functionality, it should really be available as a convenience if anything.
Yes, and this is exactly why I wouldn't want to rush this feature in.
... no "major" settings changes occurred since the snapshot was taken.
Currently, it's hard to determine which settings change is "major" and which one is "minor".
As far as the TaskManagement service, any way to track the progress of that?
You can keep track of the task management development progress on the task management meta issue #15117.
FWIW I would love this feature and found this issue while trying to cope with an unexpected outage where two nodes in different racks were lost and 13 out of 128 shards went missing.
Because we had a snapshot from the day before we took the following steps (note: this may not be the best way to do this) to restore only a small subset of the shards:
write.lock
, and ensured that the _state/state-*.st
file contained the same UUID by copying it from an existing shardHuge caveat, it worked for us on 2.4.x and YMMV but it seemed relatively painless (though time intensive) and much faster than restoring the full snapshot and then reindexing missing writes.
This feature request is an interesting idea but since its opening we have not seen enough feedback that it is a feature we should pursue. We prefer to close this issue as a clear indication that we are not going to work on this at this time. We are always open to reconsidering this in the future based on compelling feedback; despite this issue being closed please feel free to leave feedback on the proposal (including +1s).
I would like this feature as well, it seems like the task management service is finished for 6.0.0. I'm aware the complexities mentioned, but for clusters holding large amounts of data it could potentially be very useful. Thanks!
+1
We have similar case - a database with 50+ terrabytes of data.
For indices that do not have replication the backup recovery process will take a lot of time. Partial recovery just for broken shards will speed this process a lot.
+1
+1
+1
This would be a really nice feature to see! we have an logging ("observability") cluster where we use un-replicated shards for longer term data storage.
We take backups of this data because we classify it as "nice to have but not critical" and we have been in situations where a one or more primary shards across one or more indexes are lost (EC allocator forced movements for example).
It would be nice to selectively restore the missing shards only which would greatly reduce restore traffic / time.
It could be specified manually (as in index-y, shard-0 - cumbersome but functional) or introduced as a restore option ("restore missing shards only to existing indexes" - nicely streamlined but less configurable), either would be good as a first cut.
Thanks!
@MrBones757 the feature you're looking for is searchable snapshots which adds support for resilient zero-replica shards; the resilience is automatic too, you don't need any manual intervention to restore a lost shard.
I should also point out that today Elasticsearch already only restores any data that is missing: single-shard restores would not offer any further reduction to restore traffic anyway.
+1
Most helpful comment
I would like this feature as well, it seems like the task management service is finished for 6.0.0. I'm aware the complexities mentioned, but for clusters holding large amounts of data it could potentially be very useful. Thanks!