Elasticsearch version:
6.3.0
Plugins installed:
s3-repository
x-pack
JVM version
1.8
OS version
Amazon Linux
Description of the problem including expected versus actual behavior:
curl -XDELETE "http://localhost:9200/_snapshot/s3-6-backup/curator-20180623143002" hangs even after a full restart of the cluster.
I've tried turning trace logging on, but it unfortunately hasn't helped.
Here is the status information on the backup:
curl -GET "http://localhost:9200/_snapshot/s3-6-backup/_status?pretty
Relevant:
{
"snapshot" : "curator-20180623143002",
"repository" : "s3-6-backup",
"uuid" : "CbzLTfmbRCaGc5tu90x-9A",
"state" : "ABORTED",
"include_global_state" : true,
"shards_stats" : {
"initializing" : 0,
"started" : 0,
"finalizing" : 0,
"done" : 187,
"failed" : 320,
"total" : 507
},
Pinging @elastic/es-distributed
@tlrx can you take a look?
@TimHeckel Do you have more information about the snapshot deletion? It was a currently running / unfinished snapshot that you tried to delete? How long did it hang before you tried to restart the cluster?
@tlrx - apologies for my delay in getting back to you here; the snapshot had been running for many days before I attempted deleting it and/or restarting the cluster. Is there anything I can do to force the removal of this aborted snapshot attempt? Thanks
Hi -- I've since upgraded from 6.3.0 to 6.4.1, but this one hanging snapshot remains. Below are all the responses I've gotten after upgrading. @tlrx - wondering if you could take another look or give me a pointer? I'd prefer not to have to migrate to a whole new cluster, but I simply cannot delete this hanging _snapshot, and that may be my last option.
GET /_snapshot/s3-6-backup/_status
{
"snapshots": [
{
"snapshot": "curator-20180623143002",
"repository": "s3-6-backup",
"uuid": "CbzLTfmbRCaGc5tu90x-9A",
"state": "ABORTED",
"include_global_state": true,
"shards_stats": {
"initializing": 0,
"started": 0,
"finalizing": 0,
"done": 187,
"failed": 320,
"total": 507
},
"stats": {
"incremental": {
"file_count": 0,
"size_in_bytes": 0
},
"total": {
"file_count": 0,
"size_in_bytes": 0
},
"start_time_in_millis": 0,
"time_in_millis": 0,
"number_of_files": 0,
"processed_files": 0,
"total_size_in_bytes": 0,
"processed_size_in_bytes": 0
},
"indices": { ...
GET /_snapshot/s3-6-backup/_current
{
"snapshots": [
{
"snapshot": "curator-20180623143002",
"uuid": "CbzLTfmbRCaGc5tu90x-9A",
"version_id": 6040199,
"version": "6.4.1",
"indices": [ ... ],
"include_global_state": true,
"state": "IN_PROGRESS",
"start_time": "2018-06-23T14:30:03.389Z",
"start_time_in_millis": 1529764203389,
"end_time": "1970-01-01T00:00:00.000Z",
"end_time_in_millis": 0,
"duration_in_millis": -1529764203389,
"failures": [],
"shards": {
"total": 0,
"failed": 0,
"successful": 0
}
}
]
}
DELETE /_snapshot/s3-6-backup/curator-20180623143002
(eventually returns with a `curl: (52) Empty reply from server` on the server
DELETE /_snapshot/s3-6-backup
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[main-2][10.0.1.61:9300][cluster:admin/repository/delete]"
}
],
"type": "illegal_state_exception",
"reason": "trying to modify or unregister repository that is currently used "
},
"status": 500
}
The simplest solution to get rid of the stuck snapshot is to do a full cluster restart, i.e., all nodes down, and only then start them up again. This will clear the snapshot state, but will ofc also mean downtime. The more involved solution consists of the following: Look at the clusterstate, and check the snapshot entries that are marked as ABORTED. Check the node id associated with that entry. For example, let's look at:
{
"index" : {
"index_name" : "my_index",
"index_uuid" : "VhUbEkzIQvggTmOXaRS3gQ"
},
"shard" : 0,
"state" : "ABORTED",
"node" : "zDggEbNfTj-DrAwJYjagcw"
},
The node id is zDggEbNfTj-DrAwJYjagcw, so in order for this entry to be properly cleaned up, the node with that id needs to be shut down. Now you might think: But that's what we did when we performed the rolling upgrade? Unfortunately, things are a little more complex. You need to shut down the node with id zDggEbNfTj-DrAwJYjagcw, and then WAIT for the snapshot entries for that node to be cleared up in the cluster state (i.e. marked as FAILED) before starting up that node again. The reason for this odd behavior is that the clean-up logic is scheduled when the node leaves the cluster, but so are also possibly lots of other events (e.g. shard failures). The scheduled clean-up logic (which will appear in the pending tasks as update snapshot state after node removal) unfortunately runs at a quite low priority compared to some of the other tasks, which means that if the node rejoins the cluster before the task has gotten to execute, it will mistake the node for not having left, and not clean-up the ABORTED state.
I believe you may also have to look for shards that are in the INIT state, not just the ABORTED state. Please correct me if I am off on this.
In the situation outlined by @TimHeckel here, he had already issued a delete snapshot command. This moves the entries from INIT to the ABORTED state (but lets the delete snapshot request hang until the abort is fully completed and confirmed by the nodes). So a prerequisite to the above procedure is to first issue a delete snapshot command.
@ywelsch - thank you so much for your help. I did attempt the second scenario, where I shut down the ABORTED node(s) and wait for the snapshot entries to turn into FAILED -- I did shut down just one of these nodes in my three node cluster, and unfortunately my first attempt caused the cluster to report:
{ "error" : { "root_cause" : [ { "type" : "master_not_discovered_exception", "reason" : null } ], "type" : "master_not_discovered_exception", "reason" : null }, "status" : 503 }
I think I will try for the full cluster restart tonight. At any rate, you've given me the first actionable advice, so I really appreciate it.
@ywelsch - just to close this, the FULL restart of the cluster worked. Thanks again.
Most helpful comment
The simplest solution to get rid of the stuck snapshot is to do a full cluster restart, i.e., all nodes down, and only then start them up again. This will clear the snapshot state, but will ofc also mean downtime. The more involved solution consists of the following: Look at the clusterstate, and check the snapshot entries that are marked as ABORTED. Check the node id associated with that entry. For example, let's look at:
The node id is
zDggEbNfTj-DrAwJYjagcw, so in order for this entry to be properly cleaned up, the node with that id needs to be shut down. Now you might think: But that's what we did when we performed the rolling upgrade? Unfortunately, things are a little more complex. You need to shut down the node with idzDggEbNfTj-DrAwJYjagcw, and then WAIT for the snapshot entries for that node to be cleared up in the cluster state (i.e. marked as FAILED) before starting up that node again. The reason for this odd behavior is that the clean-up logic is scheduled when the node leaves the cluster, but so are also possibly lots of other events (e.g. shard failures). The scheduled clean-up logic (which will appear in the pending tasks as update snapshot state after node removal) unfortunately runs at a quite low priority compared to some of the other tasks, which means that if the node rejoins the cluster before the task has gotten to execute, it will mistake the node for not having left, and not clean-up the ABORTED state.