Elasticsearch: Unable to DELETE _snapshot even after a rolling restart

Created on 27 Jun 2018 · 10Comments · Source: elastic/elasticsearch

Elasticsearch version:
6.3.0

Plugins installed:
s3-repository
x-pack

JVM version
1.8

OS version
Amazon Linux

Description of the problem including expected versus actual behavior:
curl -XDELETE "http://localhost:9200/_snapshot/s3-6-backup/curator-20180623143002" hangs even after a full restart of the cluster.

I've tried turning trace logging on, but it unfortunately hasn't helped.

Here is the status information on the backup:
curl -GET "http://localhost:9200/_snapshot/s3-6-backup/_status?pretty

Relevant:

{
      "snapshot" : "curator-20180623143002",
      "repository" : "s3-6-backup",
      "uuid" : "CbzLTfmbRCaGc5tu90x-9A",
      "state" : "ABORTED",
      "include_global_state" : true,
      "shards_stats" : {
        "initializing" : 0,
        "started" : 0,
        "finalizing" : 0,
        "done" : 187,
        "failed" : 320,
        "total" : 507
      },

:DistributeSnapshoRestore

Source

TimHeckel

Most helpful comment

The simplest solution to get rid of the stuck snapshot is to do a full cluster restart, i.e., all nodes down, and only then start them up again. This will clear the snapshot state, but will ofc also mean downtime. The more involved solution consists of the following: Look at the clusterstate, and check the snapshot entries that are marked as ABORTED. Check the node id associated with that entry. For example, let's look at:

{
            "index" : {
              "index_name" : "my_index",
              "index_uuid" : "VhUbEkzIQvggTmOXaRS3gQ"
            },
            "shard" : 0,
            "state" : "ABORTED",
            "node" : "zDggEbNfTj-DrAwJYjagcw"
          },

The node id is zDggEbNfTj-DrAwJYjagcw, so in order for this entry to be properly cleaned up, the node with that id needs to be shut down. Now you might think: But that's what we did when we performed the rolling upgrade? Unfortunately, things are a little more complex. You need to shut down the node with id zDggEbNfTj-DrAwJYjagcw, and then WAIT for the snapshot entries for that node to be cleared up in the cluster state (i.e. marked as FAILED) before starting up that node again. The reason for this odd behavior is that the clean-up logic is scheduled when the node leaves the cluster, but so are also possibly lots of other events (e.g. shard failures). The scheduled clean-up logic (which will appear in the pending tasks as update snapshot state after node removal) unfortunately runs at a quite low priority compared to some of the other tasks, which means that if the node rejoins the cluster before the task has gotten to execute, it will mistake the node for not having left, and not clean-up the ABORTED state.

ywelsch on 1 Oct 2018

🎉3 👍1

All 10 comments

Pinging @elastic/es-distributed

elasticmachine on 28 Jun 2018

@tlrx can you take a look?

bleskes on 28 Jun 2018

@TimHeckel Do you have more information about the snapshot deletion? It was a currently running / unfinished snapshot that you tried to delete? How long did it hang before you tried to restart the cluster?

tlrx on 28 Jun 2018

@tlrx - apologies for my delay in getting back to you here; the snapshot had been running for many days before I attempted deleting it and/or restarting the cluster. Is there anything I can do to force the removal of this aborted snapshot attempt? Thanks

TimHeckel on 29 Jul 2018

Hi -- I've since upgraded from 6.3.0 to 6.4.1, but this one hanging snapshot remains. Below are all the responses I've gotten after upgrading. @tlrx - wondering if you could take another look or give me a pointer? I'd prefer not to have to migrate to a whole new cluster, but I simply cannot delete this hanging _snapshot, and that may be my last option.

GET /_snapshot/s3-6-backup/_status
{
  "snapshots": [
    {
      "snapshot": "curator-20180623143002",
      "repository": "s3-6-backup",
      "uuid": "CbzLTfmbRCaGc5tu90x-9A",
      "state": "ABORTED",
      "include_global_state": true,
      "shards_stats": {
        "initializing": 0,
        "started": 0,
        "finalizing": 0,
        "done": 187,
        "failed": 320,
        "total": 507
      },
      "stats": {
        "incremental": {
          "file_count": 0,
          "size_in_bytes": 0
        },
        "total": {
          "file_count": 0,
          "size_in_bytes": 0
        },
        "start_time_in_millis": 0,
        "time_in_millis": 0,
        "number_of_files": 0,
        "processed_files": 0,
        "total_size_in_bytes": 0,
        "processed_size_in_bytes": 0
      },
      "indices": { ...

GET /_snapshot/s3-6-backup/_current
{
  "snapshots": [
    {
      "snapshot": "curator-20180623143002",
      "uuid": "CbzLTfmbRCaGc5tu90x-9A",
      "version_id": 6040199,
      "version": "6.4.1",
      "indices": [ ... ],
      "include_global_state": true,
      "state": "IN_PROGRESS",
      "start_time": "2018-06-23T14:30:03.389Z",
      "start_time_in_millis": 1529764203389,
      "end_time": "1970-01-01T00:00:00.000Z",
      "end_time_in_millis": 0,
      "duration_in_millis": -1529764203389,
      "failures": [],
      "shards": {
        "total": 0,
        "failed": 0,
        "successful": 0
      }
    }
  ]
}

DELETE /_snapshot/s3-6-backup/curator-20180623143002

(eventually returns with a `curl: (52) Empty reply from server` on the server

DELETE /_snapshot/s3-6-backup
{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[main-2][10.0.1.61:9300][cluster:admin/repository/delete]"
      }
    ],
    "type": "illegal_state_exception",
    "reason": "trying to modify or unregister repository that is currently used "
  },
  "status": 500
}

TimHeckel on 19 Sep 2018

{
            "index" : {
              "index_name" : "my_index",
              "index_uuid" : "VhUbEkzIQvggTmOXaRS3gQ"
            },
            "shard" : 0,
            "state" : "ABORTED",
            "node" : "zDggEbNfTj-DrAwJYjagcw"
          },

ywelsch on 1 Oct 2018

🎉3 👍1

I believe you may also have to look for shards that are in the INIT state, not just the ABORTED state. Please correct me if I am off on this.

BobBlank12 on 4 Oct 2018

In the situation outlined by @TimHeckel here, he had already issued a delete snapshot command. This moves the entries from INIT to the ABORTED state (but lets the delete snapshot request hang until the abort is fully completed and confirmed by the nodes). So a prerequisite to the above procedure is to first issue a delete snapshot command.

ywelsch on 4 Oct 2018

👍1

@ywelsch - thank you so much for your help. I did attempt the second scenario, where I shut down the ABORTED node(s) and wait for the snapshot entries to turn into FAILED -- I did shut down just one of these nodes in my three node cluster, and unfortunately my first attempt caused the cluster to report:

{ "error" : { "root_cause" : [ { "type" : "master_not_discovered_exception", "reason" : null } ], "type" : "master_not_discovered_exception", "reason" : null }, "status" : 503 }

I think I will try for the full cluster restart tonight. At any rate, you've given me the first actionable advice, so I really appreciate it.

TimHeckel on 5 Oct 2018

@ywelsch - just to close this, the FULL restart of the cluster worked. Thanks again.

TimHeckel on 6 Oct 2018

Was this page helpful?

0 / 5 - 0 ratings