Elasticsearch: An API for decommissioning a node or nodes

Created on 14 Nov 2019 · 5Comments · Source: elastic/elasticsearch

A common need when administrating nodes is to remove one or more nodes from the cluster without interrupting any operations. I think it would be useful to have a dedicate API for doing this decommission that is smarter than our decommissioning process.

Currently to decommission a node, a user would set the "cluster.routing.allocation.exclude._id" parameter to the node they wanted to remove from the cluster (alternatively _ip, _host, or _name could be used). They would then have to wait for the shards to be moved off of the node (checking the /_cat/allocation or similar APIs).

The issue with this is that during this time, it's possible for mutually exclusive settings to be set that causes data not to be migrated from the node, or for other filtering settings not to be achievable since the node will be subsequently removed from the cluster. Additionally, there isn't a simple way to check readiness other than reading cat APIs (should not be relied on) or the cluster state routing table.

What would be nice is the ability to mark a node as "pending decommission" such as:

PUT /_cluster/allocation/decommission
{
  "nodes": ["node-id-1", "node-id-2"]
}

This API would act similarly to the allocation exclusion API. We could then also treat these node ids "specially" internally, for instance we could disallow setting index-level shard filtering directly to these node IDs (this would be very nice for ILM since we randomly pick a node for allocation during a shrink). We could also potentially say "when the master node fails, prefer not to elect this node because it is going to be shortly removed from the cluster".

Rather than having a user have to check our other allocation APIs for the status of the decommission, we could provide a way to check the status of decommissioning nodes:

GET /_cluster/allocation/status

Which could give status like:

{
  "nodes": [
    {
      "node": "data-node-1",
      "node_id": "node-id-1",
      "safe_to_shutdown": false,
      "data_loss_on_removal": false,
      "decommission_possible": false,
      "shards_remaining": 7,
      "time_since_decommission": "1.2h",
      "time_since_decommission_millis": 4320000
    },
    {
      "node": "data-node-2",
      "node_id": "node-id-2",
      "safe_to_shutdown": true,
      "data_loss_on_removal": false,
      "decommission_possible": true,
      "shards_remaining": 0,
      "time_since_decommission": "1.2h",
      "time_since_decommission_millis": 4320000
    }
  ]
}

These fields are all made up, there's just ideas, but here's how they'd work:

safe_to_shutdown - whether the node contains any shards, if it holds 0, this would be true
data_loss_on_removal - in some cases (2 node clusters) you cannot evacuate all the data because primary and replicas can't be on the same node, this could return "true" if there were any 0-replica indices present on the node (so you could still remove the node if you didn't mind your cluster going yellow)
decommission_possible - internally ES could check whether the shards could even be moved off of the node (maybe they can't due to disk space, for instance), and return false if it's never going to be possible to decommission this node

Additionally, we'd want a way to recommission a node in the event it was decommissioned but an admin changed their mind:

PUT /_cluster/allocation/recommission
{
  "nodes": ["node-id-2"]
}

There's a lot of additional ideas around this (potential things like "as soon as a node has moved all data off, automatically shut itself down"), but this is a bare bones issue to get discussion started.

:DistributeAllocation >feature Distributed

Source

dakrone

Most helpful comment

I imagine Cloud would be overjoyed to get this - shall I find someone to loop in?

pugnascotia on 14 Nov 2019

🎉3 ❤2

All 5 comments

Pinging @elastic/es-distributed (:Distributed/Allocation)

elasticmachine on 14 Nov 2019

I imagine Cloud would be overjoyed to get this - shall I find someone to loop in?

pugnascotia on 14 Nov 2019

🎉3 ❤2

@pugnascotia this partially comes from discussions I have had with @jhalterman about decommissioning on Cloud.

dakrone on 14 Nov 2019

Ha, yes I was thinking of him 😄

pugnascotia on 14 Nov 2019

at least with respect to decommissioning data nodes,

the cluster settings API:

PUT /_cluster/settings 
{
  "transient": {
      "cluster.routing.allocation.exclude._name" : "<node>,<node>,..."
   }
}

can be enhanced for set-like (multi-unique-valued) settings in the following (or similar) way:

PUT /_cluster/settings 
{
  "transient": {
      "cluster.routing.allocation.exclude[.<add/remove>]._name" : "<node>,<node>,..."
   }
}

this would keep such "commands" idempotent and allow for parallel, safe decommissioning of several data nodes at a time, potentially speeding up safe rolling updates and similar scenarios significantly, without the need to have custom centralized orchestration outside of the master node.