Elasticsearch: Prevent allocating shards to broken nodes

Created on 17 May 2016 · 4Comments · Source: elastic/elasticsearch

Allocating shards to a node can fail for various reasons. When an allocation fails, we currently ignore the node for that shard during the next allocation round. However, this means that:

subsequent rounds consider the node for allocating the shard again.
other shards are still allocated to the node (in particular the balancer tries to put shards on that node with the failed shard as its weight becomes smaller).
This is particularly bad if the node is permanently broken, leading to a never-ending series of failed allocations. Ultimately this affects the stability of the cluster.

:DistributeAllocation >bug Distributed

Source

ywelsch

Most helpful comment

@ywelsch I think we can approach this from multiple directions.

we can start bottom up and check that a data-path is writeable before we allocate a shard and skip it if possible (that would help if someone looses a disk and has multiple)
we can also has a simple allocation_failed counter on UnassignedInfo to prevent endless allocation of a potentially broken index (metadata / settings / whatever is broken)
we might also be able to use a simple counter of failed allocations per node that we can reset once we had a successful one on that node. We can then also have a simple allocation decider that throttles that node or takes it out of the loop entirely once the counter goes beyond a threshold?

I think in all of these cases simplicity wins over complex state... my $0.05

s1monw on 17 May 2016

👍3

All 4 comments

@ywelsch I think we can approach this from multiple directions.

we can start bottom up and check that a data-path is writeable before we allocate a shard and skip it if possible (that would help if someone looses a disk and has multiple)
we can also has a simple allocation_failed counter on UnassignedInfo to prevent endless allocation of a potentially broken index (metadata / settings / whatever is broken)
we might also be able to use a simple counter of failed allocations per node that we can reset once we had a successful one on that node. We can then also have a simple allocation decider that throttles that node or takes it out of the loop entirely once the counter goes beyond a threshold?

I think in all of these cases simplicity wins over complex state... my $0.05

s1monw on 17 May 2016

👍3

@ywelsch @s1monw is there are news on this?

Some OSs would cause the mounted disk to be read-only and if so the entire cluster will have issues with RED shards and not moving shards. Perhaps this could help in that end?

gmoskovicz on 21 Aug 2017

Pinging @elastic/es-distributed

elasticmachine on 15 Mar 2018

We have another, non trivial, of instance of this in shard fetching. When it hard fails on a node (rather then succeeding by finding a broking copy) we currently redo the fetching. This is an easy way around networking issue but can be poisonous on disk failures (for example).