Allocating shards to a node can fail for various reasons. When an allocation fails, we currently ignore the node for that shard during the next allocation round. However, this means that:
@ywelsch I think we can approach this from multiple directions.
I think in all of these cases simplicity wins over complex state... my $0.05
@ywelsch @s1monw is there are news on this?
Some OSs would cause the mounted disk to be read-only and if so the entire cluster will have issues with RED shards and not moving shards. Perhaps this could help in that end?
Pinging @elastic/es-distributed
We have another, non trivial, of instance of this in shard fetching. When it hard fails on a node (rather then succeeding by finding a broking copy) we currently redo the fetching. This is an easy way around networking issue but can be poisonous on disk failures (for example).
Most helpful comment
@ywelsch I think we can approach this from multiple directions.
I think in all of these cases simplicity wins over complex state... my $0.05