In https://github.com/elastic/elasticsearch/pull/18467 we solved the problem where the failed allocation of a shard is retried in a tight loop, filling up the log file with exceptions. Now, after five failures, the allocation is no longer attempted until the user triggers it.
The downside of this approach is that is requires user intervention.
Would it be possible to add some kind of exponential backoff so that allocation attempts continue to be made, but with less frequency. That way we still avoid flooding the logs but if the situation resolves itself, the shard will be allocated automatically.
CC @ywelsch, @bleskes
I think it's good to explore this. We can still keep the hard limit (and may increase it) - we built the feature for configuration mistakes - but delay the speed of re-assignment.
@clintongormley did you run into a specific issue that triggered this?
@bleskes Just from user feedback
Pinging @elastic/es-distributed
FWIW I think we should lose the limit and just keep trying, at sufficiently low frequency for it not to be disruptive (e.g. back off until once-per-hour)
Hello,
If nobody is working on it, I would like to pick it up.
Any initial thoughts are welcomed. 馃檪
Thanks @dhwanilpatel. I've already started working on this. I've removed the misleading help wanted
label.
As far as I've been able to tell, the only case where we need indefinite retries is where the allocation repeatedly fails due to a ShardLockObtainFailedException
because the shard is already open on the node thanks to an earlier allocation, and although it's in the process of shutting down it does not do so quickly enough. Frequently this is due to a temporarily flaky network resulting in a node leaving and rejoining the cluster a few times. By default, we wait for 5 seconds and retry 5 times, but it's definitely possible today for a shard to take more than 25 seconds to shut down.
The effect of the proposal here would be to keep retrying until the shard eventually shuts down, no matter how long that takes. I would prefer that we address the underlying causes of slow shard shutdowns, because this will bring the cluster back to health much more quickly and will result in fewer full-shard recoveries after a network wobble.
Another reason for failing allocations that eventually succeed is CircuitBreakingException
s; we are discussing making recoveries more resilient to memory pressure in #44484.
A related point is that we typically only repeatedly try allocation on one or two nodes, because we only avoid the very last failed node in the ReplicaShardAllocator
. Since #48265 we keep track of the nodes behind all failed allocations, so we could make use this to try more nodes.
Hi,
any update on this? Has this been picked yet?
Work continues on making it so we no longer need this feature, yes.
I doubt that it would be possible to avoid all scenarios of need exponential backoff on retries.
Flakey networks are probably here to stay, at least at times.
If we're bothering to retry shard allocation anyway, why not do it right and have a backoff system?
It's an easy win (vs. fixing all the infinite possible issues).
By the way, using AWS' ES service on CN-Northeast unassigned shards come up after N retries more frequently than it seems reasonable. Something like once a month.
But if I just bump the retry count by 1 (so that it will retry once more), some human amount of time after cluster status becomes red, it works.
Most helpful comment
FWIW I think we should lose the limit and just keep trying, at sufficiently low frequency for it not to be disruptive (e.g. back off until once-per-hour)