Elasticsearch: Exponential backoff of failed allocation

Created on 6 May 2017 · 12Comments · Source: elastic/elasticsearch

In https://github.com/elastic/elasticsearch/pull/18467 we solved the problem where the failed allocation of a shard is retried in a tight loop, filling up the log file with exceptions. Now, after five failures, the allocation is no longer attempted until the user triggers it.

The downside of this approach is that is requires user intervention.

Would it be possible to add some kind of exponential backoff so that allocation attempts continue to be made, but with less frequency. That way we still avoid flooding the logs but if the situation resolves itself, the shard will be allocated automatically.

:DistributeAllocation >enhancement Distributed

Source

clintongormley

👍3

Most helpful comment

FWIW I think we should lose the limit and just keep trying, at sufficiently low frequency for it not to be disruptive (e.g. back off until once-per-hour)

DaveCTurner on 17 Apr 2018

👍5

All 12 comments

CC @ywelsch, @bleskes

clintongormley on 6 May 2017

I think it's good to explore this. We can still keep the hard limit (and may increase it) - we built the feature for configuration mistakes - but delay the speed of re-assignment.

@clintongormley did you run into a specific issue that triggered this?

bleskes on 8 May 2017

@bleskes Just from user feedback

clintongormley on 8 May 2017

Pinging @elastic/es-distributed

elasticmachine on 15 Mar 2018

FWIW I think we should lose the limit and just keep trying, at sufficiently low frequency for it not to be disruptive (e.g. back off until once-per-hour)

DaveCTurner on 17 Apr 2018

👍5

Hello,
If nobody is working on it, I would like to pick it up.
Any initial thoughts are welcomed. 🙂

dhwanilpatel on 15 Jan 2020

Thanks @dhwanilpatel. I've already started working on this. I've removed the misleading help wanted label.

DaveCTurner on 17 Jan 2020

As far as I've been able to tell, the only case where we need indefinite retries is where the allocation repeatedly fails due to a ShardLockObtainFailedException because the shard is already open on the node thanks to an earlier allocation, and although it's in the process of shutting down it does not do so quickly enough. Frequently this is due to a temporarily flaky network resulting in a node leaving and rejoining the cluster a few times. By default, we wait for 5 seconds and retry 5 times, but it's definitely possible today for a shard to take more than 25 seconds to shut down.

The effect of the proposal here would be to keep retrying until the shard eventually shuts down, no matter how long that takes. I would prefer that we address the underlying causes of slow shard shutdowns, because this will bring the cluster back to health much more quickly and will result in fewer full-shard recoveries after a network wobble.

DaveCTurner on 28 Feb 2020

Another reason for failing allocations that eventually succeed is CircuitBreakingExceptions; we are discussing making recoveries more resilient to memory pressure in #44484.

A related point is that we typically only repeatedly try allocation on one or two nodes, because we only avoid the very last failed node in the ReplicaShardAllocator. Since #48265 we keep track of the nodes behind all failed allocations, so we could make use this to try more nodes.

DaveCTurner on 5 Mar 2020

Hi,

any update on this? Has this been picked yet?

amathur1893 on 12 Nov 2020

👍1

Work continues on making it so we no longer need this feature, yes.

DaveCTurner on 12 Nov 2020

😕1

I doubt that it would be possible to avoid all scenarios of need exponential backoff on retries.
Flakey networks are probably here to stay, at least at times.

If we're bothering to retry shard allocation anyway, why not do it right and have a backoff system?
It's an easy win (vs. fixing all the infinite possible issues).

By the way, using AWS' ES service on CN-Northeast unassigned shards come up after N retries more frequently than it seems reasonable. Something like once a month.
But if I just bump the retry count by 1 (so that it will retry once more), some human amount of time after cluster status becomes red, it works.