Elasticsearch: index.auto_expand_replicas and shard allocation settings don't play well together

Created on 7 Apr 2013 · 14Comments · Source: elastic/elasticsearch

With an index configured with auto_expand_replicas="0-all" the cluster will try to allocate one primary and ({number nodes} - 1) replicas for each shard, ie a copy of each shard on each node.

However, with "index.routing.allocation.include.zone"="zone1" the cluster is blocked from allocating a shard (either primary or replica) to any nodes that are not configured with a zone attribute set to "zone1", ie "node.zone=zone1" in the elasticsearch.yaml config file. So if a cluster has 3 nodes in "zone1" and 3 nodes in "zone2" it will allocate 3 shards (primary + 2 x replicas) to the "zone1" nodes and mark an additional 3 replicas as unassigned. I observed this using the elasticsearch head utility.

So the allocation behaviour is as desired, ie auto expand replicas to all nodes with a specific zone attribute, but the unassigned shards will make the cluster be in red state

See https://groups.google.com/forum/#!msg/elasticsearch/95hC-wGu7GE/BPPSWsfj8UkJ

:DistributeAllocation >enhancement Distributed high hanging fruit

Source

synhershko

👍3

Most helpful comment

we spoke about this in fixit friday to keep this issue going. We came up with a possible different way of looking at the problem. If we'd treat an index that has n-all set we could just consider it fully allocated as long as at least max(1,n) replicas are allocated. Since this is all the allocator can do at this point. We haven't fully flashed out the consequences and how hard it would be to do that but it might be a simpler solution independent of the allocation deciders. /cc @ywelsch

s1monw on 19 Jan 2018

👍3

All 14 comments

using latest master I should add

synhershko on 7 Apr 2013

I'm wondering if we should deprecate auto-expand... It just seems like the wrong solution

clintongormley on 8 Aug 2014

IMO no, auto-expand is a very useful feature and a way for clusters to automatically grow to accommodate peaks with read-heavy operations. It's just somewhat broken when shard-allocation rules aren't trivial.

I admit tho, auto-expand does require multi-cast to be enabled (or predefining IPs for the machines dynamically provisioned during peaks), so at this point it does seem more like of an exotic feature.

synhershko on 8 Aug 2014

We used it in CirrusSearch so people with only a single node or two nodes
don't end up with a yellow cluster. We document the autocorrelating
behavior and the implications to redundancy. I still don't particularly
trust it BUT it is a nice way to make installation smooth even in small
environments. We use most of the CirrusSearch details both in development
and on the production cluster and it helps with that.
On Aug 8, 2014 12:00 PM, "Itamar Syn-Hershko" [email protected]
wrote:

IMO no, auto-expand is a very useful feature and a way for clusters to
automatically grow to accommodate peaks with read-heavy operations. It's
just somewhat broken when shard-allocation rules aren't trivial.

I admit tho, auto-expand does require multi-cast to be enabled (or
predefining IPs for the machines dynamically provisioned during peaks), so
at this point it does seem more like of an exotic feature.

—
Reply to this email directly or view it on GitHub
https://github.com/elasticsearch/elasticsearch/issues/2869#issuecomment-51588189
.

nik9000 on 8 Aug 2014

Attached an "adoptme" tag as this needs some further looking into to establish if this is an issue.
The auto-expand feature is used internally to keep a copy of the .scripts index (stored scripts) on each node so this does need to work.

markharwood on 5 Sep 2014

Having reread this issue, it seems that if we were able to limit the max number of auto-expand replicas to the number of nodes that would allow allocation (eg awareness rules) then we'd avoid the yellow status.

Not sure how feasible this is.

clintongormley on 19 Sep 2015

@dakrone Do you know whether @clintongormley suggestion above is feasible?

colings86 on 13 Nov 2015

@colings86 @clintongormley I looked at where the setting is updated, it's
currently updated in MetaDataUpdateSettingsService, which submits a new
cluster state update to change the number of replicas if it detects that the
setting must be changed.

I think it would be possible to try and use the AllocationDeciders to adjust the
number of replicas to the number of nodes where it can be allocated, but I don't
think it's trivial.

Maybe it should be a separate setting? Instead of 0-all it can be
0-available or something?

dakrone on 16 Nov 2015

I noticed that in ES1.7
.scripts index has the 0-all setting automatically
My cluster is yellow because it is unable to expand replicas to nodes that are full in data capacity bounded by

        "cluster.routing.allocation.disk.watermark.low": "80%",
        "cluster.routing.allocation.disk.watermark.high": "85%"

is there a solution for this?

vb3 on 20 Sep 2016

We have run into this issue (or a heavily related one) twice in the last month on Cloud, both times on 2.x clusters (#1656, details here and here in the comments)

In particular it _appears_ (it's happening on production clusters so unfortunately there's a limit to how much debugging we can do before it is necessary to reset the cluster state) that an unallocated "auto expand" index is preventing other indexes from migrating given a cluster.routing.allocation.exclude._name directive - is that even remotely possible? (here is my notes on what I saw the most recent time) ...

We are about to deploy a workaround that will temporarily disable "auto expand" for .scripts and .security while updating a cluster (and perhaps user indexes in the future), but .security settings cannot be changed in 2.x so this workaround is incomplete.

cc @alexbrasetvik

AlexP-Elastic on 17 Oct 2016

s1monw on 19 Jan 2018

👍3

Pinging @elastic/es-distributed

elasticmachine on 15 Mar 2018

This issue can also manifest when decommissioning a data node where, for example, the node is holding a .security-6 copy and cluster level allocation filtering has been applied by _ip. Applying shard allocation filtering does not remove shards having auto-expand configured. While this is not a detrimental problem, it can cause confusion and it should be verified that any remaining shards on a node are due to having auto-expand replicas configured. It also requires the user to know which indices have auto-expand configured.

Perhaps shard allocation settings + shard allocation behavior with auto-expand replicas should be documented until a technical resolution is reached given how long this issue has been open.

inqueue on 2 May 2018

Perhaps shard allocation settings + shard allocation behavior with auto-expand replicas should be documented until a technical resolution is reached given how long this issue has been open.

@inqueue I've opened #30531 to document this.