Elasticsearch: Remove allocation commands from the `_reroute` API

Created on 10 Jun 2016  路  18Comments  路  Source: elastic/elasticsearch

The _reroute API is pretty handy if a user wants to kick off an allocation round if the cluster (due to a bug) missed to reroute shards etc. We had some of them in 1.7 etc when we added delayed allocation. Aside of this is has the following purposes:

I think this stuff needs to go. We can't offer APIs like this where basically nothing in the docs tells you:

  • this is an expert API
  • 99% of the times you are going to use it you should either user a different API, report a but or just don't mess with the cluster at all
  • use a cmd tool to repair state on disc so primaries can be allocated (we don't have that yet I know but you get the drift)

I spend so much time on pulling folks out of the dirt after using this I don't think it's worth it.

:CorFeatureIndices APIs help wanted v6.0.3

All 18 comments

In the years I've used Elasticsearch I've had two valid uses for it:

  1. To force assign a missing primary and accept that data loss.
  2. To jiggle the allocator to make it start when it stopped.

Other than that I've only ever misused it. For those reading along: you don't usually want to move shards around with this API because it doesn't pin them where you put them. If you want to pin a shard someplace you should use allocation filtering. It is much more flexible and actually works.

I'd be quite happy with an API to force assignment of an empty primary and one to jiggle the allocator. A command line tool is less nice because I'd have to connect to the right node and stuff. Also, I'm not sure it'd work in the case where the data is totally gone?

I remember once I reformatted a few machine when I didn't have any replicas. I wasn't paying good attention.....

  1. To jiggle the allocator to make it start when it stopped.

this doesn't need an allocation command right? all you need to do is to run _reroute with an empty body?

I'd be quite happy with an API to force assignment of an empty primary and one to jiggle the allocator. A command line tool is less nice because I'd have to connect to the right node and stuff. Also, I'm not sure it'd work in the case where the data is totally gone?
I remember once I reformatted a few machine when I didn't have any replicas. I wasn't paying good attention.....

I think in such a case we should have a cmd tool that creates such an empty primary shard on disk and we let _reroute?fetch_stores=true go and figure out the rest?

this doesn't need an allocation command right? all you need to do is to run _reroute with an empty body?

I don't believe it is required. I think I always added some small one just in case.

I think in such a case we should have a cmd tool that creates such an empty primary shard on disk and we let _reroute?fetch_stores=true go and figure out the rest?

Sorry, I don't know that API. I still don't like having to pick a machine on which to create the empty shard. That was one of the problems with the allow_primary command - you had to pick a node rather than just let Elasticsearch pick a decent choice. It was just another thing to have to think about when you've busted your cluster.

Sorry, I don't know that API. I still don't like having to pick a machine on which to create the empty shard. That was one of the problems with the allow_primary command - you had to pick a node rather than just let Elasticsearch pick a decent choice. It was just another thing to have to think about when you've busted your cluster.

I made that API up :) I think this is something that you should never use a REST endpoint for. something is seriously fucked up lets build one-off tools for these situations.

I have used the reroute API in many cases to get ES "unstuck" from either initializing or unassigned shards. I'm not sure how else I'd recover a cluster besides restarting it (sometimes this means a full cluster restart). Since Elasticsearch is often the primary means of searching for public services, a full cluster restart is rarely an option (in particular, replacing the cluster for a reason such as this is quite expensive from an infrastructure perspective). Restarting individual nodes gets progressively less painful with every release, but I doubt it will ever be a trivial performance cost.

In older versions of Elasticsearch, I found myself using the reroute API more frequently because it was often unclear why ES would refuse to allocate shards. I think this has also gotten better with every release, and particularly in 2.x, the state of shards (or problems allocating them) gets clearer.

I guess the reroute API gives a (false?) sense of security, since it gives the administrator the hammer with which to enforce a shard movement / allocation, and that just makes a guy feel in control.

I think this is something that you should never use a REST endpoint for. something is seriously fucked up lets build one-off tools for these situations.

I can see your argument for removing this from the REST API and, for the functionality that is not satisfied by existing API's or other means, moving to a command line tool. Of course, to complement this new tool, it would be really nice if Elasticsearch could identify the node(s) that are good candidates for the allocation, or some API to "propose" an allocation where Elasticsearch suggests which nodes are good candidates. That's not more convenient than the REST API, but at least it doesn't suddenly make the process of such allocations go from convenient to painful suddenly.

That was one of the problems with the allow_primary command - you had to pick a node rather than just let Elasticsearch pick a decent choice. It was just another thing to have to think about when you've busted your cluster.

Hm, I would always just allocate to a random node. ES will rebalance afterwards if it isn't happy with the allocation. But I suppose ES could make the destination node optional and pick something better than "random" for you.

The reroute API has been a critical tool for in keeping the cluster stable when the default allocator over allocated a node. This was a fairly normal case for us before we switched to tempest due to our shard sizes not being equal.

Balance aside, I've had to unallocate shards and let them "rebuild" because their translogs got out-of-sync and had deletes showing up on one shard but not another.

In a perfect world, maybe it's not needed, but I don't think ES is there yet. Yeah, it's a really big hammer but I can't think of another tool that would have solved our problems.

The reroute API has been a critical tool for in keeping the cluster stable when the default allocator over allocated a node. This was a fairly normal case for us before we switched to tempest due to our shard sizes not being equal.

so you took over the entire shard allocation process and manage everything yourself? I am asking because otherwise the balancer will kick in and reverse your decisions at some random point in time?

Balance aside, I've had to unallocate shards and let them "rebuild" because their translogs got out-of-sync and had deletes showing up on one shard but not another.

this is a special case which happens rarely I am against an API for, it should be a commandline tool.

In a perfect world, maybe it's not needed, but I don't think ES is there yet. Yeah, it's a really big hammer but I can't think of another tool that would have solved our problems.

I haven't seen any reason here that convinced me to not remove it. It's too much of a hammer. If the balancer is not smart enough for a usecase we have to fix it. If we need more allocation decider we have to fix it. We can't offer a hammer and expect the user to know how to use it.

+1 on this suggestion, simple reroute API to kickstart a reroute is handy. A way to (1) force a primary to be allocated (2) stop shard allocation is also very handy, don't mind if it is in an API (with extra protection flag, or conformation based execution based on random token) or a command line tool.

the balancer will kick in and reverse your decisions at some random point in time

That is not entirely true. It tries to balance by shard _count_ so as long as I swapped a large shard for a small shard the balancer left things alone.

I haven't seen any reason here that convinced me to not remove it

As long as the same functionality exists somewhere then I guess I don't really have much of an argument.

My biggest concern is that a production cluster get's into resource starved or hot-spot-heavy sate and admins have no recourse. Balancers aren't perfect, configuration can miss edge cases, and bugs happens, so make sure the hammer exists somewhere to get the system stable.

My biggest concern is that a production cluster get's into resource starved or hot-spot-heavy sate and admins have no recourse. Balancers aren't perfect, configuration can miss edge cases, and bugs happens, so make sure the hammer exists somewhere to get the system stable.

see this is my argument, I can support anybody in the community that just goes and uses that hammer. Hence I have to remove the hammer. Edge cases happen but that doesn't mean I am going to build APIs that work around ANY safety mechanism available. The reason "something might happen" isn't valid here give how many people get into the "something happened" situation because of that hammer?

As long as the same functionality exists somewhere then I guess I don't really have much of an argument.

yeah no it won't exist.

That is not entirely true. It tries to balance by shard count so as long as I swapped a large shard for a small shard the balancer left things alone.

A combination of allocation filtering and settings total_shards_per_node on the index level is going to be more permanent. It might not be enough to get the cluster safe, but in that case we should address that issue. Anything you do with the reroute API is going to be undone by the balancer eventually because it doesn't set up constraints. That is why it isn't a good API. It works temporarily.

So, what is the the hypothetical solution to the problem of a cluster with a hot spot that is killing bulk loading performance. Or, since ES can't predict the size of shards, put two large shards on the same node and cause resource/hot-spotting issues? These are both very real issues that we have faced and not hypotheticals, and as far as I can tell are unavoidable without a bug-free, configuration perfect, very very advance balancer (or hardware overkill).

I agree with @nik9000 that the use of reroute is a temporarily solution but the alternative of having NO solution in a production environment is very scary. If a bug is discovered in a live environment, I don't see waiting on a code fix as being feasible. To me, this isn't a "might happen" but more of a "when will it happen" case.

Anyway, I rest my case, I don't want to take over or derail the conversation here.

So, what is the the hypothetical solution to the problem of a cluster with a hot spot that is killing bulk loading performance. Or, since ES can't predict the size of shards, put two large shards on the same node and cause resource/hot-spotting issues? These are both very real issues that we have faced and not hypotheticals, and as far as I can tell are unavoidable without a bug-free, configuration perfect, very very advance balancer (or hardware overkill).

I have been thinking about this for a while and I think the solution to this is to give the user more control how the balancer handles individual indices. The ultimate flexibility here would be a adding a weight to an index that the balancer can take into account. This weight would be updatable such that the user can reduce the weight if the index goes readonly or can raise it if bulk indexing happens. This could be even combined with a node level max weight where certain nodes can only hold shards up to a certain weight. I think what we won't do as a start is to make the weight dynamic in terms of changing it automatically if indexing rate drops, it's too fragile and might change too quickly. But for your situation this seems to be the right solution and the user knows much better if an index is much bigger than another index. It can still be a function of the # of docs or so but it's up to the user I guess

As an update, I ran into a use-case for the reroute API today in Elasticsearch 1.7.3. I had a cluster in yellow state with 2242 unassigned shards. When I looked at the shards API, I saw that they were all the third replica of all of the indices in the cluster. The day before, the cluster had a node go unhealthy and recover during heavy indexing / index creation. I restarted the node that went unhealthy to see if Elasticsearch would start assigning shards again, and it didn't do anything. So I then used the reroute API to assign one of the replicas to a random node -- I expected to get an error with a message from Elasticsearch explaining why the shard couldn't be assigned. To my surprise, though, the replica shard _was_ assigned. So I proceeded to assign the rest in the same way.

These were replica shards, so a command line tool to create empty primaries would not have helped as a replacement for the reroute API. I'm not sure what else I could have done to recover the cluster, except muck with the allocation settings and try to convince Elasticsearch to move shards around in some way and hopefully pick up and assign those other shards.

At the very least, it would have been really helpful if ES had logged errors to indicate the reason why those replicas were never assigned so I could know in what way I needed to adjust my cluster settings and/or topology to satisfy the allocation requirements.

Here's the cluster health API output after I started assigning shards:

{
  "cluster_name": "anon_polloi_agrippasrc_lyamtestfiltalloca_elasticsearch",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 24,
  "number_of_data_nodes": 18,
  "active_primary_shards": 7548,
  "active_shards": 20295,
  "relocating_shards": 21,
  "initializing_shards": 2,
  "unassigned_shards": 2347,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 1459,
  "number_of_in_flight_fetch": 17040
}

Here's what it looked like before I did anything:

{
  "cluster_name": "anon_polloi_agrippasrc_lyamtestfiltalloca_elasticsearch",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 24,
  "number_of_data_nodes": 18,
  "active_primary_shards": 7548,
  "active_shards": 20222,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 2422,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 17572
}

you have to wait until "number_of_in_flight_fetch": 17572 goes to 0 it's still looking for unallocated version of those shards. See this is exactly why I think this API is trappy and must go away.

I think this stuff needs to go. We can't offer APIs like this where basically nothing in the docs tells you:

  • this is an expert API
  • 99% of the times you are going to use it you should either user a different API, report a but or just don't mess with the cluster at all
  • use a cmd tool to repair state on disc so primaries can be allocated (we don't have that yet I know but you get the drift)

I spend so much time on pulling folks out of the dirt after using this I don't think it's worth it.

Maybe changing the docs to point these things out, while still leaving users with the flexibility, is a better solution.

Reporting a bug when we find one is definitely a good idea, but if it occurs in production it's nice to have a workaround, you can't wait for the bug to be fixed. Granted, now ES is more stable than it used to be (and thanks so much for the titanic work on making it happen!), but I'm still scared of not having the option to allocate shards manually in various corner cases.

After chatting with @s1monw we prefer to close this issue as a clear indication that we are not going to work on this at this time. We are always open to reconsidering this in the future based on compelling feedback; despite this issue being closed please feel free to leave feedback on the proposal (including +1s).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

brwe picture brwe  路  3Comments

rpalsaxena picture rpalsaxena  路  3Comments

jpountz picture jpountz  路  3Comments

jasontedor picture jasontedor  路  3Comments

clintongormley picture clintongormley  路  3Comments