There uses cases where we should be able to have a mechanism to balance primary shards through all the nodes so the number of primaries is uniformly distributed.
There are situations where the distribution of primary shards is unevenly distributed through the nodes, for instance when doing a rolling restart last node wont have any primary shards as the other nodes would assume primary shard role while the other node is down.
This issue has pop up in other occassions, and the usual answer was that primary/replica role is not an issue becouse the workload a primary or a replica assume is similar. But there are important uses cases where this does not apply.
For instance, in an index heavy scenario, where indexing must be implemented as an scripted upsert, the execution of the upsert logic falls onto primarie shards, and replicas just have to insert the result.
In this cases having unbalanced primaries excerts a bigger workload on the nodes hosting the primaries, this can even overload the cluster capacity as the cluster bottleneck will be the capacity of the nodes hosting primaries and not the sum of the cluster nodes.
Related thread in official forum
Actually there are some workarounds for this situations, but they are not efficient:
As an example,consider simplified scenario 3 nodes 2 shard 3 replicas, where rerouting is not possible:
Node 0: Shard 0 (primary), Shard 1 (primary)
Node 1: Shard 0 (replica), Shard 1 (replica)
Node 2: Shard 0 (replica), Shard 1 (replica)
Rerouting cannot be possible, we cannot swap shard 0 (primary) from Node 0 to Node 1 and shard0 (replica)
Against a scenario like 3 nodes 3 shard 2 replicas where rerouting is possible:
Node 0: Shard 0 (primary), Shard 1 (primary)
Node 1: Shard 0 (replica), Shard 2 (primary)
Node 2: Shard 1 (replica), Shard 2 (replica)
But even, when reroute is possible it means that shard data has to be moved from one node to another (I/O and network...).
Also there is not an automatted way to detect what primaries are unbalanced, and what shards can be swapped and execute that rerouting in samlls chunks in order to dont overload node resources. But implementing a utility or script that does that is feasible (see possible solutions).
So, lets consider possible solutions (take into account that I don't know elasticsearch internals):
Pinging @elastic/es-distributed
For what it's worth, https://github.com/datarank/tempest does this. Disclaimer, I am one of the authors of that project. I am hoping to port it to more recent versions of ES at some point.
The trickiest part of this feature is that today there is no way to demote a primary back to being a replica. The existing mechanism requires us to fail the primary, promote a specific replica, and then perform a recovery to replace the failed primary as a replica. Until the recovery is complete the shard is running with reduced redundancy, and I don't think it's acceptable for the balancer to detract from resilience like this. Therefore, I've labelling this as high-hanging fruit.
I've studied Tempest a little and I don't think anything in that plugin changes this. I also think it does not balance the primaries out in situations like the one described in the OP:
Node 0: Shard 0 (primary), Shard 1 (primary)
Node 1: Shard 0 (replica), Shard 1 (replica)
Node 2: Shard 0 (replica), Shard 1 (replica)
At least, there doesn't seem to be a test showing that it does and I couldn't find a mechanism that would have helped here after a small amount of looking. I could be wrong, I don't know the codebase (or Kotlin) very well, so a link to some code or docs to help us understand Tempest's approach would be greatly appreciated.
Here's a real-life scenario.
Take this index:
Some spot instance types are more vulnerable than others, so over time we slowly see primaries moving over to the long-lived instances as the short-lived ones are being reclaimed by AWS.
With this setup we're "at risk" of experiencing both the skew in shard sizes from the custom routing (what Tempest tries to solve) as well as the skew in indexing load (what this issue is addressing).
The shard sizes we can deal with on the application side, but the primaries bundling up is a bigger problem. Given the number of replicas, it's hard to do any meaningful custom balancing. Taking a primary out of rotation by moving it would choose a new primary at random, so trying to balance this way could simply result in an endless cycle of shuffling. Never mind the performance impact and potentially the cross-AZ network costs too.
We could try the "bag of hardware" solution, but we'd need to add a new instance type to the spot pool for each additional instance to mitigate spot wipeout, and we'd consequently need to increase the replica count as well. Storage costs would increase, and it might not even solve the problem. We could also move to regular instances, but that's not great either for obvious reasons.
If, at the very least, we could choose which shard was primary, we could solve this problem with a custom solution (without the cost overhead). Having native support for balancing primaries would obviously solve this too.
Found out that ES doesn't promote a replica when moving the primary - it'll literally just move the primary. Duh.. Anyway, knowing that, it was much simpler to balance the primaries, so for anyone finding this and having the samme issue, I wrote this tool to solve it: https://github.com/hallh/elasticsearch-primary-balancer.
I've run this on our production cluster a few times already and balancing primaries did solve the issue we had with sudden spikes of write rejections under otherwise consistent load.
Found out that ES doesn't promote a replica when moving the primary - it'll literally just move the primary. Duh.. Anyway, knowing that, it was much simpler to balance the primaries, so for anyone finding this and having the same issue, I wrote this tool to solve it: https://github.com/hallh/elasticsearch-primary-balancer.
I've run this on our production cluster a few times already and balancing primaries did solve the issue we had with sudden spikes of write rejections under otherwise consistent load.
Amazing work hallh.
But it's funny, because I also realized that a primary and a replica could be swaped via cluster relocation API as a way to balance replicas (inefficient, but effective).
Even more, I also developed such a tool to evaluate cluster state, generate a migration plan and execute it (I'm talking 4 days ago... so our work overlapped :-( )
So now we have two primary balancer tools...
I will try to publish it too as soon as I consider it production ready, but yea, this last days it has been rebalancing production cluster with no issue, so if I manage to invest some more time it can be ready soon.
Also expect feedback, as we have some common points were we can improve the tools.
Now, as a long term solution, we will try to push elasticsearch capability to move primary role from one shard to another. It may be slow, but being backed by pay support may help.
Once we have this functionality it would be trivial to modify the rebalance tools to just "move the primary roles" instead of swapping the shards with it's data (that would be fast and efficient).
Posting some suggetions for elasticsearch-primary-balancer here .
As an update, I will post my rebalance tool too, sooner than later, I've been busy supporting clusre routing location awareness, as some of my cluster are using it, and that furher limtis the ways you can to try to balance shards roles.
I also encountered the same problem. When the primary shards are all on the same node, the load on this node can be very large.
shards allocation:
xxxx-36404-2018.11 1 r STARTED 11660265 2gb 10.234.242.30 node1
xxxx-36404-2018.11 1 p STARTED 11660265 2gb 10.234.242.31 node2
xxxx-36404-2018.11 0 r STARTED 11662979 2gb 10.234.242.30 node1
xxxx-36404-2018.11 0 p STARTED 11662979 2gb 10.234.242.31 node2
xxxx-36411-2019.06 1 r STARTED 28357804 6.6gb 10.234.242.30 node1
xxxx-36411-2019.06 1 p STARTED 28357804 6.6gb 10.234.242.31 node2
xxxx-36411-2019.06 0 r STARTED 28359462 6.4gb 10.234.242.30 node1
xxxx-36411-2019.06 0 p STARTED 28359462 6.4gb 10.234.242.31 node2
xxxx-36409-2018.12 1 r STARTED 21875206 4.6gb 10.234.242.30 node1
xxxx-36409-2018.12 1 p STARTED 21875206 4.6gb 10.234.242.31 node2
xxxx-36409-2018.12 0 r STARTED 21872417 4.6gb 10.234.242.30 node1
xxxx-36409-2018.12 0 p STARTED 21872417 4.6gb 10.234.242.31 node2
xxxx-36401-2018.11 1 r STARTED 37610364 6.9gb 10.234.242.30 node1
xxxx-36401-2018.11 1 p STARTED 37610364 6.9gb 10.234.242.31 node2
xxxx-36401-2018.11 0 r STARTED 37619043 6.9gb 10.234.242.30 node1
xxxx-36401-2018.11 0 p STARTED 37619043 6.9gb 10.234.242.31 node2
xxxx-36404-2019.01 1 r STARTED 12161838 2.9gb 10.234.242.30 node1
xxxx-36404-2019.01 1 p STARTED 12161838 2.9gb 10.234.242.31 node2
xxxx-36404-2019.01 0 r STARTED 12161757 2.9gb 10.234.242.30 node1
xxxx-36404-2019.01 0 p STARTED 12161757 2.9gb 10.234.242.31 node2
xxxx-36408-2018.10 1 r STARTED 58700020 10.1gb 10.234.242.30 node1
xxxx-36408-2018.10 1 p STARTED 58700020 10.1gb 10.234.242.31 node2
xxxx-36408-2018.10 0 r STARTED 58692558 10.1gb 10.234.242.30 node1
xxxx-36408-2018.10 0 p STARTED 58692558 10.1gb 10.234.242.31 node2
xxxx-36405-2019.11 1 r STARTED 12921855 3.1gb 10.234.242.30 node1
xxxx-36405-2019.11 1 p STARTED 12921855 3.2gb 10.234.242.31 node2
xxxx-36405-2019.11 0 r STARTED 12924625 3.1gb 10.234.242.30 node1
xxxx-36405-2019.11 0 p STARTED 12924625 3.1gb 10.234.242.31 node2
...
Task distribution is as follows
action task_id parent_task_id type start_time timestamp running_time ip node
indices:data/read/scroll GDS6EwMwRAavA9NACh4nXg:37417628 - transport 1574045549545 10:52:29 10.6s 10.234.242.30 node1
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999023 GDS6EwMwRAavA9NACh4nXg:37417628 netty 1574045553089 10:52:33 7.2s 10.234.242.31 node2
indices:data/read/scroll GDS6EwMwRAavA9NACh4nXg:37417629 - transport 1574045550097 10:52:30 10.1s 10.234.242.30 node1
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999049 GDS6EwMwRAavA9NACh4nXg:37417629 netty 1574045554929 10:52:34 5.3s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999058 GDS6EwMwRAavA9NACh4nXg:37417629 netty 1574045555700 10:52:35 4.6s 10.234.242.31 node2
indices:data/read/scroll GDS6EwMwRAavA9NACh4nXg:37417662 - transport 1574045553203 10:52:33 7s 10.234.242.30 node1
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999235 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045557817 10:52:37 2.4s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999236 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045557819 10:52:37 2.4s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999237 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045557842 10:52:37 2.4s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999238 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045557868 10:52:37 2.4s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999239 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045557899 10:52:37 2.4s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999240 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045557933 10:52:37 2.3s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999248 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045558112 10:52:38 2.2s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999249 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045558223 10:52:38 2s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999251 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045558264 10:52:38 2s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999252 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045558268 10:52:38 2s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999253 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045558294 10:52:38 2s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999254 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045558351 10:52:38 1.9s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999255 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045558398 10:52:38 1.9s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999256 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045558408 10:52:38 1.9s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999264 GDS6EwMwRAavA9NACh4nXg:37417662 netty 1574045558515 10:52:38 1.8s 10.234.242.31 node2
indices:data/read/scroll GDS6EwMwRAavA9NACh4nXg:37417765 - transport 1574045558454 10:52:38 1.7s 10.234.242.30 node1
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999344 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045558696 10:52:38 1.6s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999345 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045558731 10:52:38 1.5s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999346 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045558758 10:52:38 1.5s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999348 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045558823 10:52:38 1.4s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999349 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045558845 10:52:38 1.4s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999351 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045558893 10:52:38 1.4s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999353 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045558948 10:52:38 1.3s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999352 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045558948 10:52:38 1.3s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999354 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045558961 10:52:38 1.3s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999356 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559141 10:52:39 1.1s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999357 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559151 10:52:39 1.1s 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999360 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559479 10:52:39 836.6ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999362 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559485 10:52:39 829.9ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999363 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559528 10:52:39 786.8ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999364 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559535 10:52:39 780.3ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999365 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559573 10:52:39 741.8ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999366 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559693 10:52:39 622.2ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999368 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559727 10:52:39 588.1ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999369 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559763 10:52:39 551.8ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999370 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559765 10:52:39 550ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999371 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559785 10:52:39 530.4ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999372 GDS6EwMwRAavA9NACh4nXg:37417765 netty 1574045559793 10:52:39 522.3ms 10.234.242.31 node2
indices:data/read/search PP7k8Oo9RIqMcRboQS3n9w:119999406 - transport 1574045559857 10:52:39 457.8ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id] PP7k8Oo9RIqMcRboQS3n9w:119999569 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct 1574045559944 10:52:39 371ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id] PP7k8Oo9RIqMcRboQS3n9w:119999573 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct 1574045559998 10:52:39 316.9ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id] PP7k8Oo9RIqMcRboQS3n9w:119999574 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct 1574045560005 10:52:40 310.2ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id] PP7k8Oo9RIqMcRboQS3n9w:119999581 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct 1574045560075 10:52:40 240.7ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id] PP7k8Oo9RIqMcRboQS3n9w:119999582 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct 1574045560155 10:52:40 159.8ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id] PP7k8Oo9RIqMcRboQS3n9w:119999584 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct 1574045560172 10:52:40 143.1ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id] PP7k8Oo9RIqMcRboQS3n9w:119999586 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct 1574045560211 10:52:40 103.8ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id] PP7k8Oo9RIqMcRboQS3n9w:119999588 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct 1574045560273 10:52:40 41.9ms 10.234.242.31 node2
indices:data/read/search[phase/fetch/id] PP7k8Oo9RIqMcRboQS3n9w:119999589 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct 1574045560296 10:52:40 19.1ms 10.234.242.31 node2
indices:data/read/search PP7k8Oo9RIqMcRboQS3n9w:119999559 - transport 1574045559942 10:52:39 372.7ms 10.234.242.31 node2
indices:data/read/scroll PP7k8Oo9RIqMcRboQS3n9w:119999570 - transport 1574045559980 10:52:39 335.2ms 10.234.242.31 node2
indices:data/read/search PP7k8Oo9RIqMcRboQS3n9w:119999572 - transport 1574045559990 10:52:39 325.2ms 10.234.242.31 node2
indices:data/read/scroll PP7k8Oo9RIqMcRboQS3n9w:119999583 - transport 1574045560158 10:52:40 157.1ms 10.234.242.31 node2
cluster:monitor/tasks/lists GDS6EwMwRAavA9NACh4nXg:37417903 - transport 1574045560189 10:52:40 43.7ms 10.234.242.30 node1
cluster:monitor/tasks/lists[n] GDS6EwMwRAavA9NACh4nXg:37417904 GDS6EwMwRAavA9NACh4nXg:37417903 direct 1574045560221 10:52:40 12.5ms 10.234.242.30 node1
cluster:monitor/tasks/lists[n] PP7k8Oo9RIqMcRboQS3n9w:119999587 GDS6EwMwRAavA9NACh4nXg:37417903 netty 1574045560239 10:52:40 75.7ms 10.234.242.31 node2
any good solutions?
+1, we also found this case in our production
+1, We would like to have this functionality.
Some nodes are Highly overloaded due to this very case.
And if a node in the cluster gets rebooted it loses its primary shards and make the cluster even more unbalanced.
and not rebalance after re-adding.
+1 .. same issue here. 2 data node cluster.. primaries evenly balanced. One node died.... all primaries end up on live node (good). Bring the dead node back up, and we have all primaries on one, and all secondaries on another. Wont rebalance the primaries. This means all writes go to one node.... plus aggregate reads all come from one node. There is no spl,it usage acrss nodes.
Weirdly.. I am sure that the old versions of ES did this (not sure when it stopped).. but we do really need to have that ability put in.
Scratch that... found the way.
The 'cancel' command on the primary, cancels the current primary (on node 2) .. promotes teh replica (on node 1) to primary... and the new primary dumps its records to node 2... cool...
POST /_cluster/reroute
{
"commands": [
{
"cancel": {
"index":"my-index",
"shard": 3,
"node": "my-data-2",
"allow_primary":true
}
}
]
}
Most helpful comment
Posting some suggetions for elasticsearch-primary-balancer here .
As an update, I will post my rebalance tool too, sooner than later, I've been busy supporting clusre routing location awareness, as some of my cluster are using it, and that furher limtis the ways you can to try to balance shards roles.