Elasticsearch: Moving replica shard involves primary shard, even across zones

Created on 10 Apr 2018  路  8Comments  路  Source: elastic/elasticsearch

It appears a MoveAllocationCommand on a replica shard copies the primary shard rather than copying the replica.

Elasticsearch version: 6.1.2

JVM version 8u131

Description of the problem including expected versus actual behavior:

It appears a MoveAllocationCommand on a replica shard copies the primary shard rather than copying the replica. I suspect this because moving a shard from B to C causes large network activity on A where the primary shard is located.

Steps to reproduce:

  1. Setup 3 ES nodes, "A", "B" and "C"
  2. Create an index with 1 shard and 1 replica
  3. Disable allocation
  4. Let primary shard be on A, and replica shard on B
  5. Send a _cluster/reroute command to move the replica from B to C
  6. Notice A gets involved.

This behaviour is bad because ES has a tendency to put all primary shards on one node, overwhelming the node, resulting in slow shard movement. Also, shard movement happening in one zone causes network traffic between zones if the primary is elsewhere.

Most helpful comment

This behavior is by design. Today, recovery sources from the primary shard, and the act of moving a shard triggers a recovery to the target shard. The reason that recovery takes place from the primary shard is because recovery has to be coordinated carefully with replication to ensure that the newly recovered target shard does not miss any operations (there is a moment when the shard is flipped to being a replication target and we have to ensure that all operations before that moment arrive on the replica (via recovery) and all operations after that moment arrive on the replica (via replication). As replication is driven by the primary, this dictates that the source of the recovery is the primary shard. Trying to do this coordination across three nodes (an existing replica copy, the primary, and the target) sounds fraught with peril and we are unlikely to invest effort in this at this time.

All 8 comments

This behavior is by design. Today, recovery sources from the primary shard, and the act of moving a shard triggers a recovery to the target shard. The reason that recovery takes place from the primary shard is because recovery has to be coordinated carefully with replication to ensure that the newly recovered target shard does not miss any operations (there is a moment when the shard is flipped to being a replication target and we have to ensure that all operations before that moment arrive on the replica (via recovery) and all operations after that moment arrive on the replica (via replication). As replication is driven by the primary, this dictates that the source of the recovery is the primary shard. Trying to do this coordination across three nodes (an existing replica copy, the primary, and the target) sounds fraught with peril and we are unlikely to invest effort in this at this time.

I can appreciate this design decision. Unfortunately, I used Elasticsearch's historical super-speed-but-lack-of-ACID-compliance as a feature. The consequences are that shard recovery and movement is now 10x or 20x slower than it used to be. Moving the many terabytes of data goes from about-a-day to many-weeks.

May you point me to the decision code? I must patch it locally, I guess.

Thank you

There is a small hope someone may be fixing this? https://github.com/elastic/elasticsearch/issues/8369

Here is a better description of my problem: https://github.com/elastic/elasticsearch/issues/12279

Looked through some code, maybe the decision point is here: https://github.com/elastic/elasticsearch/blob/ec657109267e801024c90d144dc84ec58bd3aa1c/server/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java#L487

I will look more tomorrow.

Bah! Blame seems to indicate that check is for something else. I am too tired.

I can appreciate this design decision. Unfortunately, I used Elasticsearch's historical super-speed-but-lack-of-ACID-compliance as a feature. The consequences are that shard recovery and movement is now 10x or 20x slower than it used to be.

This behavior is as old as time and have never been different. I suggest you open a topic on discuss.elastic.co and we can help figure out what exactly has caused this change.

This behavior is as old as time and have never been different. I suggest you open a topic on discuss.elastic.co and we can help figure out what exactly has caused this change.

@bleskes, Elasticsearch version 1.7 does not have this problem. Also, I am not sure what change you are referring to when you say "figure out what exactly has caused this change". It has recently become an issue as we upgrade our clusters. I started a discussion on this topic already: https://discuss.elastic.co/t/moving-shards-is-slow/127363

Was this page helpful?
0 / 5 - 0 ratings