Elasticsearch: Cluster will not automatically assign shards

Created on 12 May 2016  路  33Comments  路  Source: elastic/elasticsearch

Elasticsearch version: 2.2.1

JVM version: java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

OS version: CentOS release 6.7 (Final)

Description of the problem including expected versus actual behavior:

Thanks in advance for your help.

I inherited a poorly-behaving Elasticsearch cluster that I've been tuning on and off for a few months. Some things have improved, others have not.

About a month ago, another team needed to reboot the (AWS hosted) nodes. I turned off shard allocation in the cluster with an API call and got a success response. However, it didn't take. So I also turned off allocation in kopf. But, rebooting a node caused shards to be allocated anyway. So we rebooted one box per day and let the cluster rebalance each time.

Some time after the reboots, I noticed the cluster was still yellow. It had stopped allocating shards. I figured the API call had finally taken, so I turned allocation back on (and got another success message). According to kopf, allocation is enabled. But the cluster did not automatically assign shards.

I tried toggling shard allocation on and off with both kopf and the API. But the cluster would not allocate shards.

I complained about this on Twitter and an Elastic employee recommended upgrading. So, I upgraded the cluster from 1.5.2 to 2.2.1. However, the behavior persists.

If I assign shards manually with the API, they allocate just fine.

Steps to reproduce:

  1. Turn off shard allocation with API call and kopf
  2. Turn on shard allocation with API call and kopf
  3. Cry a lot

Provide logs (if relevant): I didn't see any relevant logs, but if you tell me what to look for I can grep. I'm happy to gather diagnostics for you, but please note that I no longer own this cluster as of 3pm PDT Friday.

:DistributeDistributed feedback_needed

Most helpful comment

Okay, it's still possible that the indices have the disable allocation decider enabled on them as I don't have the settings you emailed Clint earlier (he's offline), you can try:

curl -XPUT 'node:9200/_cluster/settings' -d'{"transient": {"cluster.routing.allocation.disable_allocation": false}}'

And additionally (in case it is set on the indices themselves):

curl -XPUT 'node:9200/_all/_settings' -d'{"index.routing.allocation.disable_allocation": false}'

All 33 comments

Hi @alicegoldfuss

Could you provide the actual command that you use to enable/disable shard allocation, along with the output of:

GET _cluster/settings

and (assuming you have unassigned shards atm):

GET _shard_stores

Also, would be good to take an unassigned shard and try to assign it to a particular node with the cluster-reroute API, adding the ?explain flag to figure out why the shard isn't being assigned.

thanks

Hi Clinton!

$ curl -XPUT http://node-1.com:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.enable": "all"}}'
{"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}
$ curl -XGET http://node-1.com:9200/_cluster/settings
{"persistent":{"cluster":{"routing":{"allocation":{"enable":"all"}}}},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}

Shard stores returned a lot, so I cleaned up one of the replies:

{"indices":{"index-1":{"shards":{"2":{"stores":[{"urZEQshBQourcmigNqeG0g":{"name":"node-1","transport_address":"127.0.0.1:9300","attributes":{"ec2az":"us-east-1a","datacenter":"use1v"}},"version":40,"allocation":"primary"},{"uWmawPeUQ2KYNR1sLKvxfA":{"name":"node-2","transport_address":"127.0.0.1:9300","attributes":{"ec2az":"us-east-1d","datacenter":"use1v"}},"version":25,"allocation":"unused"}]}}},

The explain query dumped a bunch of stuff. Can I give it to you privately?

Also I should add that assigning the shard via the API worked. That has always worked. But the cluster doesn't assign automatically.

The explain query dumped a bunch of stuff. Can I give it to you privately?

Sure. clinton at elastic dot co

Also I should add that assigning the shard via the API worked. That has always worked. But the cluster doesn't assign automatically.

Do you have ongoing recoveries? What is the output of:

GET _cat/recovery?v

I also see you're using node attributes, and possibly multiple data centres? (usually this is a no-no, unless they're like ec2 availability zones, ie as fast as a lan). Could you paste the settings you're using for shard awareness/allocation?

I will send you the query dump and recovery dump.

We are using multiple AWS availability zones, not physical data centers.

@alicegoldfuss It looks like you tried to reroute an already assigned shard. Could you try one that is unassigned, eg shard 1 of index logstash-2016.02.13? (with ?explain)

Also, could you paste the settings you're using for shard awareness/allocation?

@clintongormley I assigned a shard that was unassigned. But when I reran the command to create that file, yes it was already assigned.

I will send you the new shard assignment.

Can you tell me where to find the settings for shard awareness/allocation?

@alicegoldfuss Either in index settings or in node settings, depending on what is set. Try:

GET _nodes/settings
GET logstash-2016.02.13/_settings

Okay, emailed those settings to you.

@alicegoldfuss apparently those shards can be allocated so if you just call _reroute without any body or arguments (now that I think about it you might need to specify an empty body) will shard get allocated? if you allocate one shard manually will others follow once that shards is started?

Hello!

If I manually allocate one shard, the other shards will not follow suit.

I ran the following:

curl -XPOST 'http://node-1.com:9200/_cluster/reroute'

That returned a large response, full of shard states:

{"state":"STARTED","primary":false,"node":"XXXXXXXXXXX","relocating_node":null,"shard":4,"index":"index-1","version":24,"allocation_id":{"id":"XXXXXXXXXXXX"}}

But no signs of actual allocation.

I ran it with an empty body and got the same.

well I think from here on I don't have many more ideas than enabling trace logging for allocation for cluster.routing.allocation.decider as well as cluster.routing.allocation @bleskes any ideas?

That would require either putting in a Puppet PR (I don't think I could deploy in time) or disabling Puppet and shoving the settings into the config on the boxes (not my favorite choice). So if there's a Plan B I would prefer it.

So if there's a Plan B I would prefer it.

You can do it from the settings API:

PUT /_cluster/settings
{
    "transient" : {
        "logger.cluster.routing.allocation.decider": "TRACE",
        "logger.cluster.routing.allocation": "TRACE"
    }
}

Thanks! So, put those in place and try to allocate a shard with explain enabled? Or will those settings generate log files?

Thanks! So, put those in place and try to allocate a shard with explain enabled? Or will those settings generate log files?

It will probably start generating log lines, but try to do an allocation too and share the master logs from the time you enabled the traces.

$ curl -XPUT 'http://node-1.com:9200/_cluster/settings' -d '{"transient" : {"logger.cluster.routing.allocation.decider": "TRACE","logger.cluster.routing.allocation": "TRACE"}}'
{"acknowledged":true,"persistent":{},"transient":{"logger":{"cluster":{"routing":{"allocation":"TRACE","allocation.decider":"TRACE"}}}}}

I successfully allocated two shards with the API. Elasticsearch generated no additional logs. I captured the explain dumps if you would like to see them.

Elasticsearch generated no additional logs.

Are you sure that you extracted the logs from the master node? You can check the master with GET /_cat/master against any node in the cluster.

Okay! I thought ES logs would be the same across the cluster. I have a 37 MB log file for you @jasontedor where can I send it?

Okay! I thought ES logs would be the same across the cluster.

Each node is a special snowflake, and the master is a very special snowflake and in particular is the only node making allocation decisions.

I have a 37 MB log file for you @jasontedor where can I send it?

Can you gzip compress that (if not already) and send it to my first name at the same domain as @clintongormley's email address from earlier?

Done :)

@alicegoldfuss Can you hit GET /_cluster/settings against the master node and share here?

Do you have the timestamps from when you performed the allocation? I see things like:

[2016-05-12 21:47:20,469][TRACE][cluster.routing.allocation.decider] [...] Can not allocate [...], at[2016-05-12T05:48:08.074Z], details[failed recovery, failure RecoveryFailedException[index: Recovery failed from {node}{...}{...}{...}{...} into {...}{...}{...}{...}{...} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]]] on node [.....] due to [DisableAllocationDecider]

Especially the DisableAllocationDecider part, it looks like allocation is still disabled at this point, has it been turned back on? I noticed in your command for enabling allocation that you have enabled the EnableAllocationDecider, but perhaps the DisableAllocationDecider (which is deprecated in 2.x ) is still disabling allocation?

In 2.x, both are around since DisableAllocationDecider is deprecated so people can move away from it, however, they both have effect, so if allocation is disabled by the DisableAllocationDecider then it will still be disabled.

Additionally, I _only_ see logs from the cluster.routing.allocation.decider and cluster.routing.allocation.allocator packages in this file, what happened to the logs from the other packages?

Ran settings against the master node:

$ curl -XGET http://node-master.com:9200/_cluster/settings
{"persistent":{"cluster":{"routing":{"allocation":{"enable":"all"}}}},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}},"logger":{"cluster":{"routing":{"allocation":"TRACE","allocation.decider":"TRACE"}}}}}

I don't have specific timestamps, but I did give you the entire log file. Everything you see in that file is what I can see on the master. The other nodes are full of the 30 minute timeout notifications from when I was allocating hundreds of shards yesterday.

How can I toggle the DisableAllocatiorDecider? Give me a command and I'll try running it.

I know this is probably in your docs somewhere, but I'm pretty swamped, so sorry in advance.

Okay, it's still possible that the indices have the disable allocation decider enabled on them as I don't have the settings you emailed Clint earlier (he's offline), you can try:

curl -XPUT 'node:9200/_cluster/settings' -d'{"transient": {"cluster.routing.allocation.disable_allocation": false}}'

And additionally (in case it is set on the indices themselves):

curl -XPUT 'node:9200/_all/_settings' -d'{"index.routing.allocation.disable_allocation": false}'

YOU DID IT!

It was set on the indices themselves! I issued the second command you provided and BAM shards are initializing automatically!

No idea how that thing got toggled in the first place, but I will never touch it again.

Thank you everyone!

Glad to hear that!

I'm going to close this issue, thanks for debugging with us :)

@clintongormley @dakrone @jasontedor the reason why we didn't see this in the explain output was that we explicitly ignore these allocation deciders when running allocation commands. I think this is trappy and should be optional (specified in the request but not ignored by default) - this could have saved us and the user lots of time and would have prevented @alicegoldfuss to write her python tool in the first place I guess? I think we should change that!

@alicegoldfuss By the way, you will probably want to disable the allocation trace logging that was enabled earlier lest your master will be over logging (as you saw, it can produce large logs very quickly). Just run the command earlier to enable allocation trace logging but replace TRACE by INFO.

@jasontedor good call on the log setting!

Was this page helpful?
0 / 5 - 0 ratings