Elasticsearch: Dangling indices living in non-data nodes are detected and auto-imported

Created on 21 Oct 2017  路  16Comments  路  Source: elastic/elasticsearch

Elasticsearch version (bin/elasticsearch --version):

Version: 5.5.3, Build: 9305a5e/2017-09-07T15:56:59.599Z, JVM: 1.8.0_151

Plugins installed: []

JVM version (java -version):

java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

OS version (uname -a if on a Unix-like system): Darwin Thiagos-MacBook-Pro.local 17.0.0 Darwin Kernel Version 17.0.0: Thu Aug 24 21:48:19 PDT 2017; root:xnu-4570.1.46~2/RELEASE_X86_64 x86_64

Description of the problem including expected versus actual behavior:

If a non-data node, that contains dangling indices in it's data path, joins a cluster these dangling indices will be detected and auto-imported.

IMO, a non-data node that contains index data in it's data path is probably accidental and unintended. In this case, those dangling indices should not be detected, better yet if the node does not even starts (maybe a bootstrap check that fails if a non-data node contains index data in it's data path).

Steps to reproduce:

This can be done in a single machine:

  1. Start node-1 with bin/elasticsearch -E path.data=/Users/thiago/data-1 -E node.name=node-1
  2. Start node-2 with bin/elasticsearch -E path.data=/Users/thiago/data-2 -E node.name=node-2
  3. Create an index test configured with 1S/0R with curl -XPUT localhost:9200/test -d '{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } } }' -H "Content-Type: application/json"
  4. Create a document curl -XPOST localhost:9200/test -d '{ "test": 1 }' -H "Content-Type: application/json"
  5. Stop both nodes
  6. Check which data directory, either data-1 or data-2, that the shard for index test was created in and delete the _other_ empty data directory (so we effectively make a dangling index).
  7. Consider that data-2 was deleted. So start node-2 again with bin/elasticsearch -E path.data=/Users/thiago/data-2 -E node.name=node-2
  8. Start node-1 (which contains dangling indices) as a non-data node with bin/elasticsearch -E path.data=/Users/thiago/data-1 -E node.name=node-1 -E node.data=false

Provide logs (if relevant):

After non-data node node-1 starts, node-2 will detect and auto-import dangling indices even though node-1 is a non-data node:

[2017-10-21T18:02:14,158][INFO ][o.e.g.LocalAllocateDangledIndices] [node-2] auto importing dangled indices [[test/R2Nh9sERThmkJ-0IZ0ppwA]/OPEN] from [{node-1}{RqWMW2AeSXWOpkUm4cT1TA}{lEqpWLIhRqqU_n1DSFuv2Q}{127.0.0.1}{127.0.0.1:9301}]
:DistributeDistributed >bug good first issue help wanted

Most helpful comment

We discussed this on Fixit Friday and agreed to add a check that will fail:

  • starting up a non-data node that has shard data (e.g. dedicated master node or coordinating-only node)
  • starting up a coordinating-only node that has index metadata.

This means that some user action (explicitly deleting shard data) is going to be required if a data node is switched to a master-only/ coordinating node.

All 16 comments

We discussed this on Fixit Friday and agreed to add a check that will fail:

  • starting up a non-data node that has shard data (e.g. dedicated master node or coordinating-only node)
  • starting up a coordinating-only node that has index metadata.

This means that some user action (explicitly deleting shard data) is going to be required if a data node is switched to a master-only/ coordinating node.

Is this taken or can I pick it?

@swethapavan sure, go ahead.

Thank you

I think we can fail earlier than the bootstrap checks so I'm not sure if this should be a bootstrap check, isn't it enough to be a check in node environment (we've done this in the past with the default path data issue)?

I'm not sure if this should be a bootstrap check

yes, I used bootstrap check in the larger sense here when I meant "a boot/start time check". It does not require the bootstrap checks code infrastructure.

I have done the changes but I get errors when i run some tests because the node fails due to the existence of dangling indices

Specifically, these are the tests that fail:
org.elasticsearch.indices.flush.FlushIT.testSyncedFlushWithConcurrentIndexing

  • org.elasticsearch.indices.flush.FlushIT.testWaitIfOngoing
  • org.elasticsearch.indices.flush.FlushIT.testSyncedFlush
  • org.elasticsearch.search.geo.GeoShapeIntegrationIT.testOrientationPersistence
  • org.elasticsearch.search.geo.GeoShapeIntegrationIT.testIgnoreMalformed
  • org.elasticsearch.gateway.GatewayIndexStateIT.testJustMasterNode
  • org.elasticsearch.index.store.CorruptedFileIT.testReplicaCorruption

I think we can fail earlier than the bootstrap checks so I'm not sure if this should be a bootstrap check, isn't it enough to be a check in node environment (we've done this in the past with the default path data issue)?

I wonder if adding it as a bootstrap check is actually a feature (ie. testing for it later). Like I can totally see starting up a node with data=false for testing in my dev env with local host disco etc. and I don't want them to fail in that case? Just putting out my way of thinking here.

@swethapavan please open a PullRequest or share your code otherwise we won't be able to help you

@s1monw I have created a pull request. Kindly have a look.

I wonder if adding it as a bootstrap check is actually a feature (ie. testing for it later). Like I can totally see starting up a node with data=false for testing in my dev env with local host disco etc. and I don't want them to fail in that case?

My preference would be not to have this as a bootstrap check. Bootstrap checks are requirements for going to production, and we should keep them at a strict minimum so that the difference between prod and dev stays low. For this particular check, I don't see a good reason why we would not want to enforce it for development mode as well. If you want to start-up a node with data=false for testing, and that you happen to do that on a data folder which previously had a node with data, you can as easily just define a different path.data.

Is this issue still open, there seems to be no update on it since long. I would like to work on this.

Is this fixed on 6.x? Ran into this issue yesterday on 5.6.10

The proposal is to detect if a data=false node have any data and fail startup if that is the case. However, even indices without any data can be resurrected and I wonder if we need to also handle that? I have created a slightly modified reproduction case to explain this:

  1. Clear out any previous experiments:

rm -r data-1 data-2

  1. Start two nodes:
bin/elasticsearch -E path.data=data-1 -E node.name=node-1
bin/elasticsearch -E path.data=data-2 -E node.name=node-2
  1. Create two indexes and data for them:
curl -XPUT localhost:9200/test?pretty -d '{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } } }' -H "Content-Type: application/json"
curl -XPOST localhost:9200/test/_doc?pretty -d '{ "test": 1 }' -H "Content-Type: application/json"

curl -XPUT localhost:9200/test2?pretty -d '{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } } }' -H "Content-Type: application/json"
curl -XPOST localhost:9200/test2/_doc?pretty -d '{ "test": 1 }' -H "Content-Type: application/json"
  1. Verify that data for the two indexes are on different nodes:

ls -d data-*/nodes/0/indices/*/0

should give something like following (notice: different data folders):

data-1/nodes/0/indices/bF19AZJvREOs33p8udeD-A/0  data-2/nodes/0/indices/xpuL1YkcR1SttdAYF6zGEg/0
  1. Shutdown both nodes. Remove the data folder for node-1:

rm -r data-1

  1. Start node-1 and then node-2 with node.data=false:

bin/elasticsearch -E path.data=data-1 -E node.name=node-1
bin/elasticsearch -E path.data=data-2 -E node.name=node-2 -E node.data=false

Expected log for node-2:

[2019-01-10T11:54:46,133][INFO ][o.e.g.DanglingIndicesState] [node-2] [[test2/bF19AZJvREOs33p8udeD-A]] dangling index exists on local file system, but not in cluster metadata, auto import to cluster state
[2019-01-10T11:54:46,133][INFO ][o.e.g.DanglingIndicesState] [node-2] [[test/xpuL1YkcR1SttdAYF6zGEg]] dangling index exists on local file system, but not in cluster metadata, auto import to cluster state

and for node-1:

[2019-01-10T11:54:46,308][INFO ][o.e.g.LocalAllocateDangledIndices] [node-1] auto importing dangled indices [[test2/bF19AZJvREOs33p8udeD-A]/OPEN][[test/xpuL1YkcR1SttdAYF6zGEg]/OPEN] from [{node-2}{wwM9q--3TmW0VCAHerzmNg}{OYshEsG6Rv6CvNmANivlnQ}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=33465024512, ml.max_open_jobs=20, xpack.installed=true}]

Looking at the file system, both indices now exist on node-1 too without any data:

ls -d data-1/nodes/0/indices/*/*
data-1/nodes/0/indices/bF19AZJvREOs33p8udeD-A/_state  data-1/nodes/0/indices/xpuL1YkcR1SttdAYF6zGEg/_state

and both are red status:

curl localhost:9200/_cat/indices?v
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
red    open   test  xpuL1YkcR1SttdAYF6zGEg   1   0                                                  
red    open   test2 bF19AZJvREOs33p8udeD-A   1   0

This makes me wonder whether the proposed change is enough since there is still a risk of resurrecting old indexes that did not have any shards allocated on the node?

Had a conversation with @ywelsch on this on another channel. We came to the conclusion that the original proposal should be implemented to avoid resurrecting the indices in clearly bad cases and also to avoid having old data lying around that are invalid for the type of node.

Was this page helpful?
0 / 5 - 0 ratings