Elasticsearch version (bin/elasticsearch --version):
Version: 5.5.3, Build: 9305a5e/2017-09-07T15:56:59.599Z, JVM: 1.8.0_151
Plugins installed: []
JVM version (java -version):
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
OS version (uname -a if on a Unix-like system): Darwin Thiagos-MacBook-Pro.local 17.0.0 Darwin Kernel Version 17.0.0: Thu Aug 24 21:48:19 PDT 2017; root:xnu-4570.1.46~2/RELEASE_X86_64 x86_64
Description of the problem including expected versus actual behavior:
If a non-data node, that contains dangling indices in it's data path, joins a cluster these dangling indices will be detected and auto-imported.
IMO, a non-data node that contains index data in it's data path is probably accidental and unintended. In this case, those dangling indices should not be detected, better yet if the node does not even starts (maybe a bootstrap check that fails if a non-data node contains index data in it's data path).
Steps to reproduce:
This can be done in a single machine:
node-1 with bin/elasticsearch -E path.data=/Users/thiago/data-1 -E node.name=node-1node-2 with bin/elasticsearch -E path.data=/Users/thiago/data-2 -E node.name=node-2test configured with 1S/0R with curl -XPUT localhost:9200/test -d '{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } } }' -H "Content-Type: application/json"curl -XPOST localhost:9200/test -d '{ "test": 1 }' -H "Content-Type: application/json"data-1 or data-2, that the shard for index test was created in and delete the _other_ empty data directory (so we effectively make a dangling index). data-2 was deleted. So start node-2 again with bin/elasticsearch -E path.data=/Users/thiago/data-2 -E node.name=node-2node-1 (which contains dangling indices) as a non-data node with bin/elasticsearch -E path.data=/Users/thiago/data-1 -E node.name=node-1 -E node.data=falseProvide logs (if relevant):
After non-data node node-1 starts, node-2 will detect and auto-import dangling indices even though node-1 is a non-data node:
[2017-10-21T18:02:14,158][INFO ][o.e.g.LocalAllocateDangledIndices] [node-2] auto importing dangled indices [[test/R2Nh9sERThmkJ-0IZ0ppwA]/OPEN] from [{node-1}{RqWMW2AeSXWOpkUm4cT1TA}{lEqpWLIhRqqU_n1DSFuv2Q}{127.0.0.1}{127.0.0.1:9301}]
We discussed this on Fixit Friday and agreed to add a check that will fail:
This means that some user action (explicitly deleting shard data) is going to be required if a data node is switched to a master-only/ coordinating node.
Is this taken or can I pick it?
@swethapavan sure, go ahead.
Thank you
I think we can fail earlier than the bootstrap checks so I'm not sure if this should be a bootstrap check, isn't it enough to be a check in node environment (we've done this in the past with the default path data issue)?
I'm not sure if this should be a bootstrap check
yes, I used bootstrap check in the larger sense here when I meant "a boot/start time check". It does not require the bootstrap checks code infrastructure.
I have done the changes but I get errors when i run some tests because the node fails due to the existence of dangling indices
Specifically, these are the tests that fail:
org.elasticsearch.indices.flush.FlushIT.testSyncedFlushWithConcurrentIndexing
I think we can fail earlier than the bootstrap checks so I'm not sure if this should be a bootstrap check, isn't it enough to be a check in node environment (we've done this in the past with the default path data issue)?
I wonder if adding it as a bootstrap check is actually a feature (ie. testing for it later). Like I can totally see starting up a node with data=false for testing in my dev env with local host disco etc. and I don't want them to fail in that case? Just putting out my way of thinking here.
@swethapavan please open a PullRequest or share your code otherwise we won't be able to help you
@s1monw I have created a pull request. Kindly have a look.
I wonder if adding it as a bootstrap check is actually a feature (ie. testing for it later). Like I can totally see starting up a node with data=false for testing in my dev env with local host disco etc. and I don't want them to fail in that case?
My preference would be not to have this as a bootstrap check. Bootstrap checks are requirements for going to production, and we should keep them at a strict minimum so that the difference between prod and dev stays low. For this particular check, I don't see a good reason why we would not want to enforce it for development mode as well. If you want to start-up a node with data=false for testing, and that you happen to do that on a data folder which previously had a node with data, you can as easily just define a different path.data.
Is this issue still open, there seems to be no update on it since long. I would like to work on this.
Is this fixed on 6.x? Ran into this issue yesterday on 5.6.10
The proposal is to detect if a data=false node have any data and fail startup if that is the case. However, even indices without any data can be resurrected and I wonder if we need to also handle that? I have created a slightly modified reproduction case to explain this:
rm -r data-1 data-2
bin/elasticsearch -E path.data=data-1 -E node.name=node-1
bin/elasticsearch -E path.data=data-2 -E node.name=node-2
curl -XPUT localhost:9200/test?pretty -d '{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } } }' -H "Content-Type: application/json"
curl -XPOST localhost:9200/test/_doc?pretty -d '{ "test": 1 }' -H "Content-Type: application/json"
curl -XPUT localhost:9200/test2?pretty -d '{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } } }' -H "Content-Type: application/json"
curl -XPOST localhost:9200/test2/_doc?pretty -d '{ "test": 1 }' -H "Content-Type: application/json"
ls -d data-*/nodes/0/indices/*/0
should give something like following (notice: different data folders):
data-1/nodes/0/indices/bF19AZJvREOs33p8udeD-A/0 data-2/nodes/0/indices/xpuL1YkcR1SttdAYF6zGEg/0
node-1:rm -r data-1
node-1 and then node-2 with node.data=false:bin/elasticsearch -E path.data=data-1 -E node.name=node-1
bin/elasticsearch -E path.data=data-2 -E node.name=node-2 -E node.data=false
Expected log for node-2:
[2019-01-10T11:54:46,133][INFO ][o.e.g.DanglingIndicesState] [node-2] [[test2/bF19AZJvREOs33p8udeD-A]] dangling index exists on local file system, but not in cluster metadata, auto import to cluster state
[2019-01-10T11:54:46,133][INFO ][o.e.g.DanglingIndicesState] [node-2] [[test/xpuL1YkcR1SttdAYF6zGEg]] dangling index exists on local file system, but not in cluster metadata, auto import to cluster state
and for node-1:
[2019-01-10T11:54:46,308][INFO ][o.e.g.LocalAllocateDangledIndices] [node-1] auto importing dangled indices [[test2/bF19AZJvREOs33p8udeD-A]/OPEN][[test/xpuL1YkcR1SttdAYF6zGEg]/OPEN] from [{node-2}{wwM9q--3TmW0VCAHerzmNg}{OYshEsG6Rv6CvNmANivlnQ}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=33465024512, ml.max_open_jobs=20, xpack.installed=true}]
Looking at the file system, both indices now exist on node-1 too without any data:
ls -d data-1/nodes/0/indices/*/*
data-1/nodes/0/indices/bF19AZJvREOs33p8udeD-A/_state data-1/nodes/0/indices/xpuL1YkcR1SttdAYF6zGEg/_state
and both are red status:
curl localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
red open test xpuL1YkcR1SttdAYF6zGEg 1 0
red open test2 bF19AZJvREOs33p8udeD-A 1 0
This makes me wonder whether the proposed change is enough since there is still a risk of resurrecting old indexes that did not have any shards allocated on the node?
Had a conversation with @ywelsch on this on another channel. We came to the conclusion that the original proposal should be implemented to avoid resurrecting the indices in clearly bad cases and also to avoid having old data lying around that are invalid for the type of node.
Most helpful comment
We discussed this on Fixit Friday and agreed to add a check that will fail:
This means that some user action (explicitly deleting shard data) is going to be required if a data node is switched to a master-only/ coordinating node.