Elasticsearch version: 7.3.2
Plugins installed: []
JVM version (java -version
): 1.8.0
OS version: Centos-7.4
We have been running an elasticsearch cluster consisting of 5 modes for quite some time now. After upgrade to v7, we have noticed a lot of times our nodes refuse to start with
an error nested: IOException[failed to find metadata for existing index XXX
.
The first time I encountered this error, I searched the discuss board and found this which talks of stronger startup checks enforced by ES-7.x and points to data directory getting corrupted due to external factors. Thinking it may be the same probloem, I duly took the node offline and ran a disk check which reported no errors. So I deleted the data directory, started the node and that was that.
However, the next time I did a rolling upgrade of my cluster, a different node failed with a similar error (The index name was different). I followed the same emergency procedure (delete data directory and restart node) and cluster was fixed.
Now after every rolling upgrade I seem to run into this error with atleast one of my node. The index name always points to a closed index. The error occurs only on restart (never while elasticsearch is running).
I find it hard to believe that all 5 of my nodes have a disk problem because:
fsck
everytime this error has occurred and no errors have been reported.Yesterday we had a power issue at the data-center which led to all nodes getting power cycled. Upon restart 4 out of 5 modes failed to start with same errors. On all 4 nodes, the names of indexes was different (The indexes in question were "closed"). I had no option but to delete all data on those 4 nodes (Thus losing about 80% of elasticsearch data).
The errors seen were
[2019-09-30T10:36:58,205][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [esnode3] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: IOException[failed to find metadata for existing index ssl-2019.09.20 [location: GmslGWkHTLGQowmMHFut7A, generation: 11]];
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.3.2.jar:7.3.2]
Caused by: org.elasticsearch.ElasticsearchException: failed to bind service
at org.elasticsearch.node.Node.<init>(Node.java:617) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
... 6 more
Caused by: java.io.IOException: failed to find metadata for existing index ssl-2019.09.20 [location: GmslGWkHTLGQowmMHFut7A, generation: 11]
at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:99) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.gateway.GatewayMetaState.upgradeMetaData(GatewayMetaState.java:141) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:95) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.node.Node.<init>(Node.java:492) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
... 6 more
[2019-09-30T10:36:58,210][INFO ][o.e.x.m.p.NativeController] [esnode3] Native controller process has stopped - no new native processes can be started
and
[2019-09-30T10:39:59,737][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [esnode2] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: IOException[failed to find metadata for existing index dns-2019.09.22 [location: ZMenLry9Qxe5-2-XNrWj2A, generation: 15]];
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.3.2.jar:7.3.2]
Caused by: org.elasticsearch.ElasticsearchException: failed to bind service
at org.elasticsearch.node.Node.<init>(Node.java:617) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
... 6 more
Caused by: java.io.IOException: failed to find metadata for existing index dns-2019.09.22 [location: ZMenLry9Qxe5-2-XNrWj2A, generation: 15]
at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:99) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.gateway.GatewayMetaState.upgradeMetaData(GatewayMetaState.java:141) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:95) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.node.Node.<init>(Node.java:492) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
... 6 more
Is it possible that data of closed indexes is not being persisted properly (leading to issues at restart)? Can this be mitigated somehow (Maybe rolling back to less stronger consistency checks)?
Pinging @elastic/es-distributed
Hi @redbaron4. Thanks for reporting this. A few questions:
I have now managed to reproduce this with data-only nodes:
public void testRelocatedClosedIndexIssue() throws Exception {
final String indexName = "closed-index";
List<String> dataNodes = internalCluster().startDataOnlyNodes(2);
// allocate shard to first data node
createIndex(indexName, Settings.builder()
.put(IndexMetaData.SETTING_NUMBER_OF_SHARDS, 1)
.put(IndexMetaData.SETTING_NUMBER_OF_REPLICAS, 0)
.put("index.routing.allocation.include._name", String.join(",", dataNodes.get(0)))
.build());
indexRandom(randomBoolean(), randomBoolean(), randomBoolean(), IntStream.range(0, randomIntBetween(0, 50))
.mapToObj(n -> client().prepareIndex(indexName, "_doc").setSource("num", n)).collect(toList()));
assertAcked(client().admin().indices().prepareClose(indexName));
// move single shard to second node
client().admin().indices().prepareUpdateSettings(indexName).setSettings(Settings.builder()
.put("index.routing.allocation.include._name", String.join(",", dataNodes.get(1)))).get();
ensureGreen(indexName);
// give first node a bit of time to clean up the index (including index metadata)
Thread.sleep(100);
internalCluster().fullRestart(); // fails as it can't find the index metadata on the first node
}
The same should not happen with master-eligible data nodes though. Can you clarify that point for us? If this only affects data-only nodes we might be able to provide instructions on how to get the node running again, without losing data.
@ywelsch Thanks for looking at this
All data nodes (the kind which show failure) have similar config to one given below
cluster.name: nemo-cluster
node.name: esnode1
bootstrap.memory_lock: true
network.host: _site:ipv4_
discovery.zen.ping.unicast.hosts:
- 10.44.0.43
- 10.44.0.44
- 10.44.0.45
discovery.zen.minimum_master_nodes: 1
# Fix 9201 for intra cluster comm
transport.port: 9201
path.logs: /var/log/elasticsearch
path.data: /data/elasticsearch
# Settings related to node
node.master: false
node.data: true
node.ingest: true
# Monitoring settings (6.3+)
xpack.monitoring.enabled: true
The master elgible nodes are 10.44.0.43,44,45
which don't show any failure.
Ok, this confirms my findings. First of all, this is a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index metadata storage mechanism, which has special handling for closed indices. On non-master-eligible data nodes, it's possible for the node's manifest
file (which tracks the relevant metadata state that the node should persist) to become out of sync with what's actually stored on disk, leading to an inconsistency that is then detected at startup, refusing for the node to start up. We will immediately start working on a bug fix for this.
In the meanwhile, the following workaround is applicable to get the node running again. This workaround should not lead to any data loss. However, great care must be taken before applying it, preferably backing up the data folder on the node before undertaking the following low-level surgery:
path.data
, remove the file named manifest-N.st
(where N
is some number) which can be found under nodes/0/_state/
in the respective data path. After removing this file, the node should properly start up again.Thanks for the work around. I'll try it the next time we face this situation. I almost had the impulse of removing the manifest file but did not do it. I tried to remove offending index entries from the manifest file first which led to consistency checks failure. So I restored the manifest file and desisted from any more tinkerings with it :)
I hope the bug gets fixed soon.
Most helpful comment
Ok, this confirms my findings. First of all, this is a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index metadata storage mechanism, which has special handling for closed indices. On non-master-eligible data nodes, it's possible for the node's
manifest
file (which tracks the relevant metadata state that the node should persist) to become out of sync with what's actually stored on disk, leading to an inconsistency that is then detected at startup, refusing for the node to start up. We will immediately start working on a bug fix for this.In the meanwhile, the following workaround is applicable to get the node running again. This workaround should not lead to any data loss. However, great care must be taken before applying it, preferably backing up the data folder on the node before undertaking the following low-level surgery:
path.data
, remove the file namedmanifest-N.st
(whereN
is some number) which can be found undernodes/0/_state/
in the respective data path. After removing this file, the node should properly start up again.