Elasticsearch: Elasticsearch fails to start with error: "Failed to find metadata for index" on every restart

Created on 30 Sep 2019 · 6Comments · Source: elastic/elasticsearch

Elasticsearch version: 7.3.2

Plugins installed: []

JVM version (java -version): 1.8.0

OS version: Centos-7.4

We have been running an elasticsearch cluster consisting of 5 modes for quite some time now. After upgrade to v7, we have noticed a lot of times our nodes refuse to start with
an error nested: IOException[failed to find metadata for existing index XXX.

The first time I encountered this error, I searched the discuss board and found this which talks of stronger startup checks enforced by ES-7.x and points to data directory getting corrupted due to external factors. Thinking it may be the same probloem, I duly took the node offline and ran a disk check which reported no errors. So I deleted the data directory, started the node and that was that.

However, the next time I did a rolling upgrade of my cluster, a different node failed with a similar error (The index name was different). I followed the same emergency procedure (delete data directory and restart node) and cluster was fixed.

Now after every rolling upgrade I seem to run into this error with atleast one of my node. The index name always points to a closed index. The error occurs only on restart (never while elasticsearch is running).

I find it hard to believe that all 5 of my nodes have a disk problem because:

I have run fsck everytime this error has occurred and no errors have been reported.
Elasticsearch runs without a problem for days on end (A disk error or other programs corrupting the data would cause running elasticsearch to crash as had happended on one of my nodes about a year back).

Yesterday we had a power issue at the data-center which led to all nodes getting power cycled. Upon restart 4 out of 5 modes failed to start with same errors. On all 4 nodes, the names of indexes was different (The indexes in question were "closed"). I had no option but to delete all data on those 4 nodes (Thus losing about 80% of elasticsearch data).

The errors seen were

[2019-09-30T10:36:58,205][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [esnode3] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: IOException[failed to find metadata for existing index ssl-2019.09.20 [location: GmslGWkHTLGQowmMHFut7A, generation: 11]];
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.3.2.jar:7.3.2]
Caused by: org.elasticsearch.ElasticsearchException: failed to bind service
        at org.elasticsearch.node.Node.<init>(Node.java:617) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more
Caused by: java.io.IOException: failed to find metadata for existing index ssl-2019.09.20 [location: GmslGWkHTLGQowmMHFut7A, generation: 11]
        at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:99) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.upgradeMetaData(GatewayMetaState.java:141) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:95) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:492) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more
[2019-09-30T10:36:58,210][INFO ][o.e.x.m.p.NativeController] [esnode3] Native controller process has stopped - no new native processes can be started

and

[2019-09-30T10:39:59,737][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [esnode2] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: IOException[failed to find metadata for existing index dns-2019.09.22 [location: ZMenLry9Qxe5-2-XNrWj2A, generation: 15]];
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.3.2.jar:7.3.2]
Caused by: org.elasticsearch.ElasticsearchException: failed to bind service
        at org.elasticsearch.node.Node.<init>(Node.java:617) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more
Caused by: java.io.IOException: failed to find metadata for existing index dns-2019.09.22 [location: ZMenLry9Qxe5-2-XNrWj2A, generation: 15]
        at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:99) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.upgradeMetaData(GatewayMetaState.java:141) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:95) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:492) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more

Is it possible that data of closed indexes is not being persisted properly (leading to issues at restart)? Can this be mitigated somehow (Maybe rolling back to less stronger consistency checks)?

:DistributeCluster Coordination >bug

Source

redbaron4

Most helpful comment

Ok, this confirms my findings. First of all, this is a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index metadata storage mechanism, which has special handling for closed indices. On non-master-eligible data nodes, it's possible for the node's manifest file (which tracks the relevant metadata state that the node should persist) to become out of sync with what's actually stored on disk, leading to an inconsistency that is then detected at startup, refusing for the node to start up. We will immediately start working on a bug fix for this.

In the meanwhile, the following workaround is applicable to get the node running again. This workaround should not lead to any data loss. However, great care must be taken before applying it, preferably backing up the data folder on the node before undertaking the following low-level surgery:

The first step is to make sure that the node is indeed a data-only node. Doing the following on a master-eligible node can and will put the whole cluster at risk of data loss.
For every path that is referenced in path.data, remove the file named manifest-N.st (where N is some number) which can be found under nodes/0/_state/ in the respective data path. After removing this file, the node should properly start up again.

ywelsch on 30 Sep 2019

👍4 ❤2 😄1

All 6 comments

Pinging @elastic/es-distributed

elasticmachine on 30 Sep 2019

Hi @redbaron4. Thanks for reporting this. A few questions:

Are these master-eligible nodes?
Are these closed replicated indices (i.e. are there allocated shards for these indices)? Have these indices been closed in 6.x or 7.x?

ywelsch on 30 Sep 2019

I have now managed to reproduce this with data-only nodes:

public void testRelocatedClosedIndexIssue() throws Exception {
    final String indexName = "closed-index";
    List<String> dataNodes = internalCluster().startDataOnlyNodes(2);
    // allocate shard to first data node
    createIndex(indexName, Settings.builder()
        .put(IndexMetaData.SETTING_NUMBER_OF_SHARDS, 1)
        .put(IndexMetaData.SETTING_NUMBER_OF_REPLICAS, 0)
        .put("index.routing.allocation.include._name", String.join(",", dataNodes.get(0)))
        .build());
    indexRandom(randomBoolean(), randomBoolean(), randomBoolean(), IntStream.range(0, randomIntBetween(0, 50))
        .mapToObj(n -> client().prepareIndex(indexName, "_doc").setSource("num", n)).collect(toList()));
    assertAcked(client().admin().indices().prepareClose(indexName));
    // move single shard to second node
    client().admin().indices().prepareUpdateSettings(indexName).setSettings(Settings.builder()
        .put("index.routing.allocation.include._name", String.join(",", dataNodes.get(1)))).get();
    ensureGreen(indexName);
    // give first node a bit of time to clean up the index (including index metadata)
    Thread.sleep(100);

    internalCluster().fullRestart(); // fails as it can't find the index metadata on the first node
}

The same should not happen with master-eligible data nodes though. Can you clarify that point for us? If this only affects data-only nodes we might be able to provide instructions on how to get the node running again, without losing data.

ywelsch on 30 Sep 2019

@ywelsch Thanks for looking at this

These are not master eligible nodes. There are 3 other master elegible nodes in the cluster (which don't store any data). While doing a rolling upgrade, the script processes all the master eligible nodes first and then moves to data nodes.

All data nodes (the kind which show failure) have similar config to one given below

cluster.name: nemo-cluster 

node.name: esnode1

bootstrap.memory_lock: true
network.host: _site:ipv4_
discovery.zen.ping.unicast.hosts: 
    - 10.44.0.43
    - 10.44.0.44
    - 10.44.0.45

discovery.zen.minimum_master_nodes: 1

# Fix 9201 for intra cluster comm
transport.port: 9201

path.logs: /var/log/elasticsearch
path.data: /data/elasticsearch

# Settings related to node

node.master: false
node.data: true
node.ingest: true

# Monitoring settings (6.3+)
xpack.monitoring.enabled: true

The master elgible nodes are 10.44.0.43,44,45 which don't show any failure.

Closed indices are mostly created on 7.x. For the latest issue, the indices mentioned in the logs above were created using 7.x

redbaron4 on 30 Sep 2019

The first step is to make sure that the node is indeed a data-only node. Doing the following on a master-eligible node can and will put the whole cluster at risk of data loss.
For every path that is referenced in path.data, remove the file named manifest-N.st (where N is some number) which can be found under nodes/0/_state/ in the respective data path. After removing this file, the node should properly start up again.

ywelsch on 30 Sep 2019

👍4 ❤2 😄1

Thanks for the work around. I'll try it the next time we face this situation. I almost had the impulse of removing the manifest file but did not do it. I tried to remove offending index entries from the manifest file first which led to consistency checks failure. So I restored the manifest file and desisted from any more tinkerings with it :)

I hope the bug gets fixed soon.

redbaron4 on 30 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings