Elasticsearch: Failure to recover shards after the disk was full

Created on 6 Jul 2015 · 22Comments · Source: elastic/elasticsearch

On one of our servers running Elasticsearch, some other process wrote to many logfiles such that the disk was out of space. After deleting these logfiles and rebooting the system, Elasticsearch did not recover.

We are running on a single server, using Elasticsearch 1.5.2

I believe we manages to recover by deleting some of the *.recovering files in the elasticsearch data directories, however it would be great if Elasticsearch could recover as much as possible by itself.

[2015-07-03 14:09:37,196][WARN ][cluster.action.shard     ] [mxserver] [abds-historic-snapshots-2015-07-03][1] received shard failed for [abds-historic-snapshots-2015-07-03][1], node[9HgooclMS6W9m-1lqKxV8Q], [P], s[INITIALIZING], indexUUID [6unWDyfbQ_yF9XTYAlMz4g], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[abds-historic-snapshots-2015-07-03][1] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchIllegalArgumentException[No version type match [116]]; ]]
[2015-07-03 14:09:37,205][WARN ][index.engine             ] [mxserver] [abds-instance][0] failed to sync translog
[2015-07-03 14:09:37,206][WARN ][indices.cluster          ] [mxserver] [[abds-instance][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [abds-instance][0] failed to recover shard
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
    ... 4 more
Caused by: org.elasticsearch.ElasticsearchException: failed to read [abdstrack][AdsbTrack-7668367]
    at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:522)
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
    ... 5 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [48]
    at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
    at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:519)
    ... 6 more
[2015-07-03 14:09:37,206][WARN ][cluster.action.shard     ] [mxserver] [abds-instance][0] received shard failed for [abds-instance][0], node[9HgooclMS6W9m-1lqKxV8Q], [P], s[INITIALIZING], indexUUID [H8FyNbqATmWQ6p8RYSGncw], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[abds-instance][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchException[failed to read [abdstrack][AdsbTrack-7668367]]; nested: ElasticsearchIllegalArgumentException[No version type match [48]]; ]]
[2015-07-03 14:09:37,216][WARN ][index.engine             ] [mxserver] [abds-historic-snapshots-2015-07-03][1] failed to sync translog
[2015-07-03 14:09:37,217][WARN ][indices.cluster          ] [mxserver] [[abds-historic-snapshots-2015-07-03][1]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [abds-historic-snapshots-2015-07-03][1] failed to recover shard
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
    ... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [116]
    at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
    at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:376)
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
    ... 5 more

Note: This issue seems very similar to #10606 which I have reported before.

Source

WellingR

Most helpful comment

Delete .recovery file inside the translog folder

Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/

balaji006 on 1 Sep 2015

🎉8

All 22 comments

The problem you solved?

287400117 on 8 Jul 2015

this will be fixed in Elasticsearch 2.0. It will unlikely make it into 1.x series since it depends on a large amount of changes that are only in 2.0

s1monw on 8 Jul 2015

When is Elasticsearch 2.0 scheduled for release?

autrejacoupa on 8 Jul 2015

Delete .recovery file inside the translog folder

Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/

balaji006 on 1 Sep 2015

🎉8

I also had this kind of error after a partition was full, and deleting the .recovery files as balaji006 suggested worked fine. I had a lot of affected index/shard directories, but after deleting each .recovery file elasticsearch worked fine again.
====Update
Oops, spoke too soon. Now all queries give me "All shards failed for phase: [query]"

tony-bye on 12 Nov 2015

I am running ElasticSearch 2.0, but am still receiving IndexShard Recovery failures:

[2015-11-23 18:03:32,670][WARN ][cluster.action.shard ] [The Russian] [logstash-2015.10.24][4] received shard failed for [logstash-2015.10.24][4], node[omb9PXHUTXqpKeesvkCbPw], [P], v[742647], s[INITIALIZING], a[id=XUctUOPUQLiHXyK2J9gdlg], unassigned_info[[reason=ALLOCATION_FAILED], at[2015-11-23T18:03:32.486Z], details[failed recovery, failure IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1]; ]], indexUUID [jf5m3aXaQLyH9gMhwMBuDQ], message [failed recovery], failure [IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1]; ]
[logstash-2015.10.24][[logstash-2015.10.24][4]] IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1];
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1
at org.elasticsearch.index.translog.Translog.upgradeLegacyTranslog(Translog.java:253)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:185)
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:131)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1349)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1344)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:889)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:866)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:249)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:60)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:133)
... 3 more

There has been no disk full issue since my upgrade to 2.0, so possibility of recovery file getting corrupted is very low.

Any fixes / workaround would be very much appreciated.

Regards,
Sagar

sa-shukla on 23 Nov 2015

Today, the disk got full and ElasticSearch is not able to go back again. Isn't there a built-in system that prevents such failures. I agree that we should be monitoring the hard space and not let this happen in first place, but some times things happen.

My setup is a single node at present.

I don't see a clear way to recover the node. A post at https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html seemed to help, but still few indices got corrupted and I have no way to recovering them.

At the end, I ended up deleted the indices, but that's not the way it should be. Such things must be taken care of ultimatley

kpcool on 9 Dec 2015

Same issue here. Applied tips at https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html but not solved.

This is really really disappointing!

CorbMax on 7 Jan 2016

@CorbMax (and @kpcool ) which ES versions are you on?

bleskes on 8 Jan 2016

I'm on 2.0, but going to upgrade to 2.1
Unfortunately I was obliged to delete indexes to unlock the system...

CorbMax on 8 Jan 2016

I think I've experienced the same just after updating from 2.1.0 to 2.2.0 (from official stable PPA)

Its only a few devel indexes but the recovery seems to just (after stopping elastic, growing the disk, starting elastic) filled up the disk very quickly with translog "stuff"

I'm just going to delete the stuff but this shouldn't be difficult to replicate.

starkers on 3 Feb 2016

@starkers clean please capture the files and logs before deleting, and share them somewhere? these things are typically not easy to reproduce :(

bleskes on 3 Feb 2016

Should this not be reopned? I just a disk full and now and getting

[2016-02-06 06:01:59,643][WARN ][cluster.action.shard     ] [ops-elk-1] [logstash-2016.02.05][2] received shard failed for ...

This is for 2.1.1

systeminsightsbuild on 6 Feb 2016

@systeminsightsbuild sadly there can be many reasons for this can of failure. this specific issue is about translog corruption due to a failure to fully write an operation. This is fixed in 2.0. There might be other issues as well. It's hard to tell from the log line you sent as it misses the part that tells why the shard failed. If you can post that (and feel free open a new issue), we can see what's going on.

bleskes on 6 Feb 2016

it works!!!

balaji006 commented on 1 Sep 2015
Delete .recovery file inside the translog folder

Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/

Thanks @balaji006

likaiguo on 14 Mar 2016

@simonw This is still there for 2.2.3.

@balaji006 workaround fixed the issue, but I think that needs to be addressed.

amir-rahnama on 6 Apr 2016

@ambodi can open a new issue with the details of what you saw? this can come in many flavors. I'm also curious how you had a .recovering translog file, which is not used in 2.x

bleskes on 7 Apr 2016

@bleskes here is what I see:

2016-04-14T10:02:45.691973552Z Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream 2016-04-14T10:02:45.691977952Z at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72) 2016-04-14T10:02:45.691982252Z at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260) 2016-04-14T10:02:45.691992452Z ... 4 more

@bleskes we upgraded from 1.5 to 2.2.3

amir-rahnama on 14 Apr 2016

@ambodi thx. That exception stack trace refers to a class that has been removed in the 2.x series. The code that generated this exception is therefore from your 1.5 version. This makes me think something went wrong with your upgrade and that that node is still on 1.5.

PS. I taking you mean upgrade to 2.2.3 (as you wrote before) and not 2.8

bleskes on 14 Apr 2016

For reference, an original thread with a complete set of instructions for this error is at:

https://groups.google.com/d/msg/elasticsearch/HtgNeUJ5uao/AdMssa0WnJMJ

And to correct mistakes found above:

Everywhere above that mentions files with suffix ".recovery" is mistaken. The correct suffix is .recovering

For us, after stopping ES, moving these .recovering files to another filesystem, and then starting ES, our cluster was able to recover. (ES version 1.6.2)

tamsky on 13 Jun 2016

👍1

@tamsky the link doesn't work, maybe the elasticsearch group was deleted/moved?
FWIW I found this issue because I had a problem with ES 2.3.3 running out of disk space and then not recovering properly. But I guess it's not related to this issue since the .recovering file is no longer used? Sorry don't have logs from the ES 2.3.3 problem.

jamshid on 9 Jan 2017

Thanks for pointing out the group is gone.

I'm disappointed the ES team invalidated (and made unsearchable by old URL) all those groups links after their bulk import and announcement. I've learned my lesson : at a minimum, quote the thread subject.

A bit of spelunking later, I found a citation containing both thread URL and subject
[ ES failed to recover after crash ]

http://repository.tudelft.nl/islandora/object/uuid:1db911a0-18e0-49b5-ac9d-b6700f9b60ab/datastream/OBJ/download

Here's the migrated thread:
https://discuss.elastic.co/t/es-failed-to-recover-after-crash/8195

I guess the message I had linked to was this
https://discuss.elastic.co/t/es-failed-to-recover-after-crash/8195/5
but my comment giving corrections seems out of place or already corrected.

tamsky on 10 Jan 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add a way to determine the position of a result in a set or the presence of forward/backward results

rbayliss · 3Comments

Error : curl(56) Recv failure: Connection reset by peer

rpalsaxena · 3Comments

Support coerce JSON -> String

matthughes · 3Comments

Bad geopoint field should throw error

clintongormley · 3Comments

Cleanup Elasticsearch configuration files support

jasontedor · 3Comments