On one of our servers running Elasticsearch, some other process wrote to many logfiles such that the disk was out of space. After deleting these logfiles and rebooting the system, Elasticsearch did not recover.
We are running on a single server, using Elasticsearch 1.5.2
I believe we manages to recover by deleting some of the *.recovering files in the elasticsearch data directories, however it would be great if Elasticsearch could recover as much as possible by itself.
[2015-07-03 14:09:37,196][WARN ][cluster.action.shard ] [mxserver] [abds-historic-snapshots-2015-07-03][1] received shard failed for [abds-historic-snapshots-2015-07-03][1], node[9HgooclMS6W9m-1lqKxV8Q], [P], s[INITIALIZING], indexUUID [6unWDyfbQ_yF9XTYAlMz4g], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[abds-historic-snapshots-2015-07-03][1] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchIllegalArgumentException[No version type match [116]]; ]]
[2015-07-03 14:09:37,205][WARN ][index.engine ] [mxserver] [abds-instance][0] failed to sync translog
[2015-07-03 14:09:37,206][WARN ][indices.cluster ] [mxserver] [[abds-instance][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [abds-instance][0] failed to recover shard
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
... 4 more
Caused by: org.elasticsearch.ElasticsearchException: failed to read [abdstrack][AdsbTrack-7668367]
at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:522)
at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
... 5 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [48]
at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:519)
... 6 more
[2015-07-03 14:09:37,206][WARN ][cluster.action.shard ] [mxserver] [abds-instance][0] received shard failed for [abds-instance][0], node[9HgooclMS6W9m-1lqKxV8Q], [P], s[INITIALIZING], indexUUID [H8FyNbqATmWQ6p8RYSGncw], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[abds-instance][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchException[failed to read [abdstrack][AdsbTrack-7668367]]; nested: ElasticsearchIllegalArgumentException[No version type match [48]]; ]]
[2015-07-03 14:09:37,216][WARN ][index.engine ] [mxserver] [abds-historic-snapshots-2015-07-03][1] failed to sync translog
[2015-07-03 14:09:37,217][WARN ][indices.cluster ] [mxserver] [[abds-historic-snapshots-2015-07-03][1]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [abds-historic-snapshots-2015-07-03][1] failed to recover shard
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [116]
at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:376)
at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
... 5 more
Note: This issue seems very similar to #10606 which I have reported before.
The problem you solved?
this will be fixed in Elasticsearch 2.0. It will unlikely make it into 1.x series since it depends on a large amount of changes that are only in 2.0
When is Elasticsearch 2.0 scheduled for release?
Delete .recovery file inside the translog folder
Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/
I also had this kind of error after a partition was full, and deleting the .recovery files as balaji006 suggested worked fine. I had a lot of affected index/shard directories, but after deleting each .recovery file elasticsearch worked fine again.
====Update
Oops, spoke too soon. Now all queries give me "All shards failed for phase: [query]"
I am running ElasticSearch 2.0, but am still receiving IndexShard Recovery failures:
[2015-11-23 18:03:32,670][WARN ][cluster.action.shard ] [The Russian] [logstash-2015.10.24][4] received shard failed for [logstash-2015.10.24][4], node[omb9PXHUTXqpKeesvkCbPw], [P], v[742647], s[INITIALIZING], a[id=XUctUOPUQLiHXyK2J9gdlg], unassigned_info[[reason=ALLOCATION_FAILED], at[2015-11-23T18:03:32.486Z], details[failed recovery, failure IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1]; ]], indexUUID [jf5m3aXaQLyH9gMhwMBuDQ], message [failed recovery], failure [IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1]; ]
[logstash-2015.10.24][[logstash-2015.10.24][4]] IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1];
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1
at org.elasticsearch.index.translog.Translog.upgradeLegacyTranslog(Translog.java:253)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:185)
at org.elasticsearch.index.engine.InternalEngine.
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1349)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1344)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:889)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:866)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:249)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:60)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:133)
... 3 more
There has been no disk full issue since my upgrade to 2.0, so possibility of recovery file getting corrupted is very low.
Any fixes / workaround would be very much appreciated.
Regards,
Sagar
Today, the disk got full and ElasticSearch is not able to go back again. Isn't there a built-in system that prevents such failures. I agree that we should be monitoring the hard space and not let this happen in first place, but some times things happen.
My setup is a single node at present.
I don't see a clear way to recover the node. A post at https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html seemed to help, but still few indices got corrupted and I have no way to recovering them.
At the end, I ended up deleted the indices, but that's not the way it should be. Such things must be taken care of ultimatley
Same issue here. Applied tips at https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html but not solved.
This is really really disappointing!
@CorbMax (and @kpcool ) which ES versions are you on?
I'm on 2.0, but going to upgrade to 2.1
Unfortunately I was obliged to delete indexes to unlock the system...
I think I've experienced the same just after updating from 2.1.0 to 2.2.0 (from official stable PPA)
Its only a few devel indexes but the recovery seems to just (after stopping elastic, growing the disk, starting elastic) filled up the disk very quickly with translog "stuff"
I'm just going to delete the stuff but this shouldn't be difficult to replicate.
@starkers clean please capture the files and logs before deleting, and share them somewhere? these things are typically not easy to reproduce :(
Should this not be reopned? I just a disk full and now and getting
[2016-02-06 06:01:59,643][WARN ][cluster.action.shard ] [ops-elk-1] [logstash-2016.02.05][2] received shard failed for ...
This is for 2.1.1
@systeminsightsbuild sadly there can be many reasons for this can of failure. this specific issue is about translog corruption due to a failure to fully write an operation. This is fixed in 2.0. There might be other issues as well. It's hard to tell from the log line you sent as it misses the part that tells why the shard failed. If you can post that (and feel free open a new issue), we can see what's going on.
it works!!!
balaji006 commented on 1 Sep 2015
Delete .recovery file inside the translog folder
Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/
Thanks @balaji006
@simonw This is still there for 2.2.3.
@balaji006 workaround fixed the issue, but I think that needs to be addressed.
@ambodi can open a new issue with the details of what you saw? this can come in many flavors. I'm also curious how you had a .recovering
translog file, which is not used in 2.x
@bleskes here is what I see:
2016-04-14T10:02:45.691973552Z Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
2016-04-14T10:02:45.691977952Z at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
2016-04-14T10:02:45.691982252Z at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
2016-04-14T10:02:45.691992452Z ... 4 more
@bleskes we upgraded from 1.5 to 2.2.3
@ambodi thx. That exception stack trace refers to a class that has been removed in the 2.x series. The code that generated this exception is therefore from your 1.5 version. This makes me think something went wrong with your upgrade and that that node is still on 1.5.
PS. I taking you mean upgrade to 2.2.3 (as you wrote before) and not 2.8
For reference, an original thread with a complete set of instructions for this error is at:
And to correct mistakes found above:
.recovery
" is mistaken. The correct suffix is .recovering
For us, after stopping ES, moving these .recovering
files to another filesystem, and then starting ES, our cluster was able to recover. (ES version 1.6.2)
@tamsky the link doesn't work, maybe the elasticsearch group was deleted/moved?
FWIW I found this issue because I had a problem with ES 2.3.3 running out of disk space and then not recovering properly. But I guess it's not related to this issue since the .recovering file is no longer used? Sorry don't have logs from the ES 2.3.3 problem.
Thanks for pointing out the group is gone.
I'm disappointed the ES team invalidated (and made unsearchable by old URL) all those groups links after their bulk import and announcement. I've learned my lesson : at a minimum, quote the thread subject.
A bit of spelunking later, I found a citation containing both thread URL and subject
[ ES failed to recover after crash ]
Here's the migrated thread:
https://discuss.elastic.co/t/es-failed-to-recover-after-crash/8195
I guess the message I had linked to was this
https://discuss.elastic.co/t/es-failed-to-recover-after-crash/8195/5
but my comment giving corrections seems out of place or already corrected.
Most helpful comment
Delete .recovery file inside the translog folder
Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/