Doesn't repro locally, I'm not sure whether something in the test or the node failing is the root cause here
testCorruptIndex has this in its logging
1> [2018-12-04T11:42:15,394][INFO ][o.e.i.s.RemoveCorruptedShardDataCommandIT] [testCorruptIndex] after test
ERROR 1.71s J5 | RemoveCorruptedShardDataCommandIT.testCorruptIndex <<< FAILURES!
> Throwable #1: ElasticsearchException[Shard does not seem to be corrupted at /var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob-unix-compatibility/os/ubuntu&&virtual/server/build/testrun/integTest/J5/temp/org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT_7292077B46A25BBF-001/tempDir-002/data/nodes/0/indices/-LSNcjAZQ0uqCXST_25kNA/0]
> at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.lambda$execute$1(RemoveCorruptedShardDataCommand.java:347)
> at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.findAndProcessShardPath(RemoveCorruptedShardDataCommand.java:202)
> at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.execute(RemoveCorruptedShardDataCommand.java:282)
> at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT$2.onNodeStopped(RemoveCorruptedShardDataCommandIT.java:192)
> at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:917)
> at org.elasticsearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1689)
> at org.elasticsearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1649)
> at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT.testCorruptIndex(RemoveCorruptedShardDataCommandIT.java:188)
> at java.lang.Thread.run(Thread.java:748)
All the tests have this in their logging
java.lang.RuntimeException: already closed
> at org.elasticsearch.test.InternalTestCluster$NodeAndClient.client(InternalTestCluster.java:849)
> at org.elasticsearch.test.InternalTestCluster.client(InternalTestCluster.java:703)
> at org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:641)
> at org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:634)
> at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:553)
> at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2191)
> at java.lang.Thread.run(Thread.java:748)
./gradlew :server:integTest -Dtests.seed=7292077B46A25BBF -Dtests.class=org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT -Dtests.method="testCorruptIndex" -Dtests.security.manager=true -Dtests.locale=ko-KR -Dtests.timezone=Pacific/Tarawa -Dcompiler.java=11 -Druntime.java=8
Pinging @elastic/es-distributed
Muted on master since this has failed a handful of times over the last month - https://github.com/elastic/elasticsearch/commit/01b8f99c17c4fcbe5d6ab9c3bca9153f21e610ee
I can reproduce this locally by running org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT#testCorruptIndex in a loop with constant seed 7292077B46A25BBF (fails about 0.1% of runs). I'll see if I can fix it :)
Fixed in https://github.com/elastic/elasticsearch/pull/36208 I think.
It appears like this also affect 6.7: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+internalClusterTest/135/console.
I don't think we'll be merging zen2 in 6.7, so not sure what to do with this test. Any ideas?
@dimitris-athanasiou yea the reason for this was somewhat obvious in the non-atomic way zen1 used to handle some files here. I'll see what I can do to fix 6.x, but I guess we should re-mute it in 6.7 for now.
I've raised a PR to mute it
muted the test for 6.6
@original-brownbear are you still investigating this one or should we close it as unfixed test-bug in 6.x?
@ywelsch I'd close it, sorry for letting this fall off the radar. It seems to be some concurrency issue(s) with the test cluster in 6.x and it looks like it's not entirely trivial to fix those since the fix in #39168 doesn't easily back port to 6.x -> closing.
I have an idea for resolving this, and a need to do so, so I'm reopening this and recording my idea here. The failure is this:
> Throwable #1: ElasticsearchException[Shard does not seem to be corrupted at /home/davidturner/elasticsearch/server/build/testrun/integTest/J0/temp/org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand
IT_29C375145B346F30-001/tempDir-002/data/nodes/0/indices/NLjYFRooRa6kcAKiiQk-SQ/0]
> at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.lambda$execute$1(RemoveCorruptedShardDataCommand.java:367)
> at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.findAndProcessShardPath(RemoveCorruptedShardDataCommand.java:211)
> at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.execute(RemoveCorruptedShardDataCommand.java:297)
> at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT$2.onNodeStopped(RemoveCorruptedShardDataCommandIT.java:193)
> at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:952)
> at org.elasticsearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1767)
> at org.elasticsearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1720)
> at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT.testCorruptIndex(RemoveCorruptedShardDataCommandIT.java:189)
> at jdk.internal.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:567)
> at java.base/java.lang.Thread.run(Thread.java:835)Throwable #2: java.lang.RuntimeException: already closed
> at org.elasticsearch.test.InternalTestCluster$NodeAndClient.client(InternalTestCluster.java:884)
> at org.elasticsearch.test.InternalTestCluster.client(InternalTestCluster.java:721)
> at org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:651)
> at org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:644)
> at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:563)
> at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2205)
> at jdk.internal.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:567)
> at java.base/java.lang.Thread.run(Thread.java:835)
The reason for this is that there's no corruption marker on the shard. Today we wait for the shard to fail to allocate and assume that this is because it's corrupt, but in fact it can fail to allocate for other reasons too:
1> [2019-10-02T14:14:43,540][WARN ][o.e.g.G.InternalPrimaryShardAllocator] [node_s0] [index42][0]: failed to list shard for shard_started on node [I6xspYPfRZ6BfvEHHGh5Fg]
1> org.elasticsearch.action.FailedNodeException: Failed node [I6xspYPfRZ6BfvEHHGh5Fg]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:236) [main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:151) [main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:210) [main/:?]
1> at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1114) [main/:?]
1> at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1226) [main/:?]
1> at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1200) [main/:?]
1> at org.elasticsearch.transport.TransportService$7.onFailure(TransportService.java:703) [main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:736) [main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) [main/:?]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
1> at java.lang.Thread.run(Thread.java:835) [?:?]
1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_s0][127.0.0.1:36643][internal:gateway/local/started_shards[n]]
1> Caused by: org.elasticsearch.ElasticsearchException: failed to load started shards
1> at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:169) ~[main/:?]
1> at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:61) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[main/:?]
1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?]
1> at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:692) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
1> ... 3 more
1> Caused by: org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read [id:2, file:/home/davidturner/elasticsearch/server/build/testrun/integTest/J0/temp/org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT_29C375145B346F30-001/tempDir-002/data/nodes/0/indices/NLjYFRooRa6kcAKiiQk-SQ/_state/state-2.st]
1> at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:165) ~[main/:?]
1> at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:306) ~[main/:?]
1> at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:324) ~[main/:?]
1> at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:127) ~[main/:?]
1> at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:61) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[main/:?]
1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?]
1> at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:692) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
1> ... 3 more
1> Caused by: java.io.IOException: failed to read [id:2, file:/home/davidturner/elasticsearch/server/build/testrun/integTest/J0/temp/org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT_29C375145B346F30-001/tempDir-002/data/nodes/0/indices/NLjYFRooRa6kcAKiiQk-SQ/_state/state-2.st]
1> at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:300) ~[main/:?]
1> at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:324) ~[main/:?]
1> at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:127) ~[main/:?]
1> at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:61) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[main/:?]
1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?]
1> at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:692) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
1> ... 3 more
1> Caused by: java.nio.file.NoSuchFileException: /home/davidturner/elasticsearch/server/build/testrun/integTest/J0/temp/org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT_29C375145B346F30-001/tempDir-002/data/nodes/0/indices/NLjYFRooRa6kcAKiiQk-SQ/_state/state-2.st
1> at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
1> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
1> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
1> at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]
1> at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
1> at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
1> at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
1> at org.apache.lucene.mockfile.HandleTrackingFS.newByteChannel(HandleTrackingFS.java:240) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
1> at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
1> at org.apache.lucene.mockfile.HandleTrackingFS.newByteChannel(HandleTrackingFS.java:240) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
1> at java.nio.file.Files.newByteChannel(Files.java:373) ~[?:?]
1> at java.nio.file.Files.newByteChannel(Files.java:424) ~[?:?]
1> at org.apache.lucene.store.SimpleFSDirectory.openInput(SimpleFSDirectory.java:77) ~[lucene-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:28]
1> at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:183) ~[main/:?]
1> at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:296) ~[main/:?]
1> at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:324) ~[main/:?]
1> at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:127) ~[main/:?]
1> at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:61) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[main/:?]
1> at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[main/:?]
1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?]
1> at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:692) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
1> ... 3 more
I think this is the concurrency issue mentioned above. We sorta expect this to happen in 6.x, and we retry the failed action a bit later if it occurs, but the test treats this as a signal to proceed and doesn't give it a chance to retry. I think we can let it retry until it hits a corruption exception and am running experiments to verify this.
fixed by #47456