Elasticsearch: [ci] RemoveCorruptedShardDataCommandIT.testCorruptIndex and 3 others

Created on 4 Dec 2018 · 12Comments · Source: elastic/elasticsearch

Doesn't repro locally, I'm not sure whether something in the test or the node failing is the root cause here

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu&&virtual/97/console

build-unix-97.txt.zip

testCorruptIndex has this in its logging

  1> [2018-12-04T11:42:15,394][INFO ][o.e.i.s.RemoveCorruptedShardDataCommandIT] [testCorruptIndex] after test
ERROR   1.71s J5 | RemoveCorruptedShardDataCommandIT.testCorruptIndex <<< FAILURES!
   > Throwable #1: ElasticsearchException[Shard does not seem to be corrupted at /var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob-unix-compatibility/os/ubuntu&&virtual/server/build/testrun/integTest/J5/temp/org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT_7292077B46A25BBF-001/tempDir-002/data/nodes/0/indices/-LSNcjAZQ0uqCXST_25kNA/0]
   >    at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.lambda$execute$1(RemoveCorruptedShardDataCommand.java:347)
   >    at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.findAndProcessShardPath(RemoveCorruptedShardDataCommand.java:202)
   >    at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.execute(RemoveCorruptedShardDataCommand.java:282)
   >    at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT$2.onNodeStopped(RemoveCorruptedShardDataCommandIT.java:192)
   >    at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:917)
   >    at org.elasticsearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1689)
   >    at org.elasticsearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1649)
   >    at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT.testCorruptIndex(RemoveCorruptedShardDataCommandIT.java:188)
   >    at java.lang.Thread.run(Thread.java:748)

All the tests have this in their logging

java.lang.RuntimeException: already closed
   >    at org.elasticsearch.test.InternalTestCluster$NodeAndClient.client(InternalTestCluster.java:849)
   >    at org.elasticsearch.test.InternalTestCluster.client(InternalTestCluster.java:703)
   >    at org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:641)
   >    at org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:634)
   >    at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:553)
   >    at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2191)
   >    at java.lang.Thread.run(Thread.java:748)

./gradlew :server:integTest -Dtests.seed=7292077B46A25BBF -Dtests.class=org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT -Dtests.method="testCorruptIndex" -Dtests.security.manager=true -Dtests.locale=ko-KR -Dtests.timezone=Pacific/Tarawa -Dcompiler.java=11 -Druntime.java=8

:DistributeEngine >test-failure

Source

andyb-elastic

All 12 comments

Pinging @elastic/es-distributed

elasticmachine on 4 Dec 2018

Muted on master since this has failed a handful of times over the last month - https://github.com/elastic/elasticsearch/commit/01b8f99c17c4fcbe5d6ab9c3bca9153f21e610ee

andyb-elastic on 4 Dec 2018

I can reproduce this locally by running org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT#testCorruptIndex in a loop with constant seed 7292077B46A25BBF (fails about 0.1% of runs). I'll see if I can fix it :)

original-brownbear on 4 Dec 2018

👍1

Fixed in https://github.com/elastic/elasticsearch/pull/36208 I think.

original-brownbear on 4 Dec 2018

It appears like this also affect 6.7: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+internalClusterTest/135/console.

I don't think we'll be merging zen2 in 6.7, so not sure what to do with this test. Any ideas?

dimitris-athanasiou on 8 Feb 2019

@dimitris-athanasiou yea the reason for this was somewhat obvious in the non-atomic way zen1 used to handle some files here. I'll see what I can do to fix 6.x, but I guess we should re-mute it in 6.7 for now.

original-brownbear on 8 Feb 2019

👍1

I've raised a PR to mute it

dimitris-athanasiou on 8 Feb 2019

❤1

muted the test for 6.6

hendrikmuhs on 19 Feb 2019

@original-brownbear are you still investigating this one or should we close it as unfixed test-bug in 6.x?

ywelsch on 18 Mar 2019

@ywelsch I'd close it, sorry for letting this fall off the radar. It seems to be some concurrency issue(s) with the test cluster in 6.x and it looks like it's not entirely trivial to fix those since the fix in #39168 doesn't easily back port to 6.x -> closing.

original-brownbear on 18 Mar 2019

I have an idea for resolving this, and a need to do so, so I'm reopening this and recording my idea here. The failure is this:

   > Throwable #1: ElasticsearchException[Shard does not seem to be corrupted at /home/davidturner/elasticsearch/server/build/testrun/integTest/J0/temp/org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand
IT_29C375145B346F30-001/tempDir-002/data/nodes/0/indices/NLjYFRooRa6kcAKiiQk-SQ/0]
   >    at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.lambda$execute$1(RemoveCorruptedShardDataCommand.java:367)
   >    at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.findAndProcessShardPath(RemoveCorruptedShardDataCommand.java:211)
   >    at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.execute(RemoveCorruptedShardDataCommand.java:297)
   >    at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT$2.onNodeStopped(RemoveCorruptedShardDataCommandIT.java:193)
   >    at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:952)
   >    at org.elasticsearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1767)
   >    at org.elasticsearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1720)
   >    at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT.testCorruptIndex(RemoveCorruptedShardDataCommandIT.java:189)
   >    at jdk.internal.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
   >    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >    at java.base/java.lang.reflect.Method.invoke(Method.java:567)
   >    at java.base/java.lang.Thread.run(Thread.java:835)Throwable #2: java.lang.RuntimeException: already closed
   >    at org.elasticsearch.test.InternalTestCluster$NodeAndClient.client(InternalTestCluster.java:884)
   >    at org.elasticsearch.test.InternalTestCluster.client(InternalTestCluster.java:721)
   >    at org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:651)
   >    at org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:644)
   >    at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:563)
   >    at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2205)
   >    at jdk.internal.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
   >    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >    at java.base/java.lang.reflect.Method.invoke(Method.java:567)
   >    at java.base/java.lang.Thread.run(Thread.java:835)

The reason for this is that there's no corruption marker on the shard. Today we wait for the shard to fail to allocate and assume that this is because it's corrupt, but in fact it can fail to allocate for other reasons too:

  1> [2019-10-02T14:14:43,540][WARN ][o.e.g.G.InternalPrimaryShardAllocator] [node_s0] [index42][0]: failed to list shard for shard_started on node [I6xspYPfRZ6BfvEHHGh5Fg]
  1> org.elasticsearch.action.FailedNodeException: Failed node [I6xspYPfRZ6BfvEHHGh5Fg]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:236) [main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:151) [main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:210) [main/:?]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1114) [main/:?]
  1>    at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1226) [main/:?]
  1>    at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1200) [main/:?]
  1>    at org.elasticsearch.transport.TransportService$7.onFailure(TransportService.java:703) [main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:736) [main/:?]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) [main/:?]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1>    at java.lang.Thread.run(Thread.java:835) [?:?]
  1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_s0][127.0.0.1:36643][internal:gateway/local/started_shards[n]]
  1> Caused by: org.elasticsearch.ElasticsearchException: failed to load started shards
  1>    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:169) ~[main/:?]
  1>    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:61) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[main/:?]
  1>    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:692) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
  1>    ... 3 more
  1> Caused by: org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read [id:2, file:/home/davidturner/elasticsearch/server/build/testrun/integTest/J0/temp/org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT_29C375145B346F30-001/tempDir-002/data/nodes/0/indices/NLjYFRooRa6kcAKiiQk-SQ/_state/state-2.st]
  1>    at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:165) ~[main/:?]
  1>    at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:306) ~[main/:?]
  1>    at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:324) ~[main/:?]
  1>    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:127) ~[main/:?]
  1>    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:61) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[main/:?]
  1>    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:692) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
  1>    ... 3 more
  1> Caused by: java.io.IOException: failed to read [id:2, file:/home/davidturner/elasticsearch/server/build/testrun/integTest/J0/temp/org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT_29C375145B346F30-001/tempDir-002/data/nodes/0/indices/NLjYFRooRa6kcAKiiQk-SQ/_state/state-2.st]
  1>    at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:300) ~[main/:?]
  1>    at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:324) ~[main/:?]
  1>    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:127) ~[main/:?]
  1>    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:61) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[main/:?]
  1>    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:692) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
  1>    ... 3 more
  1> Caused by: java.nio.file.NoSuchFileException: /home/davidturner/elasticsearch/server/build/testrun/integTest/J0/temp/org.elasticsearch.index.shard.RemoveCorruptedShardDataCommandIT_29C375145B346F30-001/tempDir-002/data/nodes/0/indices/NLjYFRooRa6kcAKiiQk-SQ/_state/state-2.st
  1>    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
  1>    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
  1>    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
  1>    at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]
  1>    at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
  1>    at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
  1>    at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
  1>    at org.apache.lucene.mockfile.HandleTrackingFS.newByteChannel(HandleTrackingFS.java:240) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
  1>    at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
  1>    at org.apache.lucene.mockfile.HandleTrackingFS.newByteChannel(HandleTrackingFS.java:240) ~[lucene-test-framework-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:32]
  1>    at java.nio.file.Files.newByteChannel(Files.java:373) ~[?:?]
  1>    at java.nio.file.Files.newByteChannel(Files.java:424) ~[?:?]
  1>    at org.apache.lucene.store.SimpleFSDirectory.openInput(SimpleFSDirectory.java:77) ~[lucene-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:28]
  1>    at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:183) ~[main/:?]
  1>    at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:296) ~[main/:?]
  1>    at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:324) ~[main/:?]
  1>    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:127) ~[main/:?]
  1>    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:61) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[main/:?]
  1>    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[main/:?]
  1>    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:692) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
  1>    ... 3 more

I think this is the concurrency issue mentioned above. We sorta expect this to happen in 6.x, and we retry the failed action a bit later if it occurs, but the test treats this as a signal to proceed and doesn't give it a chance to retry. I think we can let it retry until it hits a corruption exception and am running experiments to verify this.