Elasticsearch: [CI] IndicesQueryCache.close assertion failure in internal cluster tests

Created on 3 Jan 2019 · 26Comments · Source: elastic/elasticsearch

The timing suggests these failures are related to the change moving the ML config out of the cluster state.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu8,nodes=virtual&&linux/152/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=virtual&&linux/149/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+periodic/57/console

The log contains an assertion failure

java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@b62bec8=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@30fd04c3}
    at __randomizedtesting.SeedInfo.seed([374793047E6CA339:C9896ED516320174]:0)
    at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)
    at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
    at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:130)
    at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:296)
    at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:103)
    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
    at org.elasticsearch.node.NodeService.close(NodeService.java:135)
    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
    at org.elasticsearch.node.Node.close(Node.java:847)
    at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:990)
    at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
    at org.elasticsearch.test.InternalTestCluster.close(InternalTestCluster.java:818)
    at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:586)
    at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2225)

Then the suite fails with a thread leak error.
thread-leak.txt

Does not reproduce:

./gradlew :x-pack:plugin:ml:unitTest \
  -Dtests.seed=374793047E6CA339 \
  -Dtests.class=org.elasticsearch.license.MachineLearningLicensingTests \
  -Dtests.method="testMachineLearningStopDatafeedActionNotRestricted" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-SD \
  -Dtests.timezone=BET \
  -Dcompiler.java=11 \
  -Druntime.java=8

./gradlew :x-pack:plugin:ml:unitTest \
  -Dtests.seed=B805872D5C817758 \
  -Dtests.class=org.elasticsearch.license.MachineLearningLicensingTests \
  -Dtests.method="testAutoCloseJobWithDatafeed" \
  -Dtests.security.manager=true \
  -Dtests.locale=zh-Hans-HK \
  -Dtests.timezone=Etc/GMT-5 \
  -Dcompiler.java=11 \
  -Druntime.java=11

./gradlew :x-pack:plugin:ml:unitTest \
  -Dtests.seed=CB48E860FD3AC07E \
  -Dtests.class=org.elasticsearch.license.MachineLearningLicensingTests \
  -Dtests.method="testAutoCloseJobWithDatafeed" \
  -Dtests.security.manager=true \
  -Dtests.locale=de-GR \
  -Dtests.timezone=America/Rankin_Inlet \
  -Dcompiler.java=11 \
  -Druntime.java=8

:SearcSearch :ml >test-failure

Source

davidkyle

All 26 comments

Pinging @elastic/ml-core

elasticmachine on 3 Jan 2019

The assertion that's failing is https://github.com/elastic/elasticsearch/blob/0000e7c8b524556d7a1bbec94e6c0114fc225bab/server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java#L191

Can anyone from @elastic/es-search offer any advice on what might cause this? Is it because ML is doing some operation on the index while it's being closed?

droberts195 on 7 Jan 2019

One more thing is that many failures of this type seem to be running in the Zulu JVM.

However, I found one that isn't: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=opensuse/165/consoleText

So it may be that there's something about the Zulu JVM that makes these failures more likely to occur, but it's not the root cause of the problem.

droberts195 on 7 Jan 2019

another one here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/906/console

alpar-t on 9 Jan 2019

There was a failure in BasicDistributedJobsIT that looks like it has the same root cause

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/172/console

java.lang.AssertionError: 2
    at __randomizedtesting.SeedInfo.seed([3EDB4FBFCECABCD4:D16E134409CDA259]:0)
    at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:189)
    at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
    at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:130)
    at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:297)
    at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:105)
    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
    at org.elasticsearch.node.NodeService.close(NodeService.java:135)
    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
    at org.elasticsearch.node.Node.close(Node.java:946)
    at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1008)
    at org.elasticsearch.test.InternalTestCluster.stopNodesAndClients(InternalTestCluster.java:1651)
    at org.elasticsearch.test.InternalTestCluster.stopNodesAndClient(InternalTestCluster.java:1637)
    at org.elasticsearch.test.InternalTestCluster.stopRandomDataNode(InternalTestCluster.java:1531)
    at org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testFailOverBasics_withDataFeeder(BasicDistributedJobsIT.java:121)

In this test a node is shutdown tripping the assertion which doesn't kill the node but leaves it half running. Later there is the error message from the shutdown node:

  1> [2019-01-10T22:27:02,402][WARN ][o.e.t.TcpTransport       ] [node_t1] send message failed [channel: MockChannel{profile='default', isOpen=false, localAddress=/127.0.0.1:39548, isServerSocket=false}]
  1> java.net.SocketException: Socket is closed
  1>    at java.net.Socket.getOutputStream(Socket.java:943) ~[?:1.8.0_192]
  1>    at org.elasticsearch.transport.MockTcpTransport$MockChannel.sendMessage(MockTcpTransport.java:437) [framework-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.internalSendMessage(TcpTransport.java:758) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:844) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:815) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:64) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1>    at org.elasticsearch.action.support.HandledTransportAction$ChannelActionListener.onResponse(HandledTransportAction.java:103) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1>    at org.elasticsearch.action.support.HandledTransportAction$ChannelActionListener.onResponse(HandledTransportAction.java:87) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1>    at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:360) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1>    at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:356) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1>    at org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1116) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]

The working theory is that some request hits the node during shutdown I've added some debug logging around the likely candidates in 92b519b9c8ccb18eda40de9eb4effb9104cffde6

davidkyle on 10 Jan 2019

@jpountz could you take a look at the assertion failures in IndicesQueryCache.close thrown up by this test please. Any insights you have into what the cause may be are much appreciated.

java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@b62bec8=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@30fd04c3}
    at __randomizedtesting.SeedInfo.seed([374793047E6CA339:C9896ED516320174]:0)
    at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)

and

java.lang.AssertionError: 2
    at __randomizedtesting.SeedInfo.seed([3EDB4FBFCECABCD4:D16E134409CDA259]:0)
    at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:189)

davidkyle on 11 Jan 2019

Talking to @droberts195, he thinks these failures from today look related. Its a different test though:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=amazon/223/console

 ./gradlew :x-pack:plugin:ml:internalClusterTest \
  -Dtests.seed=2D587A9A8B6E70D8 \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT \
  -Dtests.method="testFullClusterRestart" \
  -Dtests.security.manager=true \
  -Dtests.locale=es-PE \
  -Dtests.timezone=America/St_Thomas \
  -Dcompiler.java=11 \
  -Druntime.java=8

The logs contain similar failures:

13:53:28 ERROR   12.1s J1 | MlDistributedFailureIT.testFullClusterRestart <<< FAILURES!
13:53:28    > Throwable #1: java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@714f684a=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@3d9dc3df, org.apache.lucene.index.IndexReader$CacheKey@1875360d=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@46a79038}
13:53:28    >   at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)
13:53:28    >   at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
13:53:28    >   at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:130)
13:53:28    >   at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:284)
13:53:28    >   at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:100)
13:53:28    >   at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
13:53:28    >   at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
13:53:28    >   at org.elasticsearch.node.NodeService.close(NodeService.java:135)
13:53:28    >   at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
13:53:28    >   at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
13:53:28    >   at org.elasticsearch.node.Node.close(Node.java:857)
13:53:28    >   at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1039)
13:53:28    >   at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:961)
13:53:28    >   at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1860)
13:53:28    >   at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1745)
13:53:28    >   at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testFullClusterRestart$5(MlDistributedFailureIT.java:113)
13:53:28    >   at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.run(MlDistributedFailureIT.java:293)

cbuescher on 4 Feb 2019

I changed the title of the issue because the things that all these failures have in common are:

assert stats2.isEmpty() : stats2; fails in IndicesQueryCache.close
They're all using the internal cluster test framework

The stats2 variable and the assertion were added in 38f5cc236ac6d06556d25eab81f7b05854b326ca. The code then underwent a major refactor in 44c653f5a87337a02daf4737bcb34150df509dbe. The locking was changed but the assertion in close stayed the same. Could it be a problem with visibility of changes between threads that only manifests itself in the internal cluster tests?

droberts195 on 4 Feb 2019

Another example in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+internalClusterTest/49/consoleText

droberts195 on 7 Feb 2019

I spent all day trying to reproduce this, but I was unable to.

The only thing I could think of that would cause it was related to what @droberts195 says about thread visibility, specifically maybe making this:

https://github.com/elastic/elasticsearch/blob/431c4fd55e4f317312bb616c20518deb67e17180/server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java#L209

be volatile since that is the determinant of whether stats are removed from the stats2 map; but that is a shot in the dark and I don't have a way to test it since I haven't been able to reproduce this.

I'm curious if anyone else has any ideas.

dakrone on 9 Feb 2019

❤1

Thanks for looking at this @dakrone. I agree that int count in StatsAndCount not being volatile is a likely cause of the problem.

I just had a look at the base class, LRUQueryCache in Lucene, and it has the following comment:

  // these variables are volatile so that we do not need to sync reads
  // but increments need to be performed under the lock

So I guess the base class either had the same problem at one point or else its original author realised the potential problem with visibility of changes.

Out of all the thousands of CI builds that run we only seem to get a failure of this assertion about twice a week. None of the 3 tests that have failed because of it have been muted, so it seems like the best way forward is just to make that int count volatile and see if the failures go away.

One more thing though. Can somebody who is more familiar with this confirm that ElasticsearchLRUQueryCache.onDocIdSetCache and ElasticsearchLRUQueryCache.onDocIdSetEviction can never get called by two different threads simultaneously? I guess that's covered by the // but increments need to be performed under the lock in the base class, but if not then there's a worse synchronization problem.

droberts195 on 11 Feb 2019

Can somebody who is more familiar with this confirm that ElasticsearchLRUQueryCache.onDocIdSetCache and ElasticsearchLRUQueryCache.onDocIdSetEviction can never get called by two different threads simultaneously?

The super methods of both assert that the current thread holds the cache's lock, so I think the behaviour here is fine.

+1 to making StatsAndCount.count volatile

romseygeek on 11 Feb 2019

👍1

+1 to making StatsAndCount.count volatile

I opened #38714 to do this

droberts195 on 11 Feb 2019

@droberts195 thanks for opening the PR, I'm going to close this and assume your fix worked (so we don't forget to resolve this). We can always re-open it if we see more of this failure.

dakrone on 11 Feb 2019

This failed again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/61/consoleText with the same assertion failure:

   > Throwable #1: java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@6cc2c355=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@6557d469}

Before the assertion failure on closing the query cache was this exception:

  1> [2019-02-12T03:07:14,129][ERROR][o.e.x.m.p.MlMemoryTracker] [node_t1] [full-cluster-restart-job] failed to calculate job established model memory requirement
  1> org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
  1>    at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:296) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:139) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:259) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:105) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$3(InitialSearchPhase.java:285) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:172) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
  1> Caused by: org.elasticsearch.transport.NodeNotConnectedException: [node_t2][127.0.0.1:36011] Node not connected
  1>    at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:156) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:557) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.SearchTransportService.getConnection(SearchTransportService.java:411) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.TransportSearchAction.lambda$buildConnectionLookup$6(TransportSearchAction.java:532) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.AbstractSearchAsyncAction.getConnection(AbstractSearchAsyncAction.java:301) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:54) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$4(InitialSearchPhase.java:259) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.InitialSearchPhase.performPhaseOnShard(InitialSearchPhase.java:294) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:114) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:50) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.InitialSearchPhase$2.onFailure(InitialSearchPhase.java:273) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:441) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$6.doRun(TransportService.java:660) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    ... 5 more

In other words, a search failed on one node as the cluster was being shut down because all the other nodes in the cluster were shut down and those other nodes had all the shards for some particular index.

Probably the reason this failure started being seen when in 6.6 is that the MlMemoryTracker is probably the first component ever that could be doing a search at the exact moment an internal test cluster is shut down.

In terms of stats2 in IndicesQueryCache it probably means that it can get into an inconsistent state if a search fails due to all the nodes in the cluster holding the shards it wanted to search being shut down at a particular point within the search.

droberts195 on 12 Feb 2019

I think this assertion error might happen if there are ongoing search requests, due to the fact that our use of ref-counting only effectively closes index reader and stores once all in-flight requests have finished. Closing the caches while there are ongoing requests feels wrong, so I opened #38958 which tries to only close caches after in-flight requests are done processing as well.

jpountz on 15 Feb 2019

❤1

Another failure:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+internalClusterTest/1412

==> Test Info: seed=407BCF06300EDC8C; jvms=8; suites=12
--
Tests with failures:
- org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testFullClusterRestart
- org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT (suite)
Slow Tests Summary:
86.36s \| org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT
66.01s \| org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT
49.17s \| org.elasticsearch.xpack.ml.integration.TooManyJobsIT
29.09s \| org.elasticsearch.xpack.ml.integration.EstablishedMemUsageIT
23.96s \| org.elasticsearch.xpack.ml.integration.NetworkDisruptionIT

04:46:43   1>   at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
04:46:43   1>   at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
04:46:43   1>   at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:65) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
04:46:43   1>   at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
04:46:43   1>   at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:382) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
04:46:43   1>   ... 37 more
04:46:43   1> [2019-02-18T05:45:49,765][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testFullClusterRestart] [MlDistributedFailureIT#testFullClusterRestart]: cleaning up after test
04:46:43   1> [2019-02-18T05:45:49,766][INFO ][o.e.n.Node               ] [testFullClusterRestart] stopping ...
04:46:43   1> [2019-02-18T05:45:49,785][INFO ][o.e.n.Node               ] [testFullClusterRestart] stopped
04:46:43   1> [2019-02-18T05:45:49,785][INFO ][o.e.n.Node               ] [testFullClusterRestart] closing ...
04:46:43   1> [2019-02-18T05:45:49,788][INFO ][o.e.n.Node               ] [testFullClusterRestart] closed
04:46:43   1> [2019-02-18T05:45:49,789][INFO ][o.e.n.Node               ] [testFullClusterRestart] stopping ...
04:46:43   1> [2019-02-18T05:45:49,804][INFO ][o.e.n.Node               ] [testFullClusterRestart] stopped
04:46:43   1> [2019-02-18T05:45:49,804][INFO ][o.e.n.Node               ] [testFullClusterRestart] closing ...
04:46:43   1> [2019-02-18T05:45:49,809][INFO ][o.e.n.Node               ] [testFullClusterRestart] closed
04:46:43   1> [2019-02-18T05:45:49,809][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testFullClusterRestart] after test
04:46:43 ERROR   2.19s J1 | MlDistributedFailureIT.testFullClusterRestart <<< FAILURES!
04:46:43    > Throwable #1: java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@431efb34=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@94f55c5}
04:46:43    >   at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)
04:46:43    >   at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
04:46:43    >   at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:130)
04:46:43    >   at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:284)
04:46:43    >   at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:100)
04:46:43    >   at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
04:46:43    >   at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
04:46:43    >   at org.elasticsearch.node.NodeService.close(NodeService.java:135)
04:46:43    >   at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
04:46:43    >   at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
04:46:43    >   at org.elasticsearch.node.Node.close(Node.java:857)
04:46:43    >   at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1044)
04:46:43    >   at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:966)
04:46:43    >   at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1888)
04:46:43    >   at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1773)
04:46:43    >   at

Since there's an approved PR that might fix this, I'm not going to mute the test, but if we can't get that fix in soon we'll need to.

tvernum on 18 Feb 2019

Agreed, this should have been muted earlier.

jpountz on 18 Feb 2019

Unfortunately https://github.com/elastic/elasticsearch/issues/37117 hasn't fixed the issue.

This build has the fix but still fails:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+release-tests/465/consoleFull

hendrikmuhs on 18 Feb 2019

Thanks @hendrikmuhs I'll look.

jpountz on 18 Feb 2019

So it looks like there was an issue with ongoing search requests, but it's not what we have seen in the above test failures (or at least not the only problem that we have seen) given that the stack trace suggests that all indices were already closed when the node got closed. I've opened another pull request (#39099) which fixes some minor issues that I don't expect to be the root cause of what we are seeing here, and which should also make debugging easier. Please don't mute the test yet I'd like to collect some more information.

jpountz on 19 Feb 2019

Since the MlMemoryTracker is the thing that's causing search requests to be started while the node is closing it's probably best if I make MlLifeCycleService.stop() tell the MlMemoryTracker not to start any more searches and wait for in-progress searches to complete (even though "completion" will sometimes mean failing because the other nodes in the cluster have been shut down).

Since Node.stop() is called at the very beginning of Node.close() this should fix it.

I'll work on this today.

It looks like lots of components get closed before plugins in Node.close(), so trying to make all of them robust to outstanding searches is going to be a long-winded exercise.

Sorry for the disruption to your schedule @jpountz - I should have realised what was going on much earlier.

droberts195 on 19 Feb 2019

No worries at all @droberts195. I feel like we have an actual bug that the current behavior of MlLifkeCycleService is only more likely to expose. +1 to stop firing new requests in stop() though.

jpountz on 19 Feb 2019

This happened again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+internalClusterTest/1179/consoleText

This included my change to stop the ML memory tracker as part of Node.stop before components get closed.

The exception has moved on from previous occurrences - now it says no master node:

  1> [2019-02-24T15:07:04,458][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testFullClusterRestart] after test
ERROR   11.4s J1 | MlDistributedFailureIT.testFullClusterRestart <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@533b697c=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@31412048}
   >    at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)
   >    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
   >    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
   >    at org.elasticsearch.indices.IndicesService$2.closeInternal(IndicesService.java:265)
   >    at org.elasticsearch.common.util.concurrent.AbstractRefCounted.decRef(AbstractRefCounted.java:63)
   >    at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:309)
   >    at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:100)
   >    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
   >    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
   >    at org.elasticsearch.node.NodeService.close(NodeService.java:135)
   >    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
   >    at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
   >    at org.elasticsearch.node.Node.close(Node.java:857)
   >    at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1051)
   >    at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:972)
   >    at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1892)
   >    at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1765)
   >    at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testFullClusterRestart$5(MlDistributedFailureIT.java:113)
   >    at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.run(MlDistributedFailureIT.java:286)
   >    at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testFullClusterRestart(MlDistributedFailureIT.java:111)
   >    at java.lang.Thread.run(Thread.java:748)Throwable #2: java.lang.RuntimeException: Had to resort to force-stopping datafeed, something went wrong?
   >    at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:294)
   >    at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.cleanupWorkaround(BaseMlIntegTestCase.java:207)
   >    at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.util.concurrent.ExecutionException: MasterNotDiscoveredException[no known master node]
   >    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
   >    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:253)
   >    at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
   >    at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:282)
   >    ... 36 more
   > Caused by: MasterNotDiscoveredException[no known master node]
   >    at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.doExecute(TransportStopDatafeedAction.java:108)
   >    at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.doExecute(TransportStopDatafeedAction.java:43)
   >    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:145)
   >    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:121)
   >    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:64)
   >    at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83)
   >    at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72)
   >    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393)
   >    at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:65)
   >    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393)
   >    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:382)
   >    at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:281)
   >    ... 36 moreThrowable #3: java.lang.RuntimeException: already closed
   >    at org.elasticsearch.test.InternalTestCluster$NodeAndClient.getOrBuildNodeClient(InternalTestCluster.java:925)
   >    at org.elasticsearch.test.InternalTestCluster$NodeAndClient.client(InternalTestCluster.java:904)
   >    at org.elasticsearch.test.InternalTestCluster$2.next(InternalTestCluster.java:2338)
   >    at org.elasticsearch.test.InternalTestCluster$2.next(InternalTestCluster.java:2329)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureClusterStateConsistency(ESIntegTestCase.java:1134)
   >    at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:572)
   >    at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2191)
   >    at java.lang.Thread.run(Thread.java:748)

The log confirms that the ML memory tracker is stopped before this happens:

  1> [2019-02-24T15:07:17,363][DEBUG][o.e.x.m.p.MlMemoryTracker] [testJobRelocationIsMemoryAware] ML memory tracker stopped

droberts195 on 25 Feb 2019

build-stats link

markharwood on 25 Feb 2019

Since the fixes of #39099 21 days ago the build stats link only shows one failure on 7.x and above. (It has failed a few more times on 6.6 and 6.7, but they don't have that fix.)

That one failure was in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java8fips,nodes=immutable&&linux&&docker/82/, which is on a FIPS JVM and also a number of other tests related to splitting indices also failed, so there may be some interaction there.

So it's looking like a combination of #39111 and #39099 have fixed the problems that lead to this issue being raised. #39111 is in 6.6.2 and above; #39099 is in 7.1.0 and above.

droberts195 on 25 Mar 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings