Elasticsearch: [CI] RemoteClusterClientTests#testConnectAndExecuteRequest fails

Created on 2 May 2019  路  17Comments  路  Source: elastic/elasticsearch

On master: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-9/369/console

Couldn't reproduce with:

./gradlew :server:test --tests "org.elasticsearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest" \
  -Dtests.seed=5BFCD458F15F6DA4 \
  -Dtests.security.manager=true \
  -Dtests.locale=tr \
  -Dtests.timezone=America/Atka \
  -Dcompiler.java=12 \
  -Druntime.java=11
java.lang.AssertionError
    at __randomizedtesting.SeedInfo.seed([5BFCD458F15F6DA4:248A7562ED2FB133]:0)
    at org.junit.Assert.fail(Assert.java:86)
    at org.junit.Assert.assertTrue(Assert.java:41)
    at org.junit.Assert.assertTrue(Assert.java:52)
    at org.elasticsearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest(RemoteClusterClientTests.java:60)
:SearcSearch >test-failure

All 17 comments

Pinging @elastic/es-search

The assertion thats failing is checking that a remote node is connected through the remote cluster service (which is coming from a mock in this case since its ESTestCase). The log shows test timeouts:

 1> [2019-05-02T02:40:58,730][WARN ][o.e.t.RemoteClusterConnection] [org.elasticsearch.transport.RemoteClusterClientTests] fetching nodes from external cluster [test] failed
  1> org.elasticsearch.common.util.CancellableThreads$ExecutionCancelledException: operation was cancelled reason [connect handler is closed]
  1>    at org.elasticsearch.common.util.CancellableThreads.checkForCancel(CancellableThreads.java:65) ~[main/:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:130) ~[main/:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1>    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
  1>    at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:687) [main/:?]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1>    at java.lang.Thread.run(Thread.java:834) [?:?]
  1>    Suppressed: java.lang.IllegalStateException: Future got interrupted
  1>        at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:60) ~[main/:?]
  1>        at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34) ~[main/:?]
  1>        at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:209) ~[main/:?]
  1>        at org.elasticsearch.transport.ConnectionManager.openConnection(ConnectionManager.java:80) ~[main/:?]
  1>        at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:452) ~[main/:?]
  1>        at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[main/:8.0.0-SNAPSHOT]
  1>        at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
  1>        at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
  1>        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1>        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
  1>        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
  1>        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:687) [main/:?]
  1>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1>        at java.lang.Thread.run(Thread.java:834) [?:?]
  1>    Caused by: java.lang.InterruptedException
  1>        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1040) ~[?:?]
  1>        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345) ~[?:?]

I can trigger this kind of error by adding a wait of more than 30s in RemoteClusterConnection#collectRemoteNodes() in the IOInterruptible function passed there, but I wonder why such a long wait could happen in practice, especially since this isn't an integration test.

There has been another instance of this on 5.4. (https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-9/372/console) which also shows the same kind of timeout.

This part of the stack trace shows the ConnectionManger hangs somewhere when opening the internal connection. The future that gets interupted is a listener passed to Transport#openConnection() which in this case is probably the MockNioTransport.

Suppressed: java.lang.IllegalStateException: Future got interrupted
  1>        at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:60) ~[main/:?]
  1>        at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34) ~[main/:?]
  1>        at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:209) ~[main/:?]

The same problem as https://github.com/elastic/elasticsearch/issues/41745#issuecomment-490167271 occurred in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/1360/, although the failing test was RemoteClusterConnectionTests.testNodeDisconnected rather than RemoteClusterClientTests.testConnectAndExecuteRequest.

14:29:21   1> [2019-05-08T08:29:21,244][WARN ][o.e.t.RemoteClusterConnection] [org.elasticsearch.transport.RemoteClusterConnectionTests] fetching nodes from external cluster [test-cluster] failed
14:29:21   1> org.elasticsearch.common.util.CancellableThreads$ExecutionCancelledException: operation was cancelled reason [connect handler is closed]
14:29:21   1>   at org.elasticsearch.common.util.CancellableThreads.checkForCancel(CancellableThreads.java:65) ~[main/:7.2.0-SNAPSHOT]
14:29:21   1>   at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:130) ~[main/:7.2.0-SNAPSHOT]
14:29:21   1>   at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
14:29:21   1>   at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
14:29:21   1>   at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
14:29:21   1>   at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
14:29:21   1>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
14:29:21   1>   at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
14:29:21   1>   at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
14:29:21   1>   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
14:29:21   1>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
14:29:21   1>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
14:29:21   1>   Suppressed: java.lang.IllegalStateException: Future got interrupted
14:29:21   1>       at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:60) ~[main/:?]
14:29:21   1>       at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34) ~[main/:?]
14:29:21   1>       at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:209) ~[main/:?]
14:29:21   1>       at org.elasticsearch.transport.ConnectionManager.openConnection(ConnectionManager.java:80) ~[main/:?]
14:29:21   1>       at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:452) ~[main/:?]
14:29:21   1>       at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[main/:7.2.0-SNAPSHOT]
14:29:21   1>       at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
14:29:21   1>       at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
14:29:21   1>       at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
14:29:21   1>       at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
14:29:21   1>       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
14:29:21   1>       at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
14:29:21   1>       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
14:29:21   1>       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
14:29:21   1>       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
14:29:21   1>       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
14:29:21   1>   Caused by: java.lang.InterruptedException
14:29:21   1>       at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_212]
14:29:21   1>       at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_212]
14:29:21   1>       at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252) ~[main/:?]
14:29:21   1>       at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87) ~[main/:?]
14:29:21   1>       at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:57) ~[main/:?]
14:29:21   1>       ... 15 more

Talked to @original-brownbear who was investigating similar looking problems elsewhere lately and who suggested using e.g. Yourkit to identify potential locks and determine how much time threads spend waiting on them as a next step.

I opened https://github.com/elastic/elasticsearch/pull/42000 to get us more useful logging on this situation

This appear to have failed again on master on May 4th (https://scans.gradle.com/s/ftmdnpa26cdty#tests)
and the 7.2 branch on May 26nd (https://scans.gradle.com/s/vcnmxcdmlqsdu#tests). At least the later might have
the improved logging from (https://github.com/elastic/elasticsearch/pull/42572), although I'm not entirely sure that
has been backported to the 7.2 branch.

Looks like the logging from #42572 wasn't backported to 7.2. unfortunately. I'll take a look if the logs reveal anything interesting anyway.

The test logs from the latest failure on 7.2

1> [2019-05-25T21:12:38,915][INFO ][o.e.t.RemoteClusterClientTests] [testEnsureWeReconnect] before test
--
1> [2019-05-25T21:12:39,114][INFO ][o.e.t.TransportService   ] [testEnsureWeReconnect] publish_address {127.0.0.1:10304}, bound_addresses {[::1]:10303}, {127.0.0.1:10304}
1> [2019-05-25T21:12:39,145][INFO ][o.e.t.TransportService   ] [testEnsureWeReconnect] publish_address {127.0.0.1:10307}, bound_addresses {[::1]:10307}, {127.0.0.1:10307}
1> [2019-05-25T21:12:39,698][INFO ][o.e.t.RemoteClusterClientTests] [testEnsureWeReconnect] after test
1> [2019-05-25T21:12:39,711][INFO ][o.e.t.RemoteClusterClientTests] [testConnectAndExecuteRequest] before test
1> [2019-05-25T21:12:39,734][INFO ][o.e.t.TransportService   ] [testConnectAndExecuteRequest] publish_address {127.0.0.1:10304}, bound_addresses {[::1]:10303}, {127.0.0.1:10304}
1> [2019-05-25T21:12:39,749][INFO ][o.e.t.TransportService   ] [testConnectAndExecuteRequest] publish_address {127.0.0.1:10303}, bound_addresses {[::1]:10307}, {127.0.0.1:10303}
1> [2019-05-25T21:13:09,752][WARN ][o.e.t.RemoteClusterService] [testConnectAndExecuteRequest] failed to connect to remote clusters within 30s
1> [2019-05-25T21:13:09,755][WARN ][o.e.t.n.TestEventHandler ] [[elasticsearch[transport_worker][T#1]]] Slow execution on network thread [29984 milliseconds]
1> java.lang.RuntimeException: Slow exception on network thread
1>  at org.elasticsearch.transport.nio.TestEventHandler.maybeLogElapsedTime(TestEventHandler.java:212) ~[framework-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1>  at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:123) ~[framework-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1>  at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:397) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1>  at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1>  at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1>  at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
1> [2019-05-25T21:13:09,851][WARN ][o.e.t.RemoteClusterConnection] [org.elasticsearch.transport.RemoteClusterClientTests] fetching nodes from external cluster [test] failed
1> org.elasticsearch.common.util.CancellableThreads$ExecutionCancelledException: operation was cancelled reason [connect handler is closed]
1>  at org.elasticsearch.common.util.CancellableThreads.checkForCancel(CancellableThreads.java:65) ~[main/:7.2.0-SNAPSHOT]
1>  at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:130) ~[main/:7.2.0-SNAPSHOT]
1>  at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
1>  at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
1>  at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
1>  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
1>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
1>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
1>  Suppressed: org.elasticsearch.transport.ConnectTransportException: [remote_node][127.0.0.1:10304] general node connection failure
1>      at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:127) ~[main/:?]
1>      at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:467) ~[main/:?]
1>      at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[main/:7.2.0-SNAPSHOT]
1>      at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
1>      at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
1>      at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
1>      at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
1>      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
1>      at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
1>      at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
1>      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
1>      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
1>      at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
1>  Caused by: java.lang.IllegalStateException: handshake failed with {remote_node}{nt4h9qp_QbysVWlbcfs1Kg}{jeF8L4uhSvaOyCiOz8IwOw}{127.0.0.1}{127.0.0.1:10304}
1>      at org.elasticsearch.transport.TransportService.handshake(TransportService.java:421) ~[main/:?]
1>      at org.elasticsearch.transport.TransportService.lambda$connectionValidator$4(TransportService.java:352) ~[main/:?]
1>      at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:105) ~[main/:?]
1>      ... 12 more
1>  Caused by: java.lang.IllegalStateException: Future got interrupted
1>      at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:47) ~[main/:?]
1>      at org.elasticsearch.transport.TransportService.handshake(TransportService.java:419) ~[main/:?]
1>      at org.elasticsearch.transport.TransportService.lambda$connectionValidator$4(TransportService.java:352) ~[main/:?]
1>      at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:105) ~[main/:?]
1>      ... 12 more
1>  Caused by: java.lang.InterruptedException
1>      at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_212]
1>      at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_212]
1>      at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252) ~[main/:?]
1>      at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87) ~[main/:?]
1>      at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:44) ~[main/:?]
1>      at org.elasticsearch.transport.TransportService.handshake(TransportService.java:419) ~[main/:?]
1>      at org.elasticsearch.transport.TransportService.lambda$connectionValidator$4(TransportService.java:352) ~[main/:?]
1>      at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:105) ~[main/:?]
1>      ... 12 more
1> [2019-05-25T21:13:09,864][INFO ][o.e.t.RemoteClusterClientTests] [testConnectAndExecuteRequest] after test


@cbuescher I'm on PTO starting effectively right now so I might not get to this but: feel free to backport this to 7.2 to get a better shot a catching the deadlock here. If you back port the logging improvements though, please make sure to back-port exactly https://github.com/elastic/elasticsearch/pull/42572 and not the underlying commit to master since the backport has a bug fix to this logging in it that is required to make it properly functional.

Backported #42572 to 7.2 branch with 44f2db8344ea13f1 to get better logging there as well.

Just checked again, we got another one with potentially better logging on master on 24. June 2019 here:
https://scans.gradle.com/s/b4qsaeqsirvbw/tests/nfxodyych4uas-6q6452dcdcklg

@cbuescher looks like a missed call to TransportService#acceptIncomingRequests somewhere, we're blocked on the latch that gets unblocked by a call to that method.

Didn't mean to close this, only added more logging. I will revisit this in time to see if we got more of these failures.

@original-brownbear do you think https://github.com/elastic/elasticsearch/pull/44622 fixed this or should we keep it open for investigation for a bit longer?

Let's wait for https://github.com/elastic/elasticsearch/pull/44939 which should def. fix this (or at least have it fail very differently :D).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abrahamduran picture abrahamduran  路  3Comments

matthughes picture matthughes  路  3Comments

Praveen82 picture Praveen82  路  3Comments

clintongormley picture clintongormley  路  3Comments

dawi picture dawi  路  3Comments