Elasticsearch: RemoteClusterConnectionTests. testFetchShardsSkipUnavailable

Created on 13 Feb 2018  路  11Comments  路  Source: elastic/elasticsearch

Test failed only on 6.x and 6.1 (6 times for the last 2 months)
both on Java 8 and 9, windows/unix/darwin

REPRODUCE WITH: ./gradlew :server:test \
  -Dtests.seed=45F368B316A63C14 \
  -Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests \
  -Dtests.method="testFetchShardsSkipUnavailable" \
  -Dtests.security.manager=true \
  -Dtests.locale=hi-IN \
  -Dtests.timezone=Europe/Bucharest

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=centos/670/console

This assertion fails in the test:
at org.elasticsearch.transport.RemoteClusterConnectionTests.testFetchShardsSkipUnavailable(RemoteClusterConnectionTests.java:483)
assertTrue(responseLatch.await(1, TimeUnit.SECONDS));

:SearcSearch >test-failure

Most helpful comment

This failure occurred today on 6.2:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.2+multijob-windows-compatibility/101/consoleText

I will backport to fix @talevy made to 6.2 branch too.

All 11 comments

Is not reproducible locally for me.
I wonder if we can increase the timeout of the CountdownLatch's await to something like 10 secs:

assertTrue(responseLatch.await(10, TimeUnit.SECONDS));

@mayya-sharipova I've pushed commits to 6.x and master to make it timeout after five seconds instead of one. Hopefully it stops showing up

This failure occurred today on 6.2:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.2+multijob-windows-compatibility/101/consoleText

I will backport to fix @talevy made to 6.2 branch too.

This test is no longer failing.

We just had another failure on master in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu&&virtual/27/console (build id: 20180927232217-8529077B). It failed with:

23:36:06    > Throwable #1: java.lang.AssertionError
23:36:06    >   at __randomizedtesting.SeedInfo.seed([C0637FEA9860D706:B4D7E2FF6FC7A22F]:0)
23:36:06    >   at org.elasticsearch.transport.RemoteClusterConnectionTests.testFetchShardsSkipUnavailable(RemoteClusterConnectionTests.java:747)
23:36:06    >   at java.lang.Thread.run(Thread.java:748)

Reproduction line (which does not reproduce locally):

./gradlew :server:test \
  -Dtests.seed=C0637FEA9860D706 \
  -Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests \
  -Dtests.method="testFetchShardsSkipUnavailable" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-QA \
  -Dtests.timezone=Asia/Ho_Chi_Minh \
  -Dcompiler.java=10 \
  -Druntime.java=8

The assertion that failed indicates that the we failed to reconnect within the timeout after the node is finally back up again:

https://github.com/elastic/elasticsearch/blob/9129948f60eec362d4e62deab675c1bbe034b5c4/server/src/test/java/org/elasticsearch/transport/RemoteClusterConnectionTests.java#L747

I also noted that this test failed once more with a similar failure back on September 10 on the 6.x branch (build id: 20180910140737-9BA28F90) although there an earlier assertion fired:

https://github.com/elastic/elasticsearch/blob/730fb466abbb650108979e86cd89888862a3615b/server/src/test/java/org/elasticsearch/transport/RemoteClusterConnectionTests.java#L669

This is immediately after we've started the node in the test.

The reproduction line back then was:

./gradlew :server:test \
  -Dtests.seed=B8937E9B0015D381 \
  -Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests \
  -Dtests.method="testFetchShardsSkipUnavailable" \
  -Dtests.security.manager=true \
  -Dtests.locale=uk-UA \
  -Dtests.timezone=America/Argentina/Tucuman \
  -Dcompiler.java=10 \
  -Druntime.java=8

(which does also not reproduce locally)

@talevy can you please have a look at this one?

Pinging @elastic/es-search-aggs

I have increased the timeout to 10 seconds on master, 6.x and 6.5.

Just failed again:

  1> [2019-01-18T11:23:09,016][INFO ][o.e.t.RemoteClusterConnectionTests] [testFetchShardsSkipUnavailable] after test
FAILURE 19.2s J4 | RemoteClusterConnectionTests.testFetchShardsSkipUnavailable <<< FAILURES!
   > Throwable #1: java.lang.AssertionError
   >    at __randomizedtesting.SeedInfo.seed([67359315E28BE0E4:13810E00152C95CD]:0)
   >    at org.elasticsearch.transport.RemoteClusterConnectionTests.testFetchShardsSkipUnavailable(RemoteClusterConnectionTests.java:693)
   >    at java.lang.Thread.run(Thread.java:748)
  1> [2019-01-18T11:23:09,028][INFO ][o.e.t.RemoteClusterConnectionTests] [testEnsureConnected] before test

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/5145/consoleText

I would be very skeptical of tests that fail right now related to timeouts given the infrastructure problems that we are having. If this continues to fail we can look into it but for now I would assume anything timeout related is due to our general infrastructure problems.

It failed 5 times in the last 3 days. Maybe we can increase this (and possibly all other) timeout more to make the test more resilient to infrastructure problems?

We have been having infrastructure problems since Monday. I do not think we should increase the timeout any further, as tests failing like this because of timeouts when infrastructure problems occur are a big signal that we are having infrastructure problems. Our internal infra team is aware of the infrastructure problems and is taking them seriously.

Was this page helpful?
0 / 5 - 0 ratings