Elasticsearch: RemoteClusterConnectionTests. testFetchShardsSkipUnavailable

Created on 13 Feb 2018 · 11Comments · Source: elastic/elasticsearch

Test failed only on 6.x and 6.1 (6 times for the last 2 months)
both on Java 8 and 9, windows/unix/darwin

REPRODUCE WITH: ./gradlew :server:test \
  -Dtests.seed=45F368B316A63C14 \
  -Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests \
  -Dtests.method="testFetchShardsSkipUnavailable" \
  -Dtests.security.manager=true \
  -Dtests.locale=hi-IN \
  -Dtests.timezone=Europe/Bucharest

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=centos/670/console

This assertion fails in the test:
at org.elasticsearch.transport.RemoteClusterConnectionTests.testFetchShardsSkipUnavailable(RemoteClusterConnectionTests.java:483)
assertTrue(responseLatch.await(1, TimeUnit.SECONDS));

:SearcSearch >test-failure

Source

mayya-sharipova

Most helpful comment

This failure occurred today on 6.2:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.2+multijob-windows-compatibility/101/consoleText

I will backport to fix @talevy made to 6.2 branch too.

martijnvg on 15 Feb 2018

👍2

All 11 comments

Is not reproducible locally for me.
I wonder if we can increase the timeout of the CountdownLatch's await to something like 10 secs:

assertTrue(responseLatch.await(10, TimeUnit.SECONDS));

mayya-sharipova on 13 Feb 2018

@mayya-sharipova I've pushed commits to 6.x and master to make it timeout after five seconds instead of one. Hopefully it stops showing up

talevy on 13 Feb 2018

👍1

This failure occurred today on 6.2:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.2+multijob-windows-compatibility/101/consoleText

I will backport to fix @talevy made to 6.2 branch too.

martijnvg on 15 Feb 2018

👍2

This test is no longer failing.

jasontedor on 8 Mar 2018

We just had another failure on master in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu&&virtual/27/console (build id: 20180927232217-8529077B). It failed with:

23:36:06    > Throwable #1: java.lang.AssertionError
23:36:06    >   at __randomizedtesting.SeedInfo.seed([C0637FEA9860D706:B4D7E2FF6FC7A22F]:0)
23:36:06    >   at org.elasticsearch.transport.RemoteClusterConnectionTests.testFetchShardsSkipUnavailable(RemoteClusterConnectionTests.java:747)
23:36:06    >   at java.lang.Thread.run(Thread.java:748)

Reproduction line (which does not reproduce locally):

./gradlew :server:test \
  -Dtests.seed=C0637FEA9860D706 \
  -Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests \
  -Dtests.method="testFetchShardsSkipUnavailable" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-QA \
  -Dtests.timezone=Asia/Ho_Chi_Minh \
  -Dcompiler.java=10 \
  -Druntime.java=8

The assertion that failed indicates that the we failed to reconnect within the timeout after the node is finally back up again:

https://github.com/elastic/elasticsearch/blob/9129948f60eec362d4e62deab675c1bbe034b5c4/server/src/test/java/org/elasticsearch/transport/RemoteClusterConnectionTests.java#L747

I also noted that this test failed once more with a similar failure back on September 10 on the 6.x branch (build id: 20180910140737-9BA28F90) although there an earlier assertion fired:

https://github.com/elastic/elasticsearch/blob/730fb466abbb650108979e86cd89888862a3615b/server/src/test/java/org/elasticsearch/transport/RemoteClusterConnectionTests.java#L669

This is immediately after we've started the node in the test.

The reproduction line back then was:

./gradlew :server:test \
  -Dtests.seed=B8937E9B0015D381 \
  -Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests \
  -Dtests.method="testFetchShardsSkipUnavailable" \
  -Dtests.security.manager=true \
  -Dtests.locale=uk-UA \
  -Dtests.timezone=America/Argentina/Tucuman \
  -Dcompiler.java=10 \
  -Druntime.java=8

(which does also not reproduce locally)

@talevy can you please have a look at this one?

danielmitterdorfer on 28 Sep 2018

Pinging @elastic/es-search-aggs

elasticmachine on 28 Sep 2018

I have increased the timeout to 10 seconds on master, 6.x and 6.5.

javanna on 6 Nov 2018

Just failed again:

  1> [2019-01-18T11:23:09,016][INFO ][o.e.t.RemoteClusterConnectionTests] [testFetchShardsSkipUnavailable] after test
FAILURE 19.2s J4 | RemoteClusterConnectionTests.testFetchShardsSkipUnavailable <<< FAILURES!
   > Throwable #1: java.lang.AssertionError
   >    at __randomizedtesting.SeedInfo.seed([67359315E28BE0E4:13810E00152C95CD]:0)
   >    at org.elasticsearch.transport.RemoteClusterConnectionTests.testFetchShardsSkipUnavailable(RemoteClusterConnectionTests.java:693)
   >    at java.lang.Thread.run(Thread.java:748)
  1> [2019-01-18T11:23:09,028][INFO ][o.e.t.RemoteClusterConnectionTests] [testEnsureConnected] before test

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/5145/consoleText

imotov on 18 Jan 2019

I would be very skeptical of tests that fail right now related to timeouts given the infrastructure problems that we are having. If this continues to fail we can look into it but for now I would assume anything timeout related is due to our general infrastructure problems.

jasontedor on 18 Jan 2019

It failed 5 times in the last 3 days. Maybe we can increase this (and possibly all other) timeout more to make the test more resilient to infrastructure problems?

imotov on 18 Jan 2019

We have been having infrastructure problems since Monday. I do not think we should increase the timeout any further, as tests failing like this because of timeouts when infrastructure problems occur are a big signal that we are having infrastructure problems. Our internal infra team is aware of the infrastructure problems and is taking them seriously.

jasontedor on 18 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings