Test failed only on 6.x and 6.1 (6 times for the last 2 months)
both on Java 8 and 9, windows/unix/darwin
REPRODUCE WITH: ./gradlew :server:test \
-Dtests.seed=45F368B316A63C14 \
-Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests \
-Dtests.method="testFetchShardsSkipUnavailable" \
-Dtests.security.manager=true \
-Dtests.locale=hi-IN \
-Dtests.timezone=Europe/Bucharest
This assertion fails in the test:
at org.elasticsearch.transport.RemoteClusterConnectionTests.testFetchShardsSkipUnavailable(RemoteClusterConnectionTests.java:483)
assertTrue(responseLatch.await(1, TimeUnit.SECONDS));
Is not reproducible locally for me.
I wonder if we can increase the timeout of the CountdownLatch's await to something like 10 secs:
assertTrue(responseLatch.await(10, TimeUnit.SECONDS));
@mayya-sharipova I've pushed commits to 6.x and master to make it timeout after five seconds instead of one. Hopefully it stops showing up
This failure occurred today on 6.2:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.2+multijob-windows-compatibility/101/consoleText
I will backport to fix @talevy made to 6.2 branch too.
This test is no longer failing.
We just had another failure on master in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu&&virtual/27/console (build id: 20180927232217-8529077B). It failed with:
23:36:06 > Throwable #1: java.lang.AssertionError
23:36:06 > at __randomizedtesting.SeedInfo.seed([C0637FEA9860D706:B4D7E2FF6FC7A22F]:0)
23:36:06 > at org.elasticsearch.transport.RemoteClusterConnectionTests.testFetchShardsSkipUnavailable(RemoteClusterConnectionTests.java:747)
23:36:06 > at java.lang.Thread.run(Thread.java:748)
Reproduction line (which does not reproduce locally):
./gradlew :server:test \
-Dtests.seed=C0637FEA9860D706 \
-Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests \
-Dtests.method="testFetchShardsSkipUnavailable" \
-Dtests.security.manager=true \
-Dtests.locale=ar-QA \
-Dtests.timezone=Asia/Ho_Chi_Minh \
-Dcompiler.java=10 \
-Druntime.java=8
The assertion that failed indicates that the we failed to reconnect within the timeout after the node is finally back up again:
I also noted that this test failed once more with a similar failure back on September 10 on the 6.x branch (build id: 20180910140737-9BA28F90) although there an earlier assertion fired:
This is immediately after we've started the node in the test.
The reproduction line back then was:
./gradlew :server:test \
-Dtests.seed=B8937E9B0015D381 \
-Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests \
-Dtests.method="testFetchShardsSkipUnavailable" \
-Dtests.security.manager=true \
-Dtests.locale=uk-UA \
-Dtests.timezone=America/Argentina/Tucuman \
-Dcompiler.java=10 \
-Druntime.java=8
(which does also not reproduce locally)
@talevy can you please have a look at this one?
Pinging @elastic/es-search-aggs
I have increased the timeout to 10 seconds on master, 6.x and 6.5.
Just failed again:
1> [2019-01-18T11:23:09,016][INFO ][o.e.t.RemoteClusterConnectionTests] [testFetchShardsSkipUnavailable] after test
FAILURE 19.2s J4 | RemoteClusterConnectionTests.testFetchShardsSkipUnavailable <<< FAILURES!
> Throwable #1: java.lang.AssertionError
> at __randomizedtesting.SeedInfo.seed([67359315E28BE0E4:13810E00152C95CD]:0)
> at org.elasticsearch.transport.RemoteClusterConnectionTests.testFetchShardsSkipUnavailable(RemoteClusterConnectionTests.java:693)
> at java.lang.Thread.run(Thread.java:748)
1> [2019-01-18T11:23:09,028][INFO ][o.e.t.RemoteClusterConnectionTests] [testEnsureConnected] before test
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/5145/consoleText
I would be very skeptical of tests that fail right now related to timeouts given the infrastructure problems that we are having. If this continues to fail we can look into it but for now I would assume anything timeout related is due to our general infrastructure problems.
It failed 5 times in the last 3 days. Maybe we can increase this (and possibly all other) timeout more to make the test more resilient to infrastructure problems?
We have been having infrastructure problems since Monday. I do not think we should increase the timeout any further, as tests failing like this because of timeouts when infrastructure problems occur are a big signal that we are having infrastructure problems. Our internal infra team is aware of the infrastructure problems and is taking them seriously.
Most helpful comment
This failure occurred today on 6.2:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.2+multijob-windows-compatibility/101/consoleText
I will backport to fix @talevy made to 6.2 branch too.