Elasticsearch: RollupDocumentationIT.testStartRollupJob fails

Created on 6 Nov 2018  路  10Comments  路  Source: elastic/elasticsearch

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian/42/console

12:57:00 ERROR   0.36s | RollupDocumentationIT.testStartRollupJob <<< FAILURES!
12:57:00    > Throwable #1: org.elasticsearch.client.ResponseException: method [DELETE], host [http://[::1]:52535], URI [/_xpack/rollup/job/job_1], status line [HTTP/1.1 500 Internal Server Error]
12:57:00    > {"task_failures":[{"task_id":8905,"node_id":"FhzPJ_ibQoS6vX9AbmETzA","status":"INTERNAL_SERVER_ERROR","reason":{"type":"illegal_state_exception","reason":"Could not delete job [job_1] because indexer state is [STOPPING].  Job must be [STOPPED] before deletion."}}],"acknowledged":false}
12:57:00    >   at __randomizedtesting.SeedInfo.seed([8485FBE4F308CC81:EB88BE3030DD61E3]:0)
12:57:00    >   at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:687)
12:57:00    >   at org.elasticsearch.client.RestClient.performRequest(RestClient.java:218)
12:57:00    >   at org.elasticsearch.client.documentation.RollupDocumentationIT.deleteRollupJobs(RollupDocumentationIT.java:437)
12:57:00    >   at org.elasticsearch.client.documentation.RollupDocumentationIT.wipeRollup(RollupDocumentationIT.java:417)
12:57:00    >   at java.lang.Thread.run(Thread.java:748)

Reproduce line (though does not repro locally):

REPRODUCE WITH: ./gradlew :client:rest-high-level:integTestRunner -Dtests.seed=8485FBE4F308CC81 -Dtests.class=org.elasticsearch.client.documentation.RollupDocumentationIT -Dtests.method="testStartRollupJob" -Dtests.security.manager=true -Dtests.locale=ru-RU -Dtests.timezone=Africa/Malabo -Dcompiler.java=11 -Druntime.java=8
:AnalyticRollup >test-failure

Most helpful comment

Another reproduction https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos/44/console
Build stats point to failures on 6.x also and 10-12 failures per week.
Will mute the tests.

All 10 comments

Pinging @elastic/es-search-aggs

@polyfractal this one looks like a timing issue where in some cases we delete the rollup jobs before they are stopped, but rather while they are stopping. Sounds like we should adapt the test?

Yeah, sounds like it. This behavior is caused by the recent #34574, but there is an awaitBusy that is supposed to stop this from happening (or fail).

I'll investigate today, thanks for the ping.

Another reproduction https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos/44/console
Build stats point to failures on 6.x also and 10-12 failures per week.
Will mute the tests.

I found the issue. RollupDocumentationIT has some old/obsolete cleanup code. It's no longer needed because the xpack cleanup code was moved into ESRestTestCase, but this class was missed during the refactor.

Removing the cleanup and using the superclass fixes the issue. Ran it for a few hundred iterations and it looks good. Unrelated, I also saw that we weren't checking the return result of an awaitBusy() and added an assertion.

Will push a fix and unmute these. Sorry for the noise all!

This resurfaced again on 6.x. Seems to be a timeout issue.
Does NOT reproduce locally on 6.x for me.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/762/console

REPRODUCE WITH: ./gradlew :client:rest-high-level:integTestRunner \
  -Dtests.seed=F86D6F8A86509455 \
  -Dtests.class=org.elasticsearch.client.documentation.RollupDocumentationIT \
  -Dtests.method="testStartRollupJob" \
  -Dtests.security.manager=true \
  -Dtests.locale=tr \
  -Dtests.timezone=Antarctica/South_Pole \
  -Dcompiler.java=11 \
  -Druntime.java=8

REPRODUCE WITH: ./gradlew :client:rest-high-level:integTestRunner \
  -Dtests.seed=F86D6F8A86509455 \
  -Dtests.class=org.elasticsearch.client.documentation.RollupDocumentationIT \
  -Dtests.method="testDeleteRollupJob" \
  -Dtests.security.manager=true \
  -Dtests.locale=tr \
  -Dtests.timezone=Antarctica/South_Pole \
  -Dcompiler.java=11 \
  -Druntime.java=8

REPRODUCE WITH: ./gradlew :client:rest-high-level:integTestRunner \
  -Dtests.seed=F86D6F8A86509455 \
  -Dtests.class=org.elasticsearch.client.documentation.RollupDocumentationIT \
  -Dtests.method="testDeleteRollupJob" \
  -Dtests.security.manager=true \
  -Dtests.locale=tr \
  -Dtests.timezone=Antarctica/South_Pole \
  -Dcompiler.java=11 \
  -Druntime.java=8

REPRODUCE WITH: ./gradlew :client:rest-high-level:integTestRunner \
  -Dtests.seed=F86D6F8A86509455 \
  -Dtests.class=org.elasticsearch.client.documentation.RollupDocumentationIT \
  -Dtests.method="testGetRollupCaps" \
  -Dtests.security.manager=true \
  -Dtests.locale=tr \
  -Dtests.timezone=Antarctica/South_Pole \
  -Dcompiler.java=11 \
  -Druntime.java=8

Relevant stacktrace:

17:46:20     > {"error":{"root_cause":[{"type":"timeout_exception","reason":"Timed out after [10s] while waiting for rollup job [job_1] to stop"}],"type":"timeout_exception","reason":"Timed out after [10s] while waiting for rollup job [job_1] to stop"},"status":500}
17:46:20     >  at __randomizedtesting.SeedInfo.seed([F86D6F8A86509455:97602A5E45853937]:0)
17:46:20     >  at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:929)
17:46:20     >  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229)
17:46:20     >  at org.elasticsearch.test.rest.ESRestTestCase.wipeRollupJobs(ESRestTestCase.java:570)
17:46:20     >  at org.elasticsearch.test.rest.ESRestTestCase.wipeCluster(ESRestTestCase.java:482)
17:46:20     >  at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:263)
17:46:20     >  at java.lang.Thread.run(Thread.java:748)
17:46:20     > Caused by: org.elasticsearch.client.ResponseException: method [POST], host [http://[::1]:38195], URI [/_xpack/rollup/job/job_1/_stop?wait_for_completion=true&timeout=10s], status line [HTTP/1.1 500 Internal Server Error]
17:46:20     > {"error":{"root_cause":[{"type":"timeout_exception","reason":"Timed out after [10s] while waiting for rollup job [job_1] to stop"}],"type":"timeout_exception","reason":"Timed out after [10s] while waiting for rollup job [job_1] to stop"},"status":500}
17:46:20     >  at org.elasticsearch.client.RestClient$1.completed(RestClient.java:548)
17:46:20     >  at org.elasticsearch.client.RestClient$1.completed(RestClient.java:533)
17:46:20     >  at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
17:46:20     >  at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
17:46:20     >  at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
17:46:20     >  at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
17:46:20     >  at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
17:46:20     >  at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
17:46:20     >  at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
17:46:20     >  at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
17:46:20     >  at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
17:46:20     >  at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
17:46:20     >  at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
17:46:20     >  at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
17:46:20     >  at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
17:46:20     >  at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)

Pushed a commit to master and 6.x which contains more logging info: de52ba1f78edced3060804df7301c305b384009c

There's also a fix to the test itself, to ensure that the job is stopped completely before continuing (with wait_for_completion), but I don't think this is the actual issue since the timeout is happening during the post-test cleanup.

I did manage to get this to fail a few times locally but it took thousands of iterations. Will keep my machine churning on it and see if there is more logging/debug info that I can add. Current theory is a listener or something is being dropped so that once it moves to STOPPING under certain conditions, it never finishes moving to STOPPED.

Just a note: this has failed a few more times (including on a few unrelated PR builds of mine), and I've noticed it's _only_ failing on the documentation tests. Which is interesting since similar operations are done in both regular yaml tests and java integration tests.

So there must be something either about the doc tests causing it directly (bad test cleanup?), or something about the doc test environment makes this bug more likely.

Still investigating :(

Another note for myself: I think I've found the sequence of events that leads to the issue, still digging into why.

  1. Job created, test run etc
  2. In cleanup, all indices are deleted
  3. Then all rollup jobs are stopped and deleted
  4. For whatever reason, a job has a bulk in-flight (despite the test being over), so it acknowledges the stop but stays in STOPPING because it's waiting for the indexer to finish it's async request
  5. stop request times out (10s) and fails the test because the job couldn't be cleaned up, cascades to fail a few more tests that run right after
  6. Bulk finally times out with an error because the index was deleted
  7. Job logs the bulk error, notices the stop request and finally shuts itself down (well after the test has finished)

Need to determine why the indexer is allowed to continue running even after a test "passes" (e.g. everything should be done and static at the end of the test), and perhaps alter the order of events for clean up (remove all jobs before deleting all indices)

Note: the fix was merged into master, but backports are still pending. I'll leave a note here once all backports are done so we can track if new instances of the error pop up after the fix is in.

Was this page helpful?
0 / 5 - 0 ratings