Kibana: Retry search requests on network errors

Created on 1 Oct 2020  路  8Comments  路  Source: elastic/kibana

Loading a dashboard can be quite challenging for the whole infrastructure along the way (browser, OS, wifi, router, VPN, server, proxy, ...) to handle due to the high number of concurrent requests and lots of data processing happening at once.

In real world scenarios it's common for random request failures to pop up when working with large dashboards. We can't handle all of these cases, but we can improve the current state by applying a conservative retry policy for requests.

If a request fails quickly without a status code, we could retry the request once on the lowest level in the search service and only propagate the error down the stack if the retry fails as well.

For reference, this is currently an issue in our functional test suite: https://github.com/elastic/kibana/issues/43963

I'm aware this is a potentially controversial feature, that's why I tagged this issue as discussion, to decide whether we even want to do this and, if we do, figuring out the details.

Search AppServices discuss

All 8 comments

Pinging @elastic/kibana-app-arch (Team:AppArch)

@elastic/kibana-platform I know the core ES client already has some built-in retry logic, but AFAICT it is just internally used by saved objects, correct?

it is just internally used by saved objects, correct?

In case of SO we retry a request manually
https://github.com/elastic/kibana/blob/2a82ff9566423a16e1f976c9d0d2db91acf006a9/src/core/server/elasticsearch/client/retry_call_cluster.ts

New ES client provides retry functionality out-of-the-box configured with maxRetries:

// @elastic/elasticsearch/lib/Transport.d.ts
interface TransportOptions {
  maxRetries: number;
  //...

example in the platform code:
https://github.com/elastic/kibana/blob/57d10144f9d9d661257d9eb86dad78b3bffab7cc/src/core/server/saved_objects/migrations/core/migration_es_client.ts#L83

Thanks for the pointer @restrry! A quick glance at the ES client suggests that retries are only performed if the response is a 429:

    let response = null
    for (let i = 0; i <= maxRetries; i++) {
      response = await this[kClient].search(params, options)
      if (response.statusCode !== 429) break
      await sleep(wait)
    }
    if (response.statusCode === 429) {
      throw new ResponseError(response)
    }

Assuming that's the case, we'd need to implement our own retry logic.

FWIW I am supportive of this idea in general, but might be good to make it opt-in to start, with some guardrails to prevent folks from abusing it. We'd also need to clearly define the circumstances under which we would want to perform a retry.

That's interesting. I'd expect we do not retry if there are too many requests. as per https://github.com/elastic/elasticsearch-js/blob/a064f0f357ea5797cb8a784671b85a6b0c88626d/lib/Transport.js#L254-L259

@delvedor

Ah, interesting, I missed that usage in Transport.js ... that's what I get for relying on Github search instead of just grepping the code 馃檪 The example I found seems to be specifically from scroll search.

In that case @flash1293 what do you think about this retry logic? https://github.com/elastic/elasticsearch-js/blob/a064f0f357ea5797cb8a784671b85a6b0c88626d/lib/Transport.js#L254-L259

Basically: 502/3/4 responses are retried up until maxRetries, assuming there isn't a 429. It sounds like your other requirement would be a retry on no status code?

Hello! The js client is retrying on 502/3/4 HTTP errors, connection faults, and connection timeout.
The client will not retry on 429 normal API calls. It will retry on 429 as well in some helpers with a delay.

I think it's generally uncontroversial to retry on search requests that don't get any status code back, which would solve the problem we're experiencing in https://github.com/elastic/kibana/issues/43963 and I think we could look to expand the scenarios if we want.

Was this page helpful?
0 / 5 - 0 ratings