Elasticsearch: Server-side timeout returns 408 Request Timeout

Created on 15 Nov 2018  路  6Comments  路  Source: elastic/elasticsearch

In a few places we return RestStatus.REQUEST_TIMEOUT (i.e. 408 Request Timeout) when a request times out. However, RFC 7231 says:

   The 408 (Request Timeout) status code indicates that the server did
   not receive a complete request message within the time that it was
   prepared to wait.  A server SHOULD send the "close" connection option
   (Section 6.1 of [RFC7230]) in the response, since 408 implies that
   the server has decided to close the connection rather than continue
   waiting.  If the client has an outstanding request in transit, the
   client MAY repeat that request on a new connection.

I think this does not describe the usage in the following places, all of which seem to be server-side timeouts rather than timeouts waiting for another request from the client:

https://github.com/elastic/elasticsearch/blob/bc5b1afb55c3b9a75c8d3f69c7832765918b439c/modules/reindex/src/main/java/org/elasticsearch/index/reindex/BulkIndexByScrollResponseContentListener.java#L61-L63

https://github.com/elastic/elasticsearch/blob/bc5b1afb55c3b9a75c8d3f69c7832765918b439c/x-pack/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/notification/jira/JiraIssue.java#L140-L142

https://github.com/elastic/elasticsearch/blob/bc5b1afb55c3b9a75c8d3f69c7832765918b439c/x-pack/plugin/upgrade/src/main/java/org/elasticsearch/xpack/upgrade/rest/RestIndexUpgradeAction.java#L78-L80

https://github.com/elastic/elasticsearch/blob/bc5b1afb55c3b9a75c8d3f69c7832765918b439c/server/src/main/java/org/elasticsearch/action/admin/cluster/health/ClusterHealthResponse.java#L317-L319

https://github.com/elastic/elasticsearch/blob/bc5b1afb55c3b9a75c8d3f69c7832765918b439c/client/rest-high-level/src/main/java/org/elasticsearch/client/ClusterClient.java#L117-L120

https://github.com/elastic/elasticsearch/blob/bc5b1afb55c3b9a75c8d3f69c7832765918b439c/client/rest-high-level/src/main/java/org/elasticsearch/client/ClusterClient.java#L131-L134

:CorInfrREST API CorInfra team-discuss

Most helpful comment

We chatted about this again given @HonzaKral's feedback, and settled on just a generic 500 error.

We agreed with the rationale that this should be a 5xx level error, but didn't want to use a 504 because that implies Elasticsearch is acting as a proxy. ES should never be in the position to be a proxy/gateway, so we should probably never send a 504 anywhere. None of the other 5xx seemed to make sense in this situation, so that left us with a generic 500

All 6 comments

Pinging @elastic/es-core-infra

We discussed this and, although we agreed that 408 isn't the best choice we didn't agree on an alternative. The impact on clients will be significant. Perhaps we can keep these in the 4xx series, although the most appropriate codes seem to be a general 400 or else perhaps 409 Conflict?

@elastic/es-clients do you have any opinions on a way forwards here?

Since it is a server timeout I'd say that 504 would be better than 4xx response as that indicates a client error. We just need to be careful because 504 is a retryable error and the client will, by default, attempt to retry on a different node.

Either way it is important to have a documented and stable information in the response body indicating exactly what kind of error it is, then the client code doesn't have to rely solely on the status code which has limited options.

We chatted about this again given @HonzaKral's feedback, and settled on just a generic 500 error.

We agreed with the rationale that this should be a 5xx level error, but didn't want to use a 504 because that implies Elasticsearch is acting as a proxy. ES should never be in the position to be a proxy/gateway, so we should probably never send a 504 anywhere. None of the other 5xx seemed to make sense in this situation, so that left us with a generic 500

ES should never be in the position to be a proxy/gateway, so we should probably never send a 504 anywhere.

This may have been true in the past but with cross cluster search and remote clusters, I wonder if we should reconsider the 504 as the local node may simply act as a gateway to multiple remotes? Some other items worth discussion should include when we make such a change, will our API compatibility solution be able to support making this and avoid breaking clients?

I disagree that this should be a 5xx error. The client specified the timeout, and if the only bad thing that happened to the request was that Elasticsearch didn't get to it before it the timeout elapsed then that's kinda the client's problem. Note that I'm distinguishing this case from ones where the request was actually rejected (429 overload) or genuinely failed (5xx our bad)

IMO a 504 is appropriate where the timeout is defined by the gateway, but that doesn't apply to Elasticsearch.

Was this page helpful?
0 / 5 - 0 ratings