Elasticsearch: RestClient: allow to customize retry logic.

Created on 6 Jul 2017 · 11Comments · Source: elastic/elasticsearch

Hi,

I was wondering if you were open to add customization to the retry logic which right now only check the status. I just had trouble with a 504 on a reindex and generally would want to activate the retry only on GET queries.

A function could be added to the FailureListener, such as :

public boolean shouldRetry(HttpHost host, HttpRequestBase request, HttpResponse httpResponse)

The logic of marking the host as dead when we do not retry might need to change, another function could be added to decide (we may not want to retry but still mark the node as dead).

:CorFeatureJava Low Level REST Client >enhancement CorFeatures help wanted

Source

Timshel

👍1

Most helpful comment

Sounds like we do need to allow users to configure which error codes to retry on and leave the defaults as they are, like the other clients do. Marking adoptme.

javanna on 18 Aug 2017

👍3

All 11 comments

Hi @Timshel could you elaborate on the usecase for the proposed change. What were the problems around the reindex api and why would you like to only retry GET requests?

javanna on 6 Jul 2017

My analysis of the problem is not complete but I believe I had the following scenario :

Send a reindex request for an index which took 5min.
After one minute the reindex request to the new cluster timeout (configuration problem on the load balancer).
The reindex is still in progress but not all elements are loaded
The client retry to another node and a new reindex process is started
The index is still not completed but deleted documents start appearing.
The second reindex request timeout.
The client retry again and send a new reindex request to the next node(I had 3).
The first reindex request finish and total document count is as expected.
The number of deleted documents continue to grow as the other reindex are in progress.
From the client point of view the task is not finished and there is no errors.
The ES logs contain gc overhead and throttling

In the end :

The problem was not from the RestClient or ES
I was unable to access the errors triggering the retry, the FailureListener only expose HttpHost.
(Each time I interrupted the task before the max_retry_timeout )
Even if the client received a 504 the task was still in progress and completed with success.
The retry introduced load and were unnecessary in this case.

In the last point with a 504 error the task might still be running so in general I would be conservative and retry only request which do not change the cluster state.
For a 502 or 503 I believe the request would not have been started so always retrying to a different node would seem logical.

Timshel on 6 Jul 2017

Did this 504 come because there was some kind of proxy between you and Elasticsearch? In general I think this is a good candidate for ?wait_for_completion=false on reindex.

Regarding the retries, I think it makes sense to be able to customize what you retry, yes. Especially in the presence of interesting components like firewalls and proxies.

For all this to work, though, you have to be able to detect that the response came from the proxy and not Elasticsearch.

nik9000 on 7 Jul 2017

I marked this for discussion. My current thinking on this is that we should verify when and if Elasticsearch can ever return 504 itself. 504 from a proxy should definitely not be retried. I hope that instead of making this pluggable, we can make this work the right way by default in the client (and also in all the other language clients). Thoughts @elastic/es-clients ?

javanna on 7 Jul 2017

Yes the 504 was generated by an nginx instance.
For the 504 handling I would not try to differentiate the source of the error.

My change suggestion was to allow to change this logic without changing the default handling.
But just removing the 504 from retry might be a simpler solution because if users want retry then the better solution is probably to configure timeout at the request level. With this supposition only edge cases may be left for which a retry is not pertinent.

Timshel on 7 Jul 2017

I agree with Nik, I think that some kind of customization for retries could be useful. Today the low-level client retries on 502, 503 and 504. I also see 500 as a potential candidate.

tlrx on 18 Aug 2017

as far as I remember 500 is not retried by the clients because Elasticsearch itself returns it in some cases. This is something that we should eventually fix in Elasticsearch so that 500 can be retried too. This is one of the reasons why I would prefer to better define what to retry on rather than allowing to customize it because we don't do the right thing by default.

javanna on 18 Aug 2017

We discussed this many times with clients, that's how we arrived at the trio of 502, 503, and 504.

500 wasn't selected because there is no reasonable expectation for things to work after retry - 500 literally means a generic error since otherwise it would have more precise number. In such case there shouldn't be a retry and the decision should be left for the user (either by specifying retry_on_status to include 500 or by catching the error and retrying manually) keeping up with our policy of doing the safe thing by default and allow users to customize in special circumstances.

HonzaKral on 18 Aug 2017

Sounds like we do need to allow users to configure which error codes to retry on and leave the defaults as they are, like the other clients do. Marking adoptme.

javanna on 18 Aug 2017

👍3

@HonzaKral I agree and understand your point. I didn't mean to add 500 to the default list but rather add a way for users to easily configure the error codes to be retried.

tlrx on 18 Aug 2017

👍1

+1 on opening this up, I think the current set has worked well OOTB for long enough to not change the default but there are plenty cases where the user is smarter than us :)