Hi,
I was wondering if you were open to add customization to the retry logic which right now only check the status. I just had trouble with a 504 on a reindex and generally would want to activate the retry only on GET queries.
A function could be added to the FailureListener, such as :
public boolean shouldRetry(HttpHost host, HttpRequestBase request, HttpResponse httpResponse)
The logic of marking the host as dead when we do not retry might need to change, another function could be added to decide (we may not want to retry but still mark the node as dead).
Hi @Timshel could you elaborate on the usecase for the proposed change. What were the problems around the reindex api and why would you like to only retry GET requests?
My analysis of the problem is not complete but I believe I had the following scenario :
reindex request for an index which took 5min.reindex request to the new cluster timeout (configuration problem on the load balancer).reindex is still in progress but not all elements are loadedreindex process is startedreindex request timeout.reindex request to the next node(I had 3).reindex request finish and total document count is as expected.reindex are in progress.In the end :
FailureListener only expose HttpHost.504 the task was still in progress and completed with success.In the last point with a 504 error the task might still be running so in general I would be conservative and retry only request which do not change the cluster state.
For a 502 or 503 I believe the request would not have been started so always retrying to a different node would seem logical.
Did this 504 come because there was some kind of proxy between you and Elasticsearch? In general I think this is a good candidate for ?wait_for_completion=false on reindex.
Regarding the retries, I think it makes sense to be able to customize what you retry, yes. Especially in the presence of interesting components like firewalls and proxies.
For all this to work, though, you have to be able to detect that the response came from the proxy and not Elasticsearch.
I marked this for discussion. My current thinking on this is that we should verify when and if Elasticsearch can ever return 504 itself. 504 from a proxy should definitely not be retried. I hope that instead of making this pluggable, we can make this work the right way by default in the client (and also in all the other language clients). Thoughts @elastic/es-clients ?
Yes the 504 was generated by an nginx instance.
For the 504 handling I would not try to differentiate the source of the error.
My change suggestion was to allow to change this logic without changing the default handling.
But just removing the 504 from retry might be a simpler solution because if users want retry then the better solution is probably to configure timeout at the request level. With this supposition only edge cases may be left for which a retry is not pertinent.
I agree with Nik, I think that some kind of customization for retries could be useful. Today the low-level client retries on 502, 503 and 504. I also see 500 as a potential candidate.
as far as I remember 500 is not retried by the clients because Elasticsearch itself returns it in some cases. This is something that we should eventually fix in Elasticsearch so that 500 can be retried too. This is one of the reasons why I would prefer to better define what to retry on rather than allowing to customize it because we don't do the right thing by default.
We discussed this many times with clients, that's how we arrived at the trio of 502, 503, and 504.
500 wasn't selected because there is no reasonable expectation for things to work after retry - 500 literally means a generic error since otherwise it would have more precise number. In such case there shouldn't be a retry and the decision should be left for the user (either by specifying retry_on_status to include 500 or by catching the error and retrying manually) keeping up with our policy of doing the safe thing by default and allow users to customize in special circumstances.
Sounds like we do need to allow users to configure which error codes to retry on and leave the defaults as they are, like the other clients do. Marking adoptme.
@HonzaKral I agree and understand your point. I didn't mean to add 500 to the default list but rather add a way for users to easily configure the error codes to be retried.
+1 on opening this up, I think the current set has worked well OOTB for long enough to not change the default but there are plenty cases where the user is smarter than us :)
Most helpful comment
Sounds like we do need to allow users to configure which error codes to retry on and leave the defaults as they are, like the other clients do. Marking adoptme.