Envoy: Retrying requests on Keep-Alive timeout instead of throwing HTTP 503

Created on 1 Nov 2017 · 7Comments · Source: envoyproxy/envoy

Description

When a request is sent to an upstream host just before the HTTP Keep-Alive timeout expires and the connection is closed by the upstream, Envoy returns a HTTP 503 response to the client. It is possible to overcome this by setting retry_on policy to 5xx. However in this particular case I do not want to retry on any HTTP 5xx response from the upstream. I would like Envoy to only retry the request if the connection was reset by the upstream. Is there currently any way to achieve this?

Repro steps

I have set up a HAProxy as an upstream with timeout http-keep-alive set to 1000. (I have also been able to reproduce this using nginx with keepalive_timeout set to 1s.)

Then I ran the following loop to send a request via Envoy every ~1s:

while :; do curl -i http://192.168.1.2:8080/; echo; sleep 1; done

Most of the time, the curl returned a valid response. But from time to time it returned the following error:

HTTP/1.1 503 Service Unavailable
content-length: 57
content-type: text/plain
date: Fri, 27 Oct 2017 07:30:02 GMT
server: envoy

upstream connect error or disconnect/reset before headers

And the following line was reported to the access log:

[2017-10-27T08:31:47.894Z] "GET /geolocation HTTP/1.1" 503 UC 0 57 3 - "-" "curl/7.54.0" "79b6b41d-8bcd-4845-a0cd-99aef5d8459b" "" "192.168.1.3:80"

Below you can see output from tcpdump which shows the that connection was reset by the upstream just after sending the request. The reason is that by the time the request reached the remote side, the Keep-Alive timeout had already expired.

14:06:24.640984 IP (tos 0x0, ttl 64, id 12205, offset 0, flags [DF], proto TCP (6), length 276)
    192.168.1.2.34662 > 192.168.1.3.80: Flags [P.]
        GET / HTTP/1.1
        host: 192.168.1.2
        user-agent: curl/7.54.0
14:06:24.645341 IP (tos 0x0, ttl 53, id 17359, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.3.80 > 192.168.1.2.34662: Flags [F.]
14:06:24.657501 IP (tos 0x0, ttl 53, id 45208, offset 0, flags [DF], proto TCP (6), length 40)
    192.168.1.3.80 > 192.168.1.2.34662: Flags [R]

The issue can be worked around by disabling Keep-Alive (i.e. setting max_requests_per_connection to 1), but that is not something I want to do.

When I added the following snippet to the config file the errors were gone:

"retry_policy": {
  "retry_on": "5xx",
  "num_retries": 2
}

However I would like to retry only in case the above scenario happens. Not on any HTTP 5xx error returned by the upstream. Is there a way to achieve that?

envoy_collect.py

I am attaching sanitized data from envoy_collect.py. The scenario outlined above happens at 2017-11-01T09:59:09.473Z.

envoy_debug.zip

beginner enhancement help wanted

Source

jbrunclik

All 7 comments

Currently no, but it would be easy to add a new retry policy which only retries on upstream reset. (Right now we only have a "refused stream" reset policy).

mattklein123 on 1 Nov 2017

I've also got this problem, and it might not be as easy to fix as adding a new retry policy for upstream reset.

Adding an upstream reset policy would also retry on cases where processing of the request has actually been started by the upstream, but something killed the connection (or the upstream died) before the response was fully sent. This could lead to reprocessing non-idempotent requests, but maybe the risk of that is acceptable?

Trying to "properly" handle this with retries seems hard. Retrying only idempotent requests on a reset would be fine, but how do you deal with non-idempotent requests? You could, maybe, retry only if the connection has been successfully used for a request before (i.e. is keepalive and isn't newly opened), and is reset before the first byte in response to the current request is received: that set of conditions might be considered unlikely enough to occur under circumstances that aren't a keepalive timeout that the risk of reprocessing is OK?

I suspect the best fix is not via retries at all, but to allow configuration of a idle timeout per upstream connection (see also the existing max_requests_per_connection.) If you set this to less than the upstream's keep alive timeout (taking max RTT into account!) then you should never see upstream resets due to keepalive timeouts.

JonathanO on 4 Jan 2018

Yeah, in general we have ignored idempotency in Envoy to date. The reason for this is no matter what various standards say, application developers ignore them, and I don't believe that Envoy can ever truly know if a request is idempotent or not. This is why all retries are off by default and we ask users to pick the policy that suits them.

I'm not opposed to adding standard idempotency logic into Envoy, but it would need to be done with a big giant warning in the docs.

As per your other suggestion of an upstream idle timeout, this seems reasonable to me and would be relatively easy to implement.

mattklein123 on 4 Jan 2018

I got the same issue with node.js (as upgrade to node.js v8.x, the keep alive time-out has been set to 5s by default, which makes the problem more visible)

I also agree with providing option to set the upstream timeout: if the connection is about to timeout, use new connection instead of the old one)