Title: Debugging 503s
Description:
We have a service which is returning 503s for like 1% of the total traffic.
This target service has a replication of 100 and the calling service has a replication of 50.
We have updated the circuit_breaking configuration as below
circuit_breakers:
thresholds:
-
priority: DEFAULT
max_connections: 10000
max_pending_requests: 10000
max_requests: 10000
max_retries: 3
The traffic would be around 10,000 requests/second.
And the latency is around 200ms. So i guess this configuration is sufficient enough to handle this traffic.
From the stats,
there is no upstream_rq_pending_overflow.
We have upstream_rq_pending_failure_eject and upstream_cx_connect_timeout.
I can understand upstream_cx_connect_timeout and we have a connection timeout of 0.25s.
But what could be the other reasons for upstream_rq_pending_failure_eject?
Also any suggestions to debug this 503 issues would be really helpful.
Is envoy returning 503 or does your service return 503 ?
I have seen 503s returned by envoy when upstream silently closes an idle connection but envoy is not aware of this event and tries to send a new request on this connection. Configuring retry_policy at the route level or configuring keepalives for the upstream cluster helped.
Hi @bmadhavan, envoy is returning the 503s
We have similar question on our side. We do NOT have any circuit breakers set explicitly and we have pretty low traffic, however we are getting 503 occasionally from gRPC reqeust between two envoys.
My application see:
rpc error: code = FailedPrecondition desc = transport: received the unexpected content-type \"text/plain\""
I can see on "sidecar" envoy:
[2018-07-03 15:26:43.991][14][debug][router] source/common/router/router.cc:298] [C151754][S2751502683535461939] 'x-request-id':'ab61454a-cdfd-4b4a-a351-c5f9182484f0'
[2018-07-03 15:26:43.991][14][debug][pool] source/common/http/http2/conn_pool.cc:97] [C151755] creating stream
[2018-07-03 15:26:43.991][14][debug][router] source/common/router/router.cc:966] [C151754][S2751502683535461939] pool ready
[2018-07-03 15:26:43.991][14][debug][http] source/common/http/conn_manager_impl.cc:790] [C151754][S2751502683535461939] request end stream
[2018-07-03 15:26:43.992][14][debug][router] source/common/router/router.cc:590] [C151757][S15483029364081032919] upstream headers complete: end_stream=false
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:963] [C151757][S15483029364081032919] encoding headers via codec (end_stream=false):
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919] ':status':'503'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919] 'content-length':'19'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919] 'content-type':'text/plain'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919] 'date':'Tue, 03 Jul 2018 15:26:43 GMT'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919] 'server':'envoy'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919] 'x-envoy-upstream-service-time':'0'
[2018-07-03 15:26:43.992][14][debug][client] source/common/http/codec_client.cc:94] [C151755] response complete
[2018-07-03 15:26:43.992][14][debug][pool] source/common/http/http2/conn_pool.cc:189] [C151755] destroying stream: 1 remaining
[2018-07-03 15:26:43.992][14][debug][http2] source/common/http/http2/codec_impl.cc:517] [C151757] stream closed: 0
[2018-07-03 15:26:43.992][14][debug][http2] source/common/http/http2/codec_impl.cc:517] [C151755] stream closed: 0
Cannot see anything corresponding on target envoy (next hop). Looks like either 503 was returned without any log or sidecar envoy reponsed with 503 immdiately?
Cannot see anything in metrics described here: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/circuit_breaking
..apart from drop in rate(envoy_cluster_upstream_rq_total[1m]) because of missed 1 request (my app is doing 2 request per every 5s)
This is interesting though 'x-envoy-upstream-service-time':'0'
Any clue what am I hitting here?
I have seen 503s returned by envoy when upstream silently closes an idle connection but envoy is not aware of this event and tries to send a new request on this connection. Configuring retry_policy at the route level or configuring keepalives for the upstream cluster helped.
This might be related...
BTW is there any way to improve the response? To return an actual grpc response on such error (or any other error)?
@dnivra26 @Bplotka You can enable HTTP access logging to get theresponse_flagswhich typically lists the reason for the error code. https://www.envoyproxy.io/docs/envoy/latest/configuration/access_log#format-rules
@bmadhavan so from the access logs the response code for the 503s are
UC: Upstream connection termination in addition to 503 response code
UF: Upstream connection failure in addition to 503 response code
with UC being 75%.
what could be the reasons for connection termination? the traffic is under limit
Also related:
@dnivra26 Might be this.https://github.com/envoyproxy/envoy/issues/2715
try turning on the keepalives and check if that helps.
@htuch : ack
found these in envoy logs. disconnect. resetting 1 pending requests. a bunch of these. could this be the reason for 503s? if so, how are these caused and what can be done to fix them? @bmadhavan @mattklein123
[2018-07-07 22:58:34.373][126][debug][client] source/common/http/codec_client.cc:25] [C34258] connecting
[2018-07-07 22:58:34.373][126][debug][client] source/common/http/codec_client.cc:63] [C34258] connected
[2018-07-07 22:58:34.389][132][debug][client] source/common/http/codec_client.cc:94] [C23628] response complete
[2018-07-07 22:58:34.389][126][debug][client] source/common/http/codec_client.cc:94] [C34258] response complete
[2018-07-07 22:58:34.893][130][debug][client] source/common/http/codec_client.cc:94] [C23471] response complete
[2018-07-07 22:58:34.893][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:35.132][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:35.205][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:35.244][132][debug][client] source/common/http/codec_client.cc:94] [C23630] response complete
[2018-07-07 22:58:35.244][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:35.291][132][debug][client] source/common/http/codec_client.cc:94] [C23010] response complete
[2018-07-07 22:58:35.291][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:35.885][132][debug][client] source/common/http/codec_client.cc:81] [C34113] disconnect. resetting 1 pending requests
[2018-07-07 22:58:35.885][132][debug][client] source/common/http/codec_client.cc:104] [C34113] request reset
[2018-07-07 22:58:35.885][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.205][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:36.234][132][debug][client] source/common/http/codec_client.cc:94] [C23255] response complete
[2018-07-07 22:58:36.234][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.443][132][debug][client] source/common/http/codec_client.cc:81] [C34116] disconnect. resetting 1 pending requests
[2018-07-07 22:58:36.443][132][debug][client] source/common/http/codec_client.cc:104] [C34116] request reset
[2018-07-07 22:58:36.443][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:36.552][132][debug][client] source/common/http/codec_client.cc:81] [C34118] disconnect. resetting 1 pending requests
[2018-07-07 22:58:36.552][132][debug][client] source/common/http/codec_client.cc:104] [C34118] request reset
[2018-07-07 22:58:36.552][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.774][132][debug][client] source/common/http/codec_client.cc:94] [C23261] response complete
[2018-07-07 22:58:36.774][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.853][130][debug][client] source/common/http/codec_client.cc:94] [C23478] response complete
more logs. queueing request due to no available connections
from envoy stats active connections did not even cross ~400. and the max connections by default circuit breaking is 1024. what could be the reasons for envoy saying no available connections?
[2018-07-07 23:12:40.396][130][debug][pool] source/common/http/http1/conn_pool.cc:88] [C36833] using existing connection
[2018-07-07 23:12:40.397][130][debug][pool] source/common/http/http1/conn_pool.cc:206] [C36833] response complete
[2018-07-07 23:12:40.397][130][debug][pool] source/common/http/http1/conn_pool.cc:244] [C36833] moving to ready
[2018-07-07 23:12:40.398][132][debug][pool] source/common/http/http1/conn_pool.cc:88] [C35752] using existing connection
[2018-07-07 23:12:40.405][130][debug][pool] source/common/http/http1/conn_pool.cc:79] creating a new connection
[2018-07-07 23:12:40.405][130][debug][pool] source/common/http/http1/conn_pool.cc:105] queueing request due to no available connections
[2018-07-07 23:12:40.406][130][debug][pool] source/common/http/http1/conn_pool.cc:249] [C36950] attaching to next request
[2018-07-07 23:12:40.415][130][debug][pool] source/common/http/http1/conn_pool.cc:206] [C36950] response complete
[2018-07-07 23:12:40.415][130][debug][pool] source/common/http/http1/conn_pool.cc:244] [C36950] moving to ready
[2018-07-07 23:12:40.415][132][debug][pool] source/common/http/http1/conn_pool.cc:206] [C35752] response complete
[2018-07-07 23:12:40.415][132][debug][pool] source/common/http/http1/conn_pool.cc:244] [C35752] moving to ready
[2018-07-07 23:12:40.459][132][debug][pool] source/common/http/http1/conn_pool.cc:88] [C35752] using existing connection
Envoy does connection pooling to upstream. So queueing request due to no available connections only means that there are no established connections to re-use & hence will initiate a new connection to the upstream. This shd not cause any issue.
For 503s I did recommend a few settings before.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.
I am still facing this issue. Anyone found any solution for this? @dnivra26 @bwplotka
For others that may be facing similar symptoms (low, but constant rate of HTTP 503's), be sure to check out the idle_timeout_ms setting introduced in Ambassador 0.60+. https://www.getambassador.io/reference/timeouts/#idle-timeout-idle_timeout_ms
For example, we had a node.js application that appeared to have an idle timeout set to 5 seconds, but ambassador's default idle timeout is set to 5 _minutes_. After setting this to 4 seconds idle_timeout_ms: 4000 for this service, our HTTP 503's went away. Ensure that you idle timeout on all proxies "in front of" your services have shorter idle timeouts than the ones behind them.
Most helpful comment
@dnivra26 @Bplotka You can enable HTTP access logging to get the
response_flagswhich typically lists the reason for the error code. https://www.envoyproxy.io/docs/envoy/latest/configuration/access_log#format-rules