Istio: tiny percentage of 503s on a cluster with no change & @low qps

Created on 8 Feb 2018 · 87Comments · Source: istio/istio

running 100qps through ingress in the perf setup getting a small % of 503s
(only on multi hops - maybe a timeout issue?) - it also only happens with >=20 connections or so

35.199.14.248/fortio2/fortio/fetch/echosrv1:8080/echo

Starting at 100 qps with 30 thread(s) [gomax 2] : exactly 3000, 100 calls each (total 3000 + 0)
Ended after 30.071664869s : 3000 calls. qps=99.762
Sleep times : count 2970 avg 0.26524946 +/- 0.01577 min 0.192739205 max 0.292777981 sum 787.790887
Aggregated Function Time : count 3000 avg 0.037497644 +/- 0.01582 min 0.009672528 max 0.10991789 sum 112.492931
# range, mid point, percentile, count
>= 0.00967253 <= 0.01 , 0.00983626 , 0.03, 1
> 0.01 <= 0.012 , 0.011 , 0.17, 4
> 0.012 <= 0.014 , 0.013 , 0.70, 16
> 0.014 <= 0.016 , 0.015 , 2.63, 58
> 0.016 <= 0.018 , 0.017 , 5.33, 81
> 0.018 <= 0.02 , 0.019 , 8.17, 85
> 0.02 <= 0.025 , 0.0225 , 19.70, 346
> 0.025 <= 0.03 , 0.0275 , 33.83, 424
> 0.03 <= 0.035 , 0.0325 , 52.87, 571
> 0.035 <= 0.04 , 0.0375 , 67.87, 450
> 0.04 <= 0.045 , 0.0425 , 76.67, 264
> 0.045 <= 0.05 , 0.0475 , 82.83, 185
> 0.05 <= 0.06 , 0.055 , 91.00, 245
> 0.06 <= 0.07 , 0.065 , 94.97, 119
> 0.07 <= 0.08 , 0.075 , 98.10, 94
> 0.08 <= 0.09 , 0.085 , 98.87, 23
> 0.09 <= 0.1 , 0.095 , 99.60, 22
> 0.1 <= 0.109918 , 0.104959 , 100.00, 12
# target 50% 0.0342469
# target 75% 0.044053
# target 99% 0.0918182
# target 99.9% 0.107438
Code 200 : 2995 (99.8 %)
Code 503 : 5 (0.2 %)
Response Header Sizes : count 3000 avg 165.72233 +/- 6.772 min 0 max 167 sum 497167
Response Body/Total Sizes : count 3000 avg 166.09833 +/- 2.432 min 165 max 226 sum 498295
Saved result to data/2018-02-08-055310_Fortio.json
All done 3000 calls 37.498 ms avg, 99.8 qps

arenetworking areperf and scalability kincustomer issue

Source

ldemailly

Most helpful comment

I have run into this issue as well.

[xxx] [2018-10-02T16:12:44.409Z] "GET /glb HTTP/1.1" 503 UC 0 57 1 - "-" "GoogleHC/1.0" "ecb1c797-a58f-9c53-be4f-857bc2884456" "10.100.0.100" "127.0.0.1:3000"

It appears to affect apps running Node/express but not Play. curl -v indicates that The Node app responds with Connection: keep-alive, but the Play app does not send that response header. The issue seems to go away by setting Connection: close in the response.

Is this the recommended approach?
Can envoy not retry these cases?
What is the performance implication of not using keep-alive connections?
Should this be in the Istio and/or Envoy docs?

jsw on 2 Oct 2018

👍8

All 87 comments

gets worse with higher qps and/or connections:

Starting at max qps with 20 thread(s) [gomax 2] for exactly 10000 calls (500 per thread + 0)
Code 200 : 9667 (96.7 %)
Code 503 : 333 (3.3 %)
All done 10000 calls 67.028 ms avg, 293.5 qps

Starting at max qps with 40 thread(s) [gomax 2] for exactly 10000 calls (250 per thread + 0)
Aggregated Function Time : count 10000 avg 0.11053613 +/- 0.08286 min 0.002446852 max 0.70930171 sum 1105.36134
Code 200 : 9373 (93.7 %)
Code 503 : 627 (6.3 %)
All done 10000 calls 110.536 ms avg, 352.3 qps

ldemailly on 8 Feb 2018

could it be circuit breaker / outlier detection settings?

On Wed, Feb 7, 2018 at 10:13 PM Laurent Demailly notifications@github.com
wrote:

gets worse with higher qps and/or connections:

Code 200 : 9667 (96.7 %)
Code 503 : 333 (3.3 %)
Response Header Sizes : count 10000 avg 160.6159 +/- 29.81 min 0 max 167
sum 1606159
Response Body/Total Sizes : count 10000 avg 168.1157 +/- 10.62 min 165 max
227 sum 1681157
Saved result to data/2018-02-08-061159_Fortio.json
All done 10000 calls 67.028 ms avg, 293.5 qps

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/istio/istio/issues/3295#issuecomment-364013236, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJGIxtVmGBNuKqiU0OlmK-WQyulRxPQrks5tSpCVgaJpZM4R91tL
.

kyessenov on 8 Feb 2018

could it be circuit breaker / outlier detection settings?

is there any such by default ?

ldemailly on 8 Feb 2018

Yes, see
https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/cluster/circuit_breaker.proto.html#envoy-api-msg-cluster-circuitbreakers

On Wed, Feb 7, 2018 at 10:17 PM Laurent Demailly notifications@github.com
wrote:

could it be circuit breaker / outlier detection settings?

is there any such by default ?

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/istio/istio/issues/3295#issuecomment-364013804, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJGIxhp3N8WpVhhKab_eogtG9-dbWOX2ks5tSpFvgaJpZM4R91tL
.

kyessenov on 8 Feb 2018

thanks for the pointer, but all the default value seems to be 1024 (max connections etc...) so it shouldn't trigger with 20 or 40 connections

ldemailly on 8 Feb 2018

@geeknoid removed the kind/bug label a day ago

~@geeknoid, curious why are 503s for no reason not a bug ?~

Edit: It's not just this issue, looks like there are no bugs anymore :-)

ldemailly on 11 Feb 2018

This is the same as #1038, closing as duplicate.

andraxylia on 12 Feb 2018

it wasn't actually the same, #1038 was with changes, this one is without any change
will reopen as @PiotrSikora and I aren't done figuring the possible cause
(but if #1038 gets fixed and also fixes this, wonderful!)

ldemailly on 12 Feb 2018

After fixing https://github.com/istio/fortio/pull/132 the error rate drops considerably but is still present:

Code 200 : 2998 (99.9 %)
Code 503 : 2 (0.1 %)

Apologies about that bug - and maybe once I track down the logs for those 2 errors some other bug will show up

ldemailly on 18 Feb 2018

@wattli
This happens only with mTLS enabled, with traffic going from VM --> Istio Ingress --> echosrv1 --> echosrv2

Envoy stats in the Ingress show Ingress timeout out when connecting to the upstream cluster for echosrv1:

out.echosrv1.istio.svc.cluster.local|http-echo::10.4.5.9:8080::cx_connect_fail::1208
out.echosrv1.istio.svc.cluster.local|http-echo::10.4.5.9:8080::rq_error::1336

andraxylia on 28 Feb 2018

@PiotrSikora is working on improving the Enovy SSL perf and he has a PR:

https://github.com/envoyproxy/envoy/pull/2680

wattli on 28 Feb 2018

@wattli It's unlikely that this has anything to do with https://github.com/envoyproxy/envoy/pull/2680. 503s are results of connection errors and not slower performance.

PiotrSikora on 28 Feb 2018

@PiotrSikora the hope was that while you have a pilot less setup to investigate mtls perf issue, whether you also see 503s during your testing (which would rule out pilot / bad config) and confirm andra's determination

ldemailly on 28 Feb 2018

👍1

@ldemailly ah, I'm using static config and I saw exactly zero 503s.

PiotrSikora on 28 Feb 2018

@PiotrSikora do we have any measurements for the time spent in connecting one Envoy to an upstream Envoy cluster?

Slower performance will cause connection error if it takes more time to connect than the connectTimeout set in the CDS config.

andraxylia on 28 Feb 2018

what is the connectTimeout in your CDS static config?

andraxylia on 28 Feb 2018

Can it be caused by the loading of the certificates? Is there any caching of the certificates in Envoy?
If loading of the certificates is done for every LDS and CDS, it may slow down things.

andraxylia on 28 Feb 2018

That's my question too: do we always load cert/key for all incoming and outgoing call? Or it's just loaded for the first time.

wattli on 28 Feb 2018

if it was reloaded at each call we wouldn't need to restart envoy when they change

ldemailly on 28 Feb 2018

No, it's loaded only during startup and after receiving config (LDS/CDS) update. I'm trying to investigate what happens when the update is the same as previous one.

But if the low connection timeout is the issue, then why not just raise it? Having timeout <= 1s seems quite aggressive.

PiotrSikora on 28 Feb 2018

As a workaround we can increase the timeout, but people consider 1s to be more than sufficient. In general, it should be in the tens of ms.

There is no reason to reload a certificate with the same name, except for when SDS will tell Envoy to do it. Instead, Envoy should re-use the in-memory certificate.

andraxylia on 1 Mar 2018

1s on internal network with no congestion is a lot
if we have any packet loss it's classically not enough as the first retry is at 1s (unless tuned) but I don't think that's the case here

ldemailly on 1 Mar 2018

I reproduced this in my env with 0.6.0. I don't think it related to connection timeout, as in fortio result,
the max response was around 140ms, which is way lower than 1s.

In ingress access log: (see here for log format)

[2018-03-02T01:56:49.871Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 42 38 "10.4.0.1" "istio/fortio-0.7.3-pre" "e539e54b-e239-9f44-a3fa-54f6a6d1c83d" "35.197.53.122" "10.4.0.10:8080"

The response flag is - which means the 503 from upstream.

In deployment1 sidecar access log:

[2018-03-02T01:56:49.896Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 UC 0 57 14 - "10.4.0.1" "istio/fortio-0.7.3-pre" "e539e54b-e239-9f44-a3fa-54f6a6d1c83d" "35.197.53.122" "127.0.0.1:8080"

the response flag is UC, which means Upstream connection termination in addition to 503 response code..

The 503 is coming from the sidecar, and seems upstream of sidecar (echosrv) is terminating connection.

In sidecar envoy stats (I got total 3 of 503s):

cluster.in.8080.upstream_cx_destroy_remote_with_active_rq: 3
cluster.in.8080.upstream_cx_destroy_with_active_rq: 3

this means the echosrv is terminating the connection while request is still active. Any idea why this happen?

lizan on 2 Mar 2018

👍3

@ldemailly is this then a fortio bug?

andraxylia on 2 Mar 2018

fortio's fetch is a pretty dumb proxy: https://github.com/istio/fortio/issues/133

yet it shouldn't change the number of simultaneous connections, or stop connecting only when mTLS is on ?

is there any stats on maximum open connections ? I can add some on fortio side (https://github.com/istio/fortio/issues/165) or you can check indirectly through the pprof url

ldemailly on 2 Mar 2018

@lizan In my repro, I got :
out.echosrv1.istio.svc.cluster.local|http-echo::10.4.5.9:8080::cx_connect_fail::1208
out.echosrv1.istio.svc.cluster.local|http-echo::10.4.5.9:8080::rq_error::1336

How can these be explained? When are these counters incremented?

andraxylia on 2 Mar 2018

[2018-03-02T01:56:49.896Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 UC 0 57 14 - "10.4.0.1" "istio/fortio-0.7.3-pre" "e539e54b-e239-9f44-a3fa-54f6a6d1c83d" "35.197.53.122" "127.0.0.1:8080"

can I see the rest of the log around this on the middle pod

specifically do you see the outbound calls, do those have any code ?

ldemailly on 2 Mar 2018

ok so I made a few runs on my 0.6.0 cluster:
for the url : ...my-k8s-ingress.../fortio1/fortio/fetch/echosrv2:8080/echo
Starting at 100 qps with 8 thread(s) [gomax 2] : exactly 2000, 250 calls each (total 2000 + 0)

On VM:

04:38:27 W http.go:712> Parsed non ok code 503 (HTTP/1.1 503)
04:38:28 W http.go:712> Parsed non ok code 503 (HTTP/1.1 503)
04:38:34 W http.go:712> Parsed non ok code 503 (HTTP/1.1 503)
04:38:35 W http.go:712> Parsed non ok code 503 (HTTP/1.1 503)

On Ingress:

[2018-03-02T04:38:27.613Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 11 10 "10.36.0.1" "istio/fortio-0.7.2" "30725606-7fb6-9933-b0ba-9c174b5b55c8" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:28.899Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 24 23 "10.36.0.1" "istio/fortio-0.7.2" "c108341b-886e-9de3-b2c6-7c41bbf774e4" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:34.039Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 10 10 "10.150.0.7" "istio/fortio-0.7.2" "3e606abc-aac5-9f55-a97b-a702c75bbeb2" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:35.163Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 25 25 "10.36.0.1" "istio/fortio-0.7.2" "6bdc0b5e-486b-9427-9b1b-a9137e41118e" "35.199.51.155" "10.36.0.6:8080"

and with surrounding requests:

$ kis logs istio-ingress-5b78d8df9d-5srjn  |grep -C1 " 503 "
--
[2018-03-02T04:38:27.613Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 10 10 "10.150.0.5" "istio/fortio-0.7.2" "1a001b0a-f847-90d6-bdeb-28a7c247a558" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:27.613Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 11 10 "10.36.0.1" "istio/fortio-0.7.2" "30725606-7fb6-9933-b0ba-9c174b5b55c8" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:27.613Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 12 12 "10.36.0.1" "istio/fortio-0.7.2" "4a4dd19f-1d1f-9cde-b9d3-a5d6e013cf8e" "35.199.51.155" "10.36.0.6:8080"
--
--
[2018-03-02T04:38:28.899Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 24 23 "10.36.0.1" "istio/fortio-0.7.2" "61fb76c1-841e-9118-8831-ebc4ea68c64e" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:28.899Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 24 23 "10.36.0.1" "istio/fortio-0.7.2" "c108341b-886e-9de3-b2c6-7c41bbf774e4" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:28.900Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 24 24 "10.150.0.4" "istio/fortio-0.7.2" "bb559ed9-db6a-91bd-b058-289954a7a778" "35.199.51.155" "10.36.0.6:8080"
--
--
[2018-03-02T04:38:33.961Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 21 21 "10.150.0.4" "istio/fortio-0.7.2" "ab80c021-0ab3-98fc-b5a8-654af7a9620d" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:34.039Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 10 10 "10.150.0.7" "istio/fortio-0.7.2" "3e606abc-aac5-9f55-a97b-a702c75bbeb2" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:34.038Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 11 11 "10.36.0.1" "istio/fortio-0.7.2" "75cd1950-813f-90d8-9222-ff693b9278d2" "35.199.51.155" "10.36.0.6:8080"
--
--
[2018-03-02T04:38:35.164Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 25 24 "10.150.0.3" "istio/fortio-0.7.2" "1eebd2d6-bd5f-9b7a-aba2-30001f7a37bb" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:35.163Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 25 25 "10.36.0.1" "istio/fortio-0.7.2" "6bdc0b5e-486b-9427-9b1b-a9137e41118e" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:35.163Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 26 26 "10.150.0.3" "istio/fortio-0.7.2" "9a3db532-eab9-9f15-95ae-2aedea537dfa" "35.199.51.155" "10.36.0.6:8080"

--- switching strategy as the log can't be correlated as is

Starting at 100 qps with 8 thread(s) [gomax 2] : exactly 1000, 125 calls each (total 1000 + 0)
Code 200 : 991 (99.1 %)
Code 503 : 9 (0.9 %)

35.199.51.155/fortio1/fortio/fetch/echosrv2:8080/echo?run=1

kis logs istio-ingress-5b78d8df9d-5srjn |grep "run=1"|wc -l
1000

kis logs istio-ingress-5b78d8df9d-5srjn  |grep "run=1" | grep " 503 "
[2018-03-02T04:54:10.804Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 16 16 "10.150.0.4" "istio/fortio-0.7.2" "cf9d0df0-87c9-960d-b1c4-319f69aaf0f1" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:11.045Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 9 9 "10.150.0.5" "istio/fortio-0.7.2" "f2ba03d6-7737-91d4-aca4-2ffcd4486353" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:13.545Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 14 14 "10.150.0.4" "istio/fortio-0.7.2" "473539d2-a773-98f7-8ee7-1eac8403f8ae" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:13.706Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 13 13 "10.150.0.3" "istio/fortio-0.7.2" "6ae4e7c5-3b33-91e9-9b49-9a89f4c55545" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:16.932Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 9 9 "10.150.0.8" "istio/fortio-0.7.2" "83671c4a-8f52-9b0c-ac79-08f864520365" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:17.496Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 18 17 "10.150.0.3" "istio/fortio-0.7.2" "1b47e13c-4cb3-9bf1-b7a3-dc5ca141331a" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:17.496Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 18 17 "10.150.0.7" "istio/fortio-0.7.2" "a2b89f57-d67a-9732-9aa1-a619839c3e01" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:17.576Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 15 15 "10.150.0.8" "istio/fortio-0.7.2" "c060b39a-d97d-916f-9202-e96967af205d" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:17.576Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 15 15 "10.150.0.8" "istio/fortio-0.7.2" "b37f694a-0dba-91ca-a0f3-9e18e7469c8d" "35.199.51.155" "10.36.0.6:8080"

k logs -n istio echo-svc-deployment1-6cb54fc-vsrfc istio-proxy | grep "run=1" | wc -l
    1991

so indeed 9 missing on the outcall

$ k logs -n istio echo-svc-deployment1-6cb54fc-vsrfc echosrv | grep "run=1" | wc -l
     991

interesting... the fortio app never saw those 9 calls

@lizan can you clarify UC exactly - is it connection timeout, connection and empty read + eof, something else ?

I'll double check the code on fortio's side but... the printing is the first thing it does:
https://github.com/istio/fortio/blob/master/ui/uihandler.go#L576-L578
hooked to the default go http server through
https://github.com/istio/fortio/blob/master/ui/uihandler.go#L865

ps: why are envoy logs not showing the authority/host/service names anymore but just ips ?

ldemailly on 2 Mar 2018

Definitely not timeout, which will result 504. Likely an empty read + eof.

lizan on 2 Mar 2018

How can these be explained? When are these counters incremented?

@andraxylia Do you still have the reproduced cluster? What is the access log of that proxy? (fetched by kubectl logs).

lizan on 2 Mar 2018

I have it running continuously on istio-perf-0-6 project if you want to look around

here are some envoy stats I captured around a 503:

cluster.in.8080.internal.upstream_rq_200: 455
cluster.in.8080.internal.upstream_rq_2xx: 455
cluster.in.8080.internal.upstream_rq_503: 1
cluster.in.8080.internal.upstream_rq_5xx: 1
cluster.in.8080.membership_healthy: 1
cluster.in.8080.membership_total: 1
cluster.in.8080.upstream_cx_destroy_remote_with_active_rq: 1
cluster.in.8080.upstream_cx_destroy_with_active_rq: 1
cluster.in.8080.upstream_cx_http1_total: 455
cluster.in.8080.upstream_cx_rx_bytes_total: 75083
cluster.in.8080.upstream_cx_total: 455
cluster.in.8080.upstream_cx_tx_bytes_total: 234291
cluster.in.8080.upstream_rq_200: 455
cluster.in.8080.upstream_rq_2xx: 455
cluster.in.8080.upstream_rq_503: 1
cluster.in.8080.upstream_rq_5xx: 1
cluster.in.8080.upstream_rq_pending_total: 455
cluster.in.8080.upstream_rq_total: 456

if I read this correctly it means connection was successful but
cluster.in.8080.upstream_cx_destroy_remote_with_active_rq: 1
cluster.in.8080.upstream_cx_destroy_with_active_rq: 1

anything can be glanced from those 2 counters ?

ps: what is upstream_rq_pending_total ? pending what ?

ldemailly on 2 Mar 2018

looking at lsof, envoy on the node has about 88-92 open sockets
fortio has 3 most of the time with some spikes

root@gke-istio-perf-default-pool-4d15d772-fngb:~# lsof -p 26900
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
fortio  26900 root  cwd    DIR  0,124     4096  114498 /
fortio  26900 root  rtd    DIR  0,124     4096  114498 /
fortio  26900 root  txt    REG    8,1  8293120 1828913 /usr/local/bin/fortio
fortio  26900 root    0r  FIFO   0,10      0t0  114510 pipe
fortio  26900 root    1w  FIFO   0,10      0t0  114511 pipe
fortio  26900 root    2w  FIFO   0,10      0t0  114512 pipe
fortio  26900 root    3u  sock    0,7      0t0  114603 can't identify protocol
fortio  26900 root    4u  0000   0,11        0    7212 anon_inode
fortio  26900 root    5u  sock    0,7      0t0  110448 can't identify protocol
fortio  26900 root    6u  sock    0,7      0t0  114611 can't identify protocol
fortio  26900 root    7u  sock    0,7      0t0 2110099 can't identify protocol
fortio  26900 root    8u  sock    0,7      0t0 2110112 can't identify protocol
fortio  26900 root    9u  sock    0,7      0t0 2110095 can't identify protocol
fortio  26900 root   10u  sock    0,7      0t0 2110096 can't identify protocol
fortio  26900 root   11u  sock    0,7      0t0 2110113 can't identify protocol
fortio  26900 root   12u  sock    0,7      0t0 2110114 can't identify protocol
fortio  26900 root   13u  sock    0,7      0t0 2110087 can't identify protocol
fortio  26900 root   14u  sock    0,7      0t0 2110088 can't identify protocol
fortio  26900 root   15u  sock    0,7      0t0 2110097 can't identify protocol
fortio  26900 root   16u  sock    0,7      0t0 2110117 can't identify protocol

I can try to add internal metrics but I don't see a smoking gun so far
also will try go1.10 just in case there is an issue in go1.8.7

ldemailly on 2 Mar 2018

@lizan I decreased the CDS connect_timeout to 100ms to force the repro, and I can do a simple repro by calling from Ingress to echosrv1 or to echosvr2 (no more chaining). Echosrv1 has mixer cache enabled, and echosrv2 has mixer cache disabled. The envoy stats are these:

cluster.out.echosrv1.istio.svc.cluster.local|http-echo.upstream_cx_connect_fail: 12
cluster.out.echosrv1.istio.svc.cluster.local|http-echo.upstream_rq_pending_failure_eject: 10

I would like to understand the meaning of cx_connect_fail. Does the connect_timeout in CDS include the chain from Ingress -> Envoy -->Mixer --> Echosvr1, or simply the Ingress ->Envoy? If the former, we may have just a performance issue with Mixer. If the latter, it is an Envoy issue.

I emailed you my cluster info. The Envoy logs attached, the 503s in the end are more relevant.

ingress.log

andraxylia on 2 Mar 2018

upstream_rq_pending_total

Once envoy creates pending request due to no active connection, this will be incremented (and never decremented), as long as you don't see upstream_rq_pending_active, that means those pending requests are processed.

https://github.com/envoyproxy/envoy/blob/d67729bb2ef3b33f551f622f20b5630f34ed2f6e/source/common/http/http1/conn_pool.cc#L100

lizan on 2 Mar 2018

@andraxylia from the ingress.log you attached, it doesn't look like ingress is generating 503, the 503s in the log are from its upstream. for example:

[2018-03-02 07:50:31.823][17][debug][http] external/envoy/source/common/http/async_client_impl.cc:89] async http request response headers (end_stream=false):
[2018-03-02 07:50:31.824][17][debug][http] external/envoy/source/common/http/async_client_impl.cc:92]   ':status':'503'
[2018-03-02 07:50:31.824][17][debug][http] external/envoy/source/common/http/async_client_impl.cc:92]   'content-length':'19'
[2018-03-02 07:50:31.824][17][debug][http] external/envoy/source/common/http/async_client_impl.cc:92]   'content-type':'text/plain'
[2018-03-02 07:50:31.824][17][trace][http] external/envoy/source/common/http/async_client_impl.cc:104] async http request response data (length=19 end_stream=true)

I'll take a look at your cluster.

lizan on 2 Mar 2018

got one complete stream log from echosrv1 sidecar:

[2018-03-02 22:27:32.337][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:447] [C1175][S2595374538763107673] request headers complete (end_stream=true):
[2018-03-02 22:27:32.337][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   ':authority':'35.197.53.122'
[2018-03-02 22:27:32.337][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'user-agent':'istio/fortio-0.7.3-pre'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-on-behalf-of':'104.132.11.85:40511'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-forwarded-for':'10.240.0.19'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-forwarded-proto':'http'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-envoy-internal':'true'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-request-id':'5c92dd16-2fd9-95e2-98f5-02405848440b'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-envoy-decorator-operation':'fortio1-redir'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-b3-traceid':'784a9efe5167ea81'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-b3-spanid':'784a9efe5167ea81'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-b3-sampled':'1'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-ot-span-context':'784a9efe5167ea81;784a9efe5167ea81;0000000000000000'
[2018-03-02 22:27:32.339][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-istio-attributes':'CkgKCnNvdXJjZS51aWQSOhI4a3ViZXJuZXRlczovL2lzdGlvLWluZ3Jlc3MtNzc2NGI2NjY1OS14bTQ3di5pc3Rpby1zeXN0ZW0='
[2018-03-02 22:27:32.339][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'x-envoy-original-path':'/fortio1/fortio/fetch/echosrv2:8080/echo'
[2018-03-02 22:27:32.339][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   'content-length':'0'
[2018-03-02 22:27:32.339][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   ':path':'/fortio/fetch/echosrv2:8080/echo'
[2018-03-02 22:27:32.339][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673]   ':method':'GET'
[2018-03-02 22:27:32.421][23][debug][router] external/envoy/source/common/router/router.cc:249] [C1175][S2595374538763107673] cluster 'in.8080' match for URL '/fortio/fetch/echosrv2:8080/echo'
[2018-03-02 22:27:32.422][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   ':authority':'35.197.53.122'
[2018-03-02 22:27:32.422][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'user-agent':'istio/fortio-0.7.3-pre'
[2018-03-02 22:27:32.422][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'x-on-behalf-of':'104.132.11.85:40511'
[2018-03-02 22:27:32.422][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'x-forwarded-for':'10.240.0.19'
[2018-03-02 22:27:32.422][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'x-forwarded-proto':'http'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'x-request-id':'5c92dd16-2fd9-95e2-98f5-02405848440b'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'x-b3-traceid':'784a9efe5167ea81'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'x-b3-spanid':'784a9efe5167ea81'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'x-b3-sampled':'1'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'x-ot-span-context':'784a9efe5167ea81;784a9efe5167ea81;0000000000000000'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'x-envoy-original-path':'/fortio1/fortio/fetch/echosrv2:8080/echo'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'content-length':'0'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   ':path':'/fortio/fetch/echosrv2:8080/echo'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   ':method':'GET'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   'x-envoy-internal':'true'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673]   ':scheme':'http'
[2018-03-02 22:27:32.473][23][debug][router] external/envoy/source/common/router/router.cc:948] [C1175][S2595374538763107673] pool ready
[2018-03-02 22:27:32.474][23][debug][router] external/envoy/source/common/router/router.cc:467] [C1175][S2595374538763107673] upstream reset
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:932] [C1175][S2595374538763107673] encoding headers via codec (end_stream=false):
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:937] [C1175][S2595374538763107673]   ':status':'503'
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:937] [C1175][S2595374538763107673]   'content-length':'57'
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:937] [C1175][S2595374538763107673]   'content-type':'text/plain'
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:937] [C1175][S2595374538763107673]   'date':'Fri, 02 Mar 2018 22:27:32 GMT'
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:937] [C1175][S2595374538763107673]   'server':'envoy'

lizan on 2 Mar 2018

[2018-03-02 22:27:32.472][23][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:225] [C1261] attaching to next request
[2018-03-02 22:27:32.473][23][debug][router] external/envoy/source/common/router/router.cc:948] [C1175][S2595374538763107673] pool ready
[2018-03-02 22:27:32.473][23][debug][connection] external/envoy/source/common/network/connection_impl.cc:439] [C1261] remote close
[2018-03-02 22:27:32.473][23][debug][connection] external/envoy/source/common/network/connection_impl.cc:134] [C1261] closing socket: 0
[2018-03-02 22:27:32.473][23][debug][client] external/envoy/source/common/http/codec_client.cc:70] [C1261] disconnect. resetting 1 pending requests
[2018-03-02 22:27:32.473][23][debug][client] external/envoy/source/common/http/codec_client.cc:91] [C1261] request reset
[2018-03-02 22:27:32.474][23][debug][router] external/envoy/source/common/router/router.cc:467] [C1175][S2595374538763107673] upstream reset

Looks like it is really from echosrv, @ldemailly ?

lizan on 2 Mar 2018

try istio/fortio:0.7.3-pre1 as the fortio image (it contains https://github.com/istio/fortio/pull/173)
with the url being .../fortio1/echo?close=1
instead of the full .../fortio1/fortio/fetch/echosrv2:8080/echo

it will simulate a http server that closes connection on each call and may trigger the same issue
(i'll try too but I have to run/rest)

ldemailly on 3 Mar 2018

just built a version with go1.10: istio/fortio:0.7.3-pre3

ldemailly on 4 Mar 2018

I was able to reproduce connection timeout without envoy using localhost:8080/fortio/fetch/localhost:8080/echo
on my mac at least it's related to sockets left in TIME_WAIT state, eventually I think we probably run out and get errors

In fortio logs I saw a handful of

18:34:10 E http.go:598> Unable to connect to {::1 8080 } : dial tcp [::1]:8080: connect: operation timed out

and lsof of fortio showed

fortio  7118 ldemailly   14u    IPv6 0x35837db4c5f5e18f       0t0      TCP localhost:58899->localhost:http-alt (SYN_SENT)

which should never happen (syn loss on ::1)

# Before a run
ldemailly-macbookpro:fortio ldemailly$ netstat -na -f inet6|grep 8080|grep TIME |wc
       0       0       0
# During a self chained run
ldemailly-macbookpro:fortio ldemailly$ netstat -na -f inet6|grep 8080|grep TIME |wc
    2649   15894  209271
ldemailly-macbookpro:fortio ldemailly$ netstat -na -f inet6|grep 8080|grep TIME |wc
    3297   19782  260463
ldemailly-macbookpro:fortio ldemailly$ netstat -na -f inet6|grep 8080|grep TIME |wc
    3817   22902  301543
ldemailly-macbookpro:fortio ldemailly$ netstat -na -f inet6|grep 8080|grep TIME |wc
    4273   25638  337567
# keeps growing...

ldemailly on 4 Mar 2018

also thinking this relates to https://github.com/envoyproxy/envoy/issues/2066

ldemailly on 4 Mar 2018

still getting errors with go1.10 and no TIME_WAIT sockets on the node (seems to be a seperate mac issue)

ldemailly on 4 Mar 2018

please use istio/fortio:0.7.3-pre5
or look at istio-perf-0-6, run=9 is running with 8 connections at 80 qps (so 10qps per connection, from the VM)
getting about 1% errors / not connected on the echosrv1 middle step
echo-svc-deployment1-85f857cd99-blwp9 on node gke-istio-perf-default-pool-4d15d772-rq7n

ldemailly on 4 Mar 2018

I'm trying to use tcpdump on the node but I'm finding it difficult to find the traffic I care about
probably a debug image on the pod would help. will try that later - but let me know if you find something

ldemailly on 4 Mar 2018

A trace log for a 503 upstream connection:

[2018-03-03 00:11:36.931][20][debug][client] external/envoy/source/common/http/codec_client.cc:23] [C581] connecting
[2018-03-03 00:11:36.931][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:559] [C581] connecting to 127.0.0.1:8080
[2018-03-03 00:11:36.932][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:568] [C581] connection in progress
[2018-03-03 00:11:37.178][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:377] [C581] socket event: 2
[2018-03-03 00:11:37.178][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:445] [C581] write ready
[2018-03-03 00:11:37.180][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:455] [C581] connected
[2018-03-03 00:11:37.180][20][debug][client] external/envoy/source/common/http/codec_client.cc:52] [C581] connected
[2018-03-03 00:11:37.181][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:225] [C581] attaching to next request
[2018-03-03 00:11:37.184][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:314] [C581] writing 506 bytes, end_stream false
[2018-03-03 00:11:37.184][20][trace][connection] external/envoy/source/common/network/raw_buffer_socket.cc:67] [C581] write returns: 506
[2018-03-03 00:11:37.459][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:377] [C581] socket event: 2
[2018-03-03 00:11:37.459][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:445] [C581] write ready
[2018-03-03 00:11:37.487][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:377] [C581] socket event: 3
[2018-03-03 00:11:37.487][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:445] [C581] write ready
[2018-03-03 00:11:37.487][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:415] [C581] read ready
[2018-03-03 00:11:37.488][20][trace][connection] external/envoy/source/common/network/raw_buffer_socket.cc:25] [C581] read returns: 166
[2018-03-03 00:11:37.488][20][trace][connection] external/envoy/source/common/network/raw_buffer_socket.cc:25] [C581] read returns: 0
[2018-03-03 00:11:37.488][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:298] [C581] parsing 166 bytes
[2018-03-03 00:11:37.489][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=content-length value=0
[2018-03-03 00:11:37.489][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=date value=Sat, 03 Mar 2018 00:11:36 GMT
[2018-03-03 00:11:37.489][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=content-type value=text/plain; charset=utf-8
[2018-03-03 00:11:37.490][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=x-envoy-upstream-service-time value=49
[2018-03-03 00:11:37.490][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:354] [C581] headers complete
[2018-03-03 00:11:37.491][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=server value=envoy
[2018-03-03 00:11:37.497][20][debug][client] external/envoy/source/common/http/codec_client.cc:81] [C581] response complete
[2018-03-03 00:11:37.501][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:200] [C581] response complete
[2018-03-03 00:11:37.502][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:225] [C581] attaching to next request
[2018-03-03 00:11:37.503][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:314] [C581] writing 509 bytes, end_stream false
[2018-03-03 00:11:37.504][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:315] [C581] parsed 166 bytes
[2018-03-03 00:11:37.504][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:439] [C581] remote close
[2018-03-03 00:11:37.505][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:134] [C581] closing socket: 0
[2018-03-03 00:11:37.505][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:298] [C581] parsing 0 bytes
[2018-03-03 00:11:37.505][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:315] [C581] parsed 0 bytes
[2018-03-03 00:11:37.506][20][debug][client] external/envoy/source/common/http/codec_client.cc:70] [C581] disconnect. resetting 1 pending requests
[2018-03-03 00:11:37.506][20][debug][client] external/envoy/source/common/http/codec_client.cc:91] [C581] request reset
[2018-03-03 00:11:37.520][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:115] [C581] client disconnected

This looks weird because envoy read 166 bytes and parsed it. But ended with resetting 1 pending requests. Will follow the source around codec_impl later.

lizan on 4 Mar 2018

👍3

ah ! nice ! smoking gun ! gn

ldemailly on 4 Mar 2018

Now I believe this is an issue at Envoy, filed https://github.com/envoyproxy/envoy/issues/2715.

lizan on 5 Mar 2018

👍1

awesome! glad it looks like another half close related issue.

ldemailly on 5 Mar 2018

thanks a lot @ldemailly @lizan this is quite a finding!

andraxylia on 5 Mar 2018

👍1

I added #3987 for more checks - it may help to find another client/server than fortio to demonstrate/repro issues in the future even if this one is fairly clearly envoy's

ldemailly on 6 Mar 2018

@lizan / @PiotrSikora do you have an idea why this issues only triggers in a small % of cases? I would expect if we put a closed socket back in the pool we get 50% errors (each time it's done 1 good request the next one would be bad)

ldemailly on 7 Mar 2018

Well, does echosrv close the socket after each request? or only after X requests?

PiotrSikora on 7 Mar 2018

when using fetch/... url after each request, which is why I'm curious how that turns into a small %

ldemailly on 7 Mar 2018

the other weirdness was because it seems to happen only with mtls (from andra's tests) so something related to the timing / some race is involved?

ldemailly on 7 Mar 2018

I don't think that's true? Trace logs from @lizan show EOF received by raw buffer socket and not TLS.

PiotrSikora on 7 Mar 2018

This 503 only happens if a connection pool is busy enough. i.e at the moment a connection in the pool read a response completely from echosrv, the pool already has next request to be write to upstream immediately. You only observe this in mTLS case because it make lower qps to be busy enough.

lizan on 7 Mar 2018

but upstream has closed the connection, so why isn't it always failing every other call?

ldemailly on 7 Mar 2018

Envoy doesn't always have next request to queue. I think if you turn qps to high enough, then the rate of 503 will be close to 50%?

lizan on 7 Mar 2018

k so if there is no next request immediately, how come in that case it discards the dead socket ? (and doesn't in the busy case)

ldemailly on 7 Mar 2018

The RemoteClose event right after onData will trigger it.

lizan on 7 Mar 2018

👍1

👍 I see, thank you, so it is clearly a bug if it handles it correctly when there is enough idle time between requests but not when there isn't. we could try increasing the qps to see the error rate converge toward 50% then

ldemailly on 7 Mar 2018

I am seeing this issue in 0.8.0 installed with default helm settings (thus I believe mTLS off) and nothing else changed. If I followed this thread correctly it's currently believed that the underlying issue is this: https://github.com/envoyproxy/envoy/issues/2715. I believe the committed fix for that issue should be in 0.8.0 yet I am still seeing 503's at 100 qps.

kinghrothgar on 13 Jun 2018

@lizan is there any issue left in Envoy?

andraxylia on 13 Jun 2018

@andraxylia I'm not aware of any. @kinghrothgar can you paste one line of your access log for the 503? I'd like to confirm whether it is same root cause.

envoyproxy/envoy#2715 only happens with certain upstream app which doesn't send connection: close and close connection.

lizan on 14 Jun 2018

It's likely we are uncovering other issues.

andraxylia on 14 Jun 2018

I can't figure out how to turn on debug logging for istio-proxy and I assume I need to as currently this is the only information I get

[hello-world-795769b484-474h4] [2018-06-14T15:24:23.623Z] "GET / HTTP/1.1" 503 UC 0 57 0 - "-" "istio/fortio-0.11.0" "49055e90-5ead-982d-aeb8-4baf2860f33d" "hello-world.default.svc.cluster.local" "127.0.0.1:80"

Edit: And this is the access log line:

{"level":"info","time":"2018-06-14T15:54:58.860458Z","instance":"accesslog.logentry.istio-system","apiClaims":"","apiKey":"","apiName":"","apiVersion":"","connectionMtls":false,"destinationIp":"10.28.0.29","destinationNamespace":"default","destinationService":"hello-world.default.svc.cluster.local","latency":"287.21µs","method":"GET","originIp":"0.0.0.0","protocol":"http","receivedBytes":507,"referer":"","requestOperation":"","requestSize":0,"responseCode":503,"responseSize":57,"responseTimestamp":"2018-06-14T15:54:58.860678Z","sentBytes":149,"sourceIp":"10.28.3.22","sourceNamespace":"default","sourceService":"nginx.default.svc.cluster.local","sourceUser":"kubernetes://nginx-8854ff995-c7x8w.default","url":"/","userAgent":"istio/fortio-0.11.0"}

kinghrothgar on 14 Jun 2018

503 UC is good enough info (though verbose logging would have more details)

ldemailly on 14 Jun 2018

Is there any more info from me that would be useful?

As far as the "Why isn't it failing 50% of the time" topic, the highest % of 503's I could get was a little over 1%.

kinghrothgar on 19 Jun 2018

@kinghrothgar Does your server side/app closes the connection without giving a Connection: close header ? because that's the current claim on envoy side, that it only happens in that case.

ldemailly on 19 Jun 2018

otherwise what you may be seeing is #1038 (503s during changes/updates)

ldemailly on 19 Jun 2018

Team, we need to figure out what specifically can be done within the next couple of weeks (for 1.0 launch)

@ldemailly and @lizan Did either of you try increasing the QPS to see if that made a difference?

@kinghrothgar can you provide more specifics?

jasminejaksic on 25 Jun 2018

I got a tcpdump of the issue. It appears the TCP connection is getting closed without an HTTP connection close header being sent. The link bellow is a pcap with the 2 TCP streams. One is the connection between the nginx pod and envoy, the other is the one between envoy and the python app over localhost. Please let me know if there's any more information that would be helpful.

https://gobin.io/dRjK

kinghrothgar on 26 Jun 2018

@ldemailly are you looking into this?

andraxylia on 27 Jun 2018

@PiotrSikora - it seems connection reused after close issue again.

HTTP close without close header is generally ok and common - but if the close happens immediately
I think it's normal to have a 503 and not aware of any way to avoid it, envoy wouldn't have time to
process the close event before sending. It's also not common behavior - the close without header typically happens when server clears up keep-alive connections or if server has problems (in which
case 503 is likely correct answer).

costinm on 27 Jun 2018

andra I am not as I already filled and have a repro for the original envoy problem. I was just trying to clarify if this new report is indeed the same issue or a new one (or the route change one)

ldemailly on 27 Jun 2018

I see https://github.com/envoyproxy/envoy/issues/2715 is fixed and it's clear there will be some timing situations where we cannot avoid the 503s for applications that do not send close. There is nothing more we can do in istio, so closing this one.

andraxylia on 28 Jun 2018

which istio release is the first that has the fix?

ldemailly on 28 Jun 2018

0.8

andraxylia on 28 Jun 2018

🎉1

@ldemailly
in istio 0.8, i met the same error, tiny percent request, the sidecar envoy returns 503, do not use any rule ,log is below:
[2018-07-06T06:11:12.983Z] "POST /api/ HTTP/1.1" 503 UC 16687 57 0 - "-" "Mozilla/5.0 (pc-x86_64-linux-gnu) Siege/4.0.4" "41344284-c914-9db1-a30a-983e74455227" "10.20.10.20:8888" "127.0.0.1:8888"
i saw the talk here and in envoy reference, It seems that the problem has not been solved, Is that right?

wansuiye on 6 Jul 2018

👍2

UC means upstream connection error, is this log from the last envoy in the chain

ldemailly on 6 Jul 2018

@wansuiye what application do you run?

Please see this comment, you may still see 503s if your app does not send HTTP close:

https://github.com/istio/istio/issues/3295#issuecomment-400789923

andraxylia on 6 Jul 2018

@ldemailly i use istio 0.8.0 ,which i think the envoy is the last
@andraxylia i run sentry which is an open-source error tracking app, In order to locate the problem, I added nginx to the app, so the traffic path was envoy->nginx->app. Strangely enough, when nginx was added, there was no more error, nginx used the default configuration. then I adjusted the nginx configuration, when client_body_timeout parameter tuning small such as 5 seconds, envoy will happen the same mistake, at the same time the nginx report 408 errors.
now I do not know if it's the impact of client requests timeout or it is just a coincidence

wansuiye on 10 Jul 2018

nginx log:

{"log":"127.0.0.1 - - [10/Jul/2018:02:39:25 +0000] \"POST /api/7/store/ HTTP/1.1\" 408 0 \"-\" \"sentry-java/1.7.3-c2920\" \"10.21.10.12\"\n","stream":"stdout","time":"2018-07-10T02:39:25.912227157Z"}
{"log":"127.0.0.1 - - [10/Jul/2018:02:41:29 +0000] \"POST /api/7/store/ HTTP/1.1\" 408 0 \"-\" \"sentry-java/1.7.3-c2920\" \"10.21.10.12\"\n","stream":"stdout","time":"2018-07-10T02:41:29.760106561Z"}
{"log":"127.0.0.1 - - [10/Jul/2018:02:41:30 +0000] \"POST /api/7/store/ HTTP/1.1\" 408 0 \"-\" \"sentry-java/1.7.3-c2920\" \"10.21.10.12\"\n","stream":"stdout","time":"2018-07-10T02:41:30.695566241Z"}

envoy log:

{"log":"[2018-07-10T02:39:20.341Z] \"POST /api/7/store/ HTTP/1.1\" 503 UC 4592 57 5570 - \"10.21.10.12\" \"sentry-java/1.7.3-c2920\" \"b643df69-dbcb-9b8a-8f1b-37c0786fd4\" \"10.21.10.12:9000\" \"127.0.0.1:9000\"\n","stream":"stdout","time":"2018-07-10T02:39:28.798012917Z"}
{"log":"[2018-07-10T02:41:22.888Z] \"POST /api/7/store/ HTTP/1.1\" 503 UC 16072 57 6872 - \"10.21.10.12\" \"sentry-java/1.7.3-c2920\" \"11a6084a-00bc-9715-a687-32134212de\" \"10.21.10.12:9000\" \"127.0.0.1:9009\"\n","stream":"stdout","time":"2018-07-10T02:41:38.798207467Z"}
{"log":"[2018-07-10T02:41:25.229Z] \"POST /api/7/store/ HTTP/1.1\" 503 UC 4592 57 5465 - \"10.21.10.12\" \"sentry-java/1.7.3-c2920\" \"d61dc150-b5dc-9a06-a6d8-5097b6e8ac\" \"10.21.10.12:9000\" \"127.0.0.1:9009\"\n","stream":"stdout","time":"2018-07-10T02:41:38.798384762Z"}

wansuiye on 10 Jul 2018

From the tcpdump of https://github.com/istio/istio/issues/3295#issuecomment-400413798, it looks like the app is responding with HTTP/1.0 without connection: close, is that your case too? @wansuiye

lizan on 11 Jul 2018

@lizan the app is responding with HTTP/1.1 without connection: close, not HTTP/1.0

wansuiye on 11 Jul 2018

I have run into this issue as well.

[xxx] [2018-10-02T16:12:44.409Z] "GET /glb HTTP/1.1" 503 UC 0 57 1 - "-" "GoogleHC/1.0" "ecb1c797-a58f-9c53-be4f-857bc2884456" "10.100.0.100" "127.0.0.1:3000"

Is this the recommended approach?
Can envoy not retry these cases?
What is the performance implication of not using keep-alive connections?
Should this be in the Istio and/or Envoy docs?

jsw on 2 Oct 2018

👍8

Hey,

i just ran into what seamed to be the same issue. our system is running ambassador with istio 1.0.5 (mtls enabled) and a node/express server, serving some static files. usually that service gets hit with a few requests in a short burst, and sometimes one of those requests failed with a 503 UC. i've added the connection close to the response header and it seams to fix the 503 responses, though i have the same question as @jsw regarding the recommended approach and so forth.

i'd highly appreciate it if someone would take the time to clarify this and maybe it should be documented somewhere?