running 100qps through ingress in the perf setup getting a small % of 503s
(only on multi hops - maybe a timeout issue?) - it also only happens with >=20 connections or so
35.199.14.248/fortio2/fortio/fetch/echosrv1:8080/echo
Starting at 100 qps with 30 thread(s) [gomax 2] : exactly 3000, 100 calls each (total 3000 + 0)
Ended after 30.071664869s : 3000 calls. qps=99.762
Sleep times : count 2970 avg 0.26524946 +/- 0.01577 min 0.192739205 max 0.292777981 sum 787.790887
Aggregated Function Time : count 3000 avg 0.037497644 +/- 0.01582 min 0.009672528 max 0.10991789 sum 112.492931
# range, mid point, percentile, count
>= 0.00967253 <= 0.01 , 0.00983626 , 0.03, 1
> 0.01 <= 0.012 , 0.011 , 0.17, 4
> 0.012 <= 0.014 , 0.013 , 0.70, 16
> 0.014 <= 0.016 , 0.015 , 2.63, 58
> 0.016 <= 0.018 , 0.017 , 5.33, 81
> 0.018 <= 0.02 , 0.019 , 8.17, 85
> 0.02 <= 0.025 , 0.0225 , 19.70, 346
> 0.025 <= 0.03 , 0.0275 , 33.83, 424
> 0.03 <= 0.035 , 0.0325 , 52.87, 571
> 0.035 <= 0.04 , 0.0375 , 67.87, 450
> 0.04 <= 0.045 , 0.0425 , 76.67, 264
> 0.045 <= 0.05 , 0.0475 , 82.83, 185
> 0.05 <= 0.06 , 0.055 , 91.00, 245
> 0.06 <= 0.07 , 0.065 , 94.97, 119
> 0.07 <= 0.08 , 0.075 , 98.10, 94
> 0.08 <= 0.09 , 0.085 , 98.87, 23
> 0.09 <= 0.1 , 0.095 , 99.60, 22
> 0.1 <= 0.109918 , 0.104959 , 100.00, 12
# target 50% 0.0342469
# target 75% 0.044053
# target 99% 0.0918182
# target 99.9% 0.107438
Code 200 : 2995 (99.8 %)
Code 503 : 5 (0.2 %)
Response Header Sizes : count 3000 avg 165.72233 +/- 6.772 min 0 max 167 sum 497167
Response Body/Total Sizes : count 3000 avg 166.09833 +/- 2.432 min 165 max 226 sum 498295
Saved result to data/2018-02-08-055310_Fortio.json
All done 3000 calls 37.498 ms avg, 99.8 qps
gets worse with higher qps and/or connections:
Starting at max qps with 20 thread(s) [gomax 2] for exactly 10000 calls (500 per thread + 0)
Code 200 : 9667 (96.7 %)
Code 503 : 333 (3.3 %)
All done 10000 calls 67.028 ms avg, 293.5 qps
Starting at max qps with 40 thread(s) [gomax 2] for exactly 10000 calls (250 per thread + 0)
Aggregated Function Time : count 10000 avg 0.11053613 +/- 0.08286 min 0.002446852 max 0.70930171 sum 1105.36134
Code 200 : 9373 (93.7 %)
Code 503 : 627 (6.3 %)
All done 10000 calls 110.536 ms avg, 352.3 qps
could it be circuit breaker / outlier detection settings?
On Wed, Feb 7, 2018 at 10:13 PM Laurent Demailly notifications@github.com
wrote:
gets worse with higher qps and/or connections:
Code 200 : 9667 (96.7 %)
Code 503 : 333 (3.3 %)
Response Header Sizes : count 10000 avg 160.6159 +/- 29.81 min 0 max 167
sum 1606159
Response Body/Total Sizes : count 10000 avg 168.1157 +/- 10.62 min 165 max
227 sum 1681157
Saved result to data/2018-02-08-061159_Fortio.json
All done 10000 calls 67.028 ms avg, 293.5 qps—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/istio/istio/issues/3295#issuecomment-364013236, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJGIxtVmGBNuKqiU0OlmK-WQyulRxPQrks5tSpCVgaJpZM4R91tL
.
could it be circuit breaker / outlier detection settings?
is there any such by default ?
Yes, see
https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/cluster/circuit_breaker.proto.html#envoy-api-msg-cluster-circuitbreakers
On Wed, Feb 7, 2018 at 10:17 PM Laurent Demailly notifications@github.com
wrote:
could it be circuit breaker / outlier detection settings?
is there any such by default ?
—
You are receiving this because you commented.Reply to this email directly, view it on GitHub
https://github.com/istio/istio/issues/3295#issuecomment-364013804, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJGIxhp3N8WpVhhKab_eogtG9-dbWOX2ks5tSpFvgaJpZM4R91tL
.
thanks for the pointer, but all the default value seems to be 1024 (max connections etc...) so it shouldn't trigger with 20 or 40 connections
@geeknoid removed the kind/bug label a day ago
~@geeknoid, curious why are 503s for no reason not a bug ?~
Edit: It's not just this issue, looks like there are no bugs anymore :-)
This is the same as #1038, closing as duplicate.
it wasn't actually the same, #1038 was with changes, this one is without any change
will reopen as @PiotrSikora and I aren't done figuring the possible cause
(but if #1038 gets fixed and also fixes this, wonderful!)
After fixing https://github.com/istio/fortio/pull/132 the error rate drops considerably but is still present:
Code 200 : 2998 (99.9 %)
Code 503 : 2 (0.1 %)
Apologies about that bug - and maybe once I track down the logs for those 2 errors some other bug will show up
@wattli
This happens only with mTLS enabled, with traffic going from VM --> Istio Ingress --> echosrv1 --> echosrv2
Envoy stats in the Ingress show Ingress timeout out when connecting to the upstream cluster for echosrv1:
out.echosrv1.istio.svc.cluster.local|http-echo::10.4.5.9:8080::cx_connect_fail::1208
out.echosrv1.istio.svc.cluster.local|http-echo::10.4.5.9:8080::rq_error::1336
@PiotrSikora is working on improving the Enovy SSL perf and he has a PR:
@wattli It's unlikely that this has anything to do with https://github.com/envoyproxy/envoy/pull/2680. 503s are results of connection errors and not slower performance.
@PiotrSikora the hope was that while you have a pilot less setup to investigate mtls perf issue, whether you also see 503s during your testing (which would rule out pilot / bad config) and confirm andra's determination
@ldemailly ah, I'm using static config and I saw exactly zero 503s.
@PiotrSikora do we have any measurements for the time spent in connecting one Envoy to an upstream Envoy cluster?
Slower performance will cause connection error if it takes more time to connect than the connectTimeout set in the CDS config.
what is the connectTimeout in your CDS static config?
Can it be caused by the loading of the certificates? Is there any caching of the certificates in Envoy?
If loading of the certificates is done for every LDS and CDS, it may slow down things.
That's my question too: do we always load cert/key for all incoming and outgoing call? Or it's just loaded for the first time.
if it was reloaded at each call we wouldn't need to restart envoy when they change
No, it's loaded only during startup and after receiving config (LDS/CDS) update. I'm trying to investigate what happens when the update is the same as previous one.
But if the low connection timeout is the issue, then why not just raise it? Having timeout <= 1s seems quite aggressive.
As a workaround we can increase the timeout, but people consider 1s to be more than sufficient. In general, it should be in the tens of ms.
There is no reason to reload a certificate with the same name, except for when SDS will tell Envoy to do it. Instead, Envoy should re-use the in-memory certificate.
1s on internal network with no congestion is a lot
if we have any packet loss it's classically not enough as the first retry is at 1s (unless tuned) but I don't think that's the case here
I reproduced this in my env with 0.6.0. I don't think it related to connection timeout, as in fortio result,
the max response was around 140ms, which is way lower than 1s.
In ingress access log: (see here for log format)
[2018-03-02T01:56:49.871Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 42 38 "10.4.0.1" "istio/fortio-0.7.3-pre" "e539e54b-e239-9f44-a3fa-54f6a6d1c83d" "35.197.53.122" "10.4.0.10:8080"
The response flag is -
which means the 503 from upstream.
In deployment1 sidecar access log:
[2018-03-02T01:56:49.896Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 UC 0 57 14 - "10.4.0.1" "istio/fortio-0.7.3-pre" "e539e54b-e239-9f44-a3fa-54f6a6d1c83d" "35.197.53.122" "127.0.0.1:8080"
the response flag is UC
, which means Upstream connection termination in addition to 503 response code.
.
The 503 is coming from the sidecar, and seems upstream of sidecar (echosrv) is terminating connection.
In sidecar envoy stats (I got total 3 of 503s):
cluster.in.8080.upstream_cx_destroy_remote_with_active_rq: 3
cluster.in.8080.upstream_cx_destroy_with_active_rq: 3
this means the echosrv is terminating the connection while request is still active. Any idea why this happen?
@ldemailly is this then a fortio bug?
fortio's fetch is a pretty dumb proxy: https://github.com/istio/fortio/issues/133
yet it shouldn't change the number of simultaneous connections, or stop connecting only when mTLS is on ?
is there any stats on maximum open connections ? I can add some on fortio side (https://github.com/istio/fortio/issues/165) or you can check indirectly through the pprof url
@lizan In my repro, I got :
out.echosrv1.istio.svc.cluster.local|http-echo::10.4.5.9:8080::cx_connect_fail::1208
out.echosrv1.istio.svc.cluster.local|http-echo::10.4.5.9:8080::rq_error::1336
How can these be explained? When are these counters incremented?
[2018-03-02T01:56:49.896Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 UC 0 57 14 - "10.4.0.1" "istio/fortio-0.7.3-pre" "e539e54b-e239-9f44-a3fa-54f6a6d1c83d" "35.197.53.122" "127.0.0.1:8080"
can I see the rest of the log around this on the middle pod
specifically do you see the outbound calls, do those have any code ?
ok so I made a few runs on my 0.6.0 cluster:
for the url : ...my-k8s-ingress.../fortio1/fortio/fetch/echosrv2:8080/echo
Starting at 100 qps with 8 thread(s) [gomax 2] : exactly 2000, 250 calls each (total 2000 + 0)
On VM:
04:38:27 W http.go:712> Parsed non ok code 503 (HTTP/1.1 503)
04:38:28 W http.go:712> Parsed non ok code 503 (HTTP/1.1 503)
04:38:34 W http.go:712> Parsed non ok code 503 (HTTP/1.1 503)
04:38:35 W http.go:712> Parsed non ok code 503 (HTTP/1.1 503)
On Ingress:
[2018-03-02T04:38:27.613Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 11 10 "10.36.0.1" "istio/fortio-0.7.2" "30725606-7fb6-9933-b0ba-9c174b5b55c8" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:28.899Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 24 23 "10.36.0.1" "istio/fortio-0.7.2" "c108341b-886e-9de3-b2c6-7c41bbf774e4" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:34.039Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 10 10 "10.150.0.7" "istio/fortio-0.7.2" "3e606abc-aac5-9f55-a97b-a702c75bbeb2" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:35.163Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 25 25 "10.36.0.1" "istio/fortio-0.7.2" "6bdc0b5e-486b-9427-9b1b-a9137e41118e" "35.199.51.155" "10.36.0.6:8080"
and with surrounding requests:
$ kis logs istio-ingress-5b78d8df9d-5srjn |grep -C1 " 503 "
--
[2018-03-02T04:38:27.613Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 10 10 "10.150.0.5" "istio/fortio-0.7.2" "1a001b0a-f847-90d6-bdeb-28a7c247a558" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:27.613Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 11 10 "10.36.0.1" "istio/fortio-0.7.2" "30725606-7fb6-9933-b0ba-9c174b5b55c8" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:27.613Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 12 12 "10.36.0.1" "istio/fortio-0.7.2" "4a4dd19f-1d1f-9cde-b9d3-a5d6e013cf8e" "35.199.51.155" "10.36.0.6:8080"
--
--
[2018-03-02T04:38:28.899Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 24 23 "10.36.0.1" "istio/fortio-0.7.2" "61fb76c1-841e-9118-8831-ebc4ea68c64e" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:28.899Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 24 23 "10.36.0.1" "istio/fortio-0.7.2" "c108341b-886e-9de3-b2c6-7c41bbf774e4" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:28.900Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 24 24 "10.150.0.4" "istio/fortio-0.7.2" "bb559ed9-db6a-91bd-b058-289954a7a778" "35.199.51.155" "10.36.0.6:8080"
--
--
[2018-03-02T04:38:33.961Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 21 21 "10.150.0.4" "istio/fortio-0.7.2" "ab80c021-0ab3-98fc-b5a8-654af7a9620d" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:34.039Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 10 10 "10.150.0.7" "istio/fortio-0.7.2" "3e606abc-aac5-9f55-a97b-a702c75bbeb2" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:34.038Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 11 11 "10.36.0.1" "istio/fortio-0.7.2" "75cd1950-813f-90d8-9222-ff693b9278d2" "35.199.51.155" "10.36.0.6:8080"
--
--
[2018-03-02T04:38:35.164Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 25 24 "10.150.0.3" "istio/fortio-0.7.2" "1eebd2d6-bd5f-9b7a-aba2-30001f7a37bb" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:35.163Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 25 25 "10.36.0.1" "istio/fortio-0.7.2" "6bdc0b5e-486b-9427-9b1b-a9137e41118e" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:38:35.163Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 200 - 0 0 26 26 "10.150.0.3" "istio/fortio-0.7.2" "9a3db532-eab9-9f15-95ae-2aedea537dfa" "35.199.51.155" "10.36.0.6:8080"
--- switching strategy as the log can't be correlated as is
Starting at 100 qps with 8 thread(s) [gomax 2] : exactly 1000, 125 calls each (total 1000 + 0)
Code 200 : 991 (99.1 %)
Code 503 : 9 (0.9 %)
35.199.51.155/fortio1/fortio/fetch/echosrv2:8080/echo?run=1
kis logs istio-ingress-5b78d8df9d-5srjn |grep "run=1"|wc -l
1000
kis logs istio-ingress-5b78d8df9d-5srjn |grep "run=1" | grep " 503 "
[2018-03-02T04:54:10.804Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 16 16 "10.150.0.4" "istio/fortio-0.7.2" "cf9d0df0-87c9-960d-b1c4-319f69aaf0f1" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:11.045Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 9 9 "10.150.0.5" "istio/fortio-0.7.2" "f2ba03d6-7737-91d4-aca4-2ffcd4486353" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:13.545Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 14 14 "10.150.0.4" "istio/fortio-0.7.2" "473539d2-a773-98f7-8ee7-1eac8403f8ae" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:13.706Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 13 13 "10.150.0.3" "istio/fortio-0.7.2" "6ae4e7c5-3b33-91e9-9b49-9a89f4c55545" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:16.932Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 9 9 "10.150.0.8" "istio/fortio-0.7.2" "83671c4a-8f52-9b0c-ac79-08f864520365" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:17.496Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 18 17 "10.150.0.3" "istio/fortio-0.7.2" "1b47e13c-4cb3-9bf1-b7a3-dc5ca141331a" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:17.496Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 18 17 "10.150.0.7" "istio/fortio-0.7.2" "a2b89f57-d67a-9732-9aa1-a619839c3e01" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:17.576Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 15 15 "10.150.0.8" "istio/fortio-0.7.2" "c060b39a-d97d-916f-9202-e96967af205d" "35.199.51.155" "10.36.0.6:8080"
[2018-03-02T04:54:17.576Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo?run=1 HTTP/1.1" 503 - 0 57 15 15 "10.150.0.8" "istio/fortio-0.7.2" "b37f694a-0dba-91ca-a0f3-9e18e7469c8d" "35.199.51.155" "10.36.0.6:8080"
k logs -n istio echo-svc-deployment1-6cb54fc-vsrfc istio-proxy | grep "run=1" | wc -l
1991
so indeed 9 missing on the outcall
$ k logs -n istio echo-svc-deployment1-6cb54fc-vsrfc echosrv | grep "run=1" | wc -l
991
interesting... the fortio app never saw those 9 calls
@lizan can you clarify UC exactly - is it connection timeout, connection and empty read + eof, something else ?
I'll double check the code on fortio's side but... the printing is the first thing it does:
https://github.com/istio/fortio/blob/master/ui/uihandler.go#L576-L578
hooked to the default go http server through
https://github.com/istio/fortio/blob/master/ui/uihandler.go#L865
ps: why are envoy logs not showing the authority/host/service names anymore but just ips ?
Definitely not timeout, which will result 504. Likely an empty read + eof.
How can these be explained? When are these counters incremented?
@andraxylia Do you still have the reproduced cluster? What is the access log of that proxy? (fetched by kubectl logs).
I have it running continuously on istio-perf-0-6 project if you want to look around
here are some envoy stats I captured around a 503:
cluster.in.8080.internal.upstream_rq_200: 455
cluster.in.8080.internal.upstream_rq_2xx: 455
cluster.in.8080.internal.upstream_rq_503: 1
cluster.in.8080.internal.upstream_rq_5xx: 1
cluster.in.8080.membership_healthy: 1
cluster.in.8080.membership_total: 1
cluster.in.8080.upstream_cx_destroy_remote_with_active_rq: 1
cluster.in.8080.upstream_cx_destroy_with_active_rq: 1
cluster.in.8080.upstream_cx_http1_total: 455
cluster.in.8080.upstream_cx_rx_bytes_total: 75083
cluster.in.8080.upstream_cx_total: 455
cluster.in.8080.upstream_cx_tx_bytes_total: 234291
cluster.in.8080.upstream_rq_200: 455
cluster.in.8080.upstream_rq_2xx: 455
cluster.in.8080.upstream_rq_503: 1
cluster.in.8080.upstream_rq_5xx: 1
cluster.in.8080.upstream_rq_pending_total: 455
cluster.in.8080.upstream_rq_total: 456
if I read this correctly it means connection was successful but
cluster.in.8080.upstream_cx_destroy_remote_with_active_rq: 1
cluster.in.8080.upstream_cx_destroy_with_active_rq: 1
anything can be glanced from those 2 counters ?
ps: what is upstream_rq_pending_total ? pending what ?
looking at lsof, envoy on the node has about 88-92 open sockets
fortio has 3 most of the time with some spikes
root@gke-istio-perf-default-pool-4d15d772-fngb:~# lsof -p 26900
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
fortio 26900 root cwd DIR 0,124 4096 114498 /
fortio 26900 root rtd DIR 0,124 4096 114498 /
fortio 26900 root txt REG 8,1 8293120 1828913 /usr/local/bin/fortio
fortio 26900 root 0r FIFO 0,10 0t0 114510 pipe
fortio 26900 root 1w FIFO 0,10 0t0 114511 pipe
fortio 26900 root 2w FIFO 0,10 0t0 114512 pipe
fortio 26900 root 3u sock 0,7 0t0 114603 can't identify protocol
fortio 26900 root 4u 0000 0,11 0 7212 anon_inode
fortio 26900 root 5u sock 0,7 0t0 110448 can't identify protocol
fortio 26900 root 6u sock 0,7 0t0 114611 can't identify protocol
fortio 26900 root 7u sock 0,7 0t0 2110099 can't identify protocol
fortio 26900 root 8u sock 0,7 0t0 2110112 can't identify protocol
fortio 26900 root 9u sock 0,7 0t0 2110095 can't identify protocol
fortio 26900 root 10u sock 0,7 0t0 2110096 can't identify protocol
fortio 26900 root 11u sock 0,7 0t0 2110113 can't identify protocol
fortio 26900 root 12u sock 0,7 0t0 2110114 can't identify protocol
fortio 26900 root 13u sock 0,7 0t0 2110087 can't identify protocol
fortio 26900 root 14u sock 0,7 0t0 2110088 can't identify protocol
fortio 26900 root 15u sock 0,7 0t0 2110097 can't identify protocol
fortio 26900 root 16u sock 0,7 0t0 2110117 can't identify protocol
I can try to add internal metrics but I don't see a smoking gun so far
also will try go1.10 just in case there is an issue in go1.8.7
@lizan I decreased the CDS connect_timeout to 100ms to force the repro, and I can do a simple repro by calling from Ingress to echosrv1 or to echosvr2 (no more chaining). Echosrv1 has mixer cache enabled, and echosrv2 has mixer cache disabled. The envoy stats are these:
cluster.out.echosrv1.istio.svc.cluster.local|http-echo.upstream_cx_connect_fail: 12
cluster.out.echosrv1.istio.svc.cluster.local|http-echo.upstream_rq_pending_failure_eject: 10
I would like to understand the meaning of cx_connect_fail. Does the connect_timeout in CDS include the chain from Ingress -> Envoy -->Mixer --> Echosvr1, or simply the Ingress ->Envoy? If the former, we may have just a performance issue with Mixer. If the latter, it is an Envoy issue.
I emailed you my cluster info. The Envoy logs attached, the 503s in the end are more relevant.
upstream_rq_pending_total
Once envoy creates pending request due to no active connection, this will be incremented (and never decremented), as long as you don't see upstream_rq_pending_active, that means those pending requests are processed.
@andraxylia from the ingress.log you attached, it doesn't look like ingress is generating 503, the 503s in the log are from its upstream. for example:
[2018-03-02 07:50:31.823][17][debug][http] external/envoy/source/common/http/async_client_impl.cc:89] async http request response headers (end_stream=false):
[2018-03-02 07:50:31.824][17][debug][http] external/envoy/source/common/http/async_client_impl.cc:92] ':status':'503'
[2018-03-02 07:50:31.824][17][debug][http] external/envoy/source/common/http/async_client_impl.cc:92] 'content-length':'19'
[2018-03-02 07:50:31.824][17][debug][http] external/envoy/source/common/http/async_client_impl.cc:92] 'content-type':'text/plain'
[2018-03-02 07:50:31.824][17][trace][http] external/envoy/source/common/http/async_client_impl.cc:104] async http request response data (length=19 end_stream=true)
I'll take a look at your cluster.
got one complete stream log from echosrv1 sidecar:
[2018-03-02 22:27:32.337][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:447] [C1175][S2595374538763107673] request headers complete (end_stream=true):
[2018-03-02 22:27:32.337][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] ':authority':'35.197.53.122'
[2018-03-02 22:27:32.337][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'user-agent':'istio/fortio-0.7.3-pre'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-on-behalf-of':'104.132.11.85:40511'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-forwarded-for':'10.240.0.19'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-forwarded-proto':'http'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-envoy-internal':'true'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-request-id':'5c92dd16-2fd9-95e2-98f5-02405848440b'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-envoy-decorator-operation':'fortio1-redir'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-b3-traceid':'784a9efe5167ea81'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-b3-spanid':'784a9efe5167ea81'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-b3-sampled':'1'
[2018-03-02 22:27:32.338][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-ot-span-context':'784a9efe5167ea81;784a9efe5167ea81;0000000000000000'
[2018-03-02 22:27:32.339][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-istio-attributes':'CkgKCnNvdXJjZS51aWQSOhI4a3ViZXJuZXRlczovL2lzdGlvLWluZ3Jlc3MtNzc2NGI2NjY1OS14bTQ3di5pc3Rpby1zeXN0ZW0='
[2018-03-02 22:27:32.339][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'x-envoy-original-path':'/fortio1/fortio/fetch/echosrv2:8080/echo'
[2018-03-02 22:27:32.339][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] 'content-length':'0'
[2018-03-02 22:27:32.339][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] ':path':'/fortio/fetch/echosrv2:8080/echo'
[2018-03-02 22:27:32.339][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:452] [C1175][S2595374538763107673] ':method':'GET'
[2018-03-02 22:27:32.421][23][debug][router] external/envoy/source/common/router/router.cc:249] [C1175][S2595374538763107673] cluster 'in.8080' match for URL '/fortio/fetch/echosrv2:8080/echo'
[2018-03-02 22:27:32.422][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] ':authority':'35.197.53.122'
[2018-03-02 22:27:32.422][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'user-agent':'istio/fortio-0.7.3-pre'
[2018-03-02 22:27:32.422][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'x-on-behalf-of':'104.132.11.85:40511'
[2018-03-02 22:27:32.422][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'x-forwarded-for':'10.240.0.19'
[2018-03-02 22:27:32.422][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'x-forwarded-proto':'http'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'x-request-id':'5c92dd16-2fd9-95e2-98f5-02405848440b'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'x-b3-traceid':'784a9efe5167ea81'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'x-b3-spanid':'784a9efe5167ea81'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'x-b3-sampled':'1'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'x-ot-span-context':'784a9efe5167ea81;784a9efe5167ea81;0000000000000000'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'x-envoy-original-path':'/fortio1/fortio/fetch/echosrv2:8080/echo'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'content-length':'0'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] ':path':'/fortio/fetch/echosrv2:8080/echo'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] ':method':'GET'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] 'x-envoy-internal':'true'
[2018-03-02 22:27:32.423][23][debug][router] external/envoy/source/common/router/router.cc:297] [C1175][S2595374538763107673] ':scheme':'http'
[2018-03-02 22:27:32.473][23][debug][router] external/envoy/source/common/router/router.cc:948] [C1175][S2595374538763107673] pool ready
[2018-03-02 22:27:32.474][23][debug][router] external/envoy/source/common/router/router.cc:467] [C1175][S2595374538763107673] upstream reset
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:932] [C1175][S2595374538763107673] encoding headers via codec (end_stream=false):
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:937] [C1175][S2595374538763107673] ':status':'503'
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:937] [C1175][S2595374538763107673] 'content-length':'57'
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:937] [C1175][S2595374538763107673] 'content-type':'text/plain'
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:937] [C1175][S2595374538763107673] 'date':'Fri, 02 Mar 2018 22:27:32 GMT'
[2018-03-02 22:27:32.474][23][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:937] [C1175][S2595374538763107673] 'server':'envoy'
[2018-03-02 22:27:32.472][23][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:225] [C1261] attaching to next request
[2018-03-02 22:27:32.473][23][debug][router] external/envoy/source/common/router/router.cc:948] [C1175][S2595374538763107673] pool ready
[2018-03-02 22:27:32.473][23][debug][connection] external/envoy/source/common/network/connection_impl.cc:439] [C1261] remote close
[2018-03-02 22:27:32.473][23][debug][connection] external/envoy/source/common/network/connection_impl.cc:134] [C1261] closing socket: 0
[2018-03-02 22:27:32.473][23][debug][client] external/envoy/source/common/http/codec_client.cc:70] [C1261] disconnect. resetting 1 pending requests
[2018-03-02 22:27:32.473][23][debug][client] external/envoy/source/common/http/codec_client.cc:91] [C1261] request reset
[2018-03-02 22:27:32.474][23][debug][router] external/envoy/source/common/router/router.cc:467] [C1175][S2595374538763107673] upstream reset
Looks like it is really from echosrv, @ldemailly ?
try istio/fortio:0.7.3-pre1
as the fortio image (it contains https://github.com/istio/fortio/pull/173)
with the url being .../fortio1/echo?close=1
instead of the full .../fortio1/fortio/fetch/echosrv2:8080/echo
it will simulate a http server that closes connection on each call and may trigger the same issue
(i'll try too but I have to run/rest)
just built a version with go1.10: istio/fortio:0.7.3-pre3
I was able to reproduce connection timeout without envoy using localhost:8080/fortio/fetch/localhost:8080/echo
on my mac at least it's related to sockets left in TIME_WAIT state, eventually I think we probably run out and get errors
In fortio logs I saw a handful of
18:34:10 E http.go:598> Unable to connect to {::1 8080 } : dial tcp [::1]:8080: connect: operation timed out
and lsof of fortio showed
fortio 7118 ldemailly 14u IPv6 0x35837db4c5f5e18f 0t0 TCP localhost:58899->localhost:http-alt (SYN_SENT)
which should never happen (syn loss on ::1)
# Before a run
ldemailly-macbookpro:fortio ldemailly$ netstat -na -f inet6|grep 8080|grep TIME |wc
0 0 0
# During a self chained run
ldemailly-macbookpro:fortio ldemailly$ netstat -na -f inet6|grep 8080|grep TIME |wc
2649 15894 209271
ldemailly-macbookpro:fortio ldemailly$ netstat -na -f inet6|grep 8080|grep TIME |wc
3297 19782 260463
ldemailly-macbookpro:fortio ldemailly$ netstat -na -f inet6|grep 8080|grep TIME |wc
3817 22902 301543
ldemailly-macbookpro:fortio ldemailly$ netstat -na -f inet6|grep 8080|grep TIME |wc
4273 25638 337567
# keeps growing...
also thinking this relates to https://github.com/envoyproxy/envoy/issues/2066
still getting errors with go1.10 and no TIME_WAIT sockets on the node (seems to be a seperate mac issue)
please use istio/fortio:0.7.3-pre5
or look at istio-perf-0-6, run=9 is running with 8 connections at 80 qps (so 10qps per connection, from the VM)
getting about 1% errors / not connected on the echosrv1 middle step
echo-svc-deployment1-85f857cd99-blwp9 on node gke-istio-perf-default-pool-4d15d772-rq7n
I'm trying to use tcpdump on the node but I'm finding it difficult to find the traffic I care about
probably a debug image on the pod would help. will try that later - but let me know if you find something
A trace log for a 503 upstream connection:
[2018-03-03 00:11:36.931][20][debug][client] external/envoy/source/common/http/codec_client.cc:23] [C581] connecting
[2018-03-03 00:11:36.931][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:559] [C581] connecting to 127.0.0.1:8080
[2018-03-03 00:11:36.932][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:568] [C581] connection in progress
[2018-03-03 00:11:37.178][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:377] [C581] socket event: 2
[2018-03-03 00:11:37.178][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:445] [C581] write ready
[2018-03-03 00:11:37.180][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:455] [C581] connected
[2018-03-03 00:11:37.180][20][debug][client] external/envoy/source/common/http/codec_client.cc:52] [C581] connected
[2018-03-03 00:11:37.181][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:225] [C581] attaching to next request
[2018-03-03 00:11:37.184][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:314] [C581] writing 506 bytes, end_stream false
[2018-03-03 00:11:37.184][20][trace][connection] external/envoy/source/common/network/raw_buffer_socket.cc:67] [C581] write returns: 506
[2018-03-03 00:11:37.459][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:377] [C581] socket event: 2
[2018-03-03 00:11:37.459][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:445] [C581] write ready
[2018-03-03 00:11:37.487][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:377] [C581] socket event: 3
[2018-03-03 00:11:37.487][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:445] [C581] write ready
[2018-03-03 00:11:37.487][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:415] [C581] read ready
[2018-03-03 00:11:37.488][20][trace][connection] external/envoy/source/common/network/raw_buffer_socket.cc:25] [C581] read returns: 166
[2018-03-03 00:11:37.488][20][trace][connection] external/envoy/source/common/network/raw_buffer_socket.cc:25] [C581] read returns: 0
[2018-03-03 00:11:37.488][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:298] [C581] parsing 166 bytes
[2018-03-03 00:11:37.489][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=content-length value=0
[2018-03-03 00:11:37.489][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=date value=Sat, 03 Mar 2018 00:11:36 GMT
[2018-03-03 00:11:37.489][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=content-type value=text/plain; charset=utf-8
[2018-03-03 00:11:37.490][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=x-envoy-upstream-service-time value=49
[2018-03-03 00:11:37.490][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:354] [C581] headers complete
[2018-03-03 00:11:37.491][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=server value=envoy
[2018-03-03 00:11:37.497][20][debug][client] external/envoy/source/common/http/codec_client.cc:81] [C581] response complete
[2018-03-03 00:11:37.501][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:200] [C581] response complete
[2018-03-03 00:11:37.502][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:225] [C581] attaching to next request
[2018-03-03 00:11:37.503][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:314] [C581] writing 509 bytes, end_stream false
[2018-03-03 00:11:37.504][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:315] [C581] parsed 166 bytes
[2018-03-03 00:11:37.504][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:439] [C581] remote close
[2018-03-03 00:11:37.505][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:134] [C581] closing socket: 0
[2018-03-03 00:11:37.505][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:298] [C581] parsing 0 bytes
[2018-03-03 00:11:37.505][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:315] [C581] parsed 0 bytes
[2018-03-03 00:11:37.506][20][debug][client] external/envoy/source/common/http/codec_client.cc:70] [C581] disconnect. resetting 1 pending requests
[2018-03-03 00:11:37.506][20][debug][client] external/envoy/source/common/http/codec_client.cc:91] [C581] request reset
[2018-03-03 00:11:37.520][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:115] [C581] client disconnected
This looks weird because envoy read 166 bytes and parsed it. But ended with resetting 1 pending requests
. Will follow the source around codec_impl later.
ah ! nice ! smoking gun ! gn
Now I believe this is an issue at Envoy, filed https://github.com/envoyproxy/envoy/issues/2715.
awesome! glad it looks like another half close related issue.
thanks a lot @ldemailly @lizan this is quite a finding!
I added #3987 for more checks - it may help to find another client/server than fortio to demonstrate/repro issues in the future even if this one is fairly clearly envoy's
@lizan / @PiotrSikora do you have an idea why this issues only triggers in a small % of cases? I would expect if we put a closed socket back in the pool we get 50% errors (each time it's done 1 good request the next one would be bad)
Well, does echosrv
close the socket after each request? or only after X requests?
when using fetch/... url after each request, which is why I'm curious how that turns into a small %
the other weirdness was because it seems to happen only with mtls (from andra's tests) so something related to the timing / some race is involved?
I don't think that's true? Trace logs from @lizan show EOF
received by raw buffer socket and not TLS.
This 503 only happens if a connection pool is busy enough. i.e at the moment a connection in the pool read a response completely from echosrv
, the pool already has next request to be write to upstream immediately. You only observe this in mTLS case because it make lower qps to be busy enough.
but upstream has closed the connection, so why isn't it always failing every other call?
Envoy doesn't always have next request to queue. I think if you turn qps to high enough, then the rate of 503 will be close to 50%?
k so if there is no next request immediately, how come in that case it discards the dead socket ? (and doesn't in the busy case)
The RemoteClose event right after onData will trigger it.
👍 I see, thank you, so it is clearly a bug if it handles it correctly when there is enough idle time between requests but not when there isn't. we could try increasing the qps to see the error rate converge toward 50% then
I am seeing this issue in 0.8.0 installed with default helm settings (thus I believe mTLS off) and nothing else changed. If I followed this thread correctly it's currently believed that the underlying issue is this: https://github.com/envoyproxy/envoy/issues/2715. I believe the committed fix for that issue should be in 0.8.0 yet I am still seeing 503's at 100 qps.
@lizan is there any issue left in Envoy?
@andraxylia I'm not aware of any. @kinghrothgar can you paste one line of your access log for the 503? I'd like to confirm whether it is same root cause.
envoyproxy/envoy#2715 only happens with certain upstream app which doesn't send connection: close
and close connection.
It's likely we are uncovering other issues.
I can't figure out how to turn on debug logging for istio-proxy and I assume I need to as currently this is the only information I get
[hello-world-795769b484-474h4] [2018-06-14T15:24:23.623Z] "GET / HTTP/1.1" 503 UC 0 57 0 - "-" "istio/fortio-0.11.0" "49055e90-5ead-982d-aeb8-4baf2860f33d" "hello-world.default.svc.cluster.local" "127.0.0.1:80"
Edit: And this is the access log line:
{"level":"info","time":"2018-06-14T15:54:58.860458Z","instance":"accesslog.logentry.istio-system","apiClaims":"","apiKey":"","apiName":"","apiVersion":"","connectionMtls":false,"destinationIp":"10.28.0.29","destinationNamespace":"default","destinationService":"hello-world.default.svc.cluster.local","latency":"287.21µs","method":"GET","originIp":"0.0.0.0","protocol":"http","receivedBytes":507,"referer":"","requestOperation":"","requestSize":0,"responseCode":503,"responseSize":57,"responseTimestamp":"2018-06-14T15:54:58.860678Z","sentBytes":149,"sourceIp":"10.28.3.22","sourceNamespace":"default","sourceService":"nginx.default.svc.cluster.local","sourceUser":"kubernetes://nginx-8854ff995-c7x8w.default","url":"/","userAgent":"istio/fortio-0.11.0"}
503 UC
is good enough info (though verbose logging would have more details)
Is there any more info from me that would be useful?
As far as the "Why isn't it failing 50% of the time" topic, the highest % of 503's I could get was a little over 1%.
@kinghrothgar Does your server side/app closes the connection without giving a Connection: close header ? because that's the current claim on envoy side, that it only happens in that case.
otherwise what you may be seeing is #1038 (503s during changes/updates)
Team, we need to figure out what specifically can be done within the next couple of weeks (for 1.0 launch)
@ldemailly and @lizan Did either of you try increasing the QPS to see if that made a difference?
@kinghrothgar can you provide more specifics?
I got a tcpdump of the issue. It appears the TCP connection is getting closed without an HTTP connection close header being sent. The link bellow is a pcap with the 2 TCP streams. One is the connection between the nginx pod and envoy, the other is the one between envoy and the python app over localhost. Please let me know if there's any more information that would be helpful.
@ldemailly are you looking into this?
@PiotrSikora - it seems connection reused after close issue again.
HTTP close without close header is generally ok and common - but if the close happens immediately
I think it's normal to have a 503 and not aware of any way to avoid it, envoy wouldn't have time to
process the close event before sending. It's also not common behavior - the close without header typically happens when server clears up keep-alive connections or if server has problems (in which
case 503 is likely correct answer).
andra I am not as I already filled and have a repro for the original envoy problem. I was just trying to clarify if this new report is indeed the same issue or a new one (or the route change one)
I see https://github.com/envoyproxy/envoy/issues/2715 is fixed and it's clear there will be some timing situations where we cannot avoid the 503s for applications that do not send close. There is nothing more we can do in istio, so closing this one.
which istio release is the first that has the fix?
0.8
@ldemailly
in istio 0.8, i met the same error, tiny percent request, the sidecar envoy returns 503, do not use any rule ,log is below:
[2018-07-06T06:11:12.983Z] "POST /api/ HTTP/1.1" 503 UC 16687 57 0 - "-" "Mozilla/5.0 (pc-x86_64-linux-gnu) Siege/4.0.4" "41344284-c914-9db1-a30a-983e74455227" "10.20.10.20:8888" "127.0.0.1:8888"
i saw the talk here and in envoy reference, It seems that the problem has not been solved, Is that right?
UC means upstream connection error, is this log from the last envoy in the chain
@wansuiye what application do you run?
Please see this comment, you may still see 503s if your app does not send HTTP close:
https://github.com/istio/istio/issues/3295#issuecomment-400789923
@ldemailly i use istio 0.8.0 ,which i think the envoy is the last
@andraxylia i run sentry which is an open-source error tracking app, In order to locate the problem, I added nginx to the app, so the traffic path was envoy->nginx->app. Strangely enough, when nginx was added, there was no more error, nginx used the default configuration. then I adjusted the nginx configuration, when client_body_timeout parameter tuning small such as 5 seconds, envoy will happen the same mistake, at the same time the nginx report 408 errors.
now I do not know if it's the impact of client requests timeout or it is just a coincidence
nginx log:
{"log":"127.0.0.1 - - [10/Jul/2018:02:39:25 +0000] \"POST /api/7/store/ HTTP/1.1\" 408 0 \"-\" \"sentry-java/1.7.3-c2920\" \"10.21.10.12\"\n","stream":"stdout","time":"2018-07-10T02:39:25.912227157Z"}
{"log":"127.0.0.1 - - [10/Jul/2018:02:41:29 +0000] \"POST /api/7/store/ HTTP/1.1\" 408 0 \"-\" \"sentry-java/1.7.3-c2920\" \"10.21.10.12\"\n","stream":"stdout","time":"2018-07-10T02:41:29.760106561Z"}
{"log":"127.0.0.1 - - [10/Jul/2018:02:41:30 +0000] \"POST /api/7/store/ HTTP/1.1\" 408 0 \"-\" \"sentry-java/1.7.3-c2920\" \"10.21.10.12\"\n","stream":"stdout","time":"2018-07-10T02:41:30.695566241Z"}
envoy log:
{"log":"[2018-07-10T02:39:20.341Z] \"POST /api/7/store/ HTTP/1.1\" 503 UC 4592 57 5570 - \"10.21.10.12\" \"sentry-java/1.7.3-c2920\" \"b643df69-dbcb-9b8a-8f1b-37c0786fd4\" \"10.21.10.12:9000\" \"127.0.0.1:9000\"\n","stream":"stdout","time":"2018-07-10T02:39:28.798012917Z"}
{"log":"[2018-07-10T02:41:22.888Z] \"POST /api/7/store/ HTTP/1.1\" 503 UC 16072 57 6872 - \"10.21.10.12\" \"sentry-java/1.7.3-c2920\" \"11a6084a-00bc-9715-a687-32134212de\" \"10.21.10.12:9000\" \"127.0.0.1:9009\"\n","stream":"stdout","time":"2018-07-10T02:41:38.798207467Z"}
{"log":"[2018-07-10T02:41:25.229Z] \"POST /api/7/store/ HTTP/1.1\" 503 UC 4592 57 5465 - \"10.21.10.12\" \"sentry-java/1.7.3-c2920\" \"d61dc150-b5dc-9a06-a6d8-5097b6e8ac\" \"10.21.10.12:9000\" \"127.0.0.1:9009\"\n","stream":"stdout","time":"2018-07-10T02:41:38.798384762Z"}
From the tcpdump of https://github.com/istio/istio/issues/3295#issuecomment-400413798, it looks like the app is responding with HTTP/1.0
without connection: close
, is that your case too? @wansuiye
@lizan the app is responding with HTTP/1.1 without connection: close
, not HTTP/1.0
I have run into this issue as well.
[xxx] [2018-10-02T16:12:44.409Z] "GET /glb HTTP/1.1" 503 UC 0 57 1 - "-" "GoogleHC/1.0" "ecb1c797-a58f-9c53-be4f-857bc2884456" "10.100.0.100" "127.0.0.1:3000"
It appears to affect apps running Node/express but not Play. curl -v
indicates that The Node app responds with Connection: keep-alive
, but the Play app does not send that response header. The issue seems to go away by setting Connection: close
in the response.
Hey,
i just ran into what seamed to be the same issue. our system is running ambassador with istio 1.0.5 (mtls enabled) and a node/express server, serving some static files. usually that service gets hit with a few requests in a short burst, and sometimes one of those requests failed with a 503 UC. i've added the connection close to the response header and it seams to fix the 503 responses, though i have the same question as @jsw regarding the recommended approach and so forth.
i'd highly appreciate it if someone would take the time to clarify this and maybe it should be documented somewhere?
Most helpful comment
I have run into this issue as well.
[xxx] [2018-10-02T16:12:44.409Z] "GET /glb HTTP/1.1" 503 UC 0 57 1 - "-" "GoogleHC/1.0" "ecb1c797-a58f-9c53-be4f-857bc2884456" "10.100.0.100" "127.0.0.1:3000"
It appears to affect apps running Node/express but not Play.
curl -v
indicates that The Node app responds withConnection: keep-alive
, but the Play app does not send that response header. The issue seems to go away by settingConnection: close
in the response.