Requests to ingress-nginx, reverse proxied to a backend service, get consistent 40-50% 503 Service Unavailable responses from the linkerd-proxy on the ingress-nginx side when there are a large number of clients connecting. In addition, when this happens, the Linkerd dashboard will still show a 100% success rate for both ingress-nginx and the backend service.
Repro setup is ingress-nginx reverse proxying a service with multiple replicas, both injected with linkerd-proxy but skipping inbound traffic for nginx. With a large number (~1000) of clients connecting to nginx (yay for load testing tools), linkerd-proxy will start giving back 503 Service Unavailable responses to nginx for many of the requests, while the Linkerd dashboard will still show a 100.00% success rate.
Use the following script to set up a two-node Kubernetes cluster using Kind, install Linkerd, ingress-nginx and Kuard. Run the hey HTTP load generator to reproduce the issue.
https://gist.github.com/praseodym/57e99d161020aa1a2343aa1141873eed
Thank you for the repro! That's gonna be super helpful.
@grampelberg I can reproduce this issue. This is the error I see in the likerd-proxy
[ 273.148855076s] DEBUG outbound:accept{peer.addr=10.244.0.2:36078}: hyper::proto::h1::io: read 0 bytes
[ 273.148888208s] DEBUG outbound:accept{peer.addr=10.244.0.2:36078}: hyper::proto::h1::conn: read eof
[ 273.148976074s] DEBUG outbound:accept{peer.addr=10.244.0.2:36078}: hyper::proto::h1::conn: error shutting down IO: Transport endpoint is not connected (os error 107)
[ 273.149029291s] DEBUG outbound:accept{peer.addr=10.244.0.2:36078}: tokio_reactor: dropping I/O source: 299
[ 273.149067314s] DEBUG outbound:accept{peer.addr=10.244.0.2:36078}: linkerd2_app_core::accept_error: Connection failed error=error shutting down connection: Transport endpoint is not connected (os error 107)
That makes it sound like the remote pod is shutting down?
Those logs are from the outbound accept side of the proxy, and indicate that the client which the proxy accepted a connection from has closed the connection. If this is from the proxy injected into a nginx ingress controller pod, this means that nginx opened a TCP connection to Linkerd (presumably to forward a request upstream), and closed the connection before sending a request.
Please also take a look at the linkerd-proxy logs in my gist which logs timeouts generated by the proxy as the source of 503s.
@praseodym I'm still looking into this but I just want to confirm that for this repo (and the original situation that you ran into this) this is http1 traffic?
hey defaults to http1 and I can repro this with the steps you provided, but I cannot when passing in the -h2 flag
@kleimkuhler Yes, this was using http1 traffic to nginx.
I'm too seeing no problems when the -h2 flag is passed to hey. I find this interesting because, as far as I know, nginx by default uses http1 to talk to the backend service even when incoming connections use http2. Nginx also has --skip-inbound-ports 80,443 set, so linkerd-proxy shouldn't interfere with inbound connections (whether http1 or http2) at all.
This points to it being something connection-oriented. My assumption is that the h1 test initiates _far_ more TCP connections than the h2 test.
So, I think we've fixed at least one manifestation of this issue.
Testing the the proxy, we the following with 2.7.1:
:; hey -z 10s -q 500 -c 1000 -host h1.test.example.com http://127.0.0.1:4140
Summary:
Total: 10.5863 secs
Slowest: 2.1203 secs
Fastest: 0.1966 secs
Average: 1.0250 secs
Requests/sec: 950.8482
Response time histogram:
0.197 [1] |
0.389 [211] |â–
0.581 [238] |â–
0.774 [443] |â– â–
0.966 [596] |â– â– â–
1.158 [7961] |â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â–
1.351 [268] |â–
1.543 [81] |
1.736 [82] |
1.928 [82] |
2.120 [103] |â–
Latency distribution:
10% in 0.8101 secs
25% in 1.0443 secs
50% in 1.0504 secs
75% in 1.0577 secs
90% in 1.0748 secs
95% in 1.1976 secs
99% in 1.9323 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0009 secs, 0.1966 secs, 2.1203 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0000 secs, 0.0000 secs, 0.0641 secs
resp wait: 0.9923 secs, 0.1445 secs, 1.2311 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0012 secs
Status code distribution:
[200] 8792 responses
[503] 1274 responses
But we see no such errors on master:
:; hey -z 10s -q 500 -c 1000 -host h1.test.example.com http://127.0.0.1:4140
Summary:
Total: 10.4329 secs
Slowest: 1.4775 secs
Fastest: 0.2569 secs
Average: 0.7664 secs
Requests/sec: 1281.5238
Response time histogram:
0.257 [1] |
0.379 [193] |â–
0.501 [388] |â–
0.623 [505] |â– â–
0.745 [662] |â– â–
0.867 [11276] |â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â–
0.989 [70] |
1.111 [95] |
1.233 [50] |
1.355 [70] |
1.478 [60] |
Latency distribution:
10% in 0.6754 secs
25% in 0.7702 secs
50% in 0.7803 secs
75% in 0.7982 secs
90% in 0.8132 secs
95% in 0.8222 secs
99% in 1.2147 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0013 secs, 0.2569 secs, 1.4775 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0002 secs, 0.0000 secs, 0.1513 secs
resp wait: 0.7472 secs, 0.1656 secs, 1.0301 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0009 secs
Status code distribution:
[200] 13370 responses
This change will be included with tomorrow's edge release (edge-20.5.1), so, once we release that, it would be good for you to test it to confirm that the repro we've been testing matches your scenario.
@olix0r Thanks! What change in master fixes the issue? Also, will upgrading just linkerd-proxy be sufficient or should the entire control plane be upgraded?
@praseodym stable-2.7.1's changes were mostly focused on changing how the proxy caches clients (and how it uses buffers to do that). We had unintentionally left an extra buffer on the inbound client and it was removed here.
That said, I remain concerned about our performance under high concurrency and, while I think we've got things manageable so that we're not hitting the 1s timeout, there's more work for us to do here.
I've tried the repro with edge-20.5.1 and unfortunately I'm still seeing a high number of 503 errors:
> hey -z 10s -q 500 -c 1000 -q 1 https://localhost/
Summary:
Total: 11.8417 secs
Slowest: 4.0662 secs
Fastest: 0.1455 secs
Average: 2.5919 secs
Requests/sec: 324.6162
Response time histogram:
0.146 [1] |
0.538 [188] |â– â– â–
0.930 [131] |â– â–
1.322 [96] |â– â–
1.714 [100] |â– â–
2.106 [184] |â– â– â–
2.498 [244] |â– â– â– â–
2.890 [418] |â– â– â– â– â– â– â–
3.282 [2381] |â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â–
3.674 [89] |â–
4.066 [12] |
Latency distribution:
10% in 1.2003 secs
25% in 2.5285 secs
50% in 2.9711 secs
75% in 3.0175 secs
90% in 3.0437 secs
95% in 3.0591 secs
99% in 3.4496 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0412 secs, 0.1455 secs, 4.0662 secs
DNS-lookup: 0.0253 secs, 0.0000 secs, 0.3061 secs
req write: 0.0001 secs, 0.0000 secs, 0.0111 secs
resp wait: 2.5146 secs, 0.0205 secs, 4.0646 secs
resp read: 0.0002 secs, 0.0000 secs, 0.0125 secs
Status code distribution:
[200] 3174 responses
[503] 670 responses
Relevant nginx linkerd-proxy logs:
[ 361.90622838s] DEBUG outbound:accept{peer.addr=10.244.0.5:52872}:source{target.addr=10.108.138.233:8080}: linkerd2_app_outbound::endpoint: using dst-override addr=kuard.default.svc.cluster.local:8080
[ 361.90632306s] DEBUG outbound:accept{peer.addr=10.244.0.5:52872}:source{target.addr=10.108.138.233:8080}: linkerd2_app_outbound::endpoint: Setting target for request headers={"host": "localhost", "x-request-id": "e2ec477de685cd978f3f814f57a52d36", "x-real-ip": "172.18.0.1", "x-forwarded-for": "172.18.0.1", "x-forwarded-host": "localhost", "x-forwarded-port": "443", "x-forwarded-proto": "https", "x-scheme": "https", "l5d-dst-override": "kuard.default.svc.cluster.local:8080", "user-agent": "hey/0.0.1", "content-type": "text/html", "accept-encoding": "gzip"} uri=/ target.addr=kuard.default.svc.cluster.local:8080 http.settings=Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false }
[ 364.91897576s] WARN outbound:accept{peer.addr=10.244.0.5:52872}:source{target.addr=10.108.138.233:8080}: linkerd2_app_core::errors: Failed to proxy request: request timed out
[ 364.91908947s] DEBUG outbound:accept{peer.addr=10.244.0.5:52872}:source{target.addr=10.108.138.233:8080}: linkerd2_app_core::errors: Handling error with HTTP response status=503 Service Unavailable version=HTTP/1.1
@praseodym Thanks for testing!
I have a followup branch in the works which should further reduce latency.
I'm curious, though, can you share the output of your hey test without the proxy involved? It would be helpful to understand the raw latency profile of the service being tested. It's wild to me that we're seeing multi-second response latencies at all.
Also, are these workloads running with CPU requests/limits? Specifically, the Linkerd proxies are, by default, only configured with CPU requests when the control plane is installed with --ha. If not, we may want to try a CPU request to ensure we're not being starved.
This is without linkerd-proxy injected to nginx:
```> hey -z 10s -q 500 -c 1000 -q 1 https://localhost/
Summary:
Total: 10.2551 secs
Slowest: 0.4851 secs
Fastest: 0.0058 secs
Average: 0.1805 secs
Requests/sec: 975.1233
Response time histogram:
0.006 [1] |
0.054 [295] |â– â– â–
0.102 [902] |â– â– â– â– â– â– â– â– â– â–
0.150 [2529] |â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â–
0.198 [3494] |â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â–
0.245 [1660] |â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â–
0.293 [193] |â– â–
0.341 [74] |â–
0.389 [225] |â– â– â–
0.437 [281] |â– â– â–
0.485 [346] |â– â– â– â–
Latency distribution:
10% in 0.0949 secs
25% in 0.1308 secs
50% in 0.1657 secs
75% in 0.2029 secs
90% in 0.2617 secs
95% in 0.4109 secs
99% in 0.4683 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0215 secs, 0.0058 secs, 0.4851 secs
DNS-lookup: 0.0093 secs, 0.0000 secs, 0.1723 secs
req write: 0.0002 secs, 0.0000 secs, 0.1086 secs
resp wait: 0.1045 secs, 0.0031 secs, 0.2668 secs
resp read: 0.0003 secs, 0.0000 secs, 0.0546 secs
Status code distribution:
[200] 10000 responses
```
linkerd-proxy isn't configured with any requests/limits. The repro is running in kind where all nodes can use all host resources, so setting resource requests won't have any effect.
@praseodym If you'd like to test that branch, you can annotate your workloads (or namespace) with
config.linkerd.io/proxy-version: ver-noblock.2020-05-08.1
In my local tests, this branch appears to be a substantial improvement over the last few releases, though these tests do put 1-cpu requests on each of the containers being tested.
@olix0r Thanks! That branch seems to work a lot better, with all requests getting back status code 200. There are still some latency spikes though:
> hey -z 10s -q 500 -c 1000 -q 1 https://localhost/
Summary:
Total: 10.6458 secs
Slowest: 1.7274 secs
Fastest: 0.0545 secs
Average: 0.4624 secs
Requests/sec: 939.3391
Response time histogram:
0.055 [1] |
0.222 [695] |â– â– â– â– â– â– â–
0.389 [2804] |â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â–
0.556 [3753] |â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â–
0.724 [2514] |â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â–
0.891 [121] |â–
1.058 [0] |
1.226 [0] |
1.393 [0] |
1.560 [0] |
1.727 [112] |â–
Latency distribution:
10% in 0.2539 secs
25% in 0.3484 secs
50% in 0.4577 secs
75% in 0.5673 secs
90% in 0.6368 secs
95% in 0.6840 secs
99% in 1.6479 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0499 secs, 0.0545 secs, 1.7274 secs
DNS-lookup: 0.0123 secs, 0.0000 secs, 0.2107 secs
req write: 0.0002 secs, 0.0000 secs, 0.1051 secs
resp wait: 0.3838 secs, 0.0040 secs, 1.6382 secs
resp read: 0.0007 secs, 0.0000 secs, 0.1102 secs
Status code distribution:
[200] 10000 responses
I also encountered this problem. I used jmeter for testing, and the tps dropped by 80% while response a large number of 503 errors.
When I uninject linkerd, all the response are 200.
@Fly-Coder Did you try with the latest edge release? It includes all the improvements that were mentioned in this issue.
@muenchdo I am using the latest stable version 2.7.1,I did not use the edge version
@muenchdo how to use the latest edge release? Download the file -- 'https://github.com/linkerd/linkerd2/releases/download/edge-20.5.3/linkerd2-cli-edge-20.5.3-linux', and then replace the cli executable file in the '/root/.linkerd2/bin/linkerd' directory?
@muenchdo
I tried the latest edge version, 1000 concurrency, loop execution 10 times, no 503 errors, all 200, but the tps dropped significantly.
I'm going to close this out as it appears that the 503s are taken care of. We'll keep looking at latency, but it is unsurprising in these types of constrained environments to see latency go up with the addition of the proxy.
Please reopen if you see 503s on current versions!
One thing that is not yet resolved is that 503 errors generated by the proxy are not reflected in the Linkerd dashboard. Is this something that should be handled in another issue?
@praseodym yeah, that'd be great!