Linkerd2: 503 Service Unavailable from proxy with large number of connections

Created on 1 May 2020 · 24Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

Requests to ingress-nginx, reverse proxied to a backend service, get consistent 40-50% 503 Service Unavailable responses from the linkerd-proxy on the ingress-nginx side when there are a large number of clients connecting. In addition, when this happens, the Linkerd dashboard will still show a 100% success rate for both ingress-nginx and the backend service.

How can it be reproduced?

Repro setup is ingress-nginx reverse proxying a service with multiple replicas, both injected with linkerd-proxy but skipping inbound traffic for nginx. With a large number (~1000) of clients connecting to nginx (yay for load testing tools), linkerd-proxy will start giving back 503 Service Unavailable responses to nginx for many of the requests, while the Linkerd dashboard will still show a 100.00% success rate.

Use the following script to set up a two-node Kubernetes cluster using Kind, install Linkerd, ingress-nginx and Kuard. Run the hey HTTP load generator to reproduce the issue.

https://gist.github.com/praseodym/57e99d161020aa1a2343aa1141873eed

Environment

Kubernetes Version: v1.18.2
Cluster Environment: Kind
Host OS: Debian Bullseye
Linkerd version: 2.7.1

areproxy bug prioritP0

Source

praseodym

All 24 comments

Thank you for the repro! That's gonna be super helpful.

grampelberg on 1 May 2020

@grampelberg I can reproduce this issue. This is the error I see in the likerd-proxy

[   273.148855076s] DEBUG outbound:accept{peer.addr=10.244.0.2:36078}: hyper::proto::h1::io: read 0 bytes    
[   273.148888208s] DEBUG outbound:accept{peer.addr=10.244.0.2:36078}: hyper::proto::h1::conn: read eof    
[   273.148976074s] DEBUG outbound:accept{peer.addr=10.244.0.2:36078}: hyper::proto::h1::conn: error shutting down IO: Transport endpoint is not connected (os error 107)    
[   273.149029291s] DEBUG outbound:accept{peer.addr=10.244.0.2:36078}: tokio_reactor: dropping I/O source: 299    
[   273.149067314s] DEBUG outbound:accept{peer.addr=10.244.0.2:36078}: linkerd2_app_core::accept_error: Connection failed error=error shutting down connection: Transport endpoint is not connected (os error 107)

aledbf on 1 May 2020

That makes it sound like the remote pod is shutting down?

grampelberg on 1 May 2020

Those logs are from the outbound accept side of the proxy, and indicate that the client which the proxy accepted a connection from has closed the connection. If this is from the proxy injected into a nginx ingress controller pod, this means that nginx opened a TCP connection to Linkerd (presumably to forward a request upstream), and closed the connection before sending a request.

hawkw on 1 May 2020

Please also take a look at the linkerd-proxy logs in my gist which logs timeouts generated by the proxy as the source of 503s.

praseodym on 1 May 2020

@praseodym I'm still looking into this but I just want to confirm that for this repo (and the original situation that you ran into this) this is http1 traffic?

hey defaults to http1 and I can repro this with the steps you provided, but I cannot when passing in the -h2 flag

kleimkuhler on 6 May 2020

@kleimkuhler Yes, this was using http1 traffic to nginx.

I'm too seeing no problems when the -h2 flag is passed to hey. I find this interesting because, as far as I know, nginx by default uses http1 to talk to the backend service even when incoming connections use http2. Nginx also has --skip-inbound-ports 80,443 set, so linkerd-proxy shouldn't interfere with inbound connections (whether http1 or http2) at all.

praseodym on 6 May 2020

This points to it being something connection-oriented. My assumption is that the h1 test initiates _far_ more TCP connections than the h2 test.

olix0r on 6 May 2020

So, I think we've fixed at least one manifestation of this issue.

Testing the the proxy, we the following with 2.7.1:

:; hey -z 10s -q 500 -c 1000 -host h1.test.example.com http://127.0.0.1:4140

Summary:
  Total:        10.5863 secs
  Slowest:      2.1203 secs
  Fastest:      0.1966 secs
  Average:      1.0250 secs
  Requests/sec: 950.8482


Response time histogram:
  0.197 [1]     |
  0.389 [211]   |■
  0.581 [238]   |■
  0.774 [443]   |■■
  0.966 [596]   |■■■
  1.158 [7961]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1.351 [268]   |■
  1.543 [81]    |
  1.736 [82]    |
  1.928 [82]    |
  2.120 [103]   |■


Latency distribution:
  10% in 0.8101 secs
  25% in 1.0443 secs
  50% in 1.0504 secs
  75% in 1.0577 secs
  90% in 1.0748 secs
  95% in 1.1976 secs
  99% in 1.9323 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0009 secs, 0.1966 secs, 2.1203 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0641 secs
  resp wait:    0.9923 secs, 0.1445 secs, 1.2311 secs
  resp read:    0.0000 secs, 0.0000 secs, 0.0012 secs

Status code distribution:
  [200] 8792 responses
  [503] 1274 responses

But we see no such errors on master:

:; hey -z 10s -q 500 -c 1000 -host h1.test.example.com http://127.0.0.1:4140

Summary:
  Total:        10.4329 secs
  Slowest:      1.4775 secs
  Fastest:      0.2569 secs
  Average:      0.7664 secs
  Requests/sec: 1281.5238


Response time histogram:
  0.257 [1]     |
  0.379 [193]   |■
  0.501 [388]   |■
  0.623 [505]   |■■
  0.745 [662]   |■■
  0.867 [11276] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.989 [70]    |
  1.111 [95]    |
  1.233 [50]    |
  1.355 [70]    |
  1.478 [60]    |


Latency distribution:
  10% in 0.6754 secs
  25% in 0.7702 secs
  50% in 0.7803 secs
  75% in 0.7982 secs
  90% in 0.8132 secs
  95% in 0.8222 secs
  99% in 1.2147 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0013 secs, 0.2569 secs, 1.4775 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0002 secs, 0.0000 secs, 0.1513 secs
  resp wait:    0.7472 secs, 0.1656 secs, 1.0301 secs
  resp read:    0.0000 secs, 0.0000 secs, 0.0009 secs

Status code distribution:
  [200] 13370 responses

This change will be included with tomorrow's edge release (edge-20.5.1), so, once we release that, it would be good for you to test it to confirm that the repro we've been testing matches your scenario.

olix0r on 7 May 2020

👍1

@olix0r Thanks! What change in master fixes the issue? Also, will upgrading just linkerd-proxy be sufficient or should the entire control plane be upgraded?

praseodym on 8 May 2020

@praseodym stable-2.7.1's changes were mostly focused on changing how the proxy caches clients (and how it uses buffers to do that). We had unintentionally left an extra buffer on the inbound client and it was removed here.

That said, I remain concerned about our performance under high concurrency and, while I think we've got things manageable so that we're not hitting the 1s timeout, there's more work for us to do here.

olix0r on 8 May 2020

I've tried the repro with edge-20.5.1 and unfortunately I'm still seeing a high number of 503 errors:

> hey -z 10s -q 500 -c 1000 -q 1 https://localhost/

Summary:
  Total:    11.8417 secs
  Slowest:  4.0662 secs
  Fastest:  0.1455 secs
  Average:  2.5919 secs
  Requests/sec: 324.6162


Response time histogram:
  0.146 [1] |
  0.538 [188]   |■■■
  0.930 [131]   |■■
  1.322 [96]    |■■
  1.714 [100]   |■■
  2.106 [184]   |■■■
  2.498 [244]   |■■■■
  2.890 [418]   |■■■■■■■
  3.282 [2381]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  3.674 [89]    |■
  4.066 [12]    |


Latency distribution:
  10% in 1.2003 secs
  25% in 2.5285 secs
  50% in 2.9711 secs
  75% in 3.0175 secs
  90% in 3.0437 secs
  95% in 3.0591 secs
  99% in 3.4496 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0412 secs, 0.1455 secs, 4.0662 secs
  DNS-lookup:   0.0253 secs, 0.0000 secs, 0.3061 secs
  req write:    0.0001 secs, 0.0000 secs, 0.0111 secs
  resp wait:    2.5146 secs, 0.0205 secs, 4.0646 secs
  resp read:    0.0002 secs, 0.0000 secs, 0.0125 secs

Status code distribution:
  [200] 3174 responses
  [503] 670 responses

Relevant nginx linkerd-proxy logs:

[   361.90622838s] DEBUG outbound:accept{peer.addr=10.244.0.5:52872}:source{target.addr=10.108.138.233:8080}: linkerd2_app_outbound::endpoint: using dst-override addr=kuard.default.svc.cluster.local:8080
[   361.90632306s] DEBUG outbound:accept{peer.addr=10.244.0.5:52872}:source{target.addr=10.108.138.233:8080}: linkerd2_app_outbound::endpoint: Setting target for request headers={"host": "localhost", "x-request-id": "e2ec477de685cd978f3f814f57a52d36", "x-real-ip": "172.18.0.1", "x-forwarded-for": "172.18.0.1", "x-forwarded-host": "localhost", "x-forwarded-port": "443", "x-forwarded-proto": "https", "x-scheme": "https", "l5d-dst-override": "kuard.default.svc.cluster.local:8080", "user-agent": "hey/0.0.1", "content-type": "text/html", "accept-encoding": "gzip"} uri=/ target.addr=kuard.default.svc.cluster.local:8080 http.settings=Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false }
[   364.91897576s]  WARN outbound:accept{peer.addr=10.244.0.5:52872}:source{target.addr=10.108.138.233:8080}: linkerd2_app_core::errors: Failed to proxy request: request timed out
[   364.91908947s] DEBUG outbound:accept{peer.addr=10.244.0.5:52872}:source{target.addr=10.108.138.233:8080}: linkerd2_app_core::errors: Handling error with HTTP response status=503 Service Unavailable version=HTTP/1.1

praseodym on 10 May 2020

@praseodym Thanks for testing!

I have a followup branch in the works which should further reduce latency.

I'm curious, though, can you share the output of your hey test without the proxy involved? It would be helpful to understand the raw latency profile of the service being tested. It's wild to me that we're seeing multi-second response latencies at all.

Also, are these workloads running with CPU requests/limits? Specifically, the Linkerd proxies are, by default, only configured with CPU requests when the control plane is installed with --ha. If not, we may want to try a CPU request to ensure we're not being starved.

olix0r on 10 May 2020

This is without linkerd-proxy injected to nginx:
```> hey -z 10s -q 500 -c 1000 -q 1 https://localhost/

Summary:
Total: 10.2551 secs
Slowest: 0.4851 secs
Fastest: 0.0058 secs
Average: 0.1805 secs
Requests/sec: 975.1233

Response time histogram:
0.006 [1] |
0.054 [295] |■■■
0.102 [902] |■■■■■■■■■■
0.150 [2529] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.198 [3494] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.245 [1660] |■■■■■■■■■■■■■■■■■■■
0.293 [193] |■■
0.341 [74] |■
0.389 [225] |■■■
0.437 [281] |■■■
0.485 [346] |■■■■

Latency distribution:
10% in 0.0949 secs
25% in 0.1308 secs
50% in 0.1657 secs
75% in 0.2029 secs
90% in 0.2617 secs
95% in 0.4109 secs
99% in 0.4683 secs

Details (average, fastest, slowest):
DNS+dialup: 0.0215 secs, 0.0058 secs, 0.4851 secs
DNS-lookup: 0.0093 secs, 0.0000 secs, 0.1723 secs
req write: 0.0002 secs, 0.0000 secs, 0.1086 secs
resp wait: 0.1045 secs, 0.0031 secs, 0.2668 secs
resp read: 0.0003 secs, 0.0000 secs, 0.0546 secs

Status code distribution:
[200] 10000 responses
```

linkerd-proxy isn't configured with any requests/limits. The repro is running in kind where all nodes can use all host resources, so setting resource requests won't have any effect.

praseodym on 10 May 2020

👍1

@praseodym If you'd like to test that branch, you can annotate your workloads (or namespace) with

config.linkerd.io/proxy-version: ver-noblock.2020-05-08.1

In my local tests, this branch appears to be a substantial improvement over the last few releases, though these tests do put 1-cpu requests on each of the containers being tested.

olix0r on 10 May 2020

@olix0r Thanks! That branch seems to work a lot better, with all requests getting back status code 200. There are still some latency spikes though:

> hey -z 10s -q 500 -c 1000 -q 1 https://localhost/

Summary:
  Total:    10.6458 secs
  Slowest:  1.7274 secs
  Fastest:  0.0545 secs
  Average:  0.4624 secs
  Requests/sec: 939.3391


Response time histogram:
  0.055 [1] |
  0.222 [695]   |■■■■■■■
  0.389 [2804]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.556 [3753]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.724 [2514]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.891 [121]   |■
  1.058 [0] |
  1.226 [0] |
  1.393 [0] |
  1.560 [0] |
  1.727 [112]   |■


Latency distribution:
  10% in 0.2539 secs
  25% in 0.3484 secs
  50% in 0.4577 secs
  75% in 0.5673 secs
  90% in 0.6368 secs
  95% in 0.6840 secs
  99% in 1.6479 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0499 secs, 0.0545 secs, 1.7274 secs
  DNS-lookup:   0.0123 secs, 0.0000 secs, 0.2107 secs
  req write:    0.0002 secs, 0.0000 secs, 0.1051 secs
  resp wait:    0.3838 secs, 0.0040 secs, 1.6382 secs
  resp read:    0.0007 secs, 0.0000 secs, 0.1102 secs

Status code distribution:
  [200] 10000 responses

praseodym on 10 May 2020

I also encountered this problem. I used jmeter for testing, and the tps dropped by 80% while response a large number of 503 errors.
When I uninject linkerd, all the response are 200.

Fly-Coder on 15 May 2020

@Fly-Coder Did you try with the latest edge release? It includes all the improvements that were mentioned in this issue.

muenchdo on 15 May 2020

@muenchdo I am using the latest stable version 2.7.1,I did not use the edge version

Fly-Coder on 15 May 2020

@muenchdo how to use the latest edge release? Download the file -- 'https://github.com/linkerd/linkerd2/releases/download/edge-20.5.3/linkerd2-cli-edge-20.5.3-linux', and then replace the cli executable file in the '/root/.linkerd2/bin/linkerd' directory?

Fly-Coder on 15 May 2020

@muenchdo
I tried the latest edge version, 1000 concurrency, loop execution 10 times, no 503 errors, all 200, but the tps dropped significantly.

Fly-Coder on 15 May 2020

I'm going to close this out as it appears that the 503s are taken care of. We'll keep looking at latency, but it is unsurprising in these types of constrained environments to see latency go up with the addition of the proxy.

Please reopen if you see 503s on current versions!

grampelberg on 18 May 2020

One thing that is not yet resolved is that 503 errors generated by the proxy are not reflected in the Linkerd dashboard. Is this something that should be handled in another issue?

praseodym on 19 May 2020

@praseodym yeah, that'd be great!

grampelberg on 19 May 2020

Was this page helpful?

0 / 5 - 0 ratings