We can reliably reproduce a proxy memory leak with the following config:
:; kubectl create ns lifecycle
:; curl -s https://raw.githubusercontent.com/linkerd/linkerd-examples/master/lifecycle/lifecycle.yml |linkerd inject - |kubectl apply -f - -n lifecycle
Then, the leak can be observed by watching the broadcast container, which slowly grows its RSS as long as the process runs:
:; while true ; do date ; kubectl top po --containers -n lifecycle -l app=bb-broadcast |awk '$2 ~ /^linkerd-proxy$/ {print $0}' ; sleep 600 ; done
Wed May 29 21:38:17 UTC 2019
bb-broadcast-8768bbf55-p62l4 linkerd-proxy 196m 9Mi
I'm running a side-by-side test with this repro with both edge-19.5.3 and a proxy image built from the latest master branch in the proxy repo, and thus far, both are showing an upward trend in RSS (the difference between memory usage on the edge and master proxies is at least partially due to the master proxy having been started a couple minues earlier than the edge proxy):
~ โ while true ; do date ; kubectl top po --containers --all-namespaces -l app=bb-broadcast |awk '$3 ~ /^linkerd-proxy$/ {print $0}' ; sleep 600 ; done
Wed May 29 15:09:38 PDT 2019
lifecycle-edge bb-broadcast-5f9744f7d4-frrln linkerd-proxy 216m 4Mi
lifecycle-master bb-broadcast-5f9744f7d4-rkdvf linkerd-proxy 141m 6Mi
Wed May 29 15:19:39 PDT 2019
lifecycle-edge bb-broadcast-5f9744f7d4-frrln linkerd-proxy 115m 8Mi
lifecycle-master bb-broadcast-5f9744f7d4-rkdvf linkerd-proxy 200m 10Mi
Wed May 29 15:29:39 PDT 2019
lifecycle-edge bb-broadcast-5f9744f7d4-frrln linkerd-proxy 173m 12Mi
lifecycle-master bb-broadcast-5f9744f7d4-rkdvf linkerd-proxy 190m 15Mi
Wed May 29 15:39:39 PDT 2019
lifecycle-edge bb-broadcast-5f9744f7d4-frrln linkerd-proxy 169m 17Mi
lifecycle-master bb-broadcast-5f9744f7d4-rkdvf linkerd-proxy 215m 20Mi
Wed May 29 15:49:40 PDT 2019
lifecycle-edge bb-broadcast-5f9744f7d4-frrln linkerd-proxy 197m 20Mi
lifecycle-master bb-broadcast-5f9744f7d4-rkdvf linkerd-proxy 130m 26Mi
Some initial testing implies that this is _probably_ not related to metrics or router cache entries. Going to do some task leak hunting tomorrow.
The number of entries in the metrics registry remains pretty stable over time in my tests. So it's probably not metrics-related.
I've got a much narrower reproduction:
---
kind: Namespace
apiVersion: v1
metadata:
name: strest
labels:
buoyant.io/test: strest-leak
annotations:
linkerd.io/inject: enabled
# nil.strest is empty.
---
kind: Service
apiVersion: v1
metadata:
name: nil
namespace: strest
labels:
buoyant.io/test: strest-leak
strest/role: nil
spec:
type: ClusterIP
selector:
strest/role: nil
ports:
- name: grpc
port: 8888
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: client-leaky
namespace: strest
labels:
buoyant.io/test: strest-leak
strest/role: client
strest/leakiness: leaky
spec:
replicas: 1
template:
metadata:
labels:
buoyant.io/test: strest-leak
strest/role: client
strest/leakiness: leaky
spec:
serviceAccount: client
containers:
- name: main
image: buoyantio/strest-grpc:0.0.6
imagePullPolicy: IfNotPresent
ports:
- name: admin-http
containerPort: 9999
args: [
"client",
"--address", "nil:8888",
"--metricAddr", ":9999",
"--streams", "10"]
A single client that communicates with a service with no endpoints. Leaks quickly.
The proxy emits log lines like:
client-leaky-85b44df57-8vhsk linkerd-proxy ERR! [ 2065.445768s] proxy={server=out listen=127.0.0.1:4140 remote=10.1.1.137:46826} linkerd2_proxy::app::errors unexpected error: operation was canceled: connection closed
and the proxy increments no response counters for these messages.
The application appears to receive a Bad Gateway response from the proxy.
The above example _only_ leaks when using kube-proxy for usermode proxying. In this mode, kube-proxy terminates the TCP connection. In iptables proxying, the connection never succeeds, so the request is never dispatched into hyper, etc.
Here's an example TCP stream in the usermode case:
78110 3.466894035 10.1.1.155 โ 10.1.1.1 TCP 74 39142 โ 34397 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=406828503 TSecr=0 WS=128
78111 3.466903305 10.152.183.170 โ 10.1.1.155 TCP 74 8888 โ 39142 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=2708174506 TSecr=406828503 WS=128
78112 3.466909111 10.1.1.155 โ 10.1.1.1 TCP 66 39142 โ 34397 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=406828503 TSecr=2708174506
78113 3.466937435 10.1.1.155 โ 10.1.1.1 TCP 90 39142 โ 34397 [PSH, ACK] Seq=1 Ack=1 Win=29312 Len=24 TSval=406828503 TSecr=2708174506
78114 3.466943253 10.152.183.170 โ 10.1.1.155 TCP 66 8888 โ 39142 [ACK] Seq=1 Ack=25 Win=29056 Len=0 TSval=2708174506 TSecr=406828503
78115 3.466948496 10.152.183.170 โ 10.1.1.155 TCP 66 8888 โ 39142 [RST, ACK] Seq=1 Ack=25 Win=29056 Len=0 TSval=2708174506 TSecr=406828503
It seems likely we could reproduce this with any client that talks to a server that accepts and immediately closes a connection as soon as it reads some data...
It seems like when an outbound client connection is accepted and then closed, some Bytes associated with that request is incorrectly retained somewhere. I'm curious as to whether this is happening in Linkerd or somewhere in hyper/h2 --- I'd like to take a closer look at the retryable request code in hyper's client. It might be worthwhile to run hyper/h2 example proxies against a server that implements this behaviour, to see if the leak is in Linkerd or in the client impl?
@olix0r I'm not sure if proxy-mode: usermodeis actually necessary for the leak with your strest repro. I'm seeing an apparent leak with edge-19.5.4 running in a freshly-created GKE cluster where kube-proxy is running in iptables mode.
client-leaky's RSS has grown from 1MiB to to 10MiB over the last 52 minutes, while client-stable's RSS has remained stable at 1MiB:
$ while true ; do date ; kubectl top po --containers -n strest -l strest/role=client | awk '$2 ~ /^linkerd-proxy$/ {print $0}' ; sleep 300 ; done
Mon Jun 3 12:31:22 PDT 2019
client-leaky-59957d4f6b-f2csp linkerd-proxy 5m 10Mi
client-stable-5c849bcc-crtww linkerd-proxy 503m 1Mi
Mon Jun 3 12:36:22 PDT 2019
client-leaky-59957d4f6b-f2csp linkerd-proxy 5m 11Mi
client-stable-5c849bcc-crtww linkerd-proxy 422m 1Mi
Mon Jun 3 12:41:23 PDT 2019
client-leaky-59957d4f6b-f2csp linkerd-proxy 5m 12Mi
client-stable-5c849bcc-crtww linkerd-proxy 495m 1Mi
Mon Jun 3 12:46:23 PDT 2019
client-leaky-59957d4f6b-f2csp linkerd-proxy 5m 13Mi
client-stable-5c849bcc-crtww linkerd-proxy 498m 1Mi
(note that I had prior data on memory usage in this test but lost it when my laptop crashed; the client-leaky proxy was observed to grow by roughly 2 megs every 10 minutes or so)
Note that kube-proxy is configured in iptables mode:
```
eliza@gke-eliza-dev-default-pool-f5677711-1gfc ~ $ cat /var/log/kube-proxy.log
Flag --resource-container has been deprecated, This feature will be removed in a later release.
I0603 17:19:38.300839 1 flags.go:33] FLAG: --alsologtostderr="false"
I0603 17:19:38.300909 1 flags.go:33] FLAG: --bind-address="0.0.0.0"
I0603 17:19:38.300917 1 flags.go:33] FLAG: --cleanup="false"
I0603 17:19:38.300924 1 flags.go:33] FLAG: --cleanup-iptables="false"
I0603 17:19:38.300929 1 flags.go:33] FLAG: --cleanup-ipvs="true"
I0603 17:19:38.300934 1 flags.go:33] FLAG: --cluster-cidr="10.52.0.0/14"
I0603 17:19:38.300941 1 flags.go:33] FLAG: --config=""
I0603 17:19:38.300946 1 flags.go:33] FLAG: --config-sync-period="15m0s"
I0603 17:19:38.300954 1 flags.go:33] FLAG: --conntrack-max="0"
I0603 17:19:38.300960 1 flags.go:33] FLAG: --conntrack-max-per-core="32768"
I0603 17:19:38.300966 1 flags.go:33] FLAG: --conntrack-min="131072"
I0603 17:19:38.300971 1 flags.go:33] FLAG: --conntrack-tcp-timeout-close-wait="1h0m0s"
I0603 17:19:38.300976 1 flags.go:33] FLAG: --conntrack-tcp-timeout-established="24h0m0s"
I0603 17:19:38.300981 1 flags.go:33] FLAG: --feature-gates="DynamicKubeletConfig=false,ExperimentalCriticalPodAnnotation=true,NodeLease=false,RotateKubeletServerCertificate=true,TaintBasedEvictions=false"
I0603 17:19:38.301007 1 flags.go:33] FLAG: --healthz-bind-address="0.0.0.0:10256"
I0603 17:19:38.301012 1 flags.go:33] FLAG: --healthz-port="10256"
I0603 17:19:38.301017 1 flags.go:33] FLAG: --help="false"
I0603 17:19:38.301022 1 flags.go:33] FLAG: --hostname-override=""
I0603 17:19:38.301026 1 flags.go:33] FLAG: --iptables-masquerade-bit="14"
I0603 17:19:38.301031 1 flags.go:33] FLAG: --iptables-min-sync-period="10s"
I0603 17:19:38.301036 1 flags.go:33] FLAG: --iptables-sync-period="1m0s"
I0603 17:19:38.301041 1 flags.go:33] FLAG: --ipvs-exclude-cidrs="[]"
I0603 17:19:38.301050 1 flags.go:33] FLAG: --ipvs-min-sync-period="10s"
I0603 17:19:38.301082 1 flags.go:33] FLAG: --ipvs-scheduler=""
I0603 17:19:38.301087 1 flags.go:33] FLAG: --ipvs-strict-arp="false"
I0603 17:19:38.301092 1 flags.go:33] FLAG: --ipvs-sync-period="1m0s"
I0603 17:19:38.301097 1 flags.go:33] FLAG: --kube-api-burst="10"
I0603 17:19:38.301101 1 flags.go:33] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0603 17:19:38.301107 1 flags.go:33] FLAG: --kube-api-qps="5"
I0603 17:19:38.301114 1 flags.go:33] FLAG: --kubeconfig="/var/lib/kube-proxy/kubeconfig"
I0603 17:19:38.301120 1 flags.go:33] FLAG: --log-backtrace-at=":0"
I0603 17:19:38.301126 1 flags.go:33] FLAG: --log-dir=""
I0603 17:19:38.301131 1 flags.go:33] FLAG: --log-file=""
I0603 17:19:38.301135 1 flags.go:33] FLAG: --log-flush-frequency="5s"
I0603 17:19:38.301140 1 flags.go:33] FLAG: --logtostderr="true"
I0603 17:19:38.301145 1 flags.go:33] FLAG: --masquerade-all="false"
I0603 17:19:38.301149 1 flags.go:33] FLAG: --master="https://35.203.145.149"
I0603 17:19:38.301154 1 flags.go:33] FLAG: --metrics-bind-address="127.0.0.1:10249"
I0603 17:19:38.301165 1 flags.go:33] FLAG: --metrics-port="10249"
I0603 17:19:38.301170 1 flags.go:33] FLAG: --nodeport-addresses="[]"
I0603 17:19:38.301178 1 flags.go:33] FLAG: --oom-score-adj="-998"
I0603 17:19:38.301183 1 flags.go:33] FLAG: --profiling="false"
I0603 17:19:38.301188 1 flags.go:33] FLAG: --proxy-mode=""
I0603 17:19:38.301194 1 flags.go:33] FLAG: --proxy-port-range=""
I0603 17:19:38.301200 1 flags.go:33] FLAG: --resource-container=""
I0603 17:19:38.301205 1 flags.go:33] FLAG: --skip-headers="false"
I0603 17:19:38.301209 1 flags.go:33] FLAG: --stderrthreshold="2"
I0603 17:19:38.301214 1 flags.go:33] FLAG: --udp-timeout="250ms"
I0603 17:19:38.301219 1 flags.go:33] FLAG: --v="2"
I0603 17:19:38.301224 1 flags.go:33] FLAG: --version="false"
I0603 17:19:38.301231 1 flags.go:33] FLAG: --vmodule=""
I0603 17:19:38.301236 1 flags.go:33] FLAG: --write-config-to=""
W0603 17:19:38.301250 1 server.go:199] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0603 17:19:38.301289 1 feature_gate.go:206] feature gates: &{map[NodeLease:false TaintBasedEvictions:false RotateKubeletServerCertificate:true ExperimentalCriticalPodAnnotation:true DynamicKubeletConfig:false]}
W0603 17:19:38.428384 1 server_others.go:296] Flag proxy-mode="" unknown, assuming iptables proxy
I0603 17:19:38.612561 1 server_others.go:148] Using iptables Proxier.
````
However, the logs from the proxy in the above test seem to indicate that the connections to nil.strest are failing due to the dispatch timeout, not because opening an outbound connection failed.
Instead, based on the metrics from that proxy, outbound connections don't appear to be opened successfully:
$ bin/linkerd metrics -n strest po/client-leaky-59957d4f6b-f2csp | grep tcp_
# HELP tcp_open_total Total count of opened connections
# TYPE tcp_open_total counter
tcp_open_total{direction="outbound",peer="src",tls="no_identity",no_tls_reason="loopback"} 1
tcp_open_total{direction="outbound",peer="dst",tls="no_identity",no_tls_reason="not_provided_by_service_discovery"} 0
# HELP tcp_open_connections Number of currently-open connections
# TYPE tcp_open_connections gauge
tcp_open_connections{direction="outbound",peer="src",tls="no_identity",no_tls_reason="loopback"} 1
tcp_open_connections{direction="outbound",peer="dst",tls="no_identity",no_tls_reason="not_provided_by_service_discovery"} 0
# HELP tcp_read_bytes_total Total count of bytes read from peers
# TYPE tcp_read_bytes_total counter
tcp_read_bytes_total{direction="outbound",peer="src",tls="no_identity",no_tls_reason="loopback"} 1582669
tcp_read_bytes_total{direction="outbound",peer="dst",tls="no_identity",no_tls_reason="not_provided_by_service_discovery"} 0
# HELP tcp_write_bytes_total Total count of bytes written to peers
# TYPE tcp_write_bytes_total counter
tcp_write_bytes_total{direction="outbound",peer="src",tls="no_identity",no_tls_reason="loopback"} 637714
tcp_write_bytes_total{direction="outbound",peer="dst",tls="no_identity",no_tls_reason="not_provided_by_service_discovery"} 0
# HELP tcp_close_total Total count of closed connections
# TYPE tcp_close_total counter
# HELP tcp_connection_duration_ms Connection lifetimes
# TYPE tcp_connection_duration_ms histogram
It looks like the outbound dst connections are timing out:
DBUG [ 3763.853554s] 10.55.253.34:8888 linkerd2_proxy::transport::connect connecting to 10.55.253.34:8888
WARN [ 3764.305820s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.34:41176} linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline
DBUG [ 3764.310161s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.34:41176} linkerd2_proxy::app::main outbound addr=Ok(Name(NameAddr { name: "nil", port: 8888 }))
DBUG [ 3764.310810s] nil:8888 linkerd2_proxy::app::main outbound dst=Some(DstAddr { addr: Name(NameAddr { name: "nil.strest.svc.cluster.local.", port: 8888 }), direction: Out, http_settings: Http2 })
DBUG [ 3764.312312s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.34:41176} linkerd2_proxy::app::main outbound ep=Some(Endpoint { dst_name: None, addr: V4(10.55.253.34:8888), identity: None(NoPeerName(NotProvidedByServiceDiscovery)), metadata: Metadata { weight: 10000, labels: {}, protocol_hint: Unknown, identity: None }, http_settings: Http2 })
WARN [ 3764.464902s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.34:41176} linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline
WARN [ 3764.484232s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.34:41176} linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline
WARN [ 3764.484370s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.34:41176} linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline
The leak described above appears to be HTTP/2 specific.
I've deployed an HTTP/1 version of the same test, using slow_cooker rather than strest-grpc and installing Linkerd with --disable-h2-upgrade. After about 25 minutes of this test, the memory usage of the "leaky" and "stable" pods both remain stable at 2 MiB:
$ while true ; do date ; kubectl top po --containers -n slow-cooker -l slow-cooker/role=client | awk '$2 ~ /^linkerd-proxy$/ {print $0}' ; sleep 300 ; done
Mon Jun 3 13:10:22 PDT 2019
client-leaky-7fc5d854c8-pz9f4 linkerd-proxy 8m 2Mi
client-stable-6f48ccd6fd-75fl7 linkerd-proxy 153m 2Mi
Mon Jun 3 13:15:22 PDT 2019
client-leaky-7fc5d854c8-pz9f4 linkerd-proxy 9m 2Mi
client-stable-6f48ccd6fd-75fl7 linkerd-proxy 145m 2Mi
Mon Jun 3 13:20:22 PDT 2019
client-leaky-7fc5d854c8-pz9f4 linkerd-proxy 9m 2Mi
client-stable-6f48ccd6fd-75fl7 linkerd-proxy 151m 2Mi
Mon Jun 3 13:26:49 PDT 2019
client-leaky-7fc5d854c8-pz9f4 linkerd-proxy 8m 2Mi
client-stable-6f48ccd6fd-75fl7 linkerd-proxy 150m 2Mi
Mon Jun 3 13:31:50 PDT 2019
client-leaky-7fc5d854c8-pz9f4 linkerd-proxy 9m 2Mi
client-stable-6f48ccd6fd-75fl7 linkerd-proxy 149m 2Mi
In the HTTP/2 test with strest-grpc, the leaky proxy's memory had already begun to increase by this point (and is continually increasing).
Test configuration
Linkerd installed with:
$ bin/linkerd install --disable-h2-upgrade -l linkerd-h1 | kubectl apply -f -
Test injected with:
$ bin/linkerd inject slow_cooker.yml -l linkerd-h1 --manual -v edge-19.5.4 --proxy-log-level=info,linkerd2_proxy=debug,hyper=debug | kubectl apply -f -
slow_cooker.yml:
---
kind: Namespace
apiVersion: v1
metadata:
name: slow-cooker
labels:
buoyant.io/test: slow-cooker-leak
annotations:
linkerd.io/inject: enabled
# target.slow-cooker is valid.
---
kind: Service
apiVersion: v1
metadata:
name: target
namespace: slow-cooker
labels:
buoyant.io/test: slow-cooker-leak
slow-cooker/role: server
spec:
type: ClusterIP
selector:
slow-cooker/role: server
ports:
- name: http
port: 8888
# nil.slow-cooker is empty.
---
kind: Service
apiVersion: v1
metadata:
name: nil
namespace: slow-cooker
labels:
buoyant.io/test: slow-cooker-leak
slow-cooker/role: nil
spec:
type: ClusterIP
selector:
slow-cooker/role: nil
ports:
- name: http
port: 8888
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: server
namespace: slow-cooker
labels:
buoyant.io/test: slow-cooker-leak
slow-cooker/role: server
spec:
replicas: 1
template:
metadata:
labels:
buoyant.io/test: slow-cooker-leak
slow-cooker/role: server
spec:
serviceAccount: server
containers:
- name: helloworld
image: buoyantio/helloworld:0.1.6
args:
- -addr=:8888
- -text=Hello
ports:
- name: http
containerPort: 8888
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: client-leaky
namespace: slow-cooker
labels:
buoyant.io/test: slow-cooker-leak
slow-cooker/role: client
slow-cooker/leakiness: leaky
spec:
replicas: 1
template:
metadata:
labels:
buoyant.io/test: slow-cooker-leak
slow-cooker/role: client
slow-cooker/leakiness: leaky
spec:
serviceAccount: client
containers:
- name: main
image: buoyantio/slow_cooker:1.2.0
imagePullPolicy: IfNotPresent
ports:
- name: admin-http
containerPort: 9999
args: [
"-metric-addr=0.0.0.0:9998",
"-qps=100",
"-concurrency=10",
"-interval=10s",
"-latencyUnit=us",
"http://nil:8888"]
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: client-stable
namespace: slow-cooker
labels:
buoyant.io/test: slow-cooker-leak
slow-cooker/role: client
slow-cooker/leakiness: stable
spec:
replicas: 1
template:
metadata:
labels:
buoyant.io/test: slow-cooker-leak
slow-cooker/role: client
slow-cooker/leakiness: stable
spec:
serviceAccount: client
containers:
- name: main
image: buoyantio/slow_cooker:1.2.0
imagePullPolicy: IfNotPresent
ports:
- name: admin-http
containerPort: 9999
args: [
"-metric-addr=0.0.0.0:9998",
"-qps=100",
"-concurrency=10",
"-interval=10s",
"-latencyUnit=us",
"http://target:8888"]
---
kind: ServiceAccount
apiVersion: v1
metadata:
name: client
namespace: slow-cooker
labels:
slow-cooker/role: client
buoyant.io/test: slow-cooker-leak
---
kind: ServiceAccount
apiVersion: v1
metadata:
name: server
namespace: slow-cooker
labels:
slow-cooker/role: server
buoyant.io/test: slow-cooker-leak
In re: the above comment, I notice that while the HTTP/2 proxy (which exhibits the leak) is seeing its requests fail due to the dispatch timeout, the HTTP/1 proxy logs indicate that requests are failing due to the TCP connection timing out:
DBUG [ 2998.493812s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.56:60172} linkerd2_proxy::app::main outbound ep=Some(Endpoint { dst_name: None, addr: V4(10.55.244.209:8888), identity: None(NoPeerName(NotProvidedByServiceDiscovery)), metadata: Metadata { weight: 10000, labels: {}, protocol_hint: Unknown, identity: None }, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
DBUG [ 2998.494161s] 10.55.244.209:8888 linkerd2_proxy::proxy::http::client client request: method=GET uri=http://nil:8888/ version=HTTP/1.1 headers={"host": "nil:8888", "user-agent": "Go-http-client/1.1", "sc-req-id": "29879", "l5d-dst-canonical": "nil.slow-cooker.svc.cluster.local:8888"}
DBUG [ 2998.494441s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.56:60172} linkerd2_proxy::transport::connect connecting to 10.55.244.209:8888
ERR! [ 2998.968120s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.56:60170} linkerd2_proxy::app::errors unexpected error: error trying to connect: request timed out
@hawkw it's odd to me that we're dispatching a request to an endpoint that doesn't have a ready connection in the http/1 case.
@olix0r I don't believe that's what's happening here --- it's the connection attempt that's timing out, the error message is "request timed out" only because it's coming from the tower-timeout middleware wrapping the connect service.
Looks like there's more to it than requests building up in buffers faster than they're canceled by the dispatch timeout. I ran a test where the load source is removed to allow all the buffers to drain, and the proxy's memory use is not decreasing after the load stops.
I'm seeing a somewhat similar issue. Running an app in GKE v1.11.6-gke.6with linkerd edge-19.5.3 - I constantly get 2 pods evicted because of memory leaks:
The node was low on resource: memory. Container app was using 101832Ki, which exceeds its request of 0. Container linkerd-proxy was using 10473756Ki, which exceeds its request of 0
and
The node was low on resource: memory. Container app was using 147584Ki, which exceeds its request of 0. Container linkerd-proxy was using 8689292Ki, which exceeds its request of 0.
I'm only seeing this on 2 specific pods and I get a 504 Gateway Timeout from nginx when trying to access the app (through those pods). I can't see anything useful in nginx logs in relation to this.
Running kubectl top pods --containers -n default | grep linkerd shows a linkerd-proxy latency of over 900m CPU cores and over 1000Mi memory usage.
I have tested the networking and that is not the problem. Port-forwarding directly into those pods also works.
To make matters worse on a different gke environment running the same version of the app and the same version of linkerd, same machine node size and same amount of resources, everything works as expected. The only difference between the two environments is the gke version nodes, the problematic env is running v1.11.6-gke.6 meanwhile the healthy environment is running gke v1.13.6-gke.0.
I'm tempted to upgrade the nodes but please let me know if there's any more tests I can do before upgrading the nodes.
The memory leak exists above the app::errors layer in the outbound stack. In this graph, the green line corresponds to the resident set size of a proxy which fails all requests at the errors layer:

I believe this issue is caused by an upstream memory leak in the h2 library's server implementation: hyperium/h2#367.
As per a Slack conversation with @calinah, this is confirmed to fix the issue described in https://github.com/linkerd/linkerd2/issues/2863#issuecomment-498706898.
W00t!