Linkerd2: Evaluate Prometheus resource usage

Created on 21 Jun 2019 · 12Comments · Source: linkerd/linkerd2

Background

In #2922 a user reported a linkerd-prometheus pod using 30GB mem and 10GB ephemeral storage. Many factors contribute to Prometheus' resource usage, including:

total time series (prometheus_tsdb_head_series == ~500k == ~300 linkerd proxies x ~1700 metrics/proxy)
scrape_interval: 10s
--storage.tsdb.retention.time=6h
read load (via linkerd dashboard and Grafana)

Current state

Replicating the above set up with Prometheus v2.10.0 decreased steady state memory usage from 10GB -> 5GB, and high read-load from 12GB -> 8GB, this change will ship in #2979.

Proposal

Evaluate Prometheus resource usage, the goal being one or more of these outcomes:

Linkerd default install changes
- upgrade to Prometheus 2.11 when WAL compression lands
- decrease set of metrics exported from proxy (or drop during collection)
- optimize reads from linkerd dashboard and Grafana (via recording rules and/or fewer queries per page)
- modify storage.tsdb.retention.time
- modify storage.tsdb.retention.size
- modify scrape interval
- ephemeral storage limits
Linkerd user tunable settings (via linkerd install)
- storage.tsdb.retention.time
- storage.tsdb.retention.size
- scrape interval
- ephermal storage limits
Document to the user how best to manage resource usage. This could involve modifying the linkerd-prometheus installation to use persistent volumes, etc. (https://github.com/linkerd/linkerd2/issues/2922#issuecomment-504113048)

/cc @jamesallen-vol @suever @complex64 (thanks for the user reports!)

aretelemetry pinned

Source

siggy

👍4 🎉2

Most helpful comment

Linkerd dashboard deployment summary pages (http://127.0.0.1:50750/namespaces/foo/deployments/bar) seem to be _extremely_ resource-intensive for prometheus if the deployment in question has multiple upstream and downstream relationships, for example:

deployment

Simply opening up 4 of those pages at the same time was sufficient to drive a 4X increase in the cpu utilization of linkerd-prometheus:

memory on 26 Jul 2019

👍3

All 12 comments

deployment

Simply opening up 4 of those pages at the same time was sufficient to drive a 4X increase in the cpu utilization of linkerd-prometheus:

memory on 26 Jul 2019

👍3

As a followup: I replicated the same experiment, but leaving open pages for a deployment that is purely standalone and has neither upstreams nor downstreams. No CPU usage increase for linkerd-prometheus was seen, so this really does seem to be a question of how complex the graph is for the deployment in question.

memory on 26 Jul 2019

@siggy what do you think about putting some rules in to pre-calculate the deployment pages? It's a tough tradeoff.

grampelberg on 29 Jul 2019

@grampelberg We tried recording rules awhile back with mixed results, though it may be worth another look. I'm also optimistic there may be some optimizations around dashboard query load.

siggy on 29 Jul 2019

After running a few tests on GKE and AKS, here are my two main observations so far.

Configuring the lifecycle test suite to run 100 slow-cooker pods, where each pod generates traffic at 100qps, the Linkerd Prometheus pod started to experience readiness and liveness probe failures. Changing the probes timeout interval alone from the default of 1 second to 60 seconds got me a lot further, until eventually the node runs out of memory and evicts the pod.
Launching the dashboard definitely causes CPU spikes, which diminish when the browser is closed. When operating in stressed environments with multiple dashboards, they started to fail as seen in the screenshot below.

dashboard

Environment setup on AKS:

Infrastructure Resources | Count
--------------------------------- | ---
Nodes | 10
Pods | 400
Total CPU cores | 20
Total memory | 70GB

Deployment Name | Pod count
------------------------- | ----------
slow-cooker | 100
bb-broadcast | 100
bb-p2p | 100
bb-terminus | 100

Slow cooker configuration:

qps: 100
concurrency: 1

ihcsim on 28 Aug 2019

👍1

@ihcsim Just to confirm, does this mean we're running 100 slow-cooker pods at 100qps each? If so, recommend turning qps down to 1 (or ~10), as 100qps may put undue pressure on the kubernetes nodes and linkerd-proxy. We really only want pressure on Prometheus, which should not vary with qps (and if it does I'd love to hear about it).

siggy on 28 Aug 2019

👍1

@siggy Thanks for the tips.

Unrelated to qps, in my last round of test I saw some context canceled logs in the public-api. Is this a query that the public-api was trying to send to Prometheus, but the context is canceled (due to context timeout?) because Prometheus was unresponsive?

linkerd linkerd-controller-569bb9cfd8-q9r6s public-api time="2019-08-27T22:22:42Z" level=error msg="Query(sum(increase(route_response_total{direction=\"inbound\", dst=~\"(kubernetes.default.svc.cluster.local|bb-broadcast.default.svc.cluster.local)(:\\\\d+)?\", namespace=\"default\", pod=\"bb-broadcast-7b8454d865-lg9vw\"}[1m])) by (rt_route, dst, classification)) failed with: Get http://linkerd-prometheus.linkerd.svc.cluster.local:9090/api/v1/query?query=sum%28increase%28route_response_total%7Bdirection%3D%22inbound%22%2C+dst%3D~%22%28kubernetes.default.svc.cluster.local%7Cbb-broadcast.default.svc.cluster.local%29%28%3A%5C%5Cd%2B%29%3F%22%2C+namespace%3D%22default%22%2C+pod%3D%22bb-broadcast-7b8454d865-lg9vw%22%7D%5B1m%5D%29%29+by+%28rt_route%2C+dst%2C+classification%29: context canceled"

ihcsim on 28 Aug 2019

Yeah, it's a TopRoutes query from the public API to Prometheus: https://github.com/linkerd/linkerd2/blob/981f5bc85dd84aa02524c3cd822bdd9a2c1c0756/controller/api/public/top_routes.go#L21

I think you're right that it's a timeout, but I'm not totally sure. We have pretty good metrics around the gRPC clients in the control-plane. Have a look at Prometheus metrics in the linkerd-controller Prometheus job. The Linkerd Health Grafana dashboard is probably a good place to start.

siggy on 28 Aug 2019

Just to add to this one, I'm seeing Prometheus using all the CPU available in a worker node when I open the linkerd dashboard. Memory doesn't seem to be a big issue for me, and it doesn't seem to be related to which page in the dashboard in open, even opening just the "overview" page is enough to trigger the CPU spike.

I don't notice the issue when I only open grafana, that only happens with linkerd's own dashboard.

I was running it in a 4-cores node and it was using 100% of CPU, starving all other pods. I edited the Prometheus deployment to add a limit of 1 core, and that makes the dashboard a bit flaky (and usually pretty slow), even with only 1 tab.

Screenshot 2019-09-16 15 16 40

I currently have ~100 meshed pods. This specific cluster is running with 10 nodes (m5.xlarge), 4 cores and 16GB each, running linkerd 2.5.

Please let me know if I can provide more information that could be useful.

brianstorti on 16 Sep 2019

@brianstorti Thanks for bringing this up. Can you try updating the linkerd-prometheus-config config map with the changes in https://github.com/linkerd/linkerd2/pull/3401/files#diff-26bef37e1506c3f8b33144756cb7e919R62-R68? I am very curious to see it will resolve the intense CPU consumption you are seeing. (Note that you may need to redeploy the Linkerd control plane to see the difference, in case cadvisor is already emitting a lot of metrics.)

Memory doesn't seem to be a big issue for me

I'm curious about how much memory Prometheus is consuming. kubectl -n linkerd top po should give us an idea.

~100 meshed pods

Hmm... I am a bit surprised by this number. On AKS, with 4 cores and 14GB of memory, I was about to get to about 1000 pods before my Prometheus starts to suffocate. Do you have many other workloads sharing the same node as Prometheus? For bigger clusters, I find using node selector and taint/toleration to isolate Prometheus to be helpful.

I am also curious about your service-to-service communication pattern. Do you have X meshed clients talking to Y meshed servers, where Y >> X (by many times)?

Finally, can you port-forward to your Linkerd Prometheus and run the following PromQL for me? These queries will impose additional load on your Prometheus, so don't do it on a prod cluster. You might also have to scale down the numbe of meshed pods.

kubectl -n linkerd port-forward svc/linkerd-prometheus 9090

topk(5, count({job="linkerd-proxy"} by (__name__))
topk(5, count({job="kubernetes-nodes-cadvisor"} by (__name__))

ihcsim on 17 Sep 2019

Here are Prometheus queries:

Screenshot 2019-09-17 06 23 36
Screenshot 2019-09-17 06 24 20

Here you can see the CPU and memory usage (this is Prometheus running in a "dedicated" 4-core node):

linkerd-prometheus

Do you have many other workloads sharing the same node as Prometheus?

Not many, but yeah, I was not using a node selector so it was sharing the worker node with a few other pods. Now I'm running Prometheus in a dedicated 4-core node, but still seeing it use 100% of CPU.

I am also curious about your service-to-service communication pattern. Do you have X meshed clients talking to Y meshed servers, where Y >> X (by many times)?

We have one meshed service that receives requests from ~15 clients, and a service that sends requests to ~15 other services, but other than things are pretty evenly distributed.

I can try the configmap change later today and let you know if it changes something.

brianstorti on 17 Sep 2019

@ihcsim I tried applying these changes to linkerd-prometheus-config and restarted all Linkerd deployments, but didn't notice any difference in the CPU usage. Memory usage did drop significantly though.

Screenshot 2019-09-17 11 49 52