In #2922 a user reported a linkerd-prometheus pod using 30GB mem and 10GB ephemeral storage. Many factors contribute to Prometheus' resource usage, including:
prometheus_tsdb_head_series == ~500k == ~300 linkerd proxies x ~1700 metrics/proxy)scrape_interval: 10s--storage.tsdb.retention.time=6hlinkerd dashboard and Grafana)Replicating the above set up with Prometheus v2.10.0 decreased steady state memory usage from 10GB -> 5GB, and high read-load from 12GB -> 8GB, this change will ship in #2979.
Evaluate Prometheus resource usage, the goal being one or more of these outcomes:
linkerd dashboard and Grafana (via recording rules and/or fewer queries per page)linkerd install)/cc @jamesallen-vol @suever @complex64 (thanks for the user reports!)
Linkerd dashboard deployment summary pages (http://127.0.0.1:50750/namespaces/foo/deployments/bar) seem to be _extremely_ resource-intensive for prometheus if the deployment in question has multiple upstream and downstream relationships, for example:

Simply opening up 4 of those pages at the same time was sufficient to drive a 4X increase in the cpu utilization of linkerd-prometheus:

As a followup: I replicated the same experiment, but leaving open pages for a deployment that is purely standalone and has neither upstreams nor downstreams. No CPU usage increase for linkerd-prometheus was seen, so this really does seem to be a question of how complex the graph is for the deployment in question.
@siggy what do you think about putting some rules in to pre-calculate the deployment pages? It's a tough tradeoff.
@grampelberg We tried recording rules awhile back with mixed results, though it may be worth another look. I'm also optimistic there may be some optimizations around dashboard query load.
After running a few tests on GKE and AKS, here are my two main observations so far.
lifecycle test suite to run 100 slow-cooker pods, where each pod generates traffic at 100qps, the Linkerd Prometheus pod started to experience readiness and liveness probe failures. Changing the probes timeout interval alone from the default of 1 second to 60 seconds got me a lot further, until eventually the node runs out of memory and evicts the pod.
Environment setup on AKS:
Infrastructure Resources | Count
--------------------------------- | ---
Nodes | 10
Pods | 400
Total CPU cores | 20
Total memory | 70GB
Deployment Name | Pod count
------------------------- | ----------
slow-cooker | 100
bb-broadcast | 100
bb-p2p | 100
bb-terminus | 100
Slow cooker configuration:
@ihcsim Just to confirm, does this mean we're running 100 slow-cooker pods at 100qps each? If so, recommend turning qps down to 1 (or ~10), as 100qps may put undue pressure on the kubernetes nodes and linkerd-proxy. We really only want pressure on Prometheus, which should not vary with qps (and if it does I'd love to hear about it).
@siggy Thanks for the tips.
Unrelated to qps, in my last round of test I saw some context canceled logs in the public-api. Is this a query that the public-api was trying to send to Prometheus, but the context is canceled (due to context timeout?) because Prometheus was unresponsive?
linkerd linkerd-controller-569bb9cfd8-q9r6s public-api time="2019-08-27T22:22:42Z" level=error msg="Query(sum(increase(route_response_total{direction=\"inbound\", dst=~\"(kubernetes.default.svc.cluster.local|bb-broadcast.default.svc.cluster.local)(:\\\\d+)?\", namespace=\"default\", pod=\"bb-broadcast-7b8454d865-lg9vw\"}[1m])) by (rt_route, dst, classification)) failed with: Get http://linkerd-prometheus.linkerd.svc.cluster.local:9090/api/v1/query?query=sum%28increase%28route_response_total%7Bdirection%3D%22inbound%22%2C+dst%3D~%22%28kubernetes.default.svc.cluster.local%7Cbb-broadcast.default.svc.cluster.local%29%28%3A%5C%5Cd%2B%29%3F%22%2C+namespace%3D%22default%22%2C+pod%3D%22bb-broadcast-7b8454d865-lg9vw%22%7D%5B1m%5D%29%29+by+%28rt_route%2C+dst%2C+classification%29: context canceled"
Yeah, it's a TopRoutes query from the public API to Prometheus: https://github.com/linkerd/linkerd2/blob/981f5bc85dd84aa02524c3cd822bdd9a2c1c0756/controller/api/public/top_routes.go#L21
I think you're right that it's a timeout, but I'm not totally sure. We have pretty good metrics around the gRPC clients in the control-plane. Have a look at Prometheus metrics in the linkerd-controller Prometheus job. The Linkerd Health Grafana dashboard is probably a good place to start.
Just to add to this one, I'm seeing Prometheus using all the CPU available in a worker node when I open the linkerd dashboard. Memory doesn't seem to be a big issue for me, and it doesn't seem to be related to which page in the dashboard in open, even opening just the "overview" page is enough to trigger the CPU spike.
I don't notice the issue when I only open grafana, that only happens with linkerd's own dashboard.
I was running it in a 4-cores node and it was using 100% of CPU, starving all other pods. I edited the Prometheus deployment to add a limit of 1 core, and that makes the dashboard a bit flaky (and usually pretty slow), even with only 1 tab.

I currently have ~100 meshed pods. This specific cluster is running with 10 nodes (m5.xlarge), 4 cores and 16GB each, running linkerd 2.5.
Please let me know if I can provide more information that could be useful.
@brianstorti Thanks for bringing this up. Can you try updating the linkerd-prometheus-config config map with the changes in https://github.com/linkerd/linkerd2/pull/3401/files#diff-26bef37e1506c3f8b33144756cb7e919R62-R68? I am very curious to see it will resolve the intense CPU consumption you are seeing. (Note that you may need to redeploy the Linkerd control plane to see the difference, in case cadvisor is already emitting a lot of metrics.)
Memory doesn't seem to be a big issue for me
I'm curious about how much memory Prometheus is consuming. kubectl -n linkerd top po should give us an idea.
~100 meshed pods
Hmm... I am a bit surprised by this number. On AKS, with 4 cores and 14GB of memory, I was about to get to about 1000 pods before my Prometheus starts to suffocate. Do you have many other workloads sharing the same node as Prometheus? For bigger clusters, I find using node selector and taint/toleration to isolate Prometheus to be helpful.
I am also curious about your service-to-service communication pattern. Do you have X meshed clients talking to Y meshed servers, where Y >> X (by many times)?
Finally, can you port-forward to your Linkerd Prometheus and run the following PromQL for me? These queries will impose additional load on your Prometheus, so don't do it on a prod cluster. You might also have to scale down the numbe of meshed pods.
kubectl -n linkerd port-forward svc/linkerd-prometheus 9090
topk(5, count({job="linkerd-proxy"} by (__name__))
topk(5, count({job="kubernetes-nodes-cadvisor"} by (__name__))
Here are Prometheus queries:


Here you can see the CPU and memory usage (this is Prometheus running in a "dedicated" 4-core node):

Do you have many other workloads sharing the same node as Prometheus?
Not many, but yeah, I was not using a node selector so it was sharing the worker node with a few other pods. Now I'm running Prometheus in a dedicated 4-core node, but still seeing it use 100% of CPU.
I am also curious about your service-to-service communication pattern. Do you have X meshed clients talking to Y meshed servers, where Y >> X (by many times)?
We have one meshed service that receives requests from ~15 clients, and a service that sends requests to ~15 other services, but other than things are pretty evenly distributed.
I can try the configmap change later today and let you know if it changes something.
@ihcsim I tried applying these changes to linkerd-prometheus-config and restarted all Linkerd deployments, but didn't notice any difference in the CPU usage. Memory usage did drop significantly though.

Most helpful comment
Linkerd dashboard deployment summary pages (http://127.0.0.1:50750/namespaces/foo/deployments/bar) seem to be _extremely_ resource-intensive for prometheus if the deployment in question has multiple upstream and downstream relationships, for example:
Simply opening up 4 of those pages at the same time was sufficient to drive a 4X increase in the cpu utilization of linkerd-prometheus: