We are running CA 1.15.5 with k8s 1.15.7. We are seeing memory gradually grow over time. We have the limits set to 1G but it will eventually reach that in about a day then get oom'd.

Here is our config in the deployment
44 - command:
45 - ./cluster-autoscaler
46 - --v=4
47 - --stderrthreshold=info
48 - --cloud-provider=aws
49 - --skip-nodes-with-local-storage=false
50 - --balance-similar-node-groups
51 - --expander=random
52 - --nodes=5:20:nodes-us-west-2b.cluster.foo.com
53 - --nodes=5:20:nodes-us-west-2c.cluster.foo.com
54 - --nodes=5:20:nodes-us-west-2d.cluster.foo.com
55 - --nodes=1:18:pgpool-nodes.cluster.foo.com
56 - --nodes=2:16:postgres-nodes.cluster.foo.com
57 - --nodes=1:4:api-nodes-us-west-2b.cluster.foo.com
58 - --nodes=1:4:api-nodes-us-west-2c.cluster.foo.com
59 - --nodes=1:4:api-nodes-us-west-2d.cluster.foo.com
60 - --nodes=0:5:cicd-nodes-us-west-2b.cluster.foo.com
61 - --nodes=0:5:cicd-nodes-us-west-2c.cluster.foo.com
62 - --nodes=0:5:cicd-nodes-us-west-2d.cluster.foo.com
63 - --nodes=0:5:haproxy-nodes-us-west-2b.cluster.foo.com
64 - --nodes=0:5:haproxy-nodes-us-west-2c.cluster.foo.com
65 - --nodes=0:5:haproxy-nodes-us-west-2d.cluster.foo.com
66 env:
67 - name: AWS_REGION
68 value: us-west-2
69 image: k8s.gcr.io/cluster-autoscaler:v1.15.5
70 imagePullPolicy: Always
71 name: cluster-autoscaler
72 resources:
73 limits:
74 cpu: 100m
75 memory: 1Gi
76 requests:
77 cpu: 100m
78 memory: 500Mi
Any thoughts?
curious about the workload you're running? How many pods are running and the number of pending pods you see during those spikes?
@marwanad thanks for responding. Those spikes take about a day to manifest. We have a pretty steady workflow so about 700+ pods up to 1900 if you count cron jobs.
@bradleyd interesting, I've hit weird memory behaviour with the increasing number of unschedulable pods (as logged in CA). In most cases, around 3k pods or so would hit anywhere between 500-700Mi. Haven't got to running pprof yet.
From what @MaciekPytel mentioned on Slack, it seems like this is not unexpected given that watches cache all nodes and pods in memory.
I would definitely expect CA memory to grow with cluster size (#pods, #nodes and especially #ASGs), but the number of pods mentioned is not that large, this seems like more memory than I'd expect. Note that I only run clusters in GCP/GKE, so my intuition may be way off for memory use of AWS provider.
@MaciekPytel that was my thoughts too. I get that memory will grow in proportion to the cluster size, but this seems like a lot ™
@bradleyd do you have pods using affinity/anti-affinity rules by any chance? I ran a quick pprof and it seems that MatchInterPodAffinity predicate has some heavy footprint.

Here is the change (after an oom). Two days went 400MB to over 900MB. This seems a lot like bloat or a leak to me.
I'm seeing the same issue on a 6 node cluster, Amazon EKS v1.15, k8s.gcr.io/cluster-autoscaler:v1.15.5. CA runs into the memory limit, runs out of memory, and is then restarted. Looks like this behavior is on a ~7d schedule.
We just upgraded to 1.15.6 from 1.14.x and CA was OOMing on startup based on our requests and limits we had previously set, had to significantly increase these to get CA to start up.
Has anything changed to greatly increase the memory footprint?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
30-40 pods, 10-15 nodes (another 15 terminated nodes). AWS. cluster-autoscaler is using 800mb+ of memory.
My other two clusters which have 6-8 nodes use only 200mb.
I have another cluster with 100+ pods, 20+ nodes where cluster-autoscaler uses <200mb, running 1.14.6.
any updates on this? 1.15.7 CA is getting killed by k8 due to high memory usage.
CA reached to 1.3G memory utilisation with cluster of ~80 nodes but was killed by k8
so the option we tried is --aws-use-static-instance-list=true (Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MB
there is no issue with 1.14 CA
@Jeffwan @jaypipes Based on comments above this seems like it may be a memory leak in AWS provider.
so the option we tried is --aws-use-static-instance-list=true (Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MB
I tried @infa-ddeore suggestion of static instance list. On my two clusters with minimal load I get average around 250mb before/after. On my production cluster memory was ~650-750mb past few days but after enabling static instance list it jumped to 950mb.
Well now after reverting it starts at 500mb and within 15 seconds it climbs to 950mb before k8s kills it for exceeding memory requests. HELP!
any updates on this?
1.15.7CA is getting killed by k8 due to high memory usage.
CA reached to 1.3G memory utilisation with cluster of ~80 nodes but was killed by k8so the option we tried is
--aws-use-static-instance-list=true(Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MBthere is no issue with 1.14 CA
I'm going to have to bisect to see what changed between the 1.14 and 1.15 CAS w/AWS release. At this point, I'm just not sure; I didn't think there were many changes actually to the AWS-specific code between 1.14 and 1.15.
We are also getting a similar issue, in starting memory will go high around 1 GB but after 5 min it's 100 MB only.
cluster autoscaler image: tried both 1.15.6 and 1.15.7
k8s version: v1.15.10
cloud-provider: aws
Cluster Size: approx 40 nodes
--aws-use-static-instance-list=true
Didn't help us.
Holy crap. Container aws-cluster-autoscaler was using 1149444Ki, which exceeds its request of 900Mi.
@jaypipes I'm going to reach out to you via email so we can troubleshoot this 1:1.
I upgraded cluster-autoscaler to 1.16.7 (from 1.15.7) and the memory 1min after boot is 375MiB. Will continue to monitor.
Simultaneously I was removing 55k completed Batch Jobs (cronjobs) that were just lingering on the API server as complete. Anyone else with this issue have similar latent storage? I don't see how this could be related but you never know. kubectl get jobs --all-namespaces
System Info:
Boot ID: 872ea78f-3818-48bc-a30b-a9c402c1bc04
Kernel Version: 4.14.198-152.320.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.6
Kubelet Version: v1.16.13-eks-ec92d4
Kube-Proxy Version: v1.16.13-eks-ec92d4
--aws-use-static-instance-list=true
this does not fix it.
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 09 Dec 2020 09:41:39 -0500
Finished: Wed, 09 Dec 2020 09:43:50 -0500
Ready: True
Restart Count: 1
Limits:
memory: 6Gi
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Most helpful comment
We just upgraded to 1.15.6 from 1.14.x and CA was OOMing on startup based on our requests and limits we had previously set, had to significantly increase these to get CA to start up.
Has anything changed to greatly increase the memory footprint?