Autoscaler: Memory leak?

Created on 13 Apr 2020  路  23Comments  路  Source: kubernetes/autoscaler

We are running CA 1.15.5 with k8s 1.15.7. We are seeing memory gradually grow over time. We have the limits set to 1G but it will eventually reach that in about a day then get oom'd.

Screen Shot 2020-04-13 at 2 07 10 PM

Here is our config in the deployment

 44       - command:
 45         - ./cluster-autoscaler
 46         - --v=4
 47         - --stderrthreshold=info
 48         - --cloud-provider=aws
 49         - --skip-nodes-with-local-storage=false
 50         - --balance-similar-node-groups
 51         - --expander=random
 52         - --nodes=5:20:nodes-us-west-2b.cluster.foo.com
 53         - --nodes=5:20:nodes-us-west-2c.cluster.foo.com
 54         - --nodes=5:20:nodes-us-west-2d.cluster.foo.com
 55         - --nodes=1:18:pgpool-nodes.cluster.foo.com
 56         - --nodes=2:16:postgres-nodes.cluster.foo.com
 57         - --nodes=1:4:api-nodes-us-west-2b.cluster.foo.com
 58         - --nodes=1:4:api-nodes-us-west-2c.cluster.foo.com
 59         - --nodes=1:4:api-nodes-us-west-2d.cluster.foo.com
 60         - --nodes=0:5:cicd-nodes-us-west-2b.cluster.foo.com
 61         - --nodes=0:5:cicd-nodes-us-west-2c.cluster.foo.com
 62         - --nodes=0:5:cicd-nodes-us-west-2d.cluster.foo.com
 63         - --nodes=0:5:haproxy-nodes-us-west-2b.cluster.foo.com
 64         - --nodes=0:5:haproxy-nodes-us-west-2c.cluster.foo.com
 65         - --nodes=0:5:haproxy-nodes-us-west-2d.cluster.foo.com
 66         env:
 67         - name: AWS_REGION
 68           value: us-west-2
 69         image: k8s.gcr.io/cluster-autoscaler:v1.15.5
 70         imagePullPolicy: Always
 71         name: cluster-autoscaler
 72         resources:
 73           limits:
 74             cpu: 100m
 75             memory: 1Gi
 76           requests:
 77             cpu: 100m
 78             memory: 500Mi

Any thoughts?

lifecyclstale

Most helpful comment

We just upgraded to 1.15.6 from 1.14.x and CA was OOMing on startup based on our requests and limits we had previously set, had to significantly increase these to get CA to start up.

Has anything changed to greatly increase the memory footprint?

All 23 comments

curious about the workload you're running? How many pods are running and the number of pending pods you see during those spikes?

@marwanad thanks for responding. Those spikes take about a day to manifest. We have a pretty steady workflow so about 700+ pods up to 1900 if you count cron jobs.

@bradleyd interesting, I've hit weird memory behaviour with the increasing number of unschedulable pods (as logged in CA). In most cases, around 3k pods or so would hit anywhere between 500-700Mi. Haven't got to running pprof yet.

From what @MaciekPytel mentioned on Slack, it seems like this is not unexpected given that watches cache all nodes and pods in memory.

I would definitely expect CA memory to grow with cluster size (#pods, #nodes and especially #ASGs), but the number of pods mentioned is not that large, this seems like more memory than I'd expect. Note that I only run clusters in GCP/GKE, so my intuition may be way off for memory use of AWS provider.

@MaciekPytel that was my thoughts too. I get that memory will grow in proportion to the cluster size, but this seems like a lot ™

@bradleyd do you have pods using affinity/anti-affinity rules by any chance? I ran a quick pprof and it seems that MatchInterPodAffinity predicate has some heavy footprint.

image

Here is the change (after an oom). Two days went 400MB to over 900MB. This seems a lot like bloat or a leak to me.

I'm seeing the same issue on a 6 node cluster, Amazon EKS v1.15, k8s.gcr.io/cluster-autoscaler:v1.15.5. CA runs into the memory limit, runs out of memory, and is then restarted. Looks like this behavior is on a ~7d schedule.

We just upgraded to 1.15.6 from 1.14.x and CA was OOMing on startup based on our requests and limits we had previously set, had to significantly increase these to get CA to start up.

Has anything changed to greatly increase the memory footprint?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

30-40 pods, 10-15 nodes (another 15 terminated nodes). AWS. cluster-autoscaler is using 800mb+ of memory.

My other two clusters which have 6-8 nodes use only 200mb.

I have another cluster with 100+ pods, 20+ nodes where cluster-autoscaler uses <200mb, running 1.14.6.

any updates on this? 1.15.7 CA is getting killed by k8 due to high memory usage.
CA reached to 1.3G memory utilisation with cluster of ~80 nodes but was killed by k8

so the option we tried is --aws-use-static-instance-list=true (Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MB

there is no issue with 1.14 CA

@Jeffwan @jaypipes Based on comments above this seems like it may be a memory leak in AWS provider.

so the option we tried is --aws-use-static-instance-list=true (Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MB

I tried @infa-ddeore suggestion of static instance list. On my two clusters with minimal load I get average around 250mb before/after. On my production cluster memory was ~650-750mb past few days but after enabling static instance list it jumped to 950mb.

Well now after reverting it starts at 500mb and within 15 seconds it climbs to 950mb before k8s kills it for exceeding memory requests. HELP!

Logs

any updates on this? 1.15.7 CA is getting killed by k8 due to high memory usage.
CA reached to 1.3G memory utilisation with cluster of ~80 nodes but was killed by k8

so the option we tried is --aws-use-static-instance-list=true (Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MB

there is no issue with 1.14 CA

I'm going to have to bisect to see what changed between the 1.14 and 1.15 CAS w/AWS release. At this point, I'm just not sure; I didn't think there were many changes actually to the AWS-specific code between 1.14 and 1.15.

We are also getting a similar issue, in starting memory will go high around 1 GB but after 5 min it's 100 MB only.

cluster autoscaler image: tried both 1.15.6 and 1.15.7
k8s version: v1.15.10
cloud-provider: aws
Cluster Size: approx 40 nodes

--aws-use-static-instance-list=true
Didn't help us.

Holy crap. Container aws-cluster-autoscaler was using 1149444Ki, which exceeds its request of 900Mi.
@jaypipes I'm going to reach out to you via email so we can troubleshoot this 1:1.

I upgraded cluster-autoscaler to 1.16.7 (from 1.15.7) and the memory 1min after boot is 375MiB. Will continue to monitor.

Simultaneously I was removing 55k completed Batch Jobs (cronjobs) that were just lingering on the API server as complete. Anyone else with this issue have similar latent storage? I don't see how this could be related but you never know. kubectl get jobs --all-namespaces

System Info:
 Boot ID:                    872ea78f-3818-48bc-a30b-a9c402c1bc04
 Kernel Version:             4.14.198-152.320.amzn2.x86_64
 OS Image:                   Amazon Linux 2
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://19.3.6
 Kubelet Version:            v1.16.13-eks-ec92d4
 Kube-Proxy Version:         v1.16.13-eks-ec92d4
--aws-use-static-instance-list=true

this does not fix it.

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 09 Dec 2020 09:41:39 -0500
      Finished:     Wed, 09 Dec 2020 09:43:50 -0500
    Ready:          True
    Restart Count:  1
    Limits:
      memory:  6Gi

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

Was this page helpful?
0 / 5 - 0 ratings