Autoscaler: Cluster Autoscaler on AWS is OOM killed on startup in GenerateEC2InstanceTypes

Created on 11 Sep 2020 · 14Comments · Source: kubernetes/autoscaler

We noticed our cluster autoscaler occasionally getting OOM killed on startup or when elected as leader. The memory usage spike on startup is fairly consistent even when not OOM killed, sitting just below the default limits at 250Mi or so. When it doesn't OOM, this memory is eventually garbage collected and the autoscaler stabilizes at well under 100Mi used:

After a pprof trace (requiring an ad-hoc upgrade to cluster-autoscaler v1.18.2 to get the --profiling flag) we noticed a large chunk of memory allocated in the GenerateEC2InstanceTypes function. We were able to trace this back to PR #2249 which fetches an updated list of EC2 instance types from an AWS-hosted JSON file. Surprisingly, this file is 94 MiB, the entirety of which is fetched onto the heap before parsing. The data extracted is fairly small (under 43KiB per ec2_instance_types.go) but unfortunately the allocations sometimes live long enough to push the autoscaler over the (default) memory limit.

Additionally, with the --aws-use-static-instance-list=true flag set, the memory spike disappears:

Is there some solution that could fetch the updated list without requiring an otherwise unnecessary memory limit increase? Given the autoscaler's special priority class, raising the limit well beyond what it actually needs at runtime feels a bit wrong.

Additional information:

autoscaler image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.16.6
Kubernetes version:

$ kubectl version Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-eks-2ba888", GitCommit:"2ba888155c7f8093a1bc06e3336333fbdb27b3da", GitTreeState:"clean", BuildDate:"2020-07-17T18:48:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

pprof svg: cluster-autoscaler-pprof.tar.gz (.svg in a tarball to satisfy GitHub)

kubectl describe pod output:

Name:                 cluster-autoscaler-7b9c56647d-9v8pr
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 ip-192-168-125-18.us-west-2.compute.internal/192.168.125.18
Start Time:           Tue, 08 Sep 2020 05:18:08 -0600
Labels:               app=cluster-autoscaler
                     app.kubernetes.io/instance=cluster-autoscaler
                     app.kubernetes.io/name=cluster-autoscaler
                     pod-template-hash=7b9c56647d
Annotations:          cluster-autoscaler.kubernetes.io/safe-to-evict: false
                     kubernetes.io/psp: psp.privileged
                     prometheus.io/path: /metrics
                     prometheus.io/port: 8085
                     prometheus.io/scrape: true
Status:               Running
IP:                   192.168.121.122
IPs:
 IP:           192.168.121.122
Controlled By:  ReplicaSet/cluster-autoscaler-7b9c56647d
Containers:
 cluster-autoscaler:
   Container ID:  docker://cbdbb11a7c20b042d79744edbb5dd0c6fde71303be697a1a773307c9d5ac442c
   Image:         k8s.gcr.io/autoscaling/cluster-autoscaler:v1.16.6
   Image ID:      docker-pullable://k8s.gcr.io/autoscaling/cluster-autoscaler@sha256:cbbe98dd8f325bef54557bc2854e48983cfc706aba126bedb0c52d593e869072
   Port:          8085/TCP
   Host Port:     0/TCP
   Command:
     ./cluster-autoscaler
     --v=4
     --stderrthreshold=info
     --cloud-provider=aws
     --skip-nodes-with-local-storage=false
     --expander=least-waste
     --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<snip>1
     --balance-similar-node-groups
     --skip-nodes-with-system-pods=false
   State:          Running
     Started:      Wed, 09 Sep 2020 16:56:37 -0600
   Last State:     Terminated
     Reason:       OOMKilled
     Exit Code:    137
     Started:      Wed, 09 Sep 2020 16:52:52 -0600
     Finished:     Wed, 09 Sep 2020 16:56:21 -0600
   Ready:          True
   Restart Count:  2
   Limits:
     cpu:     100m
     memory:  300Mi
   Requests:
     cpu:        100m
     memory:     300Mi
   Environment:  <none>
   Mounts:
     /etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
     /var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-token-bwpc6 (ro)
Conditions:
 Type              Status
 Initialized       True 
 Ready             True 
 ContainersReady   True 
 PodScheduled      True 
Volumes:
 ssl-certs:
   Type:          HostPath (bare host directory volume)
   Path:          /etc/ssl/certs/ca-bundle.crt
   HostPathType:  
 cluster-autoscaler-token-bwpc6:
   Type:        Secret (a volume populated by a Secret)
   SecretName:  cluster-autoscaler-token-bwpc6
   Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

Source

timothyb89

👍17

Most helpful comment

I reproduce the issue, even if contrary to the initial ticket, the default limit it now not set at 250 Mi but at 300 Mi.

Sometimes, the cluster-autoscaler pod needs a lot more, like bellow (572 Mi):

NAME                                        CPU(cores)   MEMORY(bytes)
...
aws-node-m548b                              4m           40Mi
cluster-autoscaler-6478668dc5-j6gql         93m          572Mi
coredns-6d97dc4b59-dc72r                    3m           7Mi
...

Increasing this limit accordingly solves the issue on my side.

e-nalepa on 15 Feb 2021

👍4

All 14 comments

Experienced the same OOM issue when disabling IMDSv1 and switching purely to use IRSA, but the deployment was missing AWS_REGION environment variable, which leads Cluster Autoscaler to query the pricing information for all available regions. With these JSON document sizes, OOMKills are likely to happen. With AWS_REGION specified, only matching pricing data will be retrieved https://github.com/kubernetes/autoscaler/blob/d054bb248f9689eef39be21f796fafd236437073/cluster-autoscaler/cloudprovider/aws/aws_util.go#L71

hhamalai on 19 Oct 2020

I am seeing this error even with 6Gi of memory limits... something is wrong.

seamusabshere on 7 Dec 2020

We've seen similar issues with our AWS autoscaler. We didn't have 1.18 to take a pprof but it was taking > 5GB of ram. Maybe we should default to the static list? I don't think it's been updated recently

mcristina422 on 7 Dec 2020

I cross-commented on this related issue: https://github.com/kubernetes/autoscaler/issues/3044#issuecomment-741817868

seamusabshere on 15 Dec 2020

I cross-commented on this related issue: #3044 (comment)

I actually don't believe the issues #3044 and this one are due to the same problem. This one is pretty clearly the result of the instance type dynamic generation pulling down 100+MB JSON files on startup. In #3044, however, yourself and another poster point out that using the static instance type list does not solve memory leak issues. I believe the root cause of these issues is different.

jaypipes on 13 Jan 2021

If adding AWS_REGION to container could fix pulling down huge instance type file, it was added to the latest https://github.com/aws/amazon-eks-pod-identity-webhook and also enabled in 1.18 EKS cluster.

jqmichael on 13 Jan 2021

[ @jaypipes means #3044 not #3004 above ]

seamusabshere on 14 Jan 2021

[ @jaypipes means #3044 not #3004 above ]

doh, yep, sorry about that! fixed :)

jaypipes on 14 Jan 2021

I'm trying to reproduce. How big of a cluster are you trying out? I'm running 1.18.2 on EKS 1.18. 100 node cluster, 400 pods and it's sitting stable at 300mb of memory.

ellistarn on 22 Jan 2021

Update after deep diving this. @seamusabshere's OOM was due to listwatch caches filling up on startup due to a large number of Job objects in the API Server.

@timothyb89, is there any chance your cluster is suffering similar fate?

ellistarn on 25 Jan 2021

# courtesy https://stackoverflow.com/a/61231027/310192
kubectl delete jobs --field-selector status.successful=1

😆

i thought i was safe because we were using ttlSecondsAfterFinished... but that's an alpha feature and per @ellistarn "[EKS runs] feature gates that are in Beta."

So, I had thousands of months-old jobs.

seamusabshere on 26 Jan 2021

@timothyb89, is there any chance your cluster is suffering similar fate?

Our largest cluster has 250 job objects at the moment, which I'd hope isn't nearly large enough to cause any trouble.

For what it's worth we been using --aws-use-static-instance-list=true since September and have not seen any unexpected restarts.

timothyb89 on 26 Jan 2021

I reproduce the issue, even if contrary to the initial ticket, the default limit it now not set at 250 Mi but at 300 Mi.

Sometimes, the cluster-autoscaler pod needs a lot more, like bellow (572 Mi):

NAME                                        CPU(cores)   MEMORY(bytes)
...
aws-node-m548b                              4m           40Mi
cluster-autoscaler-6478668dc5-j6gql         93m          572Mi
coredns-6d97dc4b59-dc72r                    3m           7Mi
...

Increasing this limit accordingly solves the issue on my side.

e-nalepa on 15 Feb 2021

👍4

Is there any chance we could grab this list from the local filesystem and in combination with using an initContainer or a static configmap we could contain the "controller" memory limits closer to the requests?

We have to configure requests: 96Mi and limits: 512Mi... I bet that list is going to keep growing and eventually crash the PODs :sweat: