Autoscaler: CA does not work properly while using AWS EC2 IMDSv2 only in EKS

Created on 8 Oct 2020  路  8Comments  路  Source: kubernetes/autoscaler

Recently AWS EKS supports EC2 Instance Metadata Service v2.

In my testing environment, I create a worker node with IMDSv2 only and it requires to use token-backed sessions to access IMDS.

However with this condition, CA seems cannot unmarshall it.

I1008 18:57:01.160950       1 aws_util.go:150] fetching http://169.254.169.254/latest/dynamic/instance-identity/document
..........
W1008 18:57:01.760556       1 aws_util.go:166] Error unmarshalling http://169.254.169.254/latest/dynamic/instance-identity/document, skip...

Check the CA pod, it keeps OOMed and results in CrashLoopBackOff.

# kubectl get pod -n kube-system
NAME                                  READY   STATUS             RESTARTS   AGE
cluster-autoscaler-5b5489859f-2pkdt   0/1     CrashLoopBackOff   6          13m
# kubectl describe pod cluster-autoscaler-5b5489859f-2pkdt -n kube-system
Name:           cluster-autoscaler-5b5489859f-2pkdt
Namespace:      kube-system
Priority:       0
Node:           ip-172-31-23-13.ap-northeast-1.compute.internal/172.31.23.13
Start Time:     Thu, 08 Oct 2020 19:22:15 +0000
Labels:         app=cluster-autoscaler
                pod-template-hash=5b5489859f
Annotations:    kubernetes.io/psp: eks.privileged
                prometheus.io/port: 8085
                prometheus.io/scrape: true
Status:         Running
IP:             172.31.20.73
IPs:            <none>
Controlled By:  ReplicaSet/cluster-autoscaler-5b5489859f
Containers:
  cluster-autoscaler:
    Container ID:  docker://8cea864df872af960650f9f01061ca52e62855f680306238f75a12cbc798f8a5
    Image:         k8s.gcr.io/autoscaling/cluster-autoscaler:v1.15.7
    Image ID:      docker-pullable://k8s.gcr.io/autoscaling/cluster-autoscaler@sha256:6641a69b4ea5f911ccbb11b75b2675261d90bf169f612c9e960f60036336d664
    Port:          <none>
    Host Port:     <none>
    Command:
      ./cluster-autoscaler
      --v=4
      --stderrthreshold=info
      --cloud-provider=aws
      --skip-nodes-with-local-storage=false
      --expander=least-waste
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/LAB-EKS-15
      --balance-similar-node-groups
      --skip-nodes-with-system-pods=false
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 08 Oct 2020 19:32:27 +0000
      Finished:     Thu, 08 Oct 2020 19:33:06 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 08 Oct 2020 19:29:07 +0000
      Finished:     Thu, 08 Oct 2020 19:29:46 +0000

If uses IMDSv1 back, it works without issue as following:

I1008 19:05:20.256839       1 aws_util.go:150] fetching http://169.254.169.254/latest/dynamic/instance-identity/document
I1008 19:05:38.256216       1 aws_cloud_provider.go:380] Successfully load 354 EC2 Instance Types [u-9tb1 m5n.8xlarge z1d.12xlarge m5dn.12xlarge m5.12xlarge c5d.4xlarge c5d.xlarge r6g.2xlarge m4.4xlarge c5.24xlarge r3.8xlarge i3en.24xlarge i3.4xlarge a1.xlarge r5ad.large r5dn.metal x1e u-9tb1.metal m5dn.16xlarge r5n.4xlarge t3.small c5n.2xlarge m5ad.large t3.micro c5d.2xlarge c1.xlarge r5a.24xlarge t3.large r6g.metal r5a.xlarge c6g.xlarge i3en.metal g4dn.xlarge r6g.16xlarge c3.large i2.4xlarge r5d.xlarge t4g.small t3a.xlarge c3.8xlarge m5d.4xlarge r5ad.xlarge h1 c5d.18xlarge u-6tb1.metal p2.8xlarge m6g.2xlarge c5d.metal i3en.2xlarge 
........
I1008 19:05:44.609556       1 auto_scaling_groups.go:354] Regenerating instance to ASG map for ASGs: []
I1008 19:05:44.609579       1 aws_manager.go:266] Refreshed ASG list, next refresh after 2020-10-08 19:06:44.609574794 +0000 UTC m=+102.445561263
I1008 19:05:44.609801       1 main.go:271] Registered cleanup signal handler
I1008 19:05:44.610023       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I1008 19:05:44.610039       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 1.791碌s
I1008 19:05:54.610015       1 static_autoscaler.go:187] Starting main loop
I1008 19:05:54.610119       1 utils.go:622] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I1008 19:05:54.610130       1 filter_out_schedulable.go:63] Filtering out schedulables
I1008 19:05:54.610168       1 filter_out_schedulable.go:80] No schedulable pods
I1008 19:05:54.610188       1 static_autoscaler.go:334] No unschedulable pods
I1008 19:05:54.610203       1 static_autoscaler.go:381] Calculating unneeded nodes

I suspect CA does not use token-backed sessions to access IMDS.

Most helpful comment

Hi Contributors @mwielgus @losipiuk @aleksandra-malinowska @bskiba. As this is causing eks cluster not be upgraded to IMDSv2 support, Can this issue be prioritized, I suspect CA does not use token-backed sessions to access IMDS. CA pod, it keeps OOMed and results in CrashLoopBackOff. Thank you.

All 8 comments

Got hit with this too, EKS 1.17

We worked around this issue by injecting the AWS_REGION environment variable to the cluster-autoscaler container. Obviously not an ideal solution, which would be to add support for it, but it works.

We worked around this issue by injecting the AWS_REGION environment variable to the cluster-autoscaler container. Obviously not an ideal solution, which would be to add support for it, but it works.

I was not able to workaround this issue by injecting AWS_REGION or AWS_DEFAULT_REGION into the aws-cluster-autoscaler container. With v1 metadata service [token optional] cluster-autoscaler does not error and has no issues.

Error log / behavior with IMDSv2 [token required]:

I1130 21:13:10.946968       1 aws_cloud_provider.go:371] Successfully load 392 EC2 Instance Types [...truncated...]
E1130 21:13:14.176281       1 aws_manager.go:262] Failed to regenerate ASG cache: cannot autodiscover ASGs: NoCredentialProviders: no valid providers in chain. Deprecated.
        For verbose messaging see aws.Config.CredentialsChainVerboseErrors
F1130 21:13:14.176302       1 aws_cloud_provider.go:376] Failed to create AWS Manager: cannot autodiscover ASGs: NoCredentialProviders: no valid providers in chain. Deprecated.
        For verbose messaging see aws.Config.CredentialsChainVerboseErrors

Here's our cluster-autoscaler helm release [chart v9.1.0 setting awsRegion and autoDiscovery.clusterName] as well as attempting to set the ENV variable:

resource "helm_release" "cluster_autoscaler" {
  depends_on = [
    module.eks, # Wait for cluster to be ready
  ]

  repository       = "https://kubernetes.github.io/autoscaler"
  chart            = "cluster-autoscaler"
  version          = "9.1.0"
  name             = "cluster-autoscaler"
  namespace        = "kube-system"

  values = [
    # Values set from terraform outputs
    <<EOL
awsRegion: ${module.eks.cluster_region}
autoDiscovery:
  clusterName: ${module.eks.cluster_name}
EOL
    ,
    # Workaround issue with IMDSv2
    # Inject AWS_DEFAULT_REGION into environment
    # https://github.com/kubernetes/autoscaler/issues/3592
    <<EOL
extraEnv:
  AWS_DEFAULT_REGION: ${module.eks.cluster_region}
EOL
    ,
  ] # End helm_release.values[]
}

and resulting pod description -- AWS_REGION is already set from the chart:

Name:         cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58-cm2d2
Namespace:    kube-system
Priority:     0
Node:         ip-10-100-1-57.us-west-2.compute.internal/10.100.1.57
Start Time:   Mon, 30 Nov 2020 13:06:38 -0800
Labels:       app.kubernetes.io/instance=cluster-autoscaler
              app.kubernetes.io/name=aws-cluster-autoscaler
              pod-template-hash=c4b7bdd58
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.100.0.110
IPs:
  IP:           10.100.0.110
Controlled By:  ReplicaSet/cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58
Containers:
  aws-cluster-autoscaler:
    Container ID:  docker://f91c44b21712ebcf385dfd687c5631dd44ceeb76d25afb765e6b9a5cfc43f96c
    Image:         us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1
    Image ID:      docker-pullable://us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler@sha256:1f5b11617389b8e4ce15eb45fdbbfd4321daeb63c234d46533449ab780b6ca9a
    Port:          8085/TCP
    Host Port:     0/TCP
    Command:
      ./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/kg-cet-917-staging-us-west-2
      --logtostderr=true
      --stderrthreshold=info
      --v=4
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Mon, 30 Nov 2020 13:10:10 -0800
      Finished:     Mon, 30 Nov 2020 13:10:16 -0800
    Ready:          False
    Restart Count:  5
    Liveness:       http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      AWS_REGION:  us-west-2
      AWS_DEFAULT_REGION:  us-west-2
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m43s                  default-scheduler  Successfully assigned kube-system/cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58-cm2d2 to ip-10-100-1-57.us-west-2.compute.internal
  Normal   Pulling    4m42s                  kubelet            Pulling image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1"
  Normal   Pulled     4m40s                  kubelet            Successfully pulled image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1"
  Warning  BackOff    2m52s (x9 over 4m10s)  kubelet            Back-off restarting failed container
  Normal   Created    2m38s (x5 over 4m40s)  kubelet            Created container aws-cluster-autoscaler
  Normal   Started    2m38s (x5 over 4m39s)  kubelet            Started container aws-cluster-autoscaler
  Normal   Pulled     2m38s (x4 over 4m16s)  kubelet            Container image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1" already present on machine

kubectl version:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-12T01:09:16Z", GoVersion:"go1.15.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

helm version:

version.BuildInfo{Version:"v3.4.1", GitCommit:"c4e74854886b2efe3321e185578e6db9be0a6e29", GitTreeState:"dirty", GoVersion:"go1.15.4"}

I was not able to workaround this issue by injecting AWS_REGION or AWS_DEFAULT_REGION environment into the aws-cluster-autoscaler either.

Also, there are other issues #3276 #3216 related to the load the Instance Type list from pricing API. Thus, I upgraded to the latest version 1.20, and added --aws-use-static-instance-list=true flag. However, it still keeps Terminated with 255 exit code and results in CrashLoopBackOff status.

Here are error log message with IMDSv2 [token required]:

$ kubectl -n kube-system logs deployment.apps/cluster-autoscaler
...
...
I0108 07:20:04.590454       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I0108 07:20:04.944164       1 cloud_provider_builder.go:29] Building aws cloud provider.
W0108 07:20:04.944198       1 aws_cloud_provider.go:349] Use static EC2 Instance Types and list could be outdated. Last update time: 2019-10-14
I0108 07:20:04.945035       1 reflector.go:219] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945051       1 reflector.go:255] Listing and watching *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945402       1 reflector.go:219] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945415       1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945683       1 reflector.go:219] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945695       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945952       1 reflector.go:219] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945964       1 reflector.go:255] Listing and watching *v1.StatefulSet from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946231       1 reflector.go:219] Starting reflector *v1.PersistentVolume (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946242       1 reflector.go:255] Listing and watching *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946531       1 reflector.go:219] Starting reflector *v1.StorageClass (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946542       1 reflector.go:255] Listing and watching *v1.StorageClass from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946838       1 reflector.go:219] Starting reflector *v1.CSINode (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946850       1 reflector.go:255] Listing and watching *v1.CSINode from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.039201       1 reflector.go:219] Starting reflector *v1beta1.PodDisruptionBudget (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.039225       1 reflector.go:255] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.539276       1 reflector.go:219] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.539475       1 reflector.go:255] Listing and watching *v1.Service from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543333       1 reflector.go:219] Starting reflector *v1.ReplicationController (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543349       1 reflector.go:255] Listing and watching *v1.ReplicationController from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543835       1 reflector.go:219] Starting reflector *v1.ReplicaSet (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543850       1 reflector.go:255] Listing and watching *v1.ReplicaSet from k8s.io/client-go/informers/factory.go:134
$ kubectl get po -A -w | grep "cluster"
kube-system         cluster-autoscaler-bcbc77bc7-lcsf5              1/1     Running   0          2m7s
kube-system         cluster-autoscaler-bcbc77bc7-lcsf5              0/1     Error     0          2m21s
kube-system         cluster-autoscaler-bcbc77bc7-lcsf5              1/1     Running   1          2m23s
$ kubectl -n kube-system describe po cluster-autoscaler-bcbc77bc7-lcsf5
Name:         cluster-autoscaler-bcbc77bc7-lcsf5
Namespace:    kube-system
Priority:     0
Node:         ip-192-168-33-189.ap-northeast-1.compute.internal/192.168.33.189
Start Time:   Fri, 08 Jan 2021 07:19:44 +0000
Labels:       app=cluster-autoscaler
              pod-template-hash=bcbc77bc7
Annotations:  kubectl.kubernetes.io/restartedAt: 2021-01-08T05:40:22Z
              kubernetes.io/psp: eks.privileged
              prometheus.io/port: 8085
              prometheus.io/scrape: true
Status:       Running
IP:           192.168.43.50
IPs:
  IP:           192.168.43.50
Controlled By:  ReplicaSet/cluster-autoscaler-bcbc77bc7
Containers:
  cluster-autoscaler:
    Container ID:  docker://2f0a7f6f1f514c0c75c75499020e788886da125fe1c865cebd0647bb3bf95a64
    Image:         k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0
    Image ID:      docker-pullable://k8s.gcr.io/autoscaling/cluster-autoscaler@sha256:1c19fa17b29db548d0304e9444adf84e8a6f38ee4c0a12d2ecaf262cb10c0e50
    Port:          <none>
    Host Port:     <none>
    Command:
      ./cluster-autoscaler
      --v=4
      --stderrthreshold=info
      --cloud-provider=aws
      --skip-nodes-with-local-storage=false
      --expander=least-waste
      --aws-use-static-instance-list=true
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/EKS-LAB
    State:          Running
      Started:      Fri, 08 Jan 2021 07:22:07 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 08 Jan 2021 07:19:46 +0000
      Finished:     Fri, 08 Jan 2021 07:22:05 +0000
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     100m
      memory:  300Mi
    Requests:
      cpu:     100m
      memory:  300Mi
    Environment:
      AWS_REGION:                   ap-northeast-1
      AWS_DEFAULT_REGION:           ap-northeast-1
      AWS_ROLE_ARN:                 arn:aws:iam::561333300361:role/eksctl-EKS-LAB-addon-iamserviceaccount-kube-Role1-ZKVBFVVOBNUX
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-token-vkd8b (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs/ca-bundle.crt
    HostPathType:
  cluster-autoscaler-token-vkd8b:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-autoscaler-token-vkd8b
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  ng=console
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age                  From               Message
  ----    ------     ----                 ----               -------
  Normal  Scheduled  3m37s                default-scheduler  Successfully assigned kube-system/cluster-autoscaler-bcbc77bc7-lcsf5 to ip-192-168-33-189.ap-northeast-1.compute.internal
  Normal  Pulling    77s (x2 over 3m37s)  kubelet            Pulling image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0"
  Normal  Pulled     76s (x2 over 3m36s)  kubelet            Successfully pulled image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0"
  Normal  Created    76s (x2 over 3m36s)  kubelet            Created container cluster-autoscaler
  Normal  Started    75s (x2 over 3m36s)  kubelet            Started container cluster-autoscaler

Rollback to the worker node with IMDSv1.

$ kubectl -n kube-system logs deployment.apps/cluster-autoscaler
...
...
I0108 07:15:03.847604       1 reflector.go:255] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:134
I0108 07:15:03.847633       1 reflector.go:219] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:15:03.847640       1 reflector.go:255] Listing and watching *v1.StatefulSet from k8s.io/client-go/informers/factory.go:134
I0108 07:15:03.943862       1 request.go:591] Throttling request took 96.568619ms, request: GET:https://10.100.0.1:443/api/v1/persistentvolumes?limit=500&resourceVersion=0
I0108 07:15:04.243872       1 request.go:591] Throttling request took 396.383321ms, request: GET:https://10.100.0.1:443/api/v1/pods?limit=500&resourceVersion=0
I0108 07:15:07.069368       1 auto_scaling_groups.go:351] Regenerating instance to ASG map for ASGs: [eks-96bb7009-0e0a-3450-075d-3c7ed43c94e6]
I0108 07:15:07.180416       1 auto_scaling.go:199] 1 launch configurations already in cache
I0108 07:15:07.180443       1 auto_scaling_groups.go:136] Registering ASG eks-96bb7009-0e0a-3450-075d-3c7ed43c94e6
I0108 07:15:07.180456       1 aws_manager.go:269] Refreshed ASG list, next refresh after 2021-01-08 07:16:07.180451669 +0000 UTC m=+81.757019680
I0108 07:15:07.180599       1 main.go:279] Registered cleanup signal handler
I0108 07:15:07.180643       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0108 07:15:07.180654       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 4.43碌s
I0108 07:15:17.180736       1 static_autoscaler.go:229] Starting main loop
W0108 07:15:17.181232       1 clusterstate.go:436] AcceptableRanges have not been populated yet. Skip checking
I0108 07:15:17.181367       1 filter_out_schedulable.go:65] Filtering out schedulables
I0108 07:15:17.181381       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0108 07:15:17.181390       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0108 07:15:17.181397       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0108 07:15:17.181464       1 filter_out_schedulable.go:82] No schedulable pods
I0108 07:15:17.181490       1 static_autoscaler.go:402] No unschedulable pods
I0108 07:15:17.181509       1 static_autoscaler.go:449] Calculating unneeded nodes

Hi Contributors @mwielgus @losipiuk @aleksandra-malinowska @bskiba. As this is causing eks cluster not be upgraded to IMDSv2 support, Can this issue be prioritized, I suspect CA does not use token-backed sessions to access IMDS. CA pod, it keeps OOMed and results in CrashLoopBackOff. Thank you.

It appears there are multiple symptoms here.

  1. OOMKill
  2. CrashLoop NoCredentialProviders: no valid providers in chain.

My guess is that (1) is a spurious error, it's difficult to tell. @hans72118, can you follow up with your memory settings? I'll take a look at how IMDSv2 works and what the path forward is here to make sure CAS can use these tokens.

It should be possible to skip this logic by using --aws-use-static-instance-list=true https://github.com/kubernetes/autoscaler/blob/43ab0309697271e6b2ad82dd4fc3a28132456399/cluster-autoscaler/main.go#L175

Alternatively, it should be possible to skip by including the AWS_REGION environment variable:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_util.go#L155

@focaaby, it's not clear from your logs or describe pods that this wasn't working for you. Looks like the CA started up normally and populated all listers/watchers?

Was this page helpful?
0 / 5 - 0 ratings