Autoscaler: Annotation cluster-autoscaler.kubernetes.io/safe-to-evict=false not working

Created on 3 Apr 2019 · 25Comments · Source: kubernetes/autoscaler

I recently added the following annotation in some of my critical pods to avoid cluster-autoscaler to remove the nodes where these pods are running in. cluster-autoscaler.kubernetes.io/safe-to-evict=false
According to what I read in the documentation: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node This should prevent cluster autoscaler from removing the node, however I've just seen how one of the nodes running a pod with this annotations has been deleted by cluster autoscaler.

My Kubernetes version is 1.11 and the cluster-autoscaler version is 1.3.7
According to this PR (https://github.com/kubernetes/autoscaler/pull/1054), this annotation should work as expected.

Could someone help me to identify why cluster-autoscaler does not take this annotation in consideration?

Thanks!

cluster-autoscaler

Source

mmingorance-dh

Most helpful comment

@bskiba Yes, that was exactly the problem.

mmingorance-dh on 2 May 2019

👍2

All 25 comments

I think it might be related to this: https://github.com/kubernetes/autoscaler/pull/1054/files#diff-923217ef419d47bcff68eeeca564a78dR255
This annotation is only searched in ReplicationControllers and Jobs. In my case I'm adding this annotation in pods created by deployments.

Is there any chance to include deployments when searching for this annotation?

It is also worth it to mention that in my case, these pods are running in different namespaces and not directly in default.

mmingorance-dh on 3 Apr 2019

I think it the meantime, a workaround could be adding the following annotation to those pods: scheduler.alpha.kubernetes.io/critical-pod: ""
However this annotation is deprecated in Kubernetes 1.13 and deleted in 1.14
I still think would be great to have cluster-autoscaler.kubernetes.io/safe-to-evict=false working for pods managed by deployments in multiple namespaces.

mmingorance-dh on 3 Apr 2019

/assign @Jeffwan

Jeffwan on 3 Apr 2019

@mmingorance-dh Could I know if you annotate your deployment or containers?

You need to make sure you annotate right objects.

apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: cpu
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
      annotations:
        "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
    spec:
      containers:
      - name: nginx
        image: nginx:1.8 # Update the version of nginx from 1.7.9 to 1.8
        ports:
        - containerPort: 80
        resources:
          limits:
            cpu: 1
          requests:
            cpu: 1

Jeffwan on 3 Apr 2019

See then it can be detected correctly.

I0330 23:45:14.848048   79263 cluster.go:95] Fast evaluation: node ip-192-168-89-117.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: cpu-7b7fb654fb-tnhx6
I0330 23:45:14.848070   79263 scale_down.go:490] 1 nodes found to be unremovable in simulation, will re-check them at 2019-03-30 23:50:14.201538 -0700 PDT m=+311.495845672

Jeffwan on 3 Apr 2019

@Jeffwan this is the definition of my deployment:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "4"
  creationTimestamp: 2019-02-13T15:54:19Z
  generation: 4
  labels:
    eks.amazonaws.com/component: coredns
    k8s-app: kube-dns
    kubernetes.io/name: CoreDNS
  name: coredns
  namespace: kube-system
  resourceVersion: "25237845"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/coredns
  uid: 9fd8827b-2fa7-11e9-8689-0a492027d742
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      eks.amazonaws.com/component: coredns
      k8s-app: kube-dns
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
        prometheus.io/path: /metrics
        prometheus.io/port: "9153"
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        eks.amazonaws.com/component: coredns
        k8s-app: kube-dns
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: beta.kubernetes.io/os
                operator: In
                values:
                - linux
              - key: beta.kubernetes.io/arch
                operator: In
                values:
                - amd64
      containers:
      - args:
        - -conf
        - /etc/coredns/Corefile
        image: 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.1.3
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10

This deployment is running in kube-system namespace.

Also I can see the annotation in the pod:

Name:               coredns-5c9b578b9c-hsb8z
Namespace:          kube-system
Priority:           2000001000
PriorityClassName:  system-node-critical
Node:               ip-10-0-110-169.eu-central-1.compute.internal/10.0.110.169
Start Time:         Wed, 03 Apr 2019 01:54:30 +0200
Labels:             eks.amazonaws.com/component=coredns
                    k8s-app=kube-dns
                    pod-template-hash=1756134657
Annotations:        cluster-autoscaler.kubernetes.io/safe-to-evict=false
                    prometheus.io/path=/metrics
                    prometheus.io/port=9153
Status:             Running
IP:                 10.0.105.244
Controlled By:      ReplicaSet/coredns-5c9b578b9c
Containers:
  coredns:
    Container ID:  docker://238b834e0ddfb9aea72f589a6ce00fad3c3055c1ca935601185b2e63ddeff3c5
    Image:         602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.1.3
    Image ID:      docker-pullable://602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns@sha256:323985b6fd818047898e9154a47e438acf52cca8e60eda0c9acc9e5e2bec0914
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile

Thanks for your help!

mmingorance-dh on 3 Apr 2019

@Jeffwan I've just found out a log from cluster-autoscaler with the right value:
Fast evaluation: node $NODE-NAME cannot be removed: pod annotated as not safe to evict present: coredns-ddcd48977-thsxm

It appeared after a I redeployed cluster-autoscaler with --skip-nodes-with-system-pods=true
Has this flag something to do in this case?
Was it maybe because I created a new pod? Maybe the previous pod was not finding that tag for whatever reason.

mmingorance-dh on 3 Apr 2019

Actually, your logs comes from "cluster-autoscaler.kubernetes.io/safe-to-evict=false".
The logic for --skip-nodes-with-system-pods=true is in front of checking safe to evict annotation.

Here's the logic.
https://github.com/kubernetes/autoscaler/blob/40cf6e43c032815350c07cd3e7eb36a992fd785b/cluster-autoscaler/utils/drain/drain.go#L187-L201

Since your coredns pod is in kube-system namespace. Technically, for coredns pods, both way works but --skip-nodes-with-system-pods=true will also take affect on other critical pods.

I can help on coreDNS testing, BTW, since that's core components, just wondering did you manually annotate pods in the deployment later once your cluster start up?

Jeffwan on 3 Apr 2019

Yes, the coreDNS deployment comes already in EKS by default and I'm patching this deployment to add the annotation.
I'm testing 2 different scenarios in my clusters right now. In my production clusters I have this flag --skip-nodes-with-system-pods=true together with cluster-autoscaler.kubernetes.io/safe-to-evict=false. And then in my staging and QA clusters I have --skip-nodes-with-system-pods=false with the annotation cluster-autoscaler.kubernetes.io/safe-to-evict=false included in the pods.
So far I can see in my QA and staging cluster that cluster-autoscaler is respecting the annotation and ignoring the node: Fast evaluation: node $NODE_NAME cannot be removed: pod annotated as not safe to evict present: coredns-5c9b578b9c-nz5jt

However I'm not seeing this in my production clusters, or at least not yet. It could be because those nodes are still being utilised or it could be because having --skip-nodes-with-system-pods=true avoids cluster-autoscaler to run this function:
https://github.com/kubernetes/autoscaler/blob/40cf6e43c032815350c07cd3e7eb36a992fd785b/cluster-autoscaler/utils/drain/drain.go#L187-L201

I'll post tomorrow morning at 11:00 CEST again and let you know the results in production.

Thanks for your help!

mmingorance-dh on 3 Apr 2019

Ok, after testing during last night, it seems that the annotation is found and the cluster doesn't scale the node down.
I just redeployed cluster-autoscaler with --skip-nodes-with-system-pods=false in all my environments and I'll keep watching the logs during low traffic time when the cluster is supposed to shrink.

So what I saw and I think it was the problem is that cluster-autoscaler didn't identified the annotation when running the previous pod which was there before I patched the deployment to add the annotation.
For whatever reason, when I replaced the cluster-autoscaler pod after the patch of the deployment, cluster-autoscaler could find the annotation and skip the nodes.

mmingorance-dh on 4 Apr 2019

Again 17m ago the same happened.
A node was removed by cluster-autoscaler without respecting the annotation in my coreDNS pod.
I don't know how to troubleshoot this.

mmingorance-dh on 4 Apr 2019

I can't understand how cluster-autoscaler in my QA cluster recognised the annotation and not in my production and staging clusters.
All of them are running kubernetes 1.11 and cluster-autoscaler 1.3.7

I added --skip-nodes-with-system-pods=true in my production clusters until we can find a solution to this. Just to bear this in mind when troubleshooting.

mmingorance-dh on 4 Apr 2019

Em. Interesting, which means it was working at the beginning in your production environment and then stopped working and coredns node was removed..

If you can fetch logs that the node coredns scaled down, that would be super helpful. I will also run a test to take a look. I will keep CA running a while.

Jeffwan on 4 Apr 2019

Actually I never saw this message cannot be removed: pod annotated as not safe to evict present: in my production clusters, only on QA which is changing the size more often.
But configuration was the same in all of the clusters.

mmingorance-dh on 4 Apr 2019

It makes sense you didn't see it in your prod cluster. --skip-nodes-with-system-pods=true is default option even you don't specify it. Users sometimes like to kill pods under kube-system and then they override --skip-nodes-with-system-pods=false.

cluster-autoscaler.kubernetes.io/safe-to-evict=false was not designed for system level pods. It's commonly used for other uninterrupted pods.

If coredns is the pod you to want protect, you don't need extra setting. In your production cluster, you will see

I0404 11:25:13.014475   36902 cluster.go:107] Fast evaluation: node ip-192-168-6-231.us-west-2.compute.internal cannot be removed: non-daemonset, non-mirrored, non-pdb-assigned kube-system pod present: coredns-79ddd9f658-pp7pk

You will only see if you use safe-to-evict=false and --skip-nodes-with-system-pods=false which doesn't make sense as I said.

 Fast evaluation: node ip-192-168-6-231.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: coredns-79ddd9f658-pp7pk

Jeffwan on 4 Apr 2019

I had one test with cluster-autoscaler.kubernetes.io/safe-to-evict=false and --skip-nodes-with-system-pods=false

3 coredns across on 3 nodes in two node groups. None of them get removed in past two hrs. They just regularly be recognized as unremovable every 5 mins.

I0404 20:51:29.979804       1 utils.go:498] Skipping ip-192-168-28-143.us-west-2.compute.internal - node group min size reached
I0404 20:51:29.979858       1 scale_down.go:404] Node ip-192-168-24-179.us-west-2.compute.internal - utilization 0.026250
I0404 20:51:29.979873       1 scale_down.go:404] Node ip-192-168-6-231.us-west-2.compute.internal - utilization 0.038750
I0404 20:51:29.979926       1 scale_down.go:453] Finding additional 2 candidates for scale down.
I0404 20:51:29.979955       1 cluster.go:81] Fast evaluation: ip-192-168-24-179.us-west-2.compute.internal for removal
I0404 20:51:29.979968       1 cluster.go:95] Fast evaluation: node ip-192-168-24-179.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: coredns-79ddd9f658-vxkxc
I0404 20:51:29.979977       1 cluster.go:81] Fast evaluation: ip-192-168-6-231.us-west-2.compute.internal for removal
I0404 20:51:29.979985       1 cluster.go:95] Fast evaluation: node ip-192-168-6-231.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: coredns-79ddd9f658-pp7pk
I0404 20:51:29.979995       1 scale_down.go:490] 2 nodes found to be unremovable in simulation, will re-check them at 2019-04-04 20:56:29.696490772 +0000 UTC m=+8151.898305634

Here's the pod setting.

ame:               coredns-79ddd9f658-pp7pk
Namespace:          kube-system
Priority:           2000001000
PriorityClassName:  system-node-critical
Node:               ip-192-168-6-231.us-west-2.compute.internal/192.168.6.231
Start Time:         Thu, 04 Apr 2019 11:24:41 -0700
Labels:             eks.amazonaws.com/component=coredns
                    k8s-app=kube-dns
                    pod-template-hash=79ddd9f658
Annotations:        cluster-autoscaler.kubernetes.io/safe-to-evict: false
Status:             Running
IP:                 192.168.14.155
Controlled By:      ReplicaSet/coredns-79ddd9f658

Till now, I don't see any issues to use cluster-autoscaler.kubernetes.io/safe-to-evict=false even we have --skip-nodes-with-system-pods=false configured.

Jeffwan on 4 Apr 2019

That's interesting because in my QA cluster I can see the logs as expected:
I0404 23:50:50.339425 1 aws_manager.go:148] Refreshed ASG list, next refresh after 2019-04-04 23:51:00.339418497 +0000 UTC m=+119934.687313817 I0404 23:50:50.339562 1 utils.go:541] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop I0404 23:50:50.339575 1 static_autoscaler.go:260] Filtering out schedulables I0404 23:50:50.339851 1 static_autoscaler.go:270] No schedulable pods I0404 23:50:50.339864 1 static_autoscaler.go:274] No unschedulable pods I0404 23:50:50.339878 1 static_autoscaler.go:315] Calculating unneeded nodes I0404 23:50:50.339897 1 utils.go:498] Skipping $NODE_NAME - node group min size reached I0404 23:50:50.339910 1 utils.go:498] Skipping $NODE_NAME - node group min size reached I0404 23:50:50.339924 1 utils.go:498] Skipping $NODE_NAME - node group min size reached I0404 23:50:50.340098 1 scale_down.go:373] Scale-down calculation: ignoring 3 nodes unremovable in the last 5m0s I0404 23:50:50.340115 1 scale_down.go:404] Node $NODE_NAME_WITH_POD_ANNOTATED - utilization 0.152606 I0404 23:50:50.340286 1 scale_down.go:453] Finding additional 1 candidates for scale down. I0404 23:50:50.340419 1 cluster.go:81] Fast evaluation: $NODE_NAME_WITH_POD_ANNOTATED for removal I0404 23:50:50.340434 1 cluster.go:95] Fast evaluation: node $NODE_NAME_WITH_POD_ANNOTATED cannot be removed: pod annotated as not safe to evict present: coredns-597c9769b6-rth7b

I don't understand the reason why this annotation is not being taken in consideration in my staging and production cluster but it is in my QA cluster.

mmingorance-dh on 5 Apr 2019

Could you list your CA configuration in three environments? are they same?

Jeffwan on 5 Apr 2019

Deployment:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: {{ include "name" . | quote }}
  namespace: kube-system
  labels:
    app: {{ include "name" . | quote }}
    {{- include "labels-standard" . | indent 4 }}
spec:
  replicas: 1
  selector:
    matchLabels:
      app: {{ include "name" . | quote}}
  template:
    metadata:
      labels:
        app: {{ include "name" . | quote}}
        {{- include "labels-standard" . | indent 8 }}
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8085'
    spec:
      tolerations:
      {{- if not .Values.eks }}
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      nodeSelector:
        kubernetes.io/role: master
      {{- end}}
      {{- if .Values.rbac.create }}
      serviceAccountName: {{ include "name" . }}
      {{- end}}
      containers:
        - image: "{{ .Values.deployment.image.repository }}/{{ .Values.deployment.image.imageName }}:{{ .Values.deployment.image.imageVersion }}"
          name: {{ include "name" . | quote}}
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./{{ include "name" . }}
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --skip-nodes-with-system-pods=false
            - --scale-down-enabled={{ default "false" .Values.scaleDownEnabled }}
            - --scale-down-delay-after-add={{ default "1h" .Values.scaleDownDelay }}
            - --scale-down-unneeded-time={{ default "1h" .Values.scaleDownUnneededTime }}
            - --scale-down-utilization-threshold={{ default "1h" .Values.scaleDownUtilizationThreshold }}
            - --balance-similar-node-groups=true
            - --max-node-provision-time=5m0s
            - --expander={{ default "most-pods" .Values.expander }}
            - --max-node-provision-time=7m
            - --expendable-pods-priority-cutoff=-10
            - --node-group-auto-discovery=asg:tag=cluster-autoscaler/auto-discovery/enabled,kubernetes.io/cluster/{{ .Values.clusterName }}
            {{- if .Values.utilization }}
            - --ignore-daemonsets-utilization={{ default "false" .Values.utilization.ignoreDaemonsets }}
            {{- end }}
          env:
            - name: AWS_REGION
              value: {{ .Values.awsRegion }}
          volumeMounts:
            - name: ssl-certs
              {{- if .Values.eks }}
              mountPath: {{ .Values.eks.sslCertPath }}
              {{- else}}
              mountPath: /etc/ssl/certs/ca-certificates.crt
              {{- end}}
              readOnly: true
          imagePullPolicy: "IfNotPresent"
      tolerations:
      {{- if not .Values.eks }}
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      {{- end }}
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      {{- if not .Values.eks }}
      nodeSelector:
        kubernetes.io/role: master
      {{- end}}
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-certificates.crt"
      dnsPolicy: "Default"

Values file for Prod:

env: prod

deployment:
  image:
    imageVersion: "v1.3.7"

clusterName: "$CLUSTER_NAME"

scaleDownUtilizationThreshold: 0.5
awsRegion: eu-central-1
scaleDownDelay: 10m
scaleDownUnneededTime: 30m
scaleDownEnabled: true

rbac:
  create: true

utilization: {}
#  ignoreDaemonsets: {}

eks:
  sslCertPath: /etc/kubernetes/pki/ca.crt

Values file QA:

env: qa

deployment:
  image:
    imageVersion: "v1.3.7"

clusterName: "$CLUSTER_NAME"

scaleDownUtilizationThreshold: 0.6
awsRegion: eu-central-1
scaleDownDelay: 20m
scaleDownUnneededTime: 10m
scaleDownEnabled: true

rbac:
  create: true

utilization: {}
#  ignoreDaemonsets: {}

eks:
  sslCertPath: /etc/kubernetes/pki/ca.crt

Everything it's the same.

mmingorance-dh on 5 Apr 2019

I just saw how in my production cluster I lost another node running coreDNS even though I have the --skip-nodes-with-system-pods=true set.
Also, I can see in the logs of CA that some other pod is being recognised:
Fast evaluation: node $NODE_NAME cannot be removed: non-daemonset, non-mirrored, non-pdb-assigned kube-system pod present: coredns-54f4cff84d-brdjb

So it looks like this if: https://github.com/kubernetes/autoscaler/blob/77ebb22fea9bcfaeccf7ce60be51b7e78e294cb0/cluster-autoscaler/utils/drain/drain.go#L183
It's not always working.

mmingorance-dh on 5 Apr 2019

CA doesn't look for all the reasons why it can't delete the node - as soon as one pod can't be moved it won't bother checking the other pods (since the node can't be deleted anyway). So there are multiple system pods on a single node only one of them will be logged and the other ones will likely never show up in log. So the fact that the logs are missing by itself is not an issue.

Similarly --skip-nodes-with-system-pods=false doesn't prevent safe-to-evict from working. It just prevents autoscaler from ever logging it - your pod was already disqualified by a different check (kube-system namespace), so CA never bothers to check for annotation since the pod can't be restarted anyway.

Bottom line - whether the line shows in logs or not is random and doesn't mean the feature is not working.

The way to debug your issue would be to look in the logs for the node that disappeared and try to find what happened with your node. If it was scaled-down by CA there will be logs describing the scale-down. My guess would be that it was either removed by something else, or was removed after the nodes were drained for unrelated reason (ex. nodecontroller draining unready node).

MaciekPytel on 5 Apr 2019

@MaciekPytel Thanks for the info.
In fact I could see in the logs of cluster-autoscaler how the specific node running the coreDNS pod was removed by cluster-autoscaler due to low utilization.
I'll check what else could be affecting this, but it looks weird to me that cluster-autoscaler decided by itself to remove that node when running a pod from kube-system and --skip-nodes-with-system-pods=true

I'll keep checking.
Thanks for your help!

mmingorance-dh on 8 Apr 2019

We could found out the problem in our case.
After saving all logs from cluster-autoscaler in our logging system, we could do some more in depth investigation about this problem and it turned out to be an issue produced by AZRebalance in EC2 instances.
Sometimes after cluster-autoscaler started to terminate instances, we were running more machines in certain AZ than the others, therefore AWS started to terminate instances from this AZ to rebalance the number of machines running in each AZ and since this is a process in EC2, any machine could be terminated regardless what's running in. And that was exactly the reason why instances running coreDNS pods where terminating.

Sorry for the inconvenience of this issue and thanks a lot for your support.

mmingorance-dh on 2 May 2019

👍1

Thanks for the followup!
The behaviour seems to be documented here: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#common-notes-and-gotchas
is this in line with what you experienced?