I recently added the following annotation in some of my critical pods to avoid cluster-autoscaler to remove the nodes where these pods are running in. cluster-autoscaler.kubernetes.io/safe-to-evict=false
According to what I read in the documentation: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node This should prevent cluster autoscaler from removing the node, however I've just seen how one of the nodes running a pod with this annotations has been deleted by cluster autoscaler.
My Kubernetes version is 1.11 and the cluster-autoscaler version is 1.3.7
According to this PR (https://github.com/kubernetes/autoscaler/pull/1054), this annotation should work as expected.
Could someone help me to identify why cluster-autoscaler does not take this annotation in consideration?
Thanks!
I think it might be related to this: https://github.com/kubernetes/autoscaler/pull/1054/files#diff-923217ef419d47bcff68eeeca564a78dR255
This annotation is only searched in ReplicationControllers and Jobs. In my case I'm adding this annotation in pods created by deployments.
Is there any chance to include deployments when searching for this annotation?
It is also worth it to mention that in my case, these pods are running in different namespaces and not directly in default.
I think it the meantime, a workaround could be adding the following annotation to those pods: scheduler.alpha.kubernetes.io/critical-pod: ""
However this annotation is deprecated in Kubernetes 1.13 and deleted in 1.14
I still think would be great to have cluster-autoscaler.kubernetes.io/safe-to-evict=false working for pods managed by deployments in multiple namespaces.
/assign @Jeffwan
@mmingorance-dh Could I know if you annotate your deployment or containers?
You need to make sure you annotate right objects.
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: cpu
spec:
selector:
matchLabels:
app: nginx
replicas: 1
template:
metadata:
labels:
app: nginx
annotations:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
spec:
containers:
- name: nginx
image: nginx:1.8 # Update the version of nginx from 1.7.9 to 1.8
ports:
- containerPort: 80
resources:
limits:
cpu: 1
requests:
cpu: 1
See then it can be detected correctly.
I0330 23:45:14.848048 79263 cluster.go:95] Fast evaluation: node ip-192-168-89-117.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: cpu-7b7fb654fb-tnhx6
I0330 23:45:14.848070 79263 scale_down.go:490] 1 nodes found to be unremovable in simulation, will re-check them at 2019-03-30 23:50:14.201538 -0700 PDT m=+311.495845672
@Jeffwan this is the definition of my deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "4"
creationTimestamp: 2019-02-13T15:54:19Z
generation: 4
labels:
eks.amazonaws.com/component: coredns
k8s-app: kube-dns
kubernetes.io/name: CoreDNS
name: coredns
namespace: kube-system
resourceVersion: "25237845"
selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/coredns
uid: 9fd8827b-2fa7-11e9-8689-0a492027d742
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
eks.amazonaws.com/component: coredns
k8s-app: kube-dns
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
prometheus.io/path: /metrics
prometheus.io/port: "9153"
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: null
labels:
eks.amazonaws.com/component: coredns
k8s-app: kube-dns
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: beta.kubernetes.io/os
operator: In
values:
- linux
- key: beta.kubernetes.io/arch
operator: In
values:
- amd64
containers:
- args:
- -conf
- /etc/coredns/Corefile
image: 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.1.3
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
This deployment is running in kube-system namespace.
Also I can see the annotation in the pod:
Name: coredns-5c9b578b9c-hsb8z
Namespace: kube-system
Priority: 2000001000
PriorityClassName: system-node-critical
Node: ip-10-0-110-169.eu-central-1.compute.internal/10.0.110.169
Start Time: Wed, 03 Apr 2019 01:54:30 +0200
Labels: eks.amazonaws.com/component=coredns
k8s-app=kube-dns
pod-template-hash=1756134657
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict=false
prometheus.io/path=/metrics
prometheus.io/port=9153
Status: Running
IP: 10.0.105.244
Controlled By: ReplicaSet/coredns-5c9b578b9c
Containers:
coredns:
Container ID: docker://238b834e0ddfb9aea72f589a6ce00fad3c3055c1ca935601185b2e63ddeff3c5
Image: 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.1.3
Image ID: docker-pullable://602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns@sha256:323985b6fd818047898e9154a47e438acf52cca8e60eda0c9acc9e5e2bec0914
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
Thanks for your help!
@Jeffwan I've just found out a log from cluster-autoscaler with the right value:
Fast evaluation: node $NODE-NAME cannot be removed: pod annotated as not safe to evict present: coredns-ddcd48977-thsxm
It appeared after a I redeployed cluster-autoscaler with --skip-nodes-with-system-pods=true
Has this flag something to do in this case?
Was it maybe because I created a new pod? Maybe the previous pod was not finding that tag for whatever reason.
Actually, your logs comes from "cluster-autoscaler.kubernetes.io/safe-to-evict=false".
The logic for --skip-nodes-with-system-pods=true is in front of checking safe to evict annotation.
Here's the logic.
https://github.com/kubernetes/autoscaler/blob/40cf6e43c032815350c07cd3e7eb36a992fd785b/cluster-autoscaler/utils/drain/drain.go#L187-L201
Since your coredns pod is in kube-system namespace. Technically, for coredns pods, both way works but --skip-nodes-with-system-pods=true will also take affect on other critical pods.
I can help on coreDNS testing, BTW, since that's core components, just wondering did you manually annotate pods in the deployment later once your cluster start up?
Yes, the coreDNS deployment comes already in EKS by default and I'm patching this deployment to add the annotation.
I'm testing 2 different scenarios in my clusters right now. In my production clusters I have this flag --skip-nodes-with-system-pods=true together with cluster-autoscaler.kubernetes.io/safe-to-evict=false. And then in my staging and QA clusters I have --skip-nodes-with-system-pods=false with the annotation cluster-autoscaler.kubernetes.io/safe-to-evict=false included in the pods.
So far I can see in my QA and staging cluster that cluster-autoscaler is respecting the annotation and ignoring the node: Fast evaluation: node $NODE_NAME cannot be removed: pod annotated as not safe to evict present: coredns-5c9b578b9c-nz5jt
However I'm not seeing this in my production clusters, or at least not yet. It could be because those nodes are still being utilised or it could be because having --skip-nodes-with-system-pods=true avoids cluster-autoscaler to run this function:
https://github.com/kubernetes/autoscaler/blob/40cf6e43c032815350c07cd3e7eb36a992fd785b/cluster-autoscaler/utils/drain/drain.go#L187-L201
I'll post tomorrow morning at 11:00 CEST again and let you know the results in production.
Thanks for your help!
Ok, after testing during last night, it seems that the annotation is found and the cluster doesn't scale the node down.
I just redeployed cluster-autoscaler with --skip-nodes-with-system-pods=false in all my environments and I'll keep watching the logs during low traffic time when the cluster is supposed to shrink.
So what I saw and I think it was the problem is that cluster-autoscaler didn't identified the annotation when running the previous pod which was there before I patched the deployment to add the annotation.
For whatever reason, when I replaced the cluster-autoscaler pod after the patch of the deployment, cluster-autoscaler could find the annotation and skip the nodes.
Again 17m ago the same happened.
A node was removed by cluster-autoscaler without respecting the annotation in my coreDNS pod.
I don't know how to troubleshoot this.
I can't understand how cluster-autoscaler in my QA cluster recognised the annotation and not in my production and staging clusters.
All of them are running kubernetes 1.11 and cluster-autoscaler 1.3.7
I added --skip-nodes-with-system-pods=true in my production clusters until we can find a solution to this. Just to bear this in mind when troubleshooting.
Em. Interesting, which means it was working at the beginning in your production environment and then stopped working and coredns node was removed..
If you can fetch logs that the node coredns scaled down, that would be super helpful. I will also run a test to take a look. I will keep CA running a while.
Actually I never saw this message cannot be removed: pod annotated as not safe to evict present: in my production clusters, only on QA which is changing the size more often.
But configuration was the same in all of the clusters.
It makes sense you didn't see it in your prod cluster. --skip-nodes-with-system-pods=true is default option even you don't specify it. Users sometimes like to kill pods under kube-system and then they override --skip-nodes-with-system-pods=false.
cluster-autoscaler.kubernetes.io/safe-to-evict=false was not designed for system level pods. It's commonly used for other uninterrupted pods.
If coredns is the pod you to want protect, you don't need extra setting. In your production cluster, you will see
I0404 11:25:13.014475 36902 cluster.go:107] Fast evaluation: node ip-192-168-6-231.us-west-2.compute.internal cannot be removed: non-daemonset, non-mirrored, non-pdb-assigned kube-system pod present: coredns-79ddd9f658-pp7pk
You will only see if you use safe-to-evict=false and --skip-nodes-with-system-pods=false which doesn't make sense as I said.
Fast evaluation: node ip-192-168-6-231.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: coredns-79ddd9f658-pp7pk
I had one test with cluster-autoscaler.kubernetes.io/safe-to-evict=false and --skip-nodes-with-system-pods=false
3 coredns across on 3 nodes in two node groups. None of them get removed in past two hrs. They just regularly be recognized as unremovable every 5 mins.
I0404 20:51:29.979804 1 utils.go:498] Skipping ip-192-168-28-143.us-west-2.compute.internal - node group min size reached
I0404 20:51:29.979858 1 scale_down.go:404] Node ip-192-168-24-179.us-west-2.compute.internal - utilization 0.026250
I0404 20:51:29.979873 1 scale_down.go:404] Node ip-192-168-6-231.us-west-2.compute.internal - utilization 0.038750
I0404 20:51:29.979926 1 scale_down.go:453] Finding additional 2 candidates for scale down.
I0404 20:51:29.979955 1 cluster.go:81] Fast evaluation: ip-192-168-24-179.us-west-2.compute.internal for removal
I0404 20:51:29.979968 1 cluster.go:95] Fast evaluation: node ip-192-168-24-179.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: coredns-79ddd9f658-vxkxc
I0404 20:51:29.979977 1 cluster.go:81] Fast evaluation: ip-192-168-6-231.us-west-2.compute.internal for removal
I0404 20:51:29.979985 1 cluster.go:95] Fast evaluation: node ip-192-168-6-231.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: coredns-79ddd9f658-pp7pk
I0404 20:51:29.979995 1 scale_down.go:490] 2 nodes found to be unremovable in simulation, will re-check them at 2019-04-04 20:56:29.696490772 +0000 UTC m=+8151.898305634
Here's the pod setting.
ame: coredns-79ddd9f658-pp7pk
Namespace: kube-system
Priority: 2000001000
PriorityClassName: system-node-critical
Node: ip-192-168-6-231.us-west-2.compute.internal/192.168.6.231
Start Time: Thu, 04 Apr 2019 11:24:41 -0700
Labels: eks.amazonaws.com/component=coredns
k8s-app=kube-dns
pod-template-hash=79ddd9f658
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: false
Status: Running
IP: 192.168.14.155
Controlled By: ReplicaSet/coredns-79ddd9f658
Till now, I don't see any issues to use cluster-autoscaler.kubernetes.io/safe-to-evict=false even we have --skip-nodes-with-system-pods=false configured.
That's interesting because in my QA cluster I can see the logs as expected:
I0404 23:50:50.339425 1 aws_manager.go:148] Refreshed ASG list, next refresh after 2019-04-04 23:51:00.339418497 +0000 UTC m=+119934.687313817
I0404 23:50:50.339562 1 utils.go:541] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0404 23:50:50.339575 1 static_autoscaler.go:260] Filtering out schedulables
I0404 23:50:50.339851 1 static_autoscaler.go:270] No schedulable pods
I0404 23:50:50.339864 1 static_autoscaler.go:274] No unschedulable pods
I0404 23:50:50.339878 1 static_autoscaler.go:315] Calculating unneeded nodes
I0404 23:50:50.339897 1 utils.go:498] Skipping $NODE_NAME - node group min size reached
I0404 23:50:50.339910 1 utils.go:498] Skipping $NODE_NAME - node group min size reached
I0404 23:50:50.339924 1 utils.go:498] Skipping $NODE_NAME - node group min size reached
I0404 23:50:50.340098 1 scale_down.go:373] Scale-down calculation: ignoring 3 nodes unremovable in the last 5m0s
I0404 23:50:50.340115 1 scale_down.go:404] Node $NODE_NAME_WITH_POD_ANNOTATED - utilization 0.152606
I0404 23:50:50.340286 1 scale_down.go:453] Finding additional 1 candidates for scale down.
I0404 23:50:50.340419 1 cluster.go:81] Fast evaluation: $NODE_NAME_WITH_POD_ANNOTATED for removal
I0404 23:50:50.340434 1 cluster.go:95] Fast evaluation: node $NODE_NAME_WITH_POD_ANNOTATED cannot be removed: pod annotated as not safe to evict present: coredns-597c9769b6-rth7b
I don't understand the reason why this annotation is not being taken in consideration in my staging and production cluster but it is in my QA cluster.
Could you list your CA configuration in three environments? are they same?
Deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: {{ include "name" . | quote }}
namespace: kube-system
labels:
app: {{ include "name" . | quote }}
{{- include "labels-standard" . | indent 4 }}
spec:
replicas: 1
selector:
matchLabels:
app: {{ include "name" . | quote}}
template:
metadata:
labels:
app: {{ include "name" . | quote}}
{{- include "labels-standard" . | indent 8 }}
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
prometheus.io/scrape: 'true'
prometheus.io/port: '8085'
spec:
tolerations:
{{- if not .Values.eks }}
- effect: NoSchedule
key: node-role.kubernetes.io/master
nodeSelector:
kubernetes.io/role: master
{{- end}}
{{- if .Values.rbac.create }}
serviceAccountName: {{ include "name" . }}
{{- end}}
containers:
- image: "{{ .Values.deployment.image.repository }}/{{ .Values.deployment.image.imageName }}:{{ .Values.deployment.image.imageVersion }}"
name: {{ include "name" . | quote}}
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
command:
- ./{{ include "name" . }}
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=false
- --scale-down-enabled={{ default "false" .Values.scaleDownEnabled }}
- --scale-down-delay-after-add={{ default "1h" .Values.scaleDownDelay }}
- --scale-down-unneeded-time={{ default "1h" .Values.scaleDownUnneededTime }}
- --scale-down-utilization-threshold={{ default "1h" .Values.scaleDownUtilizationThreshold }}
- --balance-similar-node-groups=true
- --max-node-provision-time=5m0s
- --expander={{ default "most-pods" .Values.expander }}
- --max-node-provision-time=7m
- --expendable-pods-priority-cutoff=-10
- --node-group-auto-discovery=asg:tag=cluster-autoscaler/auto-discovery/enabled,kubernetes.io/cluster/{{ .Values.clusterName }}
{{- if .Values.utilization }}
- --ignore-daemonsets-utilization={{ default "false" .Values.utilization.ignoreDaemonsets }}
{{- end }}
env:
- name: AWS_REGION
value: {{ .Values.awsRegion }}
volumeMounts:
- name: ssl-certs
{{- if .Values.eks }}
mountPath: {{ .Values.eks.sslCertPath }}
{{- else}}
mountPath: /etc/ssl/certs/ca-certificates.crt
{{- end}}
readOnly: true
imagePullPolicy: "IfNotPresent"
tolerations:
{{- if not .Values.eks }}
- effect: NoSchedule
key: node-role.kubernetes.io/master
{{- end }}
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
{{- if not .Values.eks }}
nodeSelector:
kubernetes.io/role: master
{{- end}}
volumes:
- name: ssl-certs
hostPath:
path: "/etc/ssl/certs/ca-certificates.crt"
dnsPolicy: "Default"
Values file for Prod:
env: prod
deployment:
image:
imageVersion: "v1.3.7"
clusterName: "$CLUSTER_NAME"
scaleDownUtilizationThreshold: 0.5
awsRegion: eu-central-1
scaleDownDelay: 10m
scaleDownUnneededTime: 30m
scaleDownEnabled: true
rbac:
create: true
utilization: {}
# ignoreDaemonsets: {}
eks:
sslCertPath: /etc/kubernetes/pki/ca.crt
Values file QA:
env: qa
deployment:
image:
imageVersion: "v1.3.7"
clusterName: "$CLUSTER_NAME"
scaleDownUtilizationThreshold: 0.6
awsRegion: eu-central-1
scaleDownDelay: 20m
scaleDownUnneededTime: 10m
scaleDownEnabled: true
rbac:
create: true
utilization: {}
# ignoreDaemonsets: {}
eks:
sslCertPath: /etc/kubernetes/pki/ca.crt
Everything it's the same.
I just saw how in my production cluster I lost another node running coreDNS even though I have the --skip-nodes-with-system-pods=true set.
Also, I can see in the logs of CA that some other pod is being recognised:
Fast evaluation: node $NODE_NAME cannot be removed: non-daemonset, non-mirrored, non-pdb-assigned kube-system pod present: coredns-54f4cff84d-brdjb
So it looks like this if: https://github.com/kubernetes/autoscaler/blob/77ebb22fea9bcfaeccf7ce60be51b7e78e294cb0/cluster-autoscaler/utils/drain/drain.go#L183
It's not always working.
CA doesn't look for all the reasons why it can't delete the node - as soon as one pod can't be moved it won't bother checking the other pods (since the node can't be deleted anyway). So there are multiple system pods on a single node only one of them will be logged and the other ones will likely never show up in log. So the fact that the logs are missing by itself is not an issue.
Similarly --skip-nodes-with-system-pods=false doesn't prevent safe-to-evict from working. It just prevents autoscaler from ever logging it - your pod was already disqualified by a different check (kube-system namespace), so CA never bothers to check for annotation since the pod can't be restarted anyway.
Bottom line - whether the line shows in logs or not is random and doesn't mean the feature is not working.
The way to debug your issue would be to look in the logs for the node that disappeared and try to find what happened with your node. If it was scaled-down by CA there will be logs describing the scale-down. My guess would be that it was either removed by something else, or was removed after the nodes were drained for unrelated reason (ex. nodecontroller draining unready node).
@MaciekPytel Thanks for the info.
In fact I could see in the logs of cluster-autoscaler how the specific node running the coreDNS pod was removed by cluster-autoscaler due to low utilization.
I'll check what else could be affecting this, but it looks weird to me that cluster-autoscaler decided by itself to remove that node when running a pod from kube-system and --skip-nodes-with-system-pods=true
I'll keep checking.
Thanks for your help!
We could found out the problem in our case.
After saving all logs from cluster-autoscaler in our logging system, we could do some more in depth investigation about this problem and it turned out to be an issue produced by AZRebalance in EC2 instances.
Sometimes after cluster-autoscaler started to terminate instances, we were running more machines in certain AZ than the others, therefore AWS started to terminate instances from this AZ to rebalance the number of machines running in each AZ and since this is a process in EC2, any machine could be terminated regardless what's running in. And that was exactly the reason why instances running coreDNS pods where terminating.
Sorry for the inconvenience of this issue and thanks a lot for your support.
Thanks for the followup!
The behaviour seems to be documented here: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#common-notes-and-gotchas
is this in line with what you experienced?
@bskiba Yes, that was exactly the problem.
Most helpful comment
@bskiba Yes, that was exactly the problem.